QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning

When individuals interact with one another to accomplish specific goals, they learn from others’ experiences to achieve the tasks at hand. The same holds for learning in virtual environments, such as video games. Deep multiagent reinforcement learning shows promising results in terms of completing many challenging tasks. To demonstrate its viability, most algorithms use value decomposition for multiple agents. To guide each agent, behavior value decomposition is utilized to decompose the combined Q-value of the agents into individual agent Q-values. A different mixing method can be utilized, using a monotonicity assumption based on value decomposition algorithms such as QMIX and QVMix. However, this method selects individual agent actions through a greedy policy. The agents, which require large numbers of training trials, are not addressed. In this paper, we propose a novel hybrid policy for the action selection of an individual agent known as Q-value Selection using Optimization and DRL (QSOD). A grey wolf optimizer (GWO) is used to determine the choice of individuals’ actions. As in GWO, there is proper attention among the agents facilitated through the agents’ coordination with one another. We used the StarCraft 2 Learning Environment to compare our proposed algorithm with the state-of-the-art algorithms QMIX and QVMix. Experimental results demonstrate that our algorithm outperforms QMIX and QVMix in all scenarios and requires fewer training trials.


I. INTRODUCTION
Recently, reinforcement learning (RL) has proven effective for solving problems related to cooperative multiagent systems (MAS), and the approach has garnered increased attention. Reinforcement learning has shown particular utility for complex tasks such as self-driving vehicles [1], [1], power supply systems, logistics distribution in factories, productivity optimization [2], and cooperative multi-robot exploration systems [3], [4], have commercial prospects in large-scale applications.
Customarily, convergent learning in multiagent reinforcement learning (MARL) is used to tackle the problems of cooperative MAS by considering the MAS as a single agent. Such centralized learning performs remarkably well in many The associate editor coordinating the review of this manuscript and approving it for publication was Yu-Da Lin .
scenarios. However, when using such an approach with an increasing number of agents, the joint action table increases exponentially, and current RL algorithms may not converge on a solution. However, a distributed and decentralized learning approach, where each agent learns individually according to its policy, can handle these problems smoothly. This learning is based on the sum of all the agents' total rewards, which is known as the global reward. Typically, independent Q-learning (IQL) is used in such cases [5]. The main shortcoming of this approach is the occurrence of nonstationary problems, even for only two agents, due to the global reward [6]. To address the non-stationarity problem, it is also possible to train a decentralized policy in a centralized manner.
For the last three to four years in the RL community, this amalgam technique has become very popular [7], [8]. However, even the induction of a hybrid approach cannot address many of the challenges still faced by MAS. Among these, the most important challenges are the convergence rate and computational power. IQL fails to address these problems because of its non-stationarity nature. Although counterfactual multiagent (COMA) techniques [9] are able to address the convergence problem, they are unable to calculate the combined Q-value from the joint state-action, which is the main criticism leveled against COMA [10]. This shortcoming is attributable to COMA using on-policy learning. Value decomposition networks (VDNs) address this problem, ensuring that learning is performed in a centralized fashion, but the global Q-value function is calculated through a factored approach [11]. In training, a VDN does not utilize the state's additional information because it only presents a shallow class of action values. QMIX [13] and QVMix [36] mitigates this problem by using a neural network to convert the centralized state into the weights of the second neural network, in a manner reminiscent of hyper-networks [12]. However, QVMix required high computational power while QMIX required large training episodes.
StarCraft has been used by many researchers to evaluate deep MARL algorithms, such as in [6], [7], [9], and [13]. Both StarCraft and StarCraft II are the registered trademarks of Blizzard. Almost all of these methods address the convergence issue, but due to their nonstationary environments and greedy policies for action selection, they require either large numbers of training episodes or very high computational power.
In this study, we used the StarCraft II Learning Environment (SC2LE) [13]. We introduce a hybrid policy gradient for deep MARL, known as Q-value Selection using Optimization and DRL (QSOD), to mitigate this problem. It relies on a grey wolf optimizer (GWO) [15], [29]. As in GWO, one agent acts as the head of the group, whereas all other agents act according to the lead agent's instruction. Due to the optimization-based policy agents learn in a faster way with a comparatively small mixer network.
The rest of the paper includes a discussion of prior works in Section 2. In Section 3, we discuss GWO and multiagent systems. In Section 4, we discuss the proposed method. Section 5 presents the experimental results. Finally, Section 6 concludes the discussion and provides an overview of our future directions in this domain.

II. RELATED WORK AND BACKGROUND
The productivity of RL-based techniques over the last several years, particularly in solving cooperative MAS problems, cannot be overlooked. As mentioned previously, MARL uses centralized, decentralized, and hybrid approaches to accomplish its goals. Initially, MARL used centralized approaches [3], [16], which later shifted toward deep learning methods capable of controlling multidimensional states and action slots [6], [7], [17].
Q-learning is a straightforward and powerful algorithm for creating an action sheet for an agent. However, if this action sheet is too long (e.g., in an environment with 10,000 states

Algorithm 1 Deep Q-learning
Require: Initialize replay memory M, Initialize Q-value function with random weights while (episode =1 to n) do Initialize s 1 = {x 1 and preprocessed sequenced φ 1 = φ(s 1 ) for (t=1 to T) do if probability < : select a random action a t else: select a t = max a Q * (φ (s t ) , a; θ ) end if execute a t in emulator and get r t andx t+1 Set: s t+1 = s t , x t+1 = x t and preprocess φ t+1 = φ(s t+1 ) Store: {φ t , a t , r t , φ t+1 } in M Sample: random minibatch b of transitions {φ j , a j , r j , φ j+1 } from M Set: y j = r j for terminalφ j+1 r j + γ max a Q * φ j+1 , a ; θ fornon − terminalφ j+1 Perform a gradient descent step on [y j − Q φ j , a j ; θ ] according to equation (1) end for end while and 1000 actions per state, wherein the size of the table reaches 10 million cells), then it becomes impossible to handle the vast number of Q-values. This results in two problems: First, the amount of memory required to save and update that table will increase. Second, it will require too much time to explore each state to create the necessary Q-table.
To solve these problems, we approximated these Q-values using neural networks. This technique is known as deep Q-learning (DQL).
All the symbols used in Algorithm 1 and the DQL section are listed in Table 1.
In DQL, a deep neural network with weight θ is used to represent the action-value function. In deep Q-networks (DQNs) [19], the Q-value function and replay memory M, which stores the transition tuple, are first initialized. Then, the process repeats until convergence is achieved. In this process, first, we initialize s 1 = {x 1 and φ 1 = φ(s 1 ), where s 1 is the first state, x 1 is the first image, and φ 1 is the preprocessor. Subsequently, each step action a t is selected. The selection depends on the probability value (if > probability then a t select randomly, otherwise a t is selected through VOLUME 9, 2021 max a Q * (φ (s t ) , a; θ)). After selecting a t , we execute this action in the emulator and receive a reward r t and the next state image x t+1 . We then set x t+1 = x t ands t+1 = s t and store the transition tuple {φ t , a t , r t , φ t+1 } in M. Then, we generate a random sample batch of transitions {φ j , a j , r j , φ j+1 } from M and set the value of y j according to φ t+1 . If φ t+1 is a terminal state, then y j = r j ; otherwise y j = r j +γ max a Q * φ j+1 , a ; θ . Through y j we want to minimize the squared TD error: The distributed and decentralized learning approach tackles problems such as the non-convergence of algorithms and the exponential growth of joint action tables that result from an increasing number of agents in a smooth manner. Instinctively, for scrutinizing policies for an MAS, direct learning of decentralized value functions or policies is preferred. IQL [5] educates self-directed action-value processes for individual agents using Q-learning [18]. Later, solutions to these kinds of task became more diverse [17], using deep neural networks through the induction of DQN [19]. A few other works [7], [20] have focused on the perseverance of learning to some extent; even then, extra state information cannot be considered during the training of learned decentralized value functions.
As expected, the centralized learning of collective actions can handle coordination problems and avoid non-stationarity. However, it is difficult to manage such centralized learning because the collective action space grows exponentially with increasing agent numbers. Classical approaches to ascendable centralized learning make use of coordination graphs [21], which use provisional independencies among agents by decomposing a combined reward function into a sum of agent-local terms. The sparse cooperative Q-learning algorithm [22] uses a tabular approach that learns to synchronize a group of cooperative agents only in certain states where such coordination is mandatory, encrypting these requirements in a coordination graph. These methods require the prior provision of dependencies among agents, although this prior knowledge is not required. Instead, it is assumed that every individual agent contributes towards the global reward, and at every state that agent becomes aware of its contribution.
Recent approaches for centralized learning require even more communication during execution, such as Comm-Net [23], which uses a centralized network architecture to exchange information between agents. Bic-Net [6] uses bidirectional RNNs to exchange information between agents in an actor-critic setting. This approach requires additional effort to estimate individual agent rewards.
Hybrid approaches exploit centralized learning with fully decentralized execution. COMA [9] uses centralized criteria to train decentralized actors, estimating a counterfactual advantage function for each agent to address the multiagent credit assignment. Similarly, Gupta et al. presented a centralized actor-critic algorithm with per-agent critics [24], which scales better than existing techniques for the same number of agents, but mitigates some of the advantages of centralization. In [25], the authors trained a centralized critic for each agent and applied it to competitive games with continuous action spaces. These approaches use onpolicy gradient learning, which can suffer from low sample efficiency and are prone to getting stuck in suboptimal local minima.
Sunehag et al. [11] proposed value decomposition networks (VDN) as a solution these problems, which allowed centralized value-function learning to be accompanied by decentralized execution. Their algorithm decomposed a central state-action value function into a sum of the individual agent terms. This corresponds to the use of a degenerated, fully disconnected coordination graph. However, VDN does not use additional state information during training and can represent only a limited class of centralized action-value functions.
QMIX [13] is a modern approach which uses a centralized training with decentralized execution (CTDE) [35] method. In this technique, the factorization of the joint state-action value function for all agents is accomplished as a monotonic function by using a mixer network, which is denoted as Q mix (s t , u t ). The mixer network is used to calculate the joint state-action value of all agents, and by using a monotonic function Q tot Q a ≥ 0∀a ∈ {a 1 , . . . , a n , it ensures the individualglobal-max condition IMG [35] for each agent. The monotonic condition is achieved through a hyper-network, which predicts a strictly positive weight for the mixer network based on the current state of each agent as an input. Moreover, through this hyper-network, the outputs of the mixer network depend on the current state. The same DQN algorithm as in the optimization procedure is adopted and applied to Q mix (s t , u t ). Furthermore, the joint action-value function class of QMIX is limited [35].
To address this limitation, QTRAN [37] introduced a novel factorization method to express the complete value function class with the help of IGM consistency. However, this method ensured more general factorization than QMIX but required an inconvenient amount of computational power to implement. Two extra soft regularizations were required for its approximate version, but it still performs below par in complex domains with online data collection [34].
Mahajan et al. [34] demonstrated that QMIX has limited exploration ability in certain environments. They proposed a model in which there is a latent space to enhance the performance of all agents. Therefore, for supportive MARL, achieving effective scalability remains an open challenge that is addressed by QPLEX [35]. For both joint and individual action-value functions, QPLEX introduced a dueling structure which then deformalized the IGM principle via advantage-based IGM. This demonstrates the ability of QPLEX to support offline training with high stability. However, although QPLEX performs well, it still requires complex networks to achieve these results. Moreover, it requires Algorithm 2 Grey Wolf Optimizer (GWO) Require: Initialize agents G i (i = 1, 2, . . . , n), number of iteration K Initialize a, A, and C Ensure: calculate the fitness of agents and set alpha, beta, and delta according to fitness while (iteration = 1 to K) do for (agents = 1 to n) do update position agent using equation (6) end for update a, A, and C compute fitness of agents update alpha, beta, and delta end while return value of alpha numerous training episodes for a large number of agents, as it uses a greedy policy for the action selection of an individual agent.
In addition, researchers have introduced two new Deep Quality-Value (DQV)-based MARL algorithms known as QVMix and QVMix-Max [36]. These algorithms are established using centralized training with decentralized execution. The results from these algorithms show that overall, QVMix performed better than the other algorithms because it is less susceptible to an overestimation bias of the Q function [36]. However, QVMix also requires high computational power and large amounts of training time because it also uses a greedy policy for the action selection of each individual agent.
Therefore, in this paper, we propose a novel, natureinspired optimization-based hybrid policy to address these limitations. In this policy, we used GWO along with a greedy policy for the action selection of each individual agent. Optimization algorithms, such as GWO (normally used for finding the prey) or Ant Colony Optimizer (normally used for finding the shortest path), require environmental information, but they perform better than the greedy policy. In GWO, agents trained in a centralized manner, wherein the leader agent helps the other agents [29]. More detail on this topic is provided in Section III (A). Moreover, to gather information about the environment, in the beginning, action-selection is performed through greedy policy. A policy is then selected with the help of maximum reward and the learning rate alpha.

III. GREY WOLF OPTIMIZATION (GWO) A. INTRODUCTION AND WORKING PARADIGM
The fundamental component of the GWO that makes it more successful than other well-known swarm intelligence algorithms is its hierarchical chain. The governance chain of importance is framed by a specific objective function known as the goal. Thus, the objective function is arranged into cost capacity, estimated cost, and the fitness function, which are utilized to assess the precision of the outcome, as compared with the prearranged structure arrangement [28], [29].
The wolf pack is partitioned into four prevailing positions, as shown in Figure 1. Alpha, beta, and delta wolves compose the main groups. The omega wolves do not reserve any options to settle on choices in a swarm, even though their  presence decides the swarm intelligence. The main purpose of the social chain of command is to lead wolves to the prey's location, and they manage omegas to play out the pursuit. Different operators, such as social hierarchy, encircling the prey, hunting, attacking the prey (exploitation), and pursuing the prey (exploration) mimic the association of cumulative behaviors in a wolf pack. Table 2 presents the symbols used in the equations. A detailed description of GWO is provided in the subsequent sections.

Encircling Prey
In GWO, grey wolves encircle the prey to examine two points in space and amend the location of one of them to correspond to the other. The following formulas represent the grey wolf encircling methodology: Where t indicates the current iteration, − → X indicates the position of a grey wolf, and − → X P is the position vector of the prey.
− → A and − → C are coefficient vectors, which can be calculated as Where the components of − → a are reduced from 2 to 0 across iterations, and − → r 1 and − → r 2 are randomly created vectors in [0, 1].

Hunting
The hunting scheme of the grey wolves can be mathematically modeled as them approaching the location of the prey with the assistance of alpha, beta, and delta wolf information. Figure 2 shows the estimated updated position of agents in the GWO based on this information. The positions held by the alpha, beta, and delta wolves X α , X β , X δ are calculated as in the following equations: Omegas take action according to Equation (6):

Attacking Prey
Advancing toward the prey requires the wolf to minimize the value of − → a . The variation scope of − → A is also reduced by − → a .
− → A is a random value in the range of [−a, a], where − → a is reduced from 2 to 0 across the iterations. If |A| < 1, then the wolves are forced to attack the prey; otherwise, they shift toward exploration. The changeover between exploration and exploitation is created by the changing values of − → a and − → A . Algorithm 2 describes the grey wolf optimization pseudocode.
GWO first initializes the number of agents/wolves and the values of a, A, and C (where ''a'' is a vector and its value declines from 2 to 0. ''A'' and ''C'' are coefficients, and the agent's exploration or exploitation behavior depends on the value of A. Subsequently, we calculate the fitness of agents, and according to the fitness level, we select alpha, beta, and delta wolves. Then, the position of all agents is updated according to Equation (6). We repeat all steps until the episode ends. Generally, after completing one execution, we are able to attain the value of alpha.

B. OPERATIONAL EXAMPLE
A multiagent system (MAS) is a sub-discipline of distributed artificial intelligence (DAI). It is a combination of comparatively independent parts, known as agents. In an environment, these agents are designed to act as experts, and they have their own actions and behaviors in that specific area. The focus of research in MAS has been to make agents which work without human interaction. According to [33], the most suitable example of MAS is the Internet, wherein millions of computers run independently but can communicate with each other.
In real-life scenarios, humans often work with each other towards a single goal. They achieve their goals more quickly through communication and shared attention. Similarly, agents can achieve a goal with fewer iterations through communication in the MAS. For scenarios involving fewer agents, this is not a significant issue.
To understand the advantages of the proposed algorithm (policy), let us consider a straightforward example of an MAS. In this scenario, there are several robots. These robots are used to explore the entire area of a building or other environment. Investigation of an unknown region starts with no prior information about the obstacles and the design of the territory. In this example, we can compare the performance of GWO with that of a distributed RL. As distributed RL is the backbone of QMIX/QVMix; however, in the case of the proposed policy, GWO plays a vital role. In addition, this example shows how GWO leverages the advantages of centralized learning without the communication constraints and propensity for stuck agents.
In Figure 3, the red color represents the current position of the agents. Yellow represents cells which have not been visited by any agent but for which some agents know the reward values of those cells. White represents the explored cells, and grey represents the unvisited, unknown cells. Figure 3(a) shows the multiagent exploration using a simple decentralized epsilon -greedy policy. Agent-1 and agent-3 are near the unexplored area, whereas the other two agents are far from the unexplored area. Agent-2 and agent-4 are stuck because the reward will be the same for all possible next states. After a specific time period, the total explored area will be very low because the maximum exploration is performed using two agents. Although agent-1 and agent-3 are closer to the unexplored area, they cannot help the other agents because of the decentralization constraints. Similarly, in the case of a centralized -greedy policy, as the number of agents increases, the length of the centralized Q-table increases exponentially; hence, these large table values require very high computational power. Figure 3(b) and 3(c) show the exploration performed using the GWO. In Figure 3(b), agent-2 has a maximum number of unvisited neighbors; therefore, it becomes the alpha. Similarly, agent-1 has three unvisited neighbors, meaning it acts as a beta. Agent -3 is the delta, and agent-4 is the omega. The next action will be taken with the help of the current combined alpha, beta, and delta information. As the omega (agent-4) has no unvisited neighbor, it has the same reward for all the next possible states. This agent can become stuck, but other agents will help the omega to take the correct next step in the case of GWO. In Figure 3 (c), Agent-1 has a maximum number of unvisited neighbors, and it becomes the alpha at that stage. Similarly, agent-2 and agent-4 become beta, whereas agent-3 becomes delta. Meanwhile, no omega is available at this stage. All the agents take actions according to either the alpha (agent-1) order or the best possible reward.  It is evident from Figure 3(a) and 3(c) that if some agents are stuck, then after the same number of episodes, GWO performs significantly better than the -greedy policy. The reason behind this is that in the case of -greedy policy, it is difficult to return a stuck agent to an operational state. However, in the case of GWO, a stuck agent returns to the operational state very quickly with the help of other agents, especially the alpha. Moreover, we used GWO along with -greedy policy to boost up the training process. Although, in some scenarios, GWO performs better than the decentralized -greedy policy, when the environment changes the performance of GWO decreases gradually. Therefore, we used the GWO policy along with decentralized -greedy in our proposed algorithm.

IV. Q-VALUE SELECTION USING OPTIMIZATION AND DRL (QSOD)
In most MADRL algorithms, the focus is on upgrading the joint action-value function using different weights [7], [13]. For individual action selection, agents usually use simple Q-learning [30], [31], and attention between agents cannot be developed appropriately. They require a large number of training sessions to deeply learn the environment. Furthermore, in most scenarios, many algorithms fail to address nonstationary problems, such as IQL [5], [16]. Hence, there is no guarantee of system convergence. Similarly, in the case of optimizers, there are many limitations, such as the failure of the algorithm due to environmental change.
We propose a novel technique for the optimization of a deep reinforcement-learning multiagent based on actionvalue selection using optimization and DRL (QSOD). In the proposed technique, learning takes place both ways either communicatively or non-communicatively, according to the situation using optimization and a greedy policy. Therefore, the benefits of centralized learning can be leveraged without experiencing communication constraints through optimization. Consequently, this technique saves computational power and offers extraordinary performance improvements, even in cases involving numerous agents. This ensures that convergence occurs faster, unlike in traditional DRL algorithms.

VOLUME 9, 2021
A group of recurrent neural networks (RNNs) for calculating the Q action values for the next state is present in QSOD (as shown in Figure 4(c)). Individual agents use individual RNNs [13]. In typical scenarios, the next action selection is performed using the − greedy policy [30], [31]. However, in the proposed algorithm, we selected actions using two different policies. The first policy uses the traditional −greedy policy for the action selection of agents. In the second policy, action selection takes place according to the GWO. As a result, within the first policy learning takes place in a decentralized or distributed fashion [5], [31], [32], wherein each agent selects action independently. The second policy is based on the last episode reward, wherein we select an agent as leader of the group to be the same as the alpha wolf in GWO [15]. This agent performs the − greedy policy for action selection. All other agents then select actions according to the Q-value of the leader agent (like GWO) [15].
The key to this policy is that if some agent moves far away from the other agents based on the leader agent's Q-value, that agent will easily return and join the team again to achieve the goal. If all agents perform in a centralized manner, then the agents' combined power increases. Consequently, the agents achieve a higher reward. In this case, learning is performed in a centralized manner, and attention is properly developed between agents, such as in GWO [5].
Changes in the environment result in optimization failures because for different environments, optimizers need human input. To address this problem, we used both policies and set two conditions to select a policy in each episode. According to the first condition, optimization-based selection started when the agents' accumulative reward in the previous episode was greater than or equal to the threshold for reward (R t−1 ≥ R th ) [28]. The value of the threshold reward R th is calculated using Equation (7).
where R th is the threshold reward, α is the learning rate, and R max is the maximum reward or target reward. A maximum or targeted reward can only be achieved if all goals are achieved, and at least one agent is still alive because the learning rate α affects the convergence time. For comparatively lower values of α, the training requires a long duration. Therefore, for the proposed algorithm, when a lowvalue α is used, the optimization policy starts with a lower threshold reward value, and in the case of a high value α, it starts with a high threshold reward value. As a result, in all scenarios, the optimization policy starts after almost the same number of episodes in both cases.
The second condition is that if the current episode's reward is equal to the reward of the previous episode, then the traditional action selection policy ( − greedy) will be used in the next episode. The main reason for this is that the agents explore the entire area while using the optimization policy. After achieving the maximum reward (optimal path), agents always follow the same path, and thus learning stops.
However, the likelihood that the environment will change in some subsequent episodes is significant, but the agents will continue follow the same previously optimal path and thereby perform poorly. Furthermore, in some scenarios, the agent can take the selected action because of some unrealistic reward or next state. Thus, in all such cases the optimization policy either becomes stuck or chooses some suboptimal action, which causes the game to stop and execution to be interrupted. This second condition helps to address these challenges. Moreover, this problem is a major limitation for the optimizer.
In addition, using optimization-based action selection for all agents creates a problem in that the learning may be ceased during that episode. During each time step, agents select the action according to the optimization value. The optimizer is used for the maximization function.
They do not use the hit and trial method, as in the case of optimizers. The concept of punishment is not used, which is the basis for reinforcement learning (RL). Therefore, an RL-based optimizer policy for action selection is not possible. We use the ∈ −greedy policy to select actions for the leader agent. The leader agent's Q-value will help the other agents to select actions [28]. All the agents choose action in the optimizer policy according to the GWO, except the leader agent.
There is a mixer network which functions after calculating the individual Q a values of all agents. A feed-forward neural network is used as a mixer network. It is used to calculate the combined Q value of agents Q tot . It takes agent network outputs as input, and by mixing that input monotonically, produces values of Q tot . This is the sum of the individual value function Q a of the agents, as shown in Figure 4(a). The monotonicity in the network is enforced in Equation (8), which shows the relationship between Q tot and Q a ; Separate hyper-networks are used to provide weights for the mixer network, as shown in Figure 4(b). They produce weights by using the current state as an input. These networks are based on single-layer networks, followed by an absolute activation function, ensuring that the weights are positive. A vector-type output is produced by these networks and reshaped later in the matrix. A two-layer hyper-network is used to produce final the bias.
Using the state in hyper-network instead of a mixer network ensures the monotonicity of the system. QSOD is trained in an end-to-end fashion to minimize the overall loss L (θ) as shown in Equation (9): where b is the batch size, and θ − and y tot j are the target network parameters, as in DQN. If the next state is a nonterminal state y tot j = r j + γ max a Q tot τ , u , s : θ − . All symbols used in Algorithm 3 and Equation (9) are listed in Table 3.
Algorithm 3 presents the steps of the proposed method. First, in Lines 1 and 2, the required parameters are initialized, including the number of agents, the reward for each action, replay memory to store the current preprocessed sequence, values of alpha and gamma, reward for a win, number of episodes, and number of steps in each episode. After that in Line 3, the two main conditions are imposed. The first condition is related to the threshold reward, through which the optimization policy is activated. The second condition is related to monotonicity. In Line 4, a repeated process is started, through which we run our algorithm to a required number of times/episodes. In Line 5, before starting each episode, the first state and the first preprocessed sequence are initialized. The first state refers to the initial position of the agents. In Lines 6 and 23, a condition is imposed for activating either policy (GWO or Greedy). For Greedy, this policy is selected if the previous episode's reward is less than the threshold reward or if the previous two consecutive episodes had identical rewards; otherwise the optimization policy is selected. In Lines 7 and 24, a loop is initialized for the number of steps for each episode. In Line 7, a loop is used for the greedy policy, whereas in Line 24 a loop is used for the optimization policy. The Line 8 loop is used to compute the Q-values of all agents. From Lines 9 to 14, simple Q-learning is performed. From Lines 16 to 21, the steps for simple deep Q-learning are performed, except Line 17, in which the accumulative Q-value is computed through the mixer network. On Line 23, if the previous episode's reward was greater than or equal to the threshold reward and the previous two consecutive episodes had different rewards, then the optimization policy is triggered. In Line 25, the fitness level of the agents is calculated, and an alpha or leader agent is selected. From Lines 26 to 29, a loop is used to compute the Q-value for each agent through optimization policy. In the loop from line 8 to 15 an action is selected from the available actions list through IQL. In the loop from Lines 26 to 29, an action is selected from the available actions list through GWO.
After calculating the individual Q a , the combined Q-value function Q tot is calculated. Then, a random batch b of transitions {τ, a j , r j , τ } is sampled from M and the value of y tot j is set according to τ . If τ is a terminal state, then y tot j = r j ; otherwise y j = r j + γ max a Q tot τ , u , s : θ − .

Algorithm 3 Attention-based hybrid policy for deep MARL (QSOD)
1 Require: Initialize agent, reward R, Replay memory M, Q tot , α, γ , 2 Target reward R max , episode, number of step in each episodes t 3 Ensure: R th = αR max 2 and Q tot Q a ≥ 0 4 while (episode = 1 to n) do 5 Initialize s 1 = {x 1 and preprocessed sequenced τ 1 = τ (s 1 ) 6 if (R th < R t−1 orR t−2 = R t−1 ) then 7 for(t=1 to T)do 8 for (agents = 1 to n) do 9 if probability < then 10 select a random action a t 11 else 12 select a t = max a Q * (τ (s t ) , a; θ ) 13 end if 14 Compute Q a according to the a t 15 end for 16 Set: s t+1 = s t , x t+1 = x t and preprocess τ = τ (s t+1 ) 17 compute Q tot using a mixer network for each agent select action by using greedy policy 8 else 9 for each agent select action by using optimization policy 10 perform steps of MARL 11 end if 12 end while Through y tot j we aim to minimize the squared TD error, as shown in Equation (9).
Here, Algorithm 3-A provides a summary of proposed Algorithm 3 (QSOD). It includes only the most important steps of the proposed algorithms because the proposed algorithm (Algorithm 3) has too many steps to be concise. Further, it includes a detailed overview of the QSOD, whereas Algorithm 3-A includes a brief overview of the proposed methodology.

V. PERFORMANCE EVALUATION A. STARCRAFT II
StarCraft II is the sequel to the first StarCraft game. Both StarCraft and StarCraft II are the registered trademark of Blizzard. Both are real-time strategy (RTS) games. During the last six to seven years, RTS games have becomes popular in the DRL field because many researchers have tested their work on RTSs. StarCraft in particular provides a powerful platform to address competitive and collaborative multiagent problems. Many complicated micro-action sets are available in StarCraft. Through these sets, StarCraft allows for the learning of complex collaboration among cooperative agents. [2], [4], [16] performed their experiments using StarCraft. In the present study, we used the StarCraft II Learning Environment (SC2LE) [11], as in QMIX and QVMix. SC2LE is based on the second edition of StarCraft. It provides many different scenarios, and it has better support from developers than the original StarCraft. Figure 5 shows the environment of StarCraft II.
Like QMIX, we chose the decentralized micromanagement problem in StarCraft II. In fighting scenarios, there are two groups of agents available on the map. The first group consists of allied agents. These agents are controlled through the proposed method. The second group consists of enemy agents. These agents are controlled by the built-in AI of the game. The initial position of both groups' agents changes with each episode. All the other setting are similar to that of QMIX [18].

B. EXPERIMENTAL RESULTS
We used the SC2LE and StarCraft Multi-agent Challenge (SMAC) environment to evaluate the performance of our proposed method. The difficulty level of the game was set to medium. We used different scenarios to compare our algorithm with the state-of-the-art algorithms QVMix and QMIX. These scenarios included 3m, 8m, 2s_3z, MMM, and 1c_3s_z. The letters c, m, s, and z are described in Table 4. A vector consisting of the features of the agents is known as the global state. It contains the health, shield, cool-down, and last taken action information. The following different actions were available in each agent's action space: Move (performed in the East, West, South, or North direction), Attack (only performed if the enemy was within range), Stop (performed when the episode ended), and Noop (performed if the next state reward was unrealistic). After every time step, agents received a combined reward. It was calculated through the total damage of the enemies, similar to [18].
To compare the performance of QVMix, QMIX, and the proposed algorithm, we adopted fewer episodes than [13] and [QVMix]. For each simulation of both methods, we suspended training after every 100 episodes and ran 20 independent episodes, with each agent performing greedy decentralized action selection in the case of QMIX and QVMix or hybrid optimization action selection in the case of the proposed algorithm. We ran a total of 10,000 episodes for training and used a 500 replay buffer size. After every 200 episodes, the target network was updated. For QMIX and our proposed algorithms, we used the normal computational power system for performing the experiments: (GPU = GTX1050, CPU = i7-8750H, RAM = 16GB). For QVMix we used a system with higher computational power (GPU = RTX2080, CPU = i7-8700K, RAM = 32GB).
To highlight the significance of our algorithm, we calculated two different types of results: win rate and average loss. The win rate was calculated as the percentage of episodes in which agents killed all enemies within a given time. The average loss was calculated as the percentage loss across the 100 episodes. All results were calculated for an average of five runs for each algorithm. Figure 6 shows the rolling average of the win rates of QSOD, QVMix, and QMIX for five different scenarios (8-marines, 3-marines, MMM, 2-stalkers with 3 zealots, and 1-stalker with 2-colossi and 3-zealots). Initially, almost all algorithms performed similarly. However, over time, the proposed QSOD algorithm performed better than QVMix and QMIX, particularly after 4,000 episodes. Moreover, in scenarios with high-power agents, such as MMM, the proposed algorithm showed better performance than QVMix and QMIX. Moreover, in QVMix and QMIX, action selection was conducted through the greedy policy for each agent. Therefore, these algorithms required a higher number of episodes in training for these challenging scenarios. VOLUME 9, 2021 FIGURE 6. Win rates during training for the proposed algorithm, QVMix, and QMIX across five different scenarios. However, in the case of the proposed algorithm, a hybrid optimization policy was used, which required much fewer training episodes. Table 5 shows the maximum win rate value for all three algorithms across five scenarios after 10,000 episodes. Therefore, in all scenarios, our proposed algorithm performed better than the state-of-the-art algorithms QMIX and QVMix. Particularly, in the MMM scenario, our proposed algorithm achieved at least a 12% higher win rate than both of the other techniques. Similarly, in the case of 8m and 3m, our algorithm achieved a 5% greater win rate than QMIX and QVMix.
Furthermore, in the 1-colossi, 3-stalkers, and 5-zealots scenarios, throughout almost all of the training episodes, QSOD performed better than both QVMix and QMIX.
The average training loss is shown in Figure 7. Across all scenarios, the proposed QSOD had the lowest training loss compared to QMIX and QVMix. Notably, in high-level-agent scenarios, QSOD outperformed QMIX and QVMix. This is because the greedy policy wastes time searching the environment repeatedly with multiple agents. However, in the proposed optimization policy, if a leader searched a point in the environment, it shared its experience with the other agents. As a result, fewer episodes were required for training. Moreover, the lowest value of training loss was achieved. Figure 8 illustrates the most important result, which is the convergence graph. In the case of RL, convergence plays a crucial role in validating the significance of any algorithm. Figure 8 shows the results of the 3-marines scenario   for QMIX and the proposed algorithm. According to the figure, convergence occurred after several episodes in both cases. However, the proposed policy's convergence started approximately 2,500 episodes earlier than that for the default QMIX. This was due to the hybrid optimization policy and the conditions for the activation of either policy. Moreover, these conditions played a vital role in controlling the abrupt changes in the results, as shown in Figure 7, and the proposed scheme's curve is smoother than that of the default QMIX. In Figure 8, we present a comparison of our proposed algorithm with only QMIX to demonstrate the importance of our proposed policy over the default greedy policy while using the same network. Table 6 compares the time efficiency of the proposed scheme with both state-of-the-art algorithms (QMIX and QVMix) for each run. In all cases, the proposed policy required less time than the default QMIX. In particular, in the case of difficult scenarios, the proposed QSOD saved more than 2000 s for 10,000 episodes. Furthermore, in the case of 3m, 50,000 episodes for the proposed algorithm saved 1700 s. Therefore, if either the number of episodes increased or the number of agents increased, the proposed algorithm outperformed the default QMIX policy. After finding the optimal path, the optimization policy required fewer steps to win a game, which saves time in each episode and explains this result.

VI. CONCLUSION
This paper proposed a novel hybrid optimization policy (QSOD) for the Q-value selection of individual agents in MARL. We selected the individual agent's Q-values according to GWO. Through the proposed method, proper attention was paid between the agents, which helped the agents to learn quickly and accurately. Moreover, we used a similar network architecture to QMIX, which requires less computational power than other state-of-the-art algorithms like QVMix and QPLEX. Additionally, the experimental results using SC2LE demonstrate that our proposed algorithm consistently outperformed QMIX and QVMix in all scenarios. Furthermore, our proposed algorithm required less time for each episode than the comparison algorithms.
Our results show the utility of the proposed algorithm. One limitation, however, is that GWO is unable to handle a very large number of agents, particularly when the number of agents reaches 1 million or more. In such cases, the one leader agent cannot control the entire group of agents. We will address this issue in future research.
Moreover, we plan to apply our proposed algorithm to mobile devices such as fire-fighter robots and spy drones to improve their efficiency. For this purpose, we will reduce the size of the agent network and try to use a centralized system as a mixing network.