Unmanned Aerial Vehicle Swarm Cooperative Decision-Making for SEAD Mission: A Hierarchical Multiagent Reinforcement Learning Approach

Unmanned aerial vehicle (UAV) swarm cooperative decision-making has attracted increasing attentions because of its low-cost, reusable, and distributed characteristics. However, existing non-learning-based methods rely on small-scale, known scenarios, and cannot solve complex multi-agent cooperation problem in large-scale, uncertain scenarios. This paper proposes a hierarchical multi-agent reinforcement learning (HMARL) method to solve the heterogeneous UAV swarm cooperative decision-making problem for the typical suppression of enemy air defense (SEAD) mission, which is decoupled into two sub-problems, i.e., the higher-level target allocation (TA) sub-problem and the lower-level cooperative attacking (CA) sub-problem. A HMARL agent model, consisting of a multi-agent deep Q network (MADQN) based TA agent and multiple independent asynchronous proximal policy optimization (IAPPO) based CA agents, is established. MADQN-TA agent can dynamically adjust the TA schemes according to the relative position. To encourage exploration and promote learning efficiency, the Metropolis criterion and inter-agent information exchange techniques are introduced. IAPPO-CA agent adopts independent learning paradigm, which can easily scale with the number of agents. Comparative simulation results validate the effectiveness, robustness, and scalability of the proposed method.

(attacking and jamming) UAVs, referred to as fighters and 96 jammers, which cooperate to make dynamic and coupled 97 decisions on target allocation (TA), route planning, jam-98 ming, and attacks. Thus, UAV swarm cooperative decision-99 making for air-to-surface [21] SEAD mission is naturally 100 suited to be solved using MARL. Moreover, hierarchical rein-101 forcement learning (HRL) gives agents hierarchical thinking 102 and decision-making capabilities similar to humans, and 103 can solve large-scale agents and sparse reward problems. 104 Therefore, DRL shows bright prospect for the problem 105 of large-scale UAV swarm cooperative decision-making in 106 uncertain scenarios. This is the primary motivation of the 107 present study. 108 In this paper, we build a hierarchical multi-agent rein-109 forcement learning (HMARL) framework for UAV swarm 110 cooperative decision-making for SEAD mission that involves 111 multiple tasks, e.g., coordinated flight, jamming, and strike 112 tasks. The hierarchical framework includes the higher-level 113 target allocation (TA) agent and the lower-level the coop-114 erative attacking (CA) agents, in which the CA agents are 115 trained to attack the selected target, and the TA agent are 116 optimized together with the CA agents to determine the opti-117 mal solution of target allocations. The lower-level CA agents 118 are trained with an independent learning (IL) paradigm, and 119 each agent utilizes an independent asynchronous proximal 120 policy optimization (IAPPO) to control a fighter and jam-121 mer to conduct the jamming-attack kill chain, which can be 122 distributed deployed online and flexibly scaled to large-scale 123 scenarios. The higher-level TA agent employs the learning-124 based multi-agent deep Q network (MADQN) algorithm, 125 which can consider the relative distance between the UAV and 126 the target. Our method's effectiveness, robustness, and scal-127 ability are tested and compared in simulation experiments. 128 Figure 1 shows the offline training with online inference 129 research framework in this paper. 130 The major contributions of this paper are: 131 (1) A HMARL framework for UAV swarm cooperative 132 decision-making is proposed. The higher-level MADQN-TA 133 agent determines the optimal target allocation results, and the 134 lower-level IAPPO-CA agent attacks the selected targets. 135 (2) Metropolis criterion and inter-agent information 136 exchange techniques are employed to improve the sample 137 efficiency and learning speed. Metropolis criterion increases 138 the action exploration probability and prevents the policy 139 from falling into local minima. The inter-agent information 140 exchange technique reduces the selection of invalid actions 141 in multi-agent system.

142
(3) A light UAV swarm cooperative decision-making envi-143 ronment is built for developing the DRL-based UAV swarm 144 air-to-surface combat tactics for SEAD mission. 145 The remainder of this paper is organized as follows. 146 Section II describes the SEAD problem and scenario. 147 Section III summarizes several standard RL algorithms used 148 in this paper. In Section IV, a cooperative decision-making 149 model based on HMARL method is established for a UAV 150 swarm in SEAD mission. Several experiments are carried out 151 FIGURE 1. Research framework in this paper to verify the method's effectiveness, robustness, and scalabil-152 ity in Section V. Section VI concludes the whole paper. 154 Using a single aircraft to fight against an IADS system 155 is vulnerable in the high-threat SEAD mission. Therefore, 156 an effective tactic is to use a jammer to suppress the enemy's 157 IADS radar and a fighter to attack in the blind areas. This 158 paper focuses on the cooperative tactics of heterogeneous 159 UAV swarm.

160
Multiple IADS batteries exist in the real-world scenario, 161 as shown in Figure 2, and thus cooperative decision-making 162 is required. The combat procedure is divided into two phases: 163 1) first, to allocate targets, which determines the attacking 164 sequence and selected target list; 2) second, to conduct a 165 cooperative attack, which determines the route, jamming, 166 and firing decisions. In particular, the jammer must approach 167 the IADS and reduce the IADS's range. The fighter can 168 destroy the IADS and live when the radar range of IADS is 169 reduced [22]. 170 Therefore, we decouple the UAV swarm cooperative

177
RL is a machine learning approach that agents can explore 178 environments by trial and error to learn optimal policies. 179 RL is usually modeled as a finite MDP, defined by the tuple 180 S, A, R, P, γ , where S is a set of states, A is a set of actions, 181 R is a reward function, P is a transition probability function, 182 and γ is a discount factor. Assuming that at time step t the 183 agent take action a ∈ A at the state s ∈ S according to policy 184 π : S → A, the environment will feedback an instant reward 185 r ∈ R to the agent, and the agent will transition to a new 186 state s ∈ S until terminal. Tabular reinforcement learning   187   evaluates the state-action value through a discrete Q table,  is to find the policy π θ with the maximum expected return network. R(τ ) = ∞ t=0 γ t r t refers to the discounted cumu-198 lative rewards on the trajectory τ = (s 0 , a 0 , s 1 , a 1 , · · · ). The 199 optimal policy is  which can be written as where π θ (a|s) is an actor and R(τ ) a critic, and R(τ ) can 226 also take other forms, such as a state-action value function 227 Q π (s, a), advantage function A π (s, a), or temporal differ-228 ence (TD) residual error G t = r t + V π (s t+1 ) − V π (s t ). surrogate loss function to control the policy update step size 233 and uses importance sampling to improve sample efficiency. 234 The algorithm balances between sample efficiency, algorithm 235 performance, and engineering implementation complexity. 236 The optimization objective of PPO, i.e., the surrogate loss 237 function, is simplified to where ρ(θ ) = π θ (a|s) π θ (a|s) is the ratio of new and old poli-241 cies, and ε is a hyper-parameter. The truncation opera-242 tion, i.e., CLIP operate, limits the policy update amplitude 243 to ensure the training stability. The generalized advantage 244 estimation (GAE) [27] is used to calculate the advantage 245 A π θ (s, a) and keep the variance and deviation estimated by 246 the value function small, as shown in (5).
Asynchronous proximal policy optimization (APPO) is an 249 asynchronous variant of PPO algorithm [28]. APPO uses 250 a surrogate policy loss with clip operate. Compared to a 251 synchronous PPO, APPO is usually more efficient due to its 252 asynchronous sampling. Figure 3 shows the architecture of 253 APPO. In each episode of updates, the algorithm runs L actors 255 in parallel, and each actor runs T steps, for a total of LT 256 items of step data, and calculates advantages A 1 , . . . , A t . The 257 policy parameters are updated after sampling is completed, 258 where the cumulative rewards of the loss function are J (π θ ). 259 We randomly sample M in each episode of updates, where 260 M ≤ LT , and learn K times to improve sample efficiency. 261 The asynchronous training mode is used for efficient training, 262 whereby sampling and learning do not need to wait for each 263 other.

265
HRL is also an important technique that can be used to solve 266 complex decision-making problems. The idea of ''divide and 267 conquer'' decomposes the original problem into several sub-268 problems. The simple sub-problems are solved one by one 269 and then are integrated to get the solution of the original prob-270 lem. Temporal abstraction of state sequence is used to treat 271 the problem as a semi-Markov decision process (SMDP) [29]. 272 Basically, the idea is to define macro-actions, composed of 273 primitive actions, which allow for modeling the agent at 274 The main advantages of using HRL are scalability and 281 generation ability. Scalability decomposes the large prob-282 lems into smaller ones, avoiding the curse of dimensionality.

283
Generalization ability is acquired due to the combination of 284 smaller sub-tasks that allows for generating new skills and 285 avoiding super-specialization [16].  We consider the scalability of large-scale UAV swarm 308 cooperative decision-making and the particularity of UAV 309 swarm that allocates targets first and then strikes indepen-310 dently, without experience and information sharing between 311 two phases. Therefore, a hierarchical IL framework is chosen 312 to solve the multi-agent cooperation problem.

314
In this section, we build an end-to-end HMARL method 315 for UAV swarm cooperative decision-making in SEAD mis-  In practical SEAD mission, UAV cannot be operated at high 329 angle of attack or high angular rates due to weapon and 330 sensor limitations. Therefore, we assume that the altitude 331 of the UAV remains constant, and a four-degree-of-freedom 332 (4-DOF) kinematics model is used whereẋ f ,ẏ f , v f , and ϕ f represent the differentiation of coordi-335 nate X and Y , speed, and heading of the fighter, respectively; 336 x j ,ẏ j , v j , and ϕ j represent the differentiation of coordinates 337 X and Y , speed, and heading of the jammer, respectively. 338 In addition, state input and action output constraints are mod-339 eled as lower and upper bounds: Therefore, in this MDP, state is defined as the coordinate 342 vector and action is the velocity and heading of the fighter 343 and jammer. The state space is defined as the coordinates vector (x f , y f ), 353 (x j , y j ), and (x s , y s ) of the fighter, jammer, and IADS, which 354 are continuous values. In the actual battlefield environment, 355 the IADS coordinate can be transmitted to the UAV by air-356 borne warning and control system (AWACS) or space-based 357 satellites.
The action space is defined as the fighter's and jammer's 360 heading ϕ f , ϕ j and speed v f , v j . By controlling the head-361 ing to change the direction of movement of a UAV in a 362 2-D environment, the UAV can coordinate in time to reach the 363 desired position by controlling the speed. The jammer adopts 364 a simple jamming model for analysis simplicity. By setting 365 the jamming distance condition, when the IADS enters the 366 jamming range, we assume that the jammer will automati-367 cally turn on jamming, and IADS radar detection range is 368 degraded. 369 c: REWARD FUNCTION 370 In this paper, a non-sparse reward is designed to guide 371 UAV accomplish SEAD mission. Four types of reward are 372 VOLUME 10, 2022 range, it will be rewarded with a success reward r 1 = 1.

376
If the jammer or fighter enters the IADS range or flies out 377 of the environment boundary, it will get a reward r 2 = −1.

378
If the jammer and fighter collide, it will also get a reward 379 r 2 = −1. In other circumstances, reward shaping is adopted  Table 1.

387
The final reward R CA is defined as the sum of four types 388 of reward: a static optimization model based on intelligent optimiza-397 tion algorithms. In this paper, we adopt a learning-based 398 method to establish a target allocation agent model. Through 399 dynamic interaction with environment and trial-and-error, 400 agent can find an optimal or sub-optimal target allocation 401 scheme and quickly allocate targets online with a low amount 402 of computation.

403
Target allocation is a continuous state and discrete action 404 decision-making problem. As the number of targets increases, 405 the number of target allocation schemes increases expo-406 nentially, and the action space dimension of the agent will 407 explode. The MARL is used to prevent the ''curse of dimen-408 sionality'' problem. Therefore, the target allocation problem 409 is modeled as a cooperative MARL model. Each formation 410 can be seen as a DQN agent, an off-policy framework and 411 sample efficient. There is inter communications between 412 multi-agents, so it is a MADQN algorithm.

413
Therefore, target allocation MDP is modeled as follows:  y s ), respectively. If a IADS is chose, the 418 IADS coordinate will be the input vector of DQN agent. 419 Therefore, the state space is defined as the coordinate con-420 catenated vector of each formation of fighter, jammer, and 421 IADS in a 2-D environment, as the Table 1   a ε-greedy , r t ≥ max(r) a ε-greedy , r t <max(r), η ≥ η 0 a random , r t <max(r), η < η 0 (10)

447
Suppose the reward for a DQN agent's action is less than 448 the history maximum reward. In that case, an action is ran-   Initialize replay buffer D to capacity K for each formation. Initialize action-value function Q with random weights θ. Initialize target action-value function Q with weights θ − = θ. For episode = 1 : M do Initialize state s 0 for each agent.
For t = 1: T do For n = 1: N do Observe state s n t and select action a n t for agent n using ε-greedy. Execute action a n t based converged IAPPO and observe next state s n t+1 , reward r n t , and done signal d n t to indicate whether s n t+1 is terminal. Action exploration according to (10) and do state transfer. Store the transition (s n t , a n t , r n t , s n t+1 , d n t ) in D. Round robin the order of agents in inter-agent information exchange.

End for
Calculate the team reward r t according to (9). Randomly sample a batch of transitions from D. Calculate target Q value: y t = r t + γ max a Q (s t , a ; θ − ) for each formation. Update Q function by gradient decent step on θ to minimize [y t − Q(s t , a t ; θ)] 2 for each formation. Every C steps reset Q = Q for each formation.

End for End for
agents, which is implemented by the MADQN algorithm. The 472 MADQN-TA agents considers the situations of both parties, 473 outputs the optimal or suboptimal target list. The lower level 474 is N cooperative attacking agents, which is implemented 475 by the IAPPO algorithm. The IAPPO-CA agent completes 476 the attack process according to the assigned IADS ID, and 477 thereby the mission is completed. The higher and lower-level 478 agents are cascaded to obtain a whole HMARL agent. The 479 network structure of the end-to-end HMARL agent is shown 480 in Figure 5. 481 The input of DQN consists of the coordinates of each 482 group of the fighter, jammer, and IADS, then normalizes by 483 Z-Score and input to the two-layer fully connected neural 484 network, output the Q value of each allocation target ID. The 485 ID with the max Q value is the optimal target and its coor-486 dinate inputs to the lower-level IAPPO agent, respectively. 487 The lower-level IAPPO agent includes critic network and 488 actor/policy network. The IAPPO policy network inputs the 489 coordinates of the fighter, jammer, and IADS ID allocated 490 by the higher-level MADQN-TA agent and then inputs to 491 the two-layer fully connected neural network, outputs the 492 heading and speed of the fighter and jammer. 493 VOLUME 10, 2022  The advantage function is normalized to improve training 509 stability and policy learning efficiency, The value function loss is normalized as The experiment adopts learning rate η annealing as (13),

518
In the early stage of training, a larger learning rate is 519 adopted to accelerate learning, and a lower learning rate is 520 adopted later to prevent the policy from prematurely converg-521 ing to a bad local optimum.
The principle of the adaptive clip is the same as that of the 524 adaptive learning rate. A larger clip value is used in the early 525 stage of training to speed up policy learning, and a smaller 526 value in the later stage ensures policy stability, To improve the robustness of the agent policy and adapt to a 530 diversified environment, random perturbations are added to 531 the scenario in the training process [41], as shown in (15).
where x and y are the fighter, jammer, and IADS coordinates, 534 respectively. We run a different random perturbation environ-535 ment on each random seed to train the agent, enabling it to 536 abstract higher-level complex policy features and avoid over-537 fitting to one specific environment or policy. Consequently, 538 the learned policy is more robust and better generalized to a 539 new environment.

541
Greater entropy means more randomness and encourages 542 exploration. Therefore, while maximizing the cumulative 543 rewards, the entropy of the policy is maximized and the 544 policy is made as random as possible. The agent can ade-545 quately explore the state space to complete the mission, 546 which enhances the robustness and generalization. Entropy 547 is calculated:   the fighter is 15 km, and the jammers range is 25 km. The 574 detection range of the IADS radar is 20 km, the attack 575 distance is 15 km, and the detection range is reduced to 576 10 km after jamming from a jammer [22], as shown in 577 Figure 6. To complete the mission, the fighter needs to fly 578 to the fire point, and the jammer to the jamming point. 579 The above distance is normalized during training, which 580 facilitates neural network training and prevents the gradient 581 vanishing.

582
Simulation experiments are completed in different sce-583 nario, from static known to dynamic uncertain, and from 584 small-to large-scale. Furthermore, compared with those clas-585 sic DRL algorithms, we complete the effectiveness (conver-586 gence), robustness, scalability testing, and ablation study to 587 verify the performance of the HMARL method and intelligent 588 decision-making capabilities. All experiments are conducted 589 on a computer with a 3.6 GHz Intel i7 CPU, 32 GB of 590 DDR4 RAM, and RTX 3060 GPU, using PyTorch 1.70 and 591 Python 3.6. 592 VOLUME 10, 2022  The hyperparameter settings are listed in Table 3. The   In Figure 9(a), the HMARL agent first completes the target 653 allocation. The result is that formations 1-3 attack IADS 1-3, 654 respectively. In Figure 9(b), the formation 1 launches a coop-655 erative attacking. The jammer of formation 1 has reached 656 jamming conditions, the detection range of IADS 657 1 has been degraded, and formations 2 and 3 have not 658 yet reached the jamming conditions. The fighter adopts a 659 specific deceleration tactic of flighting around to ensure its 660 safety and waiting for jamming. In Figure 9(c), formation 1 is 661 still attacking, formations 2 and 3 have completed jamming 662 and cooperative attacking, and IADS 2 and 3 have been 663 destroyed. In Figure 9(d), formation 1 has also completed a 664 cooperative attacking. The three formations have completed 665 their missions and return to the airports (they are set to 666 return during online inference). Therefore, the HMARL agent 667 can complete the 6V3 cooperative decision-making mission, 668 which reflects the strong cooperative formation and intelli-669 gent decision-making capability.    in Subsection 5.A. After training 500 episodes, the trained 704 model is used for online inference to test the scalability. Similarly, we train the IAPPO agent across three different ran-707 dom seeds, record the cumulative rewards during the training 708    In Figure 14(a), the formation is attacking coopera-733 tively, and the jammer is looking for a suitable position. 734 In Figure 14(b), after jamming, the IADSs and targets 735 are destroyed successfully by fighters and the cooperative 736 SEAD mission is completed. We conclude that the proposed 737 HMARL model in this paper can scale to a large-scale 738 scenario.

740
Finally, the effectiveness of different training tricks is com-741 pared; that is, the performance of models using all tricks 742 and the advantage function normalization, the value function 743 normalization, the layer normalization, the adaptive learning 744 rate, and adaptive clip annealing, and an ablation study is 745 completed. The impact of these tricks on the model per-746 formance provides an empirical reference for subsequent 747 research. The result is shown in Figure 15. As we can see from Figure 15, the episode rewards of the 749 model using all training tricks is higher, the advantage func-