Weapon-Target Assignment Strategy in Joint Combat Decision-Making Based on Multi-Head Deep Reinforcement Learning

In response to the modeling difficulties and low search efficiency of traditional weapon-target assignment algorithms, this paper proposes a deep reinforcement learning-based intelligent weapon-target assignment method. A weapon-target intelligent assignment model with strong decision-making capabilities (RL4WTA) is obtained by training. Firstly, a multi-constraint weapon-target assignment optimization model is established to discretize the dynamic weapon-target assignment problem into a static weapon-target assignment problem. Furthermore, a planning and solving environment for the weapon-target assignment (WTA) problem is designed, and a Markov Decision Process (MDP) for WTA tasks is constructed based on the planning and solving model. This provides a foundation for solving the WTA problem using reinforcement learning algorithms. Additionally, a reinforcement learning-based WTA-solving model is proposed in this paper. By utilizing a multi-head Q-value network, the complex joint decision space is decoupled, thereby improving the efficiency of the WTA model. The use of a masking mechanism allows for inferring valid actions that satisfy the constraint conditions under the current situation, reducing uncertainty during the reinforcement learning training process. Experimental results show that the proposed model, RL4WTA, can generate satisfactory solutions adaptively in both small-scale and large-scale scenarios. Compared with traditional optimization algorithms, the model is superior in adaptability and computational efficiency, meeting the requirements of making optimal decisions for weapon-target assignment problems.


I. INTRODUCTION
Weapon-target assignment (WTA) refers to the rational allocation of weapon resources and the determination of the optimal target assignment scheme based on tactical objectives, weapon performance, target threats, and other constraints, combined with real-time battlefield situational information, to achieve the best collaborative combat effectiveness.It is a typical combinatorial-constrained optimization problem [1].As one of the core problems in command decision-making in collaborative combat, WTA has attracted a large number of researchers and has important military significance [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Kah Phooi (Jasmine) Seng .
The research on WTA can be traced back to the 1950s and 1960s when Manne [3] and Day [4] established models for the WTA problem.The specific problem can be described as follows: for multiple enemy targets, our side needs to allocate a set of available weapons to maximize the expected benefits of continuous combat, as shown in Figure 1.The engagement between weapons and targets is modeled as a random event, with each weapon-target pair having a kill probability.WTA problems are typically divided into two forms: static problems and dynamic problems [1].The static weapon assignment (SWTA) problem assigns weapons to targets all at once, while the dynamic weapon assignment (DWTA) problem can be understood as attacking targets in multiple batches, where the optimal weapon assignment  strategy for the next batch is related to the effectiveness of the previous batch.In different stages of the assignment task, the changing battlefield situational information needs to be considered to provide a basis for subsequent decision-making.
Weapon-Target Assignment (WTA) is an NP-complete problem with multiple parameters and constraints [5].As the number of weapons, target types, and quantities increases, the number of possible solutions grows exponentially.Currently, two main categories of algorithms for solving the optimal solution exist: traditional exact algorithms based on Mixed Integer Linear Programming (MILP) and heuristic-based intelligent algorithms, as shown in Figure 2.
However, the above-mentioned methods are less efficient when dealing with large-scale WTA problems.For instance, branch and bound and dynamic programming suffer from the curse of dimensionality, while heuristic algorithms have slow search speeds and are prone to getting stuck in local optima.On the other hand, decision methods based on reinforcement learning can overcome these issues and have been widely applied in scenarios such as chess games, robot path planning, and autonomous aerial combat decision-making in recent years.Reinforcement learning has characteristics such as online learning and adaptability, making it suitable for high-dynamic environments.However, in the field of weapon-target assignment, the application cases of reinforcement learning are still relatively rare, and the rapid allocation of clusters based on reinforcement learning remains an open problem.
Therefore, this paper takes a new perspective from the machine learning angle and explores WTA problem modeling and solution using reinforcement learning.A weapon assignment framework called RL4WTA based on centralized multi-agent Q-learning is proposed.This framework extends the traditional DQN algorithm to the multi-agent reinforcement learning task of multiple weapons and multiple targets.Unlike traditional computational methods, this framework does not require the use of optimization tools to simplify the problem.The results of case studies demonstrate that this method significantly reduces computational time while ensuring accuracy.Furthermore, the experimental results also indicate the portability of the model, as the RL4WTA model still provides satisfactory solutions when tested with more targets than its training environment.
The structure of the remaining sections of this paper is as follows.A summary of recent work on the WTA is provided in Section II.Then in Section III, a WTA model targeted at operational requirements is constructed.In Section IV, the WTA problem is modeled as an MDP (Markov Decision Process) process, and a framework based on reinforcement learning for optimizing weapon-target assignment strategies is designed.Through simulation experiments in Section V, the speed and optimization performance of the proposed algorithms are verified from the perspectives of computational accuracy and time efficiency.Finally, Section VI presents the conclusions of this paper.

II. RELATED WORKS
Traditional exact algorithms can typically only solve small-scale problems and often make appropriate simplifications in the problem assumptions.The WTA problem is often transformed into a MILP problem [5], [13] (such as the Hungarian algorithm [6], branch and bound algorithm [7], [8], and Lagrangian relaxation method [9]), and traditional solvers like Gurobi are used instead of directly solving the original nonlinear objective.Exact methods can guarantee finding the optimal solution, but MILP methods suffer from scalability issues.As the problem size increases, the computation time increases dramatically, making it difficult to solve large-scale problems.These limitations of exact algorithms restrict the development of WTA solutions, making them only suitable for problems with low real-time requirements and simple constraints.
In the process of weapon-target assignment, the commander's decision preferences and domain knowledge related to combat style/weapon targets can be transformed into rules that assist in constructing optimal solutions.Rule-based heuristic algorithms fully utilize these rules during the construction and search process of weapon-target assignment solutions, quickly generating feasible solutions with good performance.Examples of such algorithms include Genetic Algorithm (GA) [10], Particle Swarm Optimization (PSO) [11], Ant Colony Optimization (ACO) [12], Artificial Bee Colony (ABC) [13], and others.The advantages of these algorithms are their timeliness and the ability to quickly obtain optimization solutions for the problem.However, they rely on specific scenarios and domain knowledge [14], have limitations when considering additional constraints, and are prone to generating local optima [2].Although these methods have made some improvements in terms of optimization accuracy, their convergence speed is still unsatisfactory, especially when facing high-dimensional problems.There is still significant room for improvement in terms of optimization accuracy and convergence speed.Additionally, these algorithms require the encoding of solutions into vectors (e.g., chromosome encoding in genetic algorithms).The length of the solution encoding depends on the scale of the problem, and once the problem scale changes, a new solution encoding must be provided, which brings inconvenience to practical applications.
In recent years, with the development of artificial intelligence technology, reinforcement learning has broken the barriers of traditional methods and achieved remarkable breakthroughs in many fields [15].Combinatorial optimization, which involves making optimal choices of decision variables in a discrete decision space, shares natural similarities with the ''action selection'' process in reinforcement learning.Moreover, the ''offline training, online decisionmaking'' characteristics of deep reinforcement learning make it possible to solve combinatorial optimization problems in real-time.Therefore, utilizing deep reinforcement learning methods to solve traditional combinatorial optimization problems is an excellent choice.In recent years, a series of new methods using deep reinforcement learning to solve combinatorial optimization problems have emerged, achieving good results in problems such as TSP [16], VRP [17], Knapsack [18], and more.Compared to traditional combinatorial optimization algorithms, DRL-based combinatorial optimization algorithms have a series of advantages, such as fast solution speed and strong generalization ability, making them a research hotspot in recent years.Reference [19] proposes a two-stage optimization algorithm based on Q-learning and genetic algorithms (QL-GA).The optimal solution obtained through Q-learning exploration is used as the initial population for genetic algorithms (GA), which then finds the optimal solution with a smaller population size.Reference [20] designs a missile-target assignment framework based on DRL, which can generate satisfactory solutions adaptively in both small and large-scale instances.Reference [21] solves the WTA problem using reinforcement learning with a pointer network structure.It can be seen that RL-based methods can generalize solutions in various settings, surpassing the performance of optimization tools and heuristic-based methods.For specific solution methods and their corresponding WTA problems, please refer to Table 1.

III. THE CONSTRUCTION OF WEAPON TARGET ASSIGNMENT MODELS
During joint firepower strikes, it is not cost-effective or conducive to combat to engage every potential target.Therefore, it is necessary to select important targets that can cause significant damage to the enemy for engagement.The main problem to be solved is how to allocate weapon resources reasonably to maximize the cost-effectiveness of damage inflicted per unit cost.

A. DESCRIPTION OF THE DWTA PROBLEM
DWTA belongs to a multi-stage decision problem, and its prominent feature is that during each stage of the weapon-target assignment process [34], the changing battlefield situation must be considered in real-time to adapt to the dynamic solving environment and obtain globally optimal attack strategies.The ''observe-attack-observe'' strategy is commonly used to solve the multi-stage DWTA problem, as shown in Figure 3 [13].
''Observe'' refers to the analysis of the battlefield situation to determine the targets for attack and the available weapons.''Attack'' refers to solving the allocation results of weapons and targets, and implementing the attack based on the decision.The ''observe-attack'' process is similar to the SWTA problem.Therefore, DWTA can be represented as: DWTA = SWTA (1) , SWTA (2) , . . ., SWTA (T )  (1) Based on the above analysis, assuming that stage t has completed the analysis of the battlefield situation, determining the desired number of targets for attack (n), and the available number of weapons (m) based on mission objectives and weapon status, this paper selects the maximization of the weighted damage rate on enemy targets as the optimization objective of the assignment decision.Thus, the optimization model can be represented as: max J (X ) = max J X (1) , J X (2) , . . ., J X (T )   (2) From Equations ( 2) and ( 3), it is evident that to obtain the global optimal solution for the DWTA problem, the optimal solution must be obtained at each stage.Therefore, obtaining the solution in Equation ( 3) accurately and efficiently is crucial for handling the DWTA problem.In a sense, DWTA can be considered as a repetition of SWTA until the attacking side destroys all targets.

B. CONSTRUCTION OF MATHEMATICAL MODELS 1) ASSUMPTION DESCRIPTION
In this study, to establish a reasonable mathematical model for WTA, the following assumptions can be defined: Assumption 1: The mathematical model consists of W weapon platforms, M missiles, and T targets, and the number of opposing groups may not be equal (e.g., each weapon platform may have different types and quantities of missiles).
Assumption 2: The method for determining the weapontarget damage rate belongs to weapon performance evaluation, which is beyond the scope of this paper.The weapon-target damage rate is considered known data.The kill probability between the missile (the i-th unit of M ) and the target(the j-th unit of T ) is denoted as p ij .
Assumption 3: Each weapon platform can use different missiles to attack other targets (each missile can only be used once).

2) ELEMENTS AND ATTRIBUTES OF WTA
The elements of the WTA problem mainly include two categories: weapons and targets.First, define the types and the number of weapon platforms.attributes of the WTA elements.The set of weapon platforms is denoted as W = {W 1 , W 2 , . . ., W m }, where m is the number of weapon platforms.The ammunition capacity of the i-th weapon platform is l i , and the cost of each ammunition is c i .Define the ammunition capacity matrix L and the ammunition cost matrix C as: The set of targets is denoted as T = {T 1 , T 2 , . . ., T n }, where n is the number of targets.The threat level of the j-th target is v j .The average single-shot damage probability of each ammunition of the i-th weapon platform to the j-th target is denoted as Define the threat level matrix V and the damage probability matrix P as: 3) CONSTRAINTS AND OBJECTIVE FUNCTION Let x ij m×n be the firepower allocation decision matrix, where the rows and columns of the decision matrix are arranged in categories and order of interception projectiles and targets.
represents the allocation of the i-th firepower unit to the jth target, while x ij = 0 represents no allocation.The allocation matrix is shown in the following Equation( 8): When the M-class weapons jointly attack all targets, the cost of using all the weapons is denoted as C(X), as shown in Equation (9).
The damage effect per unit cost can be represented as shown in Equation (10).(10) Therefore, in joint firepower strikes, the model with the objective of maximizing the damage effect per unit cost can be represented as Equation (11). : Constraint (1), the weapon, denoted as W i , is capable of launching attacks up to l i times.The weapon W i has the flexibility to target the same entity multiple times or different targets with each attack.Constraint (2), S j represents the maximum number of times the target T j can be attacked.The value of S j varies based on the actual battlefield conditions.The command system allows for different instructions to be issued regarding the number of attacks on different targets.
Constraint (3) indicates that the value of S j will not exceed the sum of all weapon payload capacities.

IV. WEAPON-TARGET ASSIGNMENT FRAMEWORK BASED ON DEEP Q-NETWORK (DQN)
As mentioned in Section I, using traditional methods to solve the WTA problem is time-consuming or may not be suitable for real combat environments.To overcome the limitations of traditional methods and obtain an extensible general WTA allocation strategy, a weapon assignment framework based on centralized multi-agent Q-learning is proposed.In this article, we name this allocation framework RL4WTA.This framework extends the traditional DQN algorithm to the multi-agent reinforcement learning task of multiple weapons and multiple targets.The framework includes an encoder and multiple decoders, with each decoder outputting a Q-value for target selection by each weapon, which can generate satisfactory solutions as the problem scale changes.

A. MARKOV DECISION PROCESS FOR WTA
Since the missile assignment model is discrete and the objective function is only related to future assignment matrices, independent of past information, it has strong Markovian characteristics.Therefore, a Markov process G with multiple agents can be constructed, modeled as a tuple of Markov decision processes G = (S, A, R, P, γ ), In the given context, S represents the state space, which denotes the collection of all possible states.A represents the action space, representing the set of all available actions.R corresponds to the reward function, which encodes the mapping from the state and action space to the real number space.The reward function serves as the sole feedback signal for evaluating the quality of actions.P(s t+1 |s t , a t ) = P(s t+1 |s t , a t , . . ., s 1 , a 1 ) represents the state transition function, which characterizes the dynamics of the environment.The variable γ represents the discount factor, which indicates the extent to which future rewards are valued.γ ∈ [0, 1], when considering its impact on the true Q-value, the discount factor is related to the temporal domain.In two extreme conditions, γ = 0 signifies that only the immediate rewards are of concern, while indicating equal importance given to all potential future rewards.As shown in Figure 4.At each time step t, the agent receives a state s t from the environment and generates an action a t to interact with the environment.The environment produces a new state s t+1 and immediate reward R t based on the state transition function P and reward function R. In RL, the policy π (a |s ) = P (A t = a |S t = s ) outputs the probability of each action based on the state.The quality of the policy can be evaluated based on the expected cumulative discounted return G t = ∞ i=t γ i−t R t obtained by the agent in the environment.Considering the characteristics of the WTA problem, the Markov decision process model constructed in this paper can be specifically represented as follows: State space S: Due to the diversity of weapon platforms and their quantities, the state space in the weapon assignment decision model mainly consists of two parts: the attributes of the weapons and the attributes of the targets.Thus, the state when assigning the i-th weapon platform is defined as: It includes information such as weapon payload capacity, weapon cost, damage probability, target threat coefficient, and remaining number of target attacks.
Action space A: To ensure the effectiveness of target engagement, it is necessary to prioritize targets based on their threat levels, to engage the most threatening targets first.If multiple firepower units are allocated to the same target, the damage probability of that target will be better guaranteed.Therefore, when designing the action space for the WTA task, factors such as target threat level and the number of allocated interceptors need to be considered to maximize the effectiveness of target engagement.It should be noted that the damage probability of the same interceptor is different for different types of targets.Therefore, when allocating firepower units, the target type information also needs to be considered.Thus, assuming that in the i-th decision step, the weapon is assigned to the j-th target, the action in the i-th decision step is defined as: where u i (i = 1, 2, . . ., l) represents the number of occurrences of the i-th type of weapon platform assigned to the j-th target, and v j represents the threat level of the target.Reward Function R: The constructed reward function mainly considers two factors: ammunition consumption of weapons and the benefits obtained from target engagement.The parameters of the reward function include the threat level v j of the j-th target, the cost c i of the i-th weapon, and the weighting coefficient β.The weapon allocation strategy tends to attack actions with higher threat levels and lower 113744 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.ammunition costs.In this paper, the value of β is set to −0.01.
To enhance the computational efficiency of reinforcement learning algorithms in the WTA task, this paper proposes a WTA decision framework based on a multi-head Q-value network.This framework consists of a shared WTA feature encoding module and multiple Q-value inference modules.The feature encoding module integrates and understands the battlefield situational information involved in the WTA task, while the Q-value inference modules are responsible for inferring the value of each weapon's attack plan.The key contribution lies in decoupling the complex joint decision space of the WTA task using a multi-head Q-value network, improving inference and training efficiency, and achieving rapid and efficient generation of solutions in complex WTA tasks.Specifically, this paper constructs a multi-head Q-value network at the weapon level, where . .Q in is responsible for inferring the attack value of weapon i against all available targets j.Furthermore, each agent's goal is to maximize the global reward R tot .Therefore, all agents can consider the rewards of current and future interests based on the reward function, thus maximizing the individual action value to maximize the combination action-value function Q tot , as shown in Equation ( 17): Each action-value function of the agents only requires their local observations.Therefore, the entire system is executed in a distributed manner, selecting the action with the highest cumulative expected reward based on the local value function.By ensuring that the combination action-value function has the same monotonicity as each local value function, the maximum value of the local value function can be achieved, which maximizes the combination action-value function and satisfies Equation (18): 2) MASKING FOR CONSTRAINT WTA, fundamentally, is a constraint solving task.This paper aims to propose a highly scalable WTA fast solving model that can be extended to different constraints using reinforcement learning methods.Therefore, based on the constraints and action space of the WTA task, this paper constructs a mask mechanism to enable the intelligent agent to infer valid actions that satisfy the constraint conditions under the current situation.This mechanism eliminates the selection of invalid or unreasonable actions by the reinforcement learning agent, thus increasing the sample efficiency of the reinforcement learning algorithm and reducing uncertainty during the training process.The mask mechanism described above is crucial for implementing the reinforcement learning-based WTA solving model.Based on equation (12), this paper designs a constraint-based mask module G(s_t) : A ⇒ A, which essentially compresses the joint action space and excludes illegal actions that do not comply with the WTA constraints.

3) TRAINING METHOD -Q VALUE ITERATIVE PROCESS
During the interaction with the environment, the RL4WTA framework will sample a set of trajectory data and aim to find the optimal policy that maximizes the expected cumulative return based on the obtained data.Therefore, a neural network model, consisting of a multi-layer perceptron, is constructed to map the state space to multiple-headed value functions, which represents the joint optimal value function Q * for multiple weapons.The network structure is shown in Figure 5.It consists of a shared encoder module and a Q-value inference module.The encoder module maps the current state data used for Q-value updates to a hidden space, yielding the feature vector h t .Then, the Q-value inference module takes the currently hidden state h_t as input and outputs N Q-value vectors, resents the Q-values for each weapon i's possible targets, with Q ij representing the Q-value corresponding to the action of weapon i selecting target j.The pseudocode for the algorithm can be found in Table 2. Based on the above network model, the autonomous weapon-target allocation data sampling and policy optimization process can be implemented.The specific process is shown in Figure 6.In the data sampling phase, the Q-value model infers the Q-values for all joint actions based on the current state.After constraint solving and masking, Q-values for executable solutions that comply with the current situation are obtained.Based on the above Q-values, the argmax operation is performed to obtain the current weapon allocation solution, which then interacts with the environment to obtain the next state and action.Finally, the data generated from the interaction is stored in an experience replay pool.In the policy optimization phase, a set of trajectory data is randomly sampled from the experience pool.TD targets are calculated based on the Q-value model and the target network, and the Q-value network is eventually updated.

V. SIMULATION EXPERIMENT VERIFICATION AND ANALYSIS
In this section, an analysis of the performance of the proposed allocation framework in the previous section is presented.First, two algorithms are briefly introduced as baseline algorithms.Subsequently, the detailed settings of the simulation environment and the relevant parameters of the algorithms are provided.Finally, the effectiveness of the framework proposed in this paper is verified from different perspectives based on the quality of the solution compared to other algorithms.

A. DWTA DECISION PROCESS
The DWTA decision process based on the ''attack-observeattack'' strategy is shown in Figure 7.
From figure7, it can be seen that DWTA decisions need to dynamically update the desired targets and available weapons based on the current battlefield situation.Therefore, the algorithm requires high optimization speed to meet the requirements of weapon-target assignment decisions.

B. EXPERIMENTAL PARAMETER SETTINGS
To verify the optimization performance of RL4WTA in solving DWTA problems, three different scales of weapon and target quantities are set in the background of joint operations.The number of weapons can represent the number of missiles carried by each weapon platform, and the number of targets 113746 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.can represent the number of enemy targets.The experiments are divided into three categories: the number of weapons is greater than the number of targets (m > n), the number of weapons is equal to the number of targets (m = n), and the number of weapons is less than the number of targets (m < n).In total, there are nine test groups.The quantities of weapons and targets for each group are shown in Table 3.The experiments were conducted on a computer with an i5-6300 processor, 16GB of RAM, and an RTX2080Ti graphics card.
The elements in the payload capacity matrix L are  16).After generating the payload capacity matrix L and the strike ammunition matrix S, the constraint satisfaction is checked.If the constraints are not satisfied, the generation process is repeated until the constraints are met.
For each scenario in Table II, the proposed RL4WTA method is used to optimize the allocation strategy.Since Q-learning is a deterministic algorithm, to ensure that the algorithm finds the optimal value, ε-greedy exploration is used during the training process.The agent has a probability  of choosing a random action and a probability of choosing the best action.The value of ε is linearly decreased during training to encourage more random exploration at the beginning and gradually favor the best action as the training progresses.The parameters for training the deep Q-value network are shown in Table 4.

C. SIMULATION RESULTS ANALYSIS 1) BASELINE COMPARISON ALGORITHMS
The proposed DRL-based method is compared with two types of algorithms: the MILP-based exact solution algorithm and a random selection algorithm (RSA).
Mixed Integer Linear Programming (MILP): The optimization solver Gurobi is used to solve Equation ( 12) by adopting an approximate form of the objective function.
Random Selection Algorithm (RSA): RSA adopts the same allocation framework as our DRL-based method.However, instead of using DQN for target selection, RSA uses a random selector to randomly select a target from the target pool for each missile.
To reduce the randomness of the simulation experiments, each of the above three algorithms is run 10 times.The average value and standard deviation are recorded and compared with the strategy obtained by training RL4WTA.

3) SOLUTION QUALITY PERFORMANCE COMPARISON
According to Equation (2), E(x) = max f (x) is used as an evaluation metric to measure whether the global process achieves optimality.represents the number of times the full set of targets needs to be allocated.The results of the experiments (average value ± standard deviation) are shown in Table 5.The optimal values are highlighted in bold.Since MILP is an exact algorithm, the results remain the same in multiple runs, hence no standard deviation is provided.It can be observed that MILP achieves the best performance in two instances and also performs well in other small-scale scenarios.As an exact solution method, MILP can find good allocation results in problems with relatively small solution spaces.However, as the problem size increases, the dimension of the solution space grows rapidly, making it difficult for MILP to find satisfactory feasible solutions within an acceptable time frame for large-scale problems.
In contrast, RL4WTA shows advantages in most cases, especially in larger scenarios.For example, the performance of RL4WTA in W24-T16 is better than MILP by 16.8%, and in W56-T40 it outperforms other algorithms by 20.8%.Furthermore, RL4WTA also demonstrates excellent solution quality in small-scale scenarios, with results very close to MILP.Although our deep neural network is trained using relatively small instances to reduce training time, it performs well in all test instances.This indicates that the trained RL4WTA model has certain robustness against unexpected changes in the scenario parameters and has high flexibility and applicability in practical scenarios.In addition to solution quality performance comparison, the time efficiency of different algorithms is also studied in the performance comparison.As shown in Table 6, the average runtimes of MILP and RL4WTA in different scenarios are provided.It be seen that as the problem size increases, the time consumption of MILP increases rapidly, reaching 1009s in the W24-T16 scenario.In contrast, RL4WTA can complete the weapon-target assignment on various problem scales in a short time (less than 1.2s), demonstrating greater efficiency, especially in the largest scenario.RL4WTA only needs to iterate through all the targets for each weapon, which means that the computational complexity of RL4WTA increases linearly with the problem size.This characteristic ensures that RL4WTA can meet real-time requirements even for largescale problems.
In conclusion, the ability of our model to find better solutions than the comparative methods is well-founded.First, since the MILP method adopts an approximate form of the objective function, the optimal solution of the alternative problem does not always guarantee optimality for the original problem.On the other hand, our model does not simplify the problem in any way, and the neural network can automatically handle such problems through a large amount of training data and find suitable weights to produce nearoptimal solutions.

VI. CONCLUSION
The weapon-target assignment problem is one of the key challenges in the fields of command and control and mission planning, as well as a fundamental research topic in military operations research.In this paper, a new mathematical model is constructed to meet the requirements of practical combat.To improve the accuracy and speed of solving the DWTA problem, an end-to-end weapon-target assignment framework is proposed based on an improved DQN algorithm.The performance of the proposed model is evaluated through case studies, and the results show that RL4WTA can provide optimal firepower allocation solutions for different test scenarios.It exhibits certain dynamic adaptability to changes in scenario parameters while demonstrating high optimization accuracy and fast convergence speed.
With the development of intelligent and unmanned warfare, the weapon-target assignment models will become more complex.For instance, considering factors such as weapon resource adequacy and target visibility, maximizing total damage under uncertain influences, and incorporating various complexities such as constraints and interdependencies among weapons or targets in the discrete solution space.Research on how to further enhance the effectiveness of allocation strategies through intelligent algorithms can be considered as the next research focus, taking into account the characteristics of different combat styles.

FIGURE 1 .
FIGURE 1. Illustration of the WTA problem.The targets are aimed to attack each asset along the arrows.Each weapon farm may launch multiple weapons, although not depicted in the figure.

FIGURE 2 .
FIGURE 2. The procedure of WTA in a defense scenario.

Figure 8
Figure 8 illustrates the reward function curve of the RL4WTA model trained in a simulated environment.The horizontal axis represents the number of training episodes, while the vertical axis represents the magnitude of the reward value during the training process.The blue curve represents the training curve of the multi-head DQN trained using the DRL method, and the yellow curve represents the training curve of the RSA model for comparison.From the figure, it can be observed that the reward values provided by the Q-value network are initially at a low level during the early stages of training.This is because, at this stage, the policy output by the deep Q-value network is mainly determined by the epsilongreedy strategy, which focuses on exploring various possible strategies.After 100 training episodes, the deep Q-value network has obtained a relatively comprehensive sampling of the entire sample space.Based on this, the neural network continuously generalizes the Q-values through training, while the epsilon exploration value decreases to 0.05.Therefore, the performance gradually improves from the 100th training episode until the end of training.It is worth noting that we did not train specifically for the three experiments in the largescale scenarios (W = 40, T = 56; W = 56, T = 56; W = 56, T = 40).This was done to reduce training time (each experiment requires over 3 days of training and parameter tuning to achieve the desired results) and to test the generalization ability of our proposed model.

FIGURE 8 .
FIGURE 8.The training effect of the multi-head DQN network.

TABLE 1 .
Summary of variant metaheuristic algorithms and implementation of various WTA.

TABLE 2 .
Pseudocode for the algorithm.

TABLE 3 .
Combinations of weapon platforms and target quantities in different scenarios.

TABLE 4 .
Training parameters of deep Q-value network.

TABLE 5 .
Comparison of experimental results.