Multifunctional Radar Cognitive Jamming Decision Based on Dueling Double Deep Q-Network

To solve the inefficient and imprecise problem using the Deep Q-network (DQN) algorithm for the radar jamming decision, this paper proposes a multifunctional radar jamming decision optimization method based on the Dueling Double Deep Q-network (D3QN). First, we use a value function reflecting the radar state’s change, and an advantage function related to radar state S and jamming action A to improve the cognitive jamming level for unknown radar modes. Then, using the dueling networks for jamming strategy selection and effectiveness evaluation can further improve decision accuracy. Finally, we propose a prioritized experience replay mechanism during network training to shorten the decision-making time. The experimental results show that our proposed method completes decision tasks 2.1 times more efficiently than the DQN and improves decision accuracy by approximately 10% over DQN.


I. INTRODUCTION
As multifunctional radars with complex parameter systems continue to be put into the modern battlefield constantly, the struggle between the radar and jamming sides is undergoing an unprecedented change. The radar detection technology is always ahead of the development of electronic jamming technology [1], [2], which shortens the jamming sides' response time. This situation makes it difficult for the current jamming decision technology to countermeasure the emerging modern radar, such as multifunctional radars and cognitive radar [3], [4]. Therefore, it is urgent to study the improved method for the radar jamming decision.
In recent years, the rapid development of artificial general intelligence (AGI) technology has given rise to many advanced techniques and optimization theories [5]. Reinforcement learning, an important branch of machine learning, is considered one of the essential directions of AGI research [6]. DeepMind proposed Deep Q-Network (DQN) in 2013, combining neural networks and Q-learning with building an end-to-end control policy model and successfully validating The associate editor coordinating the review of this manuscript and approving it for publication was Francisco J. Garcia-Penalvo . the method's feasibility in Atari games [7]. Nature DQN algorithm was proposed again in 2015, establishing its leadership in the field of reinforcement learning with excellent empirical results [8]. Currently, it has been widely used in game competitions [9], [10], decision optimization [11], scheduling control [12], [13], and many other areas.
Due to the powerful function of reinforcement learning methods, many scholars have proposed various radar jamming decision methods based on reinforcement learning theory. They offer great potential for promoting autonomy and intelligence in the radar countermeasure process [14], [15]. Li et al. [16] introduced cognitive techniques into the radar countermeasure process for the first time, providing a new idea for radar jamming decisions. Xing et al. [17], [18] further analyzed the Q-learning theory and solved the problems of jamming decisions when the radar operating mode is unknown. Li et al. [19] improved the Q-learning theory with the Simulated Annealing (SA) algorithm to enhance jamming strategy exploration and utilization. Zhang et al. [20] used the reinforcement learning method to make the jamming decision process more scientific and rational. Gao et al. [21] established an offensive and defensive model for jamming against the cognitive radar, the dynamic process is realized to find a reasonable jamming strategy. Smits et al. [22] presented a cognitive radar network that uses available resources, sharing the data among the network components and considering prior knowledge for jamming decisions. Pan et al. [23] applied the improved chaotic genetic algorithm to allocate jamming strategies and evaluated the jamming effect with the radar detection probability as the index. Liu et al. [24] solved the jamming strategy allocation problem by comparing the differences between the Q-learning algorithm and Double Deep Q-Network (DDQN) algorithm. The above radar jamming decision methods have achieved the desired results to a certain extent. However, there are still problems of slow decision speed and low accuracy due to the increased number of radar modes [18], [19], [20], [24]. This paper proposes a multifunctional radar jamming decision method based on Dueling Double Deep Q-Network (D3QN) to solve the above problems. We first analyze the shortcomings of the traditional method in solving the jamming decision problem, and then establish a D3QN-based decision model according to the operational characteristics of multifunctional radar. Next, we use the DDQN to solve the problem of Q-value overestimation. Then, we adopt the dueling networks to calculate the Q values for jamming actions more accurately, reducing the error of values in complex countermeasures environments. Finally, we propose a prioritized experience replay mechanism to improve the sample utilization and reduce the decision-making time further. The simulation results show that the D3QN method has apparent advantages in decision efficiency and jamming accuracy.
The paper is arranged as follows. Section 2 analyzes the inefficient overestimation problem of traditional reinforcement learning methods and introduces our method. In Section 3, we describe the core technology of the D3QN method. The simulation results are shown in Section 4, in which the scientific and feasibility of the proposed method are demonstrated. Finally, we provide a conclusion in Section 5.

A. REINFORCEMENT LEARNING PRINCIPLES FOR JAMMING DECISION-MAKING
Reinforcement learning uses the ''trial and error'' mechanism in psychology. The agent obtains an evaluative reward signal through continuous interaction with the unknown environment and repeats this process to generate optimal strategies [25]. Even if the agent does not have prior knowledge of the environment, it can still learn the best strategy through the decision-making process, which becomes one of the effective methods to solve the decision-making problem of nonlinear stochastic systems [26].
The specific process of radar jamming decisions using reinforcement learning method is as follows: the jammer detects the target radar and obtains the information of the radar state s t , and s t ∈ S denotes the set of states of the radar at the t moment, S represents the set of all operating states of the radar. The jammer is given a feedback reward by performing a jamming action on the target radar, a t ∈ A denotes the set of jamming actions that the jammer can take. At this point the radar is shifted to a new state s t+1 due to jamming. In the process of continuous countermeasure, the mapping function from the radar state to the jamming action is defined as a strategy π : S → A. The jammer can calculate the value of a strategy based on the feedback reward and use it as the basis for selecting the optimal strategy. As shown in Fig. 1. By repeating the above process, the value function V π (s t ) of the strategy π , which means the sum of the feedback rewards from the t moment, can be obtained where γ ∈ [0, 1], denotes the reward discount rate of the learning process. Thus for all the strategies π have a value function corresponding to them, the value function of the optimal strategy π * can be found B. DQN METHOD Q-learning is a derivative reinforcement learning theory, enabling decision optimization by establishing a dynamic programming process.When the problem is characterized by a Markov process, the future state is only related to the current state and not to the past state. According to the Bellman equation, the state action-value function of traditional Q-learning can be expressed as where Q (s t , a t ) indicates that when the target radar is in the state s t , the sum of rewards obtained by the jammer after taking a jamming action a t , α ∈ [0, 1] is the learning rate. The optimal decision is output when the expected sum converges, which is suitable for decision problems with simple space and VOLUME 10, 2022 low dimensionality. When the number of target radar states increases, the high-dimensional radar states make the size of the relationship Q-table tremendous. The complexity of the algorithm grows exponentially to the ''the curse of dimensionality'' problem [27], leading to a significant decrease in the overall decision efficiency. As a result, it is difficult to apply to the multifunctional radar jamming decision problems effectively. DQN differs from the traditional Q-learning by fitting Q-learning with the deep neural networks. It outputs Q-values directly from the high-dimensional raw data using two neural networks with the same structure and different parameters [28], [29]. The value network reflects the real reward value obtained by the jammer interacting with the target radar, denoted as The estimation network uses the sample data to estimate the state action-value Q s, a; θ , introducing a loss function L (θ ) that represents the difference between the estimated value and the real value By training the sample and going through several iterations, the parameters θ of the estimation network are continuously assigned to the value network. As a result, the real value is infinitely close to the estimated value so that the loss function is minimized, which makes the network more stable and solves the problem of ''the curse of dimensionality''. It opens up new research ideas and methods for the jamming decisionmaking problem of multifunctional radar.

C. D3QN METHOD
At present, DQN application in radar jamming decisions has achieved remarkable results [18], [19], [20], [24]. Nevertheless, analyzing the network structure and algorithm principle, the following aspects still deserve to be explored in depth.
(1) The model uses the same structure to generate the real reward and estimated values. However, when the network parameters are constantly updated, obtaining relatively stable estimated values is difficult, which is adverse to the algorithm's convergence. (2) There is an estimation bias in the value function during training. Using maxQ s , a ; θ can lead the model to overestimate the reward of action, thus misleading the jammer to choose the wrong action and fall into the locally optimal solution. In order to solve the problem of DQN training instability and overestimation, this paper proposes the D3QN theory to improve the efficiency and accuracy of jamming decisionmaking. We first introduces the DDQN [30] based on DQN. DDQN represents the action selection and effectiveness evaluation as an estimation value network Q M (s, a; θ ) and a target value network Q T s, a; θ . The estimation value network calculates the Q-value after jamming, the parameters θ are updated according to the sample. The target value network calculates the target value Y through time-series differential, the parameters θ are replaced with the latest θ . Finally, the target valueY is calculated as θ holding constant for a period can make the target value Y relatively fixed, which is beneficial for convergence. We use the Q M to generate the actions and the Q T to calculate the target value. The maximum functions are not the same, preventing the model from selecting the sub-optimal actions that are overestimated. It effectively solves the overestimation problem of the DQN method. D3QN takes advantage of the Dueling network architecture by diverting the estimation value network Q M (s, a; θ) of DDQN into two parts. The state value function V s; θ, w V characterizes the influence of the radar state. The action advantage function A s, a; θ, w A distinguishes the jamming effect in a given radar state [31]. The improved estimation value network Q M s, a; θ, w V , w A is defined as the following equation The neural network carries out the initial judgment of the data. Then it completes the action reward correction so that the output action is more aligned with the actual situation. The target value of the D3QN model is given by The loss function for updating the network parameters is denoted as In order to reasonably simplify the problem and highlight the keypoint of the radar jamming decision process, this paper does not consider the specific equipment types, operator manmade errors, and other influencing factors. The D3QN model includes the following four elements.
(1) State-space S. The set of states represents the operating modes of multifunctional radar. For example, phased array radar has many modes, such as detection, tracking, guidance, and measurement parameters. (2) Action space A. Action space is noted as a set of jamming strategies that the jamming party use in the electronic countermeasures. For example, the jammer has deceptive jamming, suppression jamming, and other jamming patterns. (3) Transfer probability function P s |s, a . It denotes the probability that the jammer changes the radar's state to s by jamming action a when the radar operating state is s. (4) Reward function R (s, a). It indicates the immediate return value after taking a particular jamming action. R values are defined by the change of radar threat level after jamming, we set a) R = 100, the state of radar switches to a lower threat level. b) R = 0, no transformation of the radar threat level. c) R = −100, the state of radar switches to a higher threat level. When the state of multifunctional radar is s at the moment, the jammer selects a jamming pattern according to the ε-greedy strategy. We analyze the change of radar threat level after jamming and store the obtained sample data s, a, r, s in the experience pool. The experience pool is a memory playback unit to store the experience samples obtained from the jamming countermeasure. The neural network is updated with randomly selected samples from the unit during training. The data sampling follows the prioritized experience replay mechanism. We calculate the action reward value using the estimation value network and update the network parameters with the mean squared difference as the loss function. The target value network outputs Y as the final reward value. After sufficient training and learning in the adversarial environment, the optimal jamming strategy can be output when the cumulative reward values converge. The flow chart of the D3QN for multifunctional radar jamming decisions is as follows.
The above method realizes an autonomous online closedloop learning process, effectively improving the countermeasure level of jamming decision models. It meets the requirements of intelligent, dynamic, and real-time cognitive electronic warfare.

B. NETWORK STRUCTURE
Since the state of the multifunctional radar are highdimensional continuous, the discrete state space increases the difficulty of the decision process. Therefore, the D3QN uses the nonlinear fitting capability of the Dueling network to obtain a more accurate estimation value network function. As a result, the jammer can better reduce the action-value error after completing jamming for different radar states.
We input radar states s in the Dueling network, and output the state value function V s; θ, w V and action advantage function A s, a; θ, w A respectively after the hidden layer processing.
The state value function represents the value of the radar threat level change after jamming. The action advantage function represents the value by choosing a particular jamming action and outputs a vector of dimension |A|. Then, the state value function and the action advantage function are adopted to do linear fitting to obtain the real reward value Q M s, a; θ, w V , w A of each jamming pattern This paper uses a forward 3-layer fully connected neural network to fit the action value approximately. The neural network structure is shown in Figure 3.

C. PRIORITY EXPERIENCE REPLAY MECHANISM
The premise of neural network training assumes that the training data are independent and identically distributed. However, the jammer can only get the reward value by observing the state change of the target radar. This situation leads to a correlation between the interaction data and does not meet the neural network training conditions. Therefore, DQN adopts the ''Experience Replay'' mechanism and randomly selects samples to update the network, solving the distribution problem caused by the correlation data [32], [33].
In the actual radar jamming decision, the random sampling method tends to ignore the differences between experience samples, resulting in sampling inefficiencies and increasing the decision time consumption. Therefore, this paper proposes an improved prioritized experience replay mechanism based on temporal difference error (TD-error) [34]. The TD algorithm uses the value of the difference between the target and estimated Q value to evaluate the priority of samples Then ε is selected as temporal difference error in the D3QN network. It indicates that learning this sample can make the network obtain a better improvement effect, and its priority I (i) should be higher. The sampling priority I (i) is given by where i is the sample serial number. However, the jammer will often visit samples with larger absolute values and rarely or not visit some samples, leading to local convergence of the strategy, which is difficult to provide reliable guidance for the actual jamming decision process. Therefore, this paper intends to assign a higher sampling priority to the experience samples with low access frequency. The state distribution of different experience samples is given by where I (m 0 ) is the probability of the initial state. When the decision process proceeds, if the sampling probability I (m) of sample m t is large, implying that the jammer often updates the neural network using the same radar state. So it is appropriate to reduce the sampling frequency of the experience sample m t . Then, more samples update the neural network to maximize the information value of each radar VOLUME 10, 2022  state, which can effectively improve the decision efficiency and reduce the impact of the local optimum on the decision accuracy.

IV. SIMULATION VERIFICATION A. DESCRIPTION OF THE COUNTERMEASURES ENVIRONMENT
Multifunctional radar generally has a variety of operating states. In the actual countermeasure process, the radar threat level is gradually reduced by the jamming. For example, when a multifunctional radar is in the guidance state, the radar may lose some parameter information after jamming, the radar can not lock on the target continuously. Thus, the radar only shifts to the imaging state with lower threat levels. The imaging accuracy and precision of the radar decrease by continued jamming. As a result, the radar can not detect the target and transforms it into the coarse search state. This situation can be considered that the effect of the jamming process is significant. Therefore, radar generally does not switch from the highest known threat level to the lowest threat level [35]. We completed the experiments in a Matlab environment with the experimental platform parameters of Intel(R) Core(TM) i7-10750H CPU@2.60 GHz processor, 16G RAM, and no graphics acceleration is used. We assume a multifunctional radar has sixteen radar operating states S sample = {s 1 , s 2 , s 3 . . . , s 16 } and the jammer can take nine jamming patterns A = {a 1 , a 2 . . . , a 9 }. Then we generate a connected network with random transformation relationships using Matlab. Figure 4 shows that the network nodes represent the radar states and the arrow lines between the nodes indicate the state transition direction. We define the highest threat level for the state s 1 of the radar, the target state s 16 with the lowest threat level. The transfer probabilities P t between states conform to a Gaussian distribution with the mean value of µ and the variance is σ 2 , and P t ∈ [0, 1] indicates that the probability of transferring a radar state to other radar states, which sums to 1 [17], [18], [20].
The neural network built for training is a 3-layer, fully connected layer. The number of nodes in the input layer is the radar state dimension. The number of nodes in the output layer is the jamming pattern dimension. Finally, the intermediate layer is connected to the Dueling network, and other parameters are set in Table 1.

B. SIMULATION PROCESS
We first initialize the network parameters before the start of the jamming decision process. Then we extract 10% of the samples with lower sampling frequencies from the experience   pool for calculating the loss function. Finally, we update the estimation value network Q M according to the calculation results and replace the parameters θ with the current θ every 100 training rounds.
The radar state starts from s 1 , and the transition ends at s 16 . The jamming that makes the fastest transition is considered the optimal strategy.
In order to use all the jamming patterns, the exploration factor is initially set to 1. The jamming pattern is randomly selected at the beginning of the decision process. The exploration factor decreases by 0.1 with every 100 training rounds and remains constant when it decreases to 0.2. The searching probability is only 20% at this moment, indicating that the jammer can take full advantage of the acquired experience at the end of training. The decision results are shown in Figure 5.
As can be seen from Figure 5, the horizontal coordinate represents the number of training rounds, and the vertical coordinate represents the decision steps. In the beginning of the countermeasure, due to the low experience in the network, the jammer can only explore through many aimless attempts. As the number of interactions increasing, the jammer stores the learned experience in the experience sample pool. The neural network introduces the priority experience replay mechanism and significantly reduces the number of decision steps. As a result, the learning efficiency is sharply improved. Eventually, the training reaches a steady state at about 1500 rounds.
Furthermore, the decision curve finally converges in about five steps, coinciding with the minimum number of steps required in the network constructed in Figure 6. This indicates that the jammer has learned the best jamming strategy. The jammer uses less a priori knowledge and ultimately completes the decision task.

C. COMPARATIVE ANALYSIS OF METHODS
The D3QN-based multifunctional radar decision-making method introduces the DDQN [24] to improve the decision accuracy. Furthermore, we improve the sample utilization and shorten the decision time through the prioritized experience replay mechanism [20].
Generally, the more cumulative reward values the jammer obtains when training, the more times the jammer can successfully transition during the jamming decision process. Therefore, we demonstrate the improvement effect of the overall decision by analyzing each part. We simulate and compare the cumulative reward value for 2000 rounds of only a single metric. The results of the cumulative reward values are shown in Figure 7. Where the horizontal coordinate of the curve represents the total decision rounds, the vertical coordinate represents the cumulative reward value in a decision round. The D3QN-ER and D3QN-PER represent the methods introduced the experience replay mechanism of the literature [34] and this paper, respectively. The corresponding average reward values per 200 rounds are recorded in Table 2. Figure 7 and Table 2 show that although all four methods can maximize their cumulative reward value, the DQN method has a poor convergence effect due to the overestimation problem. The cumulative reward value is only taken to the maximum value of 782.4 in the 1800 to 2000 rounds, which is difficult to provide reliable help to the jammer.
DDQN uses a different network structure and effectively avoids the influence of the local optimum on the decision. The maximum value is 1239.3 from 1200 to 1400 rounds, which improves the decision effect compared with the DQN method. However, the deviation in the Q-value calculation will gradually increase, and the final effect at the end of training is unsatisfactory.  The D3QN-ER method samples by the experience replay mechanism based on temporal difference error (TD-error) [34]. D3QN-ER makes the Q-value calculation more accurate, with a maximum value of 1352.0 from 1000 to 1200 rounds. It has a particular enhancement effect on the decision-making process. However, repeated sampling also reduces the valuable information in the subsequent training process and an inevitable magnitude decrease in the gain value.
The D3QN-PER maximizes the information of the samples by using the prioritized experience replay mechanism proposed in this paper. It combines the advantages of other methods to make the final reward value curve relatively smooth. The reward value converges at rounds 600 to 800, and the maximum reward value is 1643.9, which is approximately 2.1 times higher than the DQN method. This shows that the method obtains the best jamming decision scheme with less training times, avoiding the waste of jamming resources and making the decision process more effective and stable.
When the decision accuracy converges, it can be considered that the method has learned the optimal strategy, and the overall decision result will not change over time. Therefore, we use the simulation environment designed in Section A to compare the methods in this paper with the current main methods [18], [19]. We define the percentage of times each method makes the successful transition in 2000 rounds as the decision success rate. Then we record the decision time as an index to evaluate the efficiency of these methods. The results are shown in Figure 8. When the decision process is stabilized, the Q-learning method needs to establish a large-scale state action table, which leads to many calculations and further prolongs the decision time, and finally takes about 37s. As a result, the decision accuracy of this method is only about 60%, which is challenging to complete the radar jamming decision task in real-time and accurately.
The DQN method introduces neural networks in calculating Q values, effectively avoids the problem of increasing the number of decision dimensions. As a result, the overall decision process is more efficient and only takes approximately 25s to arrive at the optimal decision method. However, the same structure function in calculating the Q value leading to a frequent overestimation of the Q value. As a result, even when the decision accuracy tends to be stable, the system still has the probability of choosing the suboptimal strategy, which results in a significant decision error. It will mislead the jammer to choose the wrong jamming action, delay the best jamming time.
Although the D3QN method takes slightly longer to converge the decision accuracy than the DQN method, it effectively reduces the calculation error and the impact of the overestimation problem on the decision result. The overall change of the decision accuracy curve more stable. Therefore, D3QN can provide a more reliable decision-making process for the jammer and has better practical application value. In summary, the multifunctional radar cognitive jamming decision method based on D3QN has achieved better results.

V. CONCLUSION
In this paper, we solve the slow convergence and Q-value overestimation problems of existing DQN-based radar jamming decision methods. Firstly, we build the decision model according to the multifunctional radar countermeasure process. Then, the action selection and effectiveness evaluation are generated with different functions. Finally, the prioritized experience replay mechanism is used further to improve the training efficiency of the neural network and shorten the optimal decision time. The simulation experimental result shows that the D3QN method is more stable and reliable. The D3QN completes decision tasks 2.1 times more efficiently than DQN and improves decision accuracy by approximately 10% over DQN. In whole, the D3QN method can be used as an effective method of the multifunctional radar jamming decision technology, which lays a good foundation for the engineering implementation of cognitive electronic warfare systems.
LU-WEI FENG was born in 1998. He received the B.S. degree in electronic information engineering from Nanchang University, Nanchang, China, in 2020. He is currently pursuing the M.S. degree in electronic information engineering with Dalian Naval Academy. His research interests include electronic countermeasure and artificial intelligence.