A Radar Anti-Jamming Strategy Based on Game Theory With Temporal Constraints

The confrontation between radar and jammer is increasingly competitive in electromagnetic spectrum warfare. The current radar anti-jamming methods are constrained to some extent in complex electromagnetic environment. It is necessary to study not only radar specific action benefit but also confrontation strategy. To improve the anti-jamming capacity of radar system, a game theory-based optimization method is proposed to enhance the decision-making of anti-jamming strategy in this paper. First, we analyzed the radar winning conditions and discussed the temporal constraints of the recognition and preparation process of the radar and jammer actions. Second, a temporal sequence interaction based dynamic game model between radar and jammer is constructed. Then, Q-learning is performed to optimize the radar anti-jamming strategies with the temporal sequence interaction gain as the objective function. The simulation experimental results show that the proposed strategy can significantly improve the radar winning probability in the confrontation.

INDEX TERMS Game theory, optimization methods, radar countermeasures.

I. INTRODUCTION
As an important remote sensor, radar is widely used in military and civilian fields, such as target detection and recognition, disaster monitoring, atmospheric measurements, etc.
With rapid development of military electronic technology, radar has played a key role in electromagnetic spectrum warfare, such as the scenario of a missile attacking the hostile naval vessel.The jamming technology, however, is also developing rapidly to confront against radar.Especially in the electromagnetic spectrum warfare, the battlefield electromagnetic environment becomes increasingly complex, which brings serious threats and challenges to the target detection ability of radar [1].As the radar moves rapidly with the missile to the naval vessel and the naval vessel keeps adjusting its location, the radar needs to constantly detect the information including the location reliably in order to hit the naval vessel in the end.However, the naval vessel implements jamming from one type to another to protect itself, so the radar must implement anti-jamming actions accordingly in The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Gaggero .this process.Radar has a set that includes many detection and 31 anti-jamming actions, while the jammer has a set that includes 32 many jamming actions.In the radar or jammer action set there 33 is no action that can defeat all the actions of the opponent.The 34 confrontation between radar and jammer can be regarded as 35 a two-player zero-sum dynamic game.

36
As the competition between jamming and anti-jamming 37 technology is becoming fiercer, more attentions have 38 been paid on radar anti-jamming techniques by the radar 39 researchers and practitioners.The game theory is a useful tool 40 to analyze the strategies of radar and jammer confrontation 41 [2].There are two popular research fields to applying the 42 game theory to the strategy of radar-jammer interaction.43 1) Solve the Nash equilibrium (NE) point under the spe-44 cific strategy set and utility function matrix and take the 45 action corresponding to the equilibrium point as the follow-46 up.Wonderley et al. [3] proposed a radar anti-jamming model 47 based on game theory, which divided the radar anti-jamming 48 game process into three parts: information collection and 49 identification, evaluation of utility functions and results opti-50 mization.However, the radar or jammer would keep adjust-51 ing its actions instead of remaining unchanged even in the 52 dominant position of a game.He and Su [4] analyzed and summarized the application of game theory in radar antijamming strategy.Mutual information was used to optimize the utility function, and the equilibrium scenarios were analyzed game information in symmetrical and asymmetric game information.Mutual information criterion was also applied in [5] to solve the strategy design problem within the framework of Stackelberg game and egalitarian game.Li et al. [6] studied the radar anti-jamming process under the condition of signal game and carried out simulation experiments, without considering the recognition process of radar.In the radarjammer interaction, one player could not fully perceive the action of the other.Sheng [7] studied the anti-jamming techniques of SAR and got the corresponding solutions under the conditions of imperfect and perfect information, still with the focus on the solutions of NE points.Zhou [8] studied the anti-jamming decision-making technology of radar under the scene of imperfect information, and compared different methods to get solutions of NE point.However, the winning condition of radar confrontation was not concerned.Occupying the advantageous position in the last round of a game matters much for the player to win the game.The effects of jamming on the target detection performance of a radar using constant false alarm rate (CFAR) processing were analyzed using a game theoretic approach in [9], which solved the resulting matrix-form games for the optimal strategies of both the jammer and the radar.Deligiannis et al. [10] studied a distributed beamforming and resource allocation technique for a radar system in the presence of multiple targets, which presented a proof of the existence and uniqueness of the NE in both the partially cooperative and noncooperative games.
2) Study the optimization issues of game utility function in specific environment (such as the noise or clutter background), and in specific radar system (such as multiple-input multiple-output (MIMO) radar, polarization radar and radar network).Zhang et al. [11] discussed the information acquisition and utility function acquisition in the game process of polarization radar in detail.The optimization of utility function was realized and the working performance of radar was improved by selecting reasonable polarization mode and waveform.Gao et al. [12] optimized the utility function by adaptively adjusting the energy and waveform of MIMO radar based on mutual information under the conditions of perfect and imperfect information.In the Stackelberg game scenario of cognitive MIMO radar, Wang et al. [13] used particle swarm filter to optimize the utility function with the minimum root mean square as the objective function.
Under the clutter condition of MIMO radar, Lan et al. [14] used mutual information to optimize the utility function by adopting two-step water filling method to adjust the selection of waveform.Mutual information criterion was also applied in [15] to formulate the utility functions of a smart target and a smart MIMO radar.Reference [16] studied a game framework of joint beamforming and power allocation for multistatic radars and multiple jammers, in which the receive beamformer weight vector could be obtained by applying minimum variance distortionless response.Panoui et al. [17] 109 proposed a game theoretic waveform allocation algorithm 110 for a MIMO radar network, which used potential games to 111 optimize the performance for radars in the clusters.For radar 112 network, Deligiannis et al. [18] investigated a game-theoretic 113 power allocation scheme and performed a NE analysis for a 114 multistatic MIMO radar network, and Bogdanović et al. [19] 115 considered a target selection problem for multitarget tracking 116 in a multifunction radar network from a game-theoretic per-117 spective, which proposed distributed algorithm based on the 118 best response dynamics to find the equilibrium points.

119
The main research purposes of the above literatures focus 120 on the confrontation benefits of the player's action.How-121 ever, the winning of radar usually requires superiority in 122 the last round of a game, which relates to the utility of the 123 NE point as well as the action's temporal constraints.The 124 game between radar and jammer is a dynamic process within 125 limited confrontation interval, and it usually consumes time 126 to take effective actions and recognize the actions of the 127 other player rather than real-time, which brings a great impact 128 on the winning condition.The game confrontation interval 129 contains hostile action recognition and the preparation for the 130 effective actions.It is a vital problem about how to optimize 131 the radar actions during the total game interval.However, 132 the current researchers aforementioned do not consider this 133 problem into the radar-jammer game process.

134
In order to deal with the above problem, an optimization 135 method of radar-jammer game strategy with temporal con-136 straints is proposed in this paper.First, a temporal sequence 137 interaction based dynamic model of radar and jammer is 138 constructed in a noncooperative two-player zero-sum game.139 Then, the strategy planning techniques of both sides are 140 analyzed by considering the temporal constraints, including 141 the preparation intervals and recognition intervals of actions.142 Lastly, the Q-learning method is performed to optimize the 143 temporal sequence planning of actions under the correspond-144 ing strategy sets.

145
The remainder of this paper is organized as follows: 146 Section II constructs the game model of radar and jam-147 mer confrontation based on temporal sequence interaction.148 Section III presents the proposed optimization method based 149 on Q-learning.The effectiveness of the proposed method 150 is verified with simulation and experimental results in 151 Section IV.Finally, the conclusions of this study are pre-152 sented in Section V.

GAME MODEL 155
In this section, the game model of radar and jammer con-156 frontation based on sequential interaction is proposed.i) The 157 actual utility matrix of radar is constructed through the recog-158 nition process; ii) The radar winning condition is proposed, 159 then the evaluation index is educed, and a temporal sequence 160 radar-jammer interaction game model is constructed.iii) The 161 trend of the specific parameters in the game model is dis-162 cussed in detail.

A. ACTUAL UTILITY FUNCTIONS
The recognition probability is introduced to construct the actual utility matrix based on the utility functions of radarjammer game.The elements of game theory generally include players, strategy sets and utility functions [20].The players are the main participants in the game, the strategy sets are the sets of actions that can be taken by each player, and the utility function is the benefit that is corresponding to the actions taken by the radar and jammer respectively.The strategy set and utility functions of one player are generally unknown to the other, but in the confrontation between radar and jammer, they can be estimated by the prior knowledge.
The utility matrix is the matrix of benefits that the players can get when they take different actions.Assuming the radar and the jammer both know the utility matrix.By calculating the utility matrix of radar, the mixed strategy corresponding to NE is obtained, which is the weighted combination of several actions.However, only one action can be taken in the actual game, which means the mixed strategy cannot be directly applied.The radar utility is obtained under the assumption that both radar and jammer take actions corresponding to the maximum probability in the mixed strategy, but the benefits obtained in this way may be quite different from the utility corresponding to the mixed strategy.By introducing the recognition process and considering the utility of NE point, the utility matrix is modified to become executable and consistent to the actual process.
The rows represent the radar actions and the columns represent the jammer actions, and e ij is the benefit of the ith action of radar to the jth action of jammer.It represents the radar capability to obtain information of naval vessel, which can be reflected by parameters such as signal-noise-ratio (SNR), the detection probability (Pd) and so on.A big value of e ij means radar can get the information of the target easily and accurately.According to the characteristics of zero-sum game, the utility matrix of jammer is -E.
However, an action recognition process takes place during the confrontation between radar and jammer.One player usually needs to get the action information from the other with a certain recognition probability, which brings a vital impact on the utility matrix.When the actions of jammer are recognized, the radar utility is defined by where e Rij0 represents the radar utility when the jammer 215 action is successfully recognized, p j is the recognition prob-216 ability of radar for the j th jammer action, and e ij is the same 217 variable in (1).

218
When the actions are not identified, the radar utility is 219 calculated according to the utility corresponding to the action 220 which takes the maximum probability in the mixed strategy 221 of NE.It is expressed by where e Rij1 represents the radar utility when it is not identified 224 and e ne is the corresponding benefit when the radar reaches 225 the NE.

226
The actual radar utility is defined by the larger value of the 227 above conditions, which is denoted as where e Rij represents the actual radar utility.

230
Considering the recognition process, the actual utility 231 matrix of radar, denoted as E R , is obtained as The utility matrix without recognition process is called 237 initial utility matrix, and each items of the initial utility 238 matrix of radar and jammer are opposite to each other.The 239 utility matrix obtained after the recognition process is called 240 actual utility matrix, and the actual utility matrix of radar and 241 jammer are obviously different.

B. WINNING CONDITION OF RADAR-JAMMER GAME 243
Attention should be paid to the temporal constraints of actions 244 in the game, rather than only the benefits of actions in the pro-245 cess of radar-jammer interaction.In fact, it consumes time not 246 only for the actions of radar and jammer to take effect, but also 247 for them to be recognized by each other.The interval that the 248 radar and jammer need to take effect is called the preparation 249 interval, and the interval required for their recognition of the 250 other player's action is called the recognition interval.

251
After the action implemented by the radar takes effect, the 252 jammer carries out recognition process and takes its action, 253 and then its action is recognized by radar.These four stages 254 are called a round of confrontation in radar-jammer game.

255
The radar carried by the missile approaches the naval 256 vessel with a high speed from distance.After detected by 257 the naval vessel, the dynamic game starts.The objective 258 conditions for the end of game include entering the nonescape area of radar carrier, the condition where radar antenna can illuminate the target but cannot complete overload maneuver, etc.The total game interval is limited but not constant, and it is constantly updated with the progress of the game.As the total game interval of radar game process is limited, the temporal characteristics of preparation interval and recognition interval lead to great effect on the final results.
The game is composed of multiple rounds whose number is determined by the total game interval, the recognition interval and the preparation interval in each round.This process between radar and jammer is called temporal sequence interaction model.
The outcome of the game between radar and jammer is usually determined by the result of the last round, which is quite different from the general two-player zero-sum game.
There are two situations in the last round of the game: 1) After the radar takes actions and passes the preparation interval, the remaining game interval is less than the sum of the recognition interval of the jammer to radar action and the preparation interval of the corresponding action.Therefore, in the last round the radar implements the action, while the jammer cannot.The radar is dominant at the end of the game.
2) After the jammer takes actions and passes the preparation interval, the remaining game interval is less than the sum of the recognition interval of radar to jammer action and the preparation interval of the corresponding action.Therefore, in the last round the jammer implements the action, while the radar cannot.The jammer is dominant at the end of the game.
The game process is presented in Fig. 1.  jammer to radar's action is recorded as T R (n, a).The prepa-302 ration interval of jammer's action is recorded as T P (n, s), 303 and the recognition interval of radar to jammer's action is 304 recorded as T R (n, s).The temporal sequence interaction gain 305 G 0 is expressed as In the actual confrontation, due to the actions such as radar 308 fingerprint recognition, the recognition interval of the other 309 player's action may become shorter.

310
The recognition conditions may be met after several rounds 311 of confrontation, then step reduce the recognition interval and 312 stabilize again.
where Thr refers to the game round threshold which is set 315 according to the prior knowledge, T 0 is the recognition inter-316 val before reaching the threshold, and T 1 is the recognition 317 interval after exceeding the threshold.

318
The recognition interval may also gradually reduce in the 319 law of where a 0 is the amplitude adjustment factor.

322
The change of recognition interval can be simplified 323 according to the actual scenario.

A. FUNDAMENTAL OF Q-LEARNING 327
Reinforcement learning is a popular optimization algorithm 328 in recent years, which allows agents to acquire knowledge 329 through the interaction with the environment and make inde-330 pendent adjustment and selection [21].The idea of reinforce-331 ment learning is to give ''reward'' for the correct choice and 332 ''punishment'' for the wrong choice.Through reinforcement 333 learning, the G 0 of radar is improved, so as to improve the 334 winning probability.Q-learning is a kind of reinforcement 335 learning [22].Its formula is where the matrix Q(s, a) is the action matrix with the highest 339 reward value found in the current state, s, a representing the 340 current state and behavior and s, a are the next state and 341 behavior; α is the proportion of the old Q value in the new 342 state; γ is the discount factor, which indicates the degree to which future earnings are converted to current ones.
Q-learning aims to find the best action of the agent in its current state.It uses predictions of the environment's expected response to move forward and always choose the best one.In this temporal sequence interactions model, s represents the action that the jammer has taken, and a represents the action that the radar will take.
As the game between radar and jammer is a dynamic process, the impact of the benefit at the next moment on this round should be considered, which often requires complex derivation and judgment.The discount factor γ in Q-learning can solve this problem.

B. DESIGN OF REWARD IN Q-LEARNING
The design of reward is very important in Q-learning.If the benefits of radar's action as a reward is the only factor that is considered, the dominant position of radar in the temporal sequence interaction model cannot be guaranteed.Thus, we need to design the reward of reinforcement learning by considering the actual benefits of radar action and temporal sequence interaction G 0 .The reward of Q-learning is divided into two parts as the following formula where R (s, a, n) is the final reward of Q-learning, G 0 (s, a, n) is temporal sequence interaction gain in (7), and G 1 (s, a, n) is the gain which is related to the radar actual utility matrix in (5), s is the radar state and is actually the action that the jammer has taken in this model, a represents the action that the radar will take and n is the number of game rounds.
The preparation interval of radar action is constant in the game.The change of recognition interval is related to the recognition ability of both sides as mentioned in ( 8) and ( 9).However, in the actual confrontation, there are too many actions for both sides to choose, each of which only emerges for several times.That is not enough to reduce the period of recognition interval.We can consider that the recognition interval does not change with the confrontation round by simply denoting G 0 (n) as G 0 .
In (5), attention is mainly paid to whether the item value in the matrix is positive rather than its amplitude.Considering the actual confrontation process, it is generally expected that G 1 (s, a, n) is large in a long distance in order to ensure that the radar can detect the target accurately and reliably.
where r as is the utility under the current radar action and jammer action which can be checked in (5), and n is the number of game rounds.A 1 is the gain amplitude coefficient, and a 1 is the amplitude adjustment rate coefficient.A 1 and a 1 are determined by the prior knowledge and can be adjusted.
In the theoretical calculation, the model can be simplified by assuming that the utility matrix remains unchanged during the game.As the radar carried by the missile approaches the naval vessel, the utility will usually get bigger.However, all 396 the item value in the matrix get bigger at the same time, 397 which make little influence to the transition matrix as it is 398 determined by the relative value of each other in the matrix.399 Then G 1 is not relevant with n.We can calculate G 1 by 400 G 1 = 1, when r as ≥ 0 −1/G 0 , when r as < 0.
(13) 401 Then the reward of Q-learning is depended by the state and 402 the corresponding action, which becomes 403 R (s, a) = G 0 , when r as ≥ 0 −1, when r as < 0. ( The reward of optimization takes the main factor G 0 into 405 account and weakens the influence of utility function when 406 its value is negative.In this way, the Q-learning can find the 407 action to get higher G 0 in the end of the game.

408
The reward of Q-learning is calculated in the following 409 process as illustrated in Fig. 3.For example, the jammer takes the action of DN, and the 411 radar takes the action of anti-DN, then the r as gets their value 412 from (1).T p (n, a), T R (n, a), T P (n, s), T R (n, s) gets its value 413 from the prior knowledge.Then G 0 is calculated, and as r as is 414 positive in the example, G 1 equals to 1. Therefore, the reward 415 R (s, a, n) can get its value which equals to G 0 .

416
If the radar takes the action of anti-RFT instead, r as is 417 negative this time.G 1 equals to 1/G 0 .R (s, a, n) can get its 418 value which equals to −1.

419
G 0 is usually larger than 1, which makes it more effective 420 in the optimization progress.

422
In the game between radar and jammer, it is considered 423 that the radar can optimize its action by Q-learning, and the 424 jammer determines its next action according to the transi-425 tion matrix without Q-learning optimization.In this way, the 426 action of jammer can be regarded as the environmental factor 427 of radar.

428
The transition matrix of jammer is determined by its initial 429 utility matrix.There are several methods to determine the 430 transition matrix, where the simplest one is to select the jam-431 mer's action that obtain the maximum benefit for the current 432 radar action by the following formula where i stands for the current action taken by the current radar, and e ij the initial utility matrix of the radar in (1).−e ij represents the initial utility matrix of the jammer.J * is the number of the optimal action for the jammer, which follows the greedy deterministic control policy.
For example, the utility is 4 when the jammer takes the action of DN, and the radar takes the action of anti-DN.And the utility turns to 1 when the radar takes the action of anti-VGPO.According to (15), the radar takes the determinative action of anti-DN.
Jammer action can also be selected from all actions favorable to jammers according to SoftMax method with the formula of where j is the number of the action radar taking in current state, and e ij stands for the combination of the negative values of radar actual utility matrix corresponding to the current radar action.
Considering the example mentioned above, the action anti-VGPO also get the probability to be chosen in (16).
It can also be selected by comprehensively considering the action utility of the jammer and the specific distribution of actions, so as to reduce the predictability of actions.

IV. EXPERIMENTS AND DISCUSSION
A numerical experiment is carried out to verify the rationality of the proposed optimization method.In the experiment, the radar and jammer can both take their action from the strategy set with four actions.

A. EXPERIMENTAL SETTINGS AND DATASETS
In the temporal sequence interaction model, the radar action is determined by Q-learning optimization, while in the control group, the jammer takes its action by the transition matrix without considering the temporal constraints.
According to the typical process from the start-up of antiship missile to the end, the total game interval is set as 100 s.
Assuming the radar starts to work when its distance to the naval vessel is 100 kilometers, and its velocity is 1000 m/s, then the total game interval is 100 s.As we would carry out the experiment many times for the same utility function matrix, in order to increase the uncertainty, a random interval is added to the total game interval to make it variant from time to time.The random interval is between -1 and 1.In the actual confrontation, a radar action can beat several jammer actions and be defeated by some other jammer actions.Each row and column of the utility function matrix must have positive and negative items simultaneously.We set an example of the radar initial utility matrix (1) as where the rows represent the radar actions and the columns 483 represent the jammer actions.The item value in the matrix 484 is the benefit of the radar action to the one specific jammer 485 action.It represents the radar capability to obtain information 486 of naval vessel, which can be reflected by parameters such as 487 signal-noise-ratio (SNR), the detection probability (Pd) and 488 so on.

489
As the experiment is implemented to verify the perfor-490 mance of the temporal sequence interaction model by consid-491 ering the temporal constraints, we set a typical utility matrix 492 by which we could get the radar winning probability.The 493 radar winning probability is taken as the control group.Thus, 494 the probability distribution of NE corresponding to radar 495 actions can be obtained as

501
The result shows that radar is in a weak position.

502
Set the recognition probability of jammer for radar actions 503 and radar for jammer as Table 1: Recognition probability settings of the experiment.
After recognition, the radar actual utility matrix becomes 505

508
The item values in the actual utility matrix of the jammer 509 are all negative, because the utility corresponding to the NE is 510 negative, which means the jammer can always ensure that its 511 utility exceeds that of the radar.This conclusion is obtained 512 when the radar also adopts the corresponding action of NE. 513 However, when the radar is at a disadvantage, its action is not 514 necessarily selected in this way.As the strategy corresponding to NE is a mixed strategy, one specific action has to be taken instead.Therefore, the utility corresponding to NE can hardly be got, radar can still have chance to win the game.
We assume that the preparation interval and recognition interval of radar and jammer keep unchanged in the game.
According to the radar-jammer interaction process, the preparation interval and recognition interval of radar and jammer actions are set as the order of 0.01 s, which is typical in actual confrontation.
For example, a radar needs about 128 pulses to recognize the jammer or make its action effective typically.Assuming the pulse repetition interval is 200 µs, then the total interval comes to 25600 µs(0.0256s), which is the same order of 0.01 s.The preparation interval and recognition interval may be different when the action changes, but they still have the same order.Table 2 shows the specific settings of radar.Table 3 shows the specific settings of jammer.These settings should ensure that if the benefit of one radar action to one jammer action is big, the corresponding preparation interval and recognition interval will lead to small temporal sequence interaction gain G 0 .Otherwise the optimization will be of no vail because the reward of Q-learning will be the same with the control group.
The state transition matrix is determined by the actual utility matrix of the jammer.The state transition matrix is set as The rows represent the radar actions and the columns represent the jammer actions.

B. RESULTS
The simulation data should base on the temporal sequence interaction gain G 0 .However, we find its value changes greatly among different game rounds.It is difficult to see the trend directly.Therefore, we take the average value of G 0 each round to analyze the overall trend, which is expressed as where N is the number of rounds the game is played in, and G 0 indicates the average value of temporal sequence interaction gain G 0 after N rounds of the game.These param-555 eters of Q-learning are optimized through simulation experi-556 ments, and 100-time repeated experiments are carried out by 557 each group to explore the performance and stability of the 558 parameters.

559
As presented in Fig. 4, when γ equals to 0.3, and α 560 values are taken as 0.5,0.6,0.7,0.8 successively, we recorded 561 the result G 0 when one parameter combination experiment 562 ends.Then we get a curve of 100 points for one parameter 563 combination experiment.There are four parameter combi-564 nation experiment curves and one control experiment curve 565 in the figure.We can see clearly that the value of G 0 varies 566 from round to round.Therefore, we must consider the result 567 of parameter combination of γ and α to make sure the 568 result of optimization can get high and stable performance 569 simultaneously.

570
We choose γ values from 0.3,0.4,0.5,0.6,0.7,0.8, and α val-571 ues are taken as 0.5,0.6,0.7,0.8, then we can get 24 parameter 572 combinations.As presented in Fig. 5, it is drawn that under 573 different α and γ combinations, the reinforcement learning 574 results of some combinations are not stable enough, and the 575 performance is also diverse.The mean value and standard 576 deviation of the above parameter combinations are counted 577 to measure the characteristics of the combined parameters.

578
We can see intuitively from the Fig. 5 that good results of 579 the mean value and standard deviation can hardly be achieved 580 at the same time.The standard deviation corresponding to 581 the parameter group with large mean value is also large.582 In Table 4, the statistical analysis of specific data is listed.

583
Considering that in practical engineering, the pursuit of 584 stability is often more urgent than the pursuit of high mean 585 VOLUME 10, 2022  We could see that the Q-learning group and the control group ends at different round because they have different temporal constrains of radar and jammer action.The curve of Q-learning is clearly much higher than that of the control   As listed in Table 5, the temporal sequence interaction gain and the radar winning probability before and after the optimization in the two experiments are calculated.
In the Table 5, the radar winning probability P v is calculated as It is observed that the proposed optimization method can improve the temporal sequence interaction gain by more than 23%, and improve the radar winning probability by more than 5%.
The optimization of G 0 in the experiment is executed offline, while the data it used may differ from the actual environment.However, we can gather the information we need through the experiments, then the data we get can be more reliable.Although the data may contain error compared with the real data, or the actual data vary from time to time, we can upgrade the data within the flight before the missile reaches the goal.

V. CONCLUSION
This paper proposes a radar anti-jamming strategy based on noncooperative two-player zero-sum dynamic game.The winning criterion of the game is analyzed, which is strongly related to the dominance of radar in the last round.By considering the temporal constraints and the utility functions, a game model which set the temporal sequence interaction gain as the objective reward is constructed.The experimental results show that the radar winning probability is significantly improved in confrontation after being optimized by Q-learning.
The actions of jammer include Barrage Noise (BN), Responsive Spot Noise (RSN), Doppler Noise (DN), Range False Targets (RFT), and Velocity Gate Pull-Off (VGPO) or other jamming attacks.The actions of radar are simply referred to as Anti-BN, Anti-RSN, etc.The numbers of actions of radar and jammer are M and N, respectively.The utility matrix of radar is expressed by

FIGURE 1 .
FIGURE 1. Temporal Sequence interaction of radar and jammer flow and the winning condition.The interval range of radar's dominance in a round of game is shown in Fig. 2. If the game ends in the radar dominance interval range, the radar wins the game while the jammer wins otherwise.Therefore, a conclusion is drawn that the radar should try to cut down the recognition interval of the other player's action and select its action with shorter preparation interval.At the same time, it should try to prolong the jammer's recognition interval of the radar's action and the preparation interval of the corresponding action.

FIGURE 2 .
FIGURE 2. Interval distribution of radar dominance in one round.

FIGURE 3 .
FIGURE 3. Process of calculating the reward of Q-learning optimization.
value, it is necessary to select the parameter combination 586 with stable results as far as possible.We consider that the 587 combination of γ = 0.5 and α = 0.6 is the optimal one 588 in the experiment.In this condition, the average value of 589 the experimental results is 1.25, and the result of the control 590 group without Q-learning optimization is 0.99, which reveals 591 that the optimization effect of Q-learning is obvious.Under 592 this condition, a typical curve of a group of experimental optimization process with the optimal parameter combination is shown in Fig.6.
group, which reveals that the result of optimization is obvious.The first few points in Fig.6deviate from the line, as we 601 choose the radar action randomly and these points have not 602 been smoothed by the averaging process.
603C.SUMMARY AND DISCUSSION604Considering the experimental result in Table4, we find that 605 the combination of γ = 0.5 and α = 0.6 can maintain 606 good stability and performance by Q-learning optimization, 607 in the case of typical interval characteristic parameters of radar countermeasure.

TABLE 2 .
Preparation interval and recognition interval of radar.

TABLE 3 .
Preparation interval and recognition interval of jammer.
FIGURE 4. Gain of temporal sequence interaction game when γ = 0.3 for repeating 100 times.

TABLE 4 .
Statistics of experimental result.FIGURE 6.Typical result of Experiment with and without Q-learning.

TABLE 5 .
Comparison of experimental results before and after optimization.