Anti-Jamming Game to Combat Intelligent Jamming for Cognitive Radio Networks

Cognitive Radio (CR) provides a promising solution to the spectrum scarcity problem in dense wireless networks, where the sensing ability of cognitive users helps acquire knowledge of the environment. However, cognitive users are vulnerable to different types of attacks, due to its shared medium. In particular, jamming is considered as one of the most challenging security threats in CR networks. In jamming, an attacker jams the communication by transmitting a high power noise signal in the vicinity of the targeted node. The jammer could be an intelligent entity that is capable of exploiting the dynamics of the environment. In this work, we provide a machine-learning-based anti-jamming technique for CR networks to avoid a hostile jammer, where both the jamming and anti-jamming processes are formulated based on the Markov game framework. In our framework, secondary users avoid the jammer by maximizing its payoff function using an online, model-free reinforcement learning technique called Q-learning. We consider a realistic mathematical model, where the channel conditions are time-varying and differ from one sub-channel to another, as in practical scenarios. Simulation results show that our proposed approach outperforms existing approaches to combat jamming over a wide range of scenarios.

INDEX TERMS Cognitive radio networks, anti-jamming communication, stochastic game, machine learning, multi-agent reinforcement learning, Q learning.

LIST OF SYMBOLS α
Learning rate. β Regret factor of jammer. η Bandwidth efficiency (bps/Hz). γ Discount factor. a n l Action of the n th secondary user at l th channel. c m j Action of the m th jammer at l th channel. C l,t (a n , c m j ) Channel capacity of the n th secondary user at l th channel at time slot t.  No. of secondary users in the network. Q(s k , a k l ) Q table entries for the secondary user at l th channel in k th time slot. Q j (s k , c k j ) Q table entries for jammer at l th channel inm th k th time slot.

A(s)
Action space of secondary users. S State space of the game. U n l,t (a n , c m j ) Utility of the n th secondary user at l th channel at time slot t.

I. INTRODUCTION
Cognitive Radio (CR) is a promising technology to cope with the scarcity issue of the electromagnetic spectrum, which is a natural resource. Traditional wireless radio communication works on fixed frequency slots, which results in overcrowding in certain portions of the electromagnetic spectrum while other portions of the spectrum are underutilized. CR is aware of its surrounding Radio Frequency (RF) environment. It learns, reasons, decides and adapts to external conditions with the aim of efficiently utilizing the radio spectrum and carrying out reliable and uninterrupted wireless communication [1], [2]. Furthermore, CR could provide opportunistic access to spectrum holes to solve intermittent use of radio spectrum using machine learning algorithms [3]- [7]. The adaptability of CR could provide intelligence to spectrum sensing and spectrum decision [5]. On the other hand, the adversary can also manoeuvre the same features intelligently to create more harm to the underlying CR Network (CRN) [8]- [10]. Therefore, ensuring security is of paramount importance to the successful deployment of cognitive radio networks. More explicitly, jamming attacks, Denial of Service (DoS) attacks [11], Primary User Emulation Attack (PUEA) [12]- [14], Spectrum Sensing Data Falsification Attack (SSDFA) [15], exploitation of common control channel security [16] and collaborative jamming [17] are well known attacks in cognitive radio networks. However, our major concern is to combat jamming attacks in cognitive radio networks.
The model of a jammer in a CRN is shown in Fig. 1, where the jammer disrupts the wireless communication by generating high-power noise causing narrowband interference on a single sub-channel at a time near the transmitting and receiving nodes [18]. Intensive jamming could result in either total disruption of the wireless communication or a very low signal to noise ratio (SNR) that does not allow secondary users (SUs) to communicate successfully. In traditional wireless communication systems, Frequency Hoping Spread Spectrum (FHSS) and Direct Sequence Spread Spectrum (DSSS) are widely used to thwart jammers [19], [20]. Due to dynamic spectrum mobility [21], it is not possible to directly apply these techniques in cognitive radio technology to combat the hostile jammers.
Game theoretic analysis for power control based anti-jamming communication has been investigated in [31], [35]- [41]. For instance, in [35], a power control Stackelberg game was presented as leader follower game for jamming defense in cognitive radio networks. The problem is divided into sequential sub-problems, follower sub-game and leadersub game. Another Stackelberg game was used in [36] for relay selection for the security of physical layers in cognitive radio networks. More specifically, the One Leader One Follower Stackelberg Game (OLOFS) was modeled to achieve optimal pricing strategy and power allocation in the presence of two eavesdroppers. Furthermore, the Primary User (PU) and the selected relay simultaneously act to achieve a Nash Equilibrium (NE). In [37], the authors presented an adaptive approach to defend the jamming attacks in CRN by controlling transmission powers of the nodes, where the network topology is adaptively updated to nullify the effects of the jammer. The trade-off between jamming immunity and network coverage is seen as an optimization problem, which can be solved by scalable decomposition strategies. The authors also present a continuous version of the game by considering continuous action space for both players.
Game theory has also been used to investigate the frequency hopping anti-jamming communication in wireless communication networks. For instance, anti-jamming communication in CRNs with unknown channel statistics has been studied in [55]. The authors formulated the problem of anti-jamming multi-channel access in CRN as a non stochastic multi-armed bandit problem, where both secondary sender and receiver chooses their common operating channels by minimizing the probability of being jammed. Another interference avoidance based channel-hopping stochastic game was investigated in a multi-agent environment in [32], where game-theoretic based reinforcement learning mechanism is used to avoid jamming.
Recently, the authors in [56] presented bandwidth-efficient frequency hopping game in wireless sensor networks. The authors in [57] presented a brief overview of anti-jamming communication in the context of dynamic spectrum access. Two typical ways of thwarting jammer are by adaptation of transmission rate and Frequency Hopping (FH). These two are jointly adopted by [58] in order to improve the average throughput and to provide better jamming resiliency against reactive sweep jammer. Specifically, the interaction between the jammer and the legitimate user is modeled in [58] as a Zero Sum Markov Game (ZSMG) and a constrained NE is derived. The authors in [45] utilized game theoretic framework to access an optimal channel in the presence of attacker, hence maximizing the channel payoff. Channel Hopping (CH) based rendezvous scheme is adopted for the SU to meet and to make connection for further communications [44]. This bounded time rendezvous scheme neither uses pre-shared secrets nor is role pre-assignment needed for bringing the SUs on commonly available channel.
The authors in [46], presented a mobility-based Single Leader and Multiple Follower Stackelberg game (SLMFSG) to avoid jamming for increasing the network life in the WSN. Anti-jamming games in multi-channel cognitive radio networks were presented in [47], where the SU hops to another channel to avoid the jamming. A zero sum game is played between the attacker and the SU based on a Markov Decision Process (MDP). Maximum Likelihood Estimation (MLE) and Q learning are used for SU to learn from their environment. In [43], the authors proposed a Hierarchical Learning Algorithm (HLA) for anti-jamming channel selection strategies in the presence of co-channel interference as a Stackelberg game. They considered jammer and users as independent learners (ILs), which choose their strategies independently and selfishly.
In [48], anti-jamming FH game is constructed using a bi-matrix game between the jammer and the legitimate user. In [49], game theoretic stochastic learning approach is used for anti-jamming communication in dense wireless networks. In [33], the authors have considered joint multi agent learners in stochastic game settings against a sweep jammer. They presented a collaborative multi-agent antijamming algorithm based on reinforcement learning in wireless networks. Markov game is formulated to model and VOLUME 9, 2021 analyze the anti-jamming problem in multi user environment. A time domain countermeasures against random pulse jamming using MDP and reinforcement learning was presented in [29]. In [32], MARL is used as independent Q learning for each agent against a sweep jammer as a common practice. Another game theoretic anti-jamming scheme for CRN is presented in [34], where the SU used Q learning to learn the dynamics of the jammer and to reduce the complexity of value iteration based learning. We will consider this scheme as our benchmark scheme. However they only consider the anti-jamming in ideal channel conditions with no noise present. Secondly, they did not consider the time variations in the wireless channel. We have improved the framework presented in [34] by considering time varying variable channels, which is a more realistic approach in CRN. Moreover, in our case, utilities of both players are dependent on the channel qualities, the better the channel is, the higher is the reward and vice versa. We differentiate the sub-channels based on the received SNR, which results in the varying maximum channel capacities.

A. MOTIVATION AND CONTRIBUTION
From the above discussion, it is evident that the research community has contributed a lot of research towards anti-jamming for CRN in the frequency domain. However, most of literature has assumed fixed strategy for the jammer, which is not changing with time. With the development and technological advancement in the cognitive radio networks, it is highly conceivable that a jammer will also manoeuvre its attacking strategies intelligently. Hence, there is a need for intelligent anti-jamming strategy for CRN. To fulfill the need, we develop a mathematical modelling of the system that incorporates intelligence in the SU to cope with intelligent jammer. Intelligent jammer is cognitive in nature, having the ability to learn, to reason and to adjust its strategies against SU for maximum damage to the CRN.
Moreover, the authors in [35] have shown that for discrete problems like selection of frequencies in anti-jamming problems, it is difficult to manage using convex optimization. Hence, learning theory is needed in the decision process. The learning algorithm should be capable of coping with uncertain dynamics and incomplete information, whereas game theory can adequately model and analyze the mutual interactions among adversarial users. Therefore, it is promising to incorporate the learning algorithm into game theory.
Against this backdrop, we devise a game-theoretic optimal frequency hopping scheme between SU and intelligent jammer in a dynamic environment using the Q-learning approach to pick the optimal sub-channel as shown in Fig. 4. The emphasis of this study is on a game theoretic frequency hopping technique for avoiding the jammer. We refer the readers to [31], [35]- [41] for power adaptation based anti-jamming techniques. We develop a game model with both players as Independent Learners (ILs), where they selfishly and independently select their optimal sub-channels in a Multi-Agent Reinforcement Learning (MARL) setting to increase their individual utilities. The proposed game theoretic model in conjunction with learning based FH algorithm helps SU to avoid the attacker and hence to reduces the probability of jamming and to increase the bandwidth efficiency of the cognitive system. Our novel contribution can be summarized as follows: 1) We consider a cognitive adversarial jammer, which is an intelligent attacker that can adopt the dynamics of the sub-channels and the strategies of SU.

2)
We extend the framework of [34], which assumed that all sub-channels have the same quality of service, by proposing a more realistic and practical channel model. More specifically, our channel conditions may change over time and differ from one sub-channel to another.

3) The proposed framework considers various factors and
parameters that capture the near practical channel dynamics, e.g. SNR, variable channel capacity, jamming gain as well as transmission cost and jamming cost of each player in the game.
The rest of the paper is organized as follows. Section III explains the system and adversary models. Section IV provides an anti-jamming game formulation against intelligent jammer. Section V describes the proposed solution mechanism, while section VI describes the evaluation and results. Finally, Section VII concludes the paper.

III. SYSTEM AND ADVERSARY MODEL A. SYSTEM MODEL
The interweave paradigm for the time-slotted system is assumed in our CRN, where SU can access the spectrum only if it is not used by the PU [59]. In cognitive radiobased communication systems, the presence of timing and frequency asynchronization affects the spectrum sensing performance and may result in false detection of PUs by the SUs. Assuming imperfect synchronization is more realistic for the better realization and analysis of communication systems. However, as a first step we assume perfect time and frequency synchronization between all SUs for the sake of simplicity as in [60], [61]. Every user scans the available sub-channels and starts transmission after a white space is found. We assume that there are H PUs, N SUs and M jammers in the network. Furthermore, W is the bandwidth of the channel. Each subchannel may have different channel capacity, based on the received SNR. It is assumed that the jamming attack is the only source of channel deterioration in the network and we neglect any other source of interference including the effects of multipath fading. Each sub-channel can be in one of the two states, namely the idle state and the busy state. The relationship between the PU and an SU can be described by one of the two possible states of the sub-channel as follow: • IDLE STATE: The channel is idle if it is not being used by any PU. The SU and the jammer are allowed to utilize an idle channel. The idle state of the sub-channel is represented by P = 1 • BUSY STATE: The channel is considered busy if it is being used by any PU. Both SU and jammer are not allowed to transmit over a busy channel. This state is represented by P = 0. The channel states (idle or busy) are not known before the sensing action is taken place. We considered L sub-channels, where the quality of each sub-channel is different. Each sub-channel has its maximum capacity limit based on its received SNR, given by: where C l,t (a n , c m j ) represents the capacity of the l th subchannel for the n th SU at time slot t, where a n and c m j are the actions of the n th SU and the m th jammer, respectively. Moreover, W/L is the bandwidth in Hz for each of the L sub-channels and SNR n l,t (a n , c m j ) is the received SNR of the l th channel for the n th SU. Let us first consider the case where there is no jammer present in the system and the SNR is defined as: where P n l,t is the average signal power received by the n th SU at the l th sub-channel at time slot t and N o is the power spectral density (PSD) of the Additive White Gaussian Noise (AWGN). A high SNR n l,t (a n , c m j ) would give a high channel capacity C l,t (a n , c m j ) and hence a higher channel quality. The channel capacity of the n th SU at the l th sub-channel can be computed as: Moreover, the Signal to Interference plus Noise Ration (SINR) in the presence of a jammer can be calculated as SINR n l,t (a n , c m l is PSD of the jamming signal and B j is the bandwidth of the jammed channel. Since all sub-channels have identical bandwidth of B j = W L , the SNR becomes: We only get (4) when a n = c m j , i.e. both the SU and the jammer are on the same channel. This results in a severe degradation of the SNR for the SU. The objective of the SU is to carefully switch to the available high-capacity channel to maximize the spectrum utilization, while at the same time to avoid the potential jamming.

B. SECONDARY USER MODEL
The SU senses its environment during its sensing period, before initiating any data transmission. Contention based channel selection algorithm uses a structure called Preferable Channel List (PCL) to initiate data transmission. The algorithm avoids collision and perform Request To Send / Clear To Send (RTS-CTS) contention for data transmission. Moreover, a sensing-assisted access (SAA) protocol may be used as a complete random access mechanism for CRN to initiate data transmission. In this mechanism, the contentionbased access is designed based on the integration of the backoff process and spectrum sensing [62], [63]. However, in this contribution, this aspect is not investigated. During the sensing period, each SU would try to sense for the presence of any PU in the available sub-channels. However, the SU cannot be able to detect the presence of a jammer at the beginning of the time slot. Nonetheless, the SU is able to realize the presence of the jammer at the end of each time slot. More specifically, at the end of each time slot, the SU would know if its transmission was successful or was jammed by a malicious jammer. The interested readers may refer to [64], [65] for more details concerning jamming detection. A successful transmission yield a positive payoff to the SU, while a jammed transmission would yield a negative payoff to the SU. The utility of the n th SU in the l th sub-channel based on the actions of SU (represented as a n ) and the action of the jammer (denoted as c m j ) at time slot t can be derived as: U n l,t (a n , c m j ) = C n l,t (a n , c m j )(x l,t (a n , c m j )(T n l − E n l ) −(1 − x l,t (a n , c m j ))(J n l + E n l )), (5) where E n l is the cost of transmission of the n th SU, T n l is the SU gain factor for a successful communication, J n l is the loss factor for SU when the SU transmission in the l th subchannel is jammed. The impact of the sub-channel SNR and the decision of each player on the utilities of both players are shown in Fig. 5. The received SNR at each sub-channel is increasing from sub-channel 1 to sub-channel 10. The utilities earned by each player is opposite to each other. The missing utility at sub-channel 4 in Fig. 5 indicates that the PU is transmitting in this sub-channel and neither SU nor jammer can use this sub-channel. Furthermore, sub-channel 9 was jammed in the previous time slot, if the jammer stays there, then the SU would have negative utility to use sub-channel 9.
Similarly for the jammer, if SU was at sub-channel 9 and if it stays there, then the jammer would have positive utility at sub-channel 9. Additionally, the PU could change to a different sub-channel in each time slot, but we assume that VOLUME 9, 2021 both the SU and the jammer are able to detect the sub-channel used by the PU.
The utility of our proposed system in (5) can be compared with the utility function in of the benchmark system [34] given by G(s, a) = L l=1 G l (s, a), where the gain of the SU at the l th sub-channel is computed as: while x l (s, a) and y l (s, a) are binary switching functions. Furthermore, U and C denote the utility earned by the SU and the jamming cost of SU, respectively. The authors in [34] assume that the values for the utility and for the cost in every sub-channel are identical, since their sub-channels have the same quality. The left hand side of both (5) and (6) denotes the utility function of the SU, although different notations were used. The right hand side of (5) and (6) have the following differences: • We introduce the factor C n l,t (a n , c m j ) in (5) in order to differentiate the sub-channels based on the sub-channel capacities. Hence, successful transmission in a good quality sub-channel would yield better utility for the SU. This quality factor is missing in [34] and in (6).
• We also considered the factors E n l and E m jl to account for the transmission costs for SU and for the jammer, respectively, in terms of the battery utilization and the power transmitted.
• The two binary switching functions x l (s, a) and y l (s, a) in (6), are used in [34] such that x l (s, a) + y l (s, a) = 1. To simplify the mathematical notation, we used only one binary switching function x l,t (a n , c m j ) instead of two, such that x l,t (a n , c m j ) + (1 − x l,t (a n , c m j )) = 1. • Furthermore, we have more detailed utility function for the jammer compared to that of [34] as will be explained later in (11) and (13). Combining (5) and (3) gives (7), as shown at the bottom of the page, where x l,t (a n , c m j ) ∈ {1, 0} is a binary switching function used to indicate successful/jammed SU communication: x l,t (a n , c m j ) = 1, a n = c m j , ∀n ∈ N , ∀m ∈ M, 0, a n = c m j , ∀n ∈ N , ∀m ∈ M.
Note that x l,t (a n , c m j ) is 1 for successful SU transmission and is 0 for jammed SU transmission. Specifically, C n l,t (a n , c m j ) = 0 if the SNR is bellow a certain threshold value SNR th i.e. SNR n l,t (a n , c m j ) ≤ SNR th and the value of the switching function x l,t (a n , c m j ) would also become 0. Equations (2), (4) and (8) are related in the sense that: x l,t (a n , c m j ) = 1, SNR n l,t (a n , c m j ) > SNR th , 0, SNR n l,t (a n , c m j ) ≤ SNR th , and SNR n l,t (a n , c m j ) = In other words, the SU utility function in (7) incorporates the practical channel condition in terms of both the channel capacity and the jamming conditions. The goal of the SU is to maximize the expected sum of the discounted payoff by choosing a good quality channel that is not jammed by the jammer.

C. JAMMER MODEL
Jamming is a hostile attack in the CRN, where it disrupts the wireless communication by generating high-power noise at the targeted sub-channel as shown in Fig. 6. We considered two types of jammers i.e. random jammer and intelligent jammer. A random jammer would randomly jam a sub-channel in different time slot. Inspired from [34] and [8], when an intelligent jammer with cognitive capabilities is assumed, it adopts the best strategy in response to the observation of the channel dynamics and the SU strategies. The jammer senses the RF environment for a given sensing duration and then transmits its jamming signals based on the channel conditions and the strategy of the SU. If a PU is detected in a sub-channel, the jammer would switch to other available sub-channels to U n l,t (a n , c m j ) = W L log 2 (1 + P n l,t N o .W/L )(x l,t (a n , c m j )(T n l − E n l ) − (1 − x l,t (a n , c m j ))(J n l + E n l )), ∀n ∈ N , ∀m ∈ M (7) avoid the heavy penalty imposed by law-enforcement agencies and start sensing again [8]. The utility function of the jammer in the l th sub-channel is based on the actions of the SU and of the jammer, which is represented by: where C m l,t (a n , c m j ) is the channel capacity of the l th subchannel and T m jl is the jammer gain factor when a SU was successfully jammed, while E m jl is the cost of transmitting the jamming signals. Furthermore, β is the jammer regret factor when the jamming was not successful, which is the negative reward earned by the jammer when the jammer sends a jamming signal to a sub-channel that was not used by the SU. As mentioned in (8), x l,t (a n , c m j ) is a switching function having x l,t (a n , c m j ) = 0, when the jammer successfully jams a channel (zero regrets), while x l,t (a n , c m j ) = 1 when the jammer fails to jam the SU. Hence, an intelligent jammer is more inclined to jam a SU that operates in a high-capacity sub-channel compared to that in a low-capacity sub-channel. The objective of the jammer is to maximize the probability of successful jamming.

IV. GAME THEORETIC ANTI-JAMMING MECHANISM A. PRELIMINARIES
A Stochastic Game (SG) is the natural extension and generalization of the Markov Decision Process (MDP) to multi-agent systems [7], [66]. 1 SG provide a framework for multi-agents in multiagent reinforcement learning (MARL). In this contribution, a stochastic anti-jamming game is developed between two players of conflicting interests.
Definition 1 [67]: A two-player stochastic game is defined as G = X , S, A i , U i , where X = {1, 2} is the index of the players, S is the discrete strategy space of the game, A i is the discrete action space of player i, while U i : S × A i is the utility/payoff of player i, ∀i ∈ X . Definition 2 [68]: A pair of strategies (i * , j * ) {i for the row player, and j for the column player} yields a non-cooperative Nash equilibrium solution to a bimatrix game (A = {A ij }, B = {B ij }), where A and B are payoff matrices for each player, if the following two inequalities are satisfied: where P is the total number of pure strategies. Each stage of a stochastic game can be viewed as a bimatrix game. The basic assumption of a stochastic game between two interacting players is that the actions of each player will have an impact on the utility of other player. The same assumption is valid in our case. The SU obtain its utility based on the past actions of jammer in the previous time slot. If the subchannel to be accessed by the SU receives an SNR lower than a threshold, it implies that the sub-channel was successfully jammed by the jammer and SU will get lower utility at that sub-channel at time slot t.
There are two types of learners in the MARL setting, namely the independent learner (IL) and the joint action learner (JAL) [69]. IL uses Q-learning in a classical setting, ignoring the other agents. More specifically, it assumes that the other agents are part of the environment. A MARL algorithm is an IL algorithm if the learner would take action individually and would not consider the actions taken by other agents. The IL algorithm is an appropriate method of learning if the agent is unaware of the other agents in the system and their actions [70], [71].
A JAL is an agent that learns the environment in the presence of other agents and then updates its Q values based on the joint actions of all the agents in the system. In fact, even though JALs have much more information at their disposal, they do not perform much different from ILs in the straightforward application of Q-learning to multi-agent systems [69]. We take both the SU and jammer players as ILs, where each IL would apply the Q-learning algorithm in the classical setting, while ignoring the actions of the other agents.
Theorem 1 [72]: An IL agent in a MARL setting, which follows the Q-learning update rule, will converge to the optimal Q-function with unit probability.

B. GAME FORMULATION
Based on the knowledge about the channel, the system and the attacker, the objective of the SU is to carefully choose a sub-channel to maximize its spectrum utilization, while avoiding the jamming. On the other hand, the jammer aims to forbid the SU from effective channel utilization by a strategic jamming approach. The objectives of the two players, namely the jammer and the SU, are opposite to each other. Therefore, the dynamic interaction between them is well formulated as a non-cooperative game, where the gain of one player is the loss of another player. Furthermore, spectrum availability, VOLUME 9, 2021 quality of the channel and strategies of both SU and jammer can be time-varying. Players of the game are assumed to be intelligent and would exhibit rational behaviour to maximize their own payoffs according to their individual objectives.
We have formulated a two-player SG between the SU and the jammer as described bellow: Players: There are two non-cooperative players in the game namely the SU and the jammer.
States: Every sub-channel occupation is considered as the state S of the game. For example, if we have L sub-channels, then we have L states. The number of available states to the SU and jammer is given by L−H, where L is the total number of sub-channels and H is the number of PUs in the network.
Actions: An action A(s) at each state has L − H hopping possibilities. For each L − H available states, the possible action set is A(s) = {a 1 , a 2 , a 3 , . . . , a i , . . . , a L−H }, where a i is the action to hop to the i th sub-channel in the L − H available sub-channels. Both players choose actions to hop to any of the available sub-channels, which are not occupied by the PU. Since the available frequency slots are the same for both players, the action set A(s) is the same for both players. Every action results in a change of state. Both jammer and SU sense the channel during the sensing period and hence the channel states and channel quality are assumed to be a common knowledge in the game.
Payoff: The immediate payoffs of both players in a bimatrix game at each stage are given by (7) and (11), respectively. The total utility of the whole secondary network is given by where U n l,t (a n , c m j ) is given by (7). The long term objective of SU is to maximize the expected sum of discounted payoff, which can be written as [8], [66]: where γ t is a time decaying discount factor, 0 < γ t < 1, that determines the significance of future payoffs and U T ,t (a, c j ) is the utility of the secondary network at time t, which is given by (13). The frequency hopping strategy of the SU is to maximize its utility by taking an optimal action that is given by: Similarly, the frequency hopping strategy of an intelligent jammer is to maximize its expected utility of ∞ t=0 γ t U jl,T ,t (a, c j ) by taking an optimal action of: t (a, c j ) .
The pair (a * , c * j ) is said to be an equilibrium pair, if (15) and (16) follow the following inequalities:

V. SOLVING OPTIMAL STRATEGIES OF THE GAME
In the previous section, we have described the anti-jamming game formulation. In this section, we will describe the defending strategy of the SU i.e. how to defend the SU from being jammed by the jammer. The Q-learning is a value-based reinforcement learning algorithm, which uses a Q table to maximize the utility [73]. It is a mechanism to effectively learn the sub-channel selection strategy. The Q function Q(s k , a k ) at stage k, is the expected discounted payoff when the SU takes the action a k at the state s k . More specifically, the Q value is the estimation of the expected sum of the discounted payoff [8]. Hence, an SU can consider the Q value in a bimatrix game at stage k as the expected sum of the discounted payoffs given in (14). Given the Q function Q(s k , a k ), the SU can find the value of the game from: After an action a k is taken, the SU would receive an immediate payoff R(s k , a k ), which is then used to update the Q table. Specifically, the Q function can be approximated as: |s k , a k )V (s k+1 ), (19) where Pr(s k+1 , a k+1 |s k , a k ) is the transition probability from state s k to s k+1 . The value of Q(s k , a k ) in (19) can be updated recursively without having to estimate the transition probabilities [74], as follows: (20) where α k is the linear learning rate satisfying 0 < α k < 1. For α k to be time decaying it must satisfy the conditions ∞ k=0 α k = ∞ and ∞ k=0 (α k ) 2 < ∞. Note that γ is the discount factor satisfying 0 < γ < 1. Similarly, the intelligent jammer can also adopt the Q-learning algorithm to learn the dynamics of the SU and the sub-channels [75]. After updating the Q table, an intelligent jammer could take a decision based on its own updated Q table. The update rule for the jammer's Q value is given by: where Q(s k , a k ) and Q j (s k , c k j ) are the estimates of the expected sums of the discounted payoffs for both the SU and the jammer, respectively, which could evolve over time. The rewards of the SU and the jammer after choosing their respective actions at state s k are given by R(s k , a k ) and R j (s k , c k j ), respectively. These immediate rewards are calculated using (7) and (11) for the SU and the jammer, respectively. The SU would stay on the current sub-channel if the reward it receives on the current sub-channel is good enough to contribute positively in the Q value update. A negative instant reward in a certain sub-channel indicates that the sub-channel has been jammed and the SU should avoid that sub-channel by hopping to another available sub-channel in the next round. The SU's optimal frequency hopping strategy to avoid the jammer is the action that maximizes its Q-value in state s and is given by: Algorithm 1 summarizes the game theoretic frequency hopping algorithm for SU. The jammer's optimal frequency hopping strategy to jam the SU is a greedy policy that chooses the action with maximum Q-value in state s and is given by Algorithm 1 Game Theoretic Frequency Hopping Algorithm 1: Initialize k = 0, K = 100, α k ∈ [0,1] and γ ∈ [0,1], ∀s k ∈ S, a k ∈ A(s) and c k j ∈ A(s). 2: Let Q(s k , a k ) = 0. 3: while k → K do 4: Execute action a k , get immediate reward R(s, a k ) according to (7) and observe s k+1 .

5:
Choose a k+1 from s k+1 using policy derived from Q(s k , a k ) as given in (22). 6: Update Q(s k , a k ) for SU according to (20). 7: s k ← − s k+1 and a k ← − a k+1 . 8: end while 9: if a = c j then 10: The channel is jammed. 11: else 12: The SU transmission is successful. 13: end if For jammer, the procedure is summarized in Algorithm 2.

A. COMPLEXITY ANALYSIS
Inspired from [34] we will look at the computational complexity of Algorithm 1 and Algorithm 2, in this subsection. Inside while loop line 5 of both Algorithms represent the policy derived from Q learning, and line 6 represent the update equation of Q learning. The computational complexity in each iteration of the policy phase comes from solving Algorithm 2 Game Theoretic Algorithm for Jammer 1: Initialize k = 0, K = 100, α k ∈ [0,1] and γ ∈ [0,1], ∀s k ∈ S, a k ∈ A(s) and c k j ∈ A(s) 2: Let Q j (s k , c k j ) = 0. 3: while k → K do 4: Execute action c k j , get immediate reward R j (s, c k j ) according to (11) and observe s k+1 .

5:
Choose c k+1 j from s k+1 using policy derived from Q j (s k , c k j ) as given in (23). 6: Update Q j (s k , c k j ) for jammer according to (21). 7: s k ← − s k+1 and c k j ← − c k+1 j . 8: end while 9: if c j = a then 10: The jammer is successful. 11: else 12: The SU transmission is successful. 13

VI. RESULTS AND DISCUSSION
In our simulation study, we have considered N = 10 subchannels, one SU, one PU and up to four jammers. When a sub-channel is occupied by the PU, neither the SU nor the jammers can access that channel. The SU chooses a highcapacity sub-channel that is potentially jamming-free, while the jammers predict and choose the sub-channel used by the SU.

A. THE EFFECT OF USING DIFFERENT CHANNEL TYPES
The capacity of a sub-channel is a measure of the highest information rate that can be achieved with a very small error rate. The channel capacity C n l,t (a n , c m j ) is represented by (1) and (10), while the bandwidth efficiency in bits per second per Hertz (bps/Hz) can be computed as: In each epoch the simulations are run A = 2000 times to get the average bandwidth efficiency for each SU: We considered two cases, where case I refers to the situation when all the L sub-channels have different SNRs, and hence, different channel capacities as given by Table 1. By contrast, all L sub-channels have the same SNR in case II. More specifically, case II is related to the idealistic scenario considered in [34]. The mean SNR in dB is calculated by: where SNR dB,l = 10 log 10 (SNR l,t ) is the SNR of l th subchannel. The SNR for each sub-channel in case II is the same as the average SNR of case I (SNR dB = 16.2 dB). For a fair comparison, the means SNRs for both cases are equal, which is SNR dB = 16.2 dB as shown in Table 1.

B. COMPARISON WITH THE BENCHMARK SYSTEM
Here, we compare the proposed scheme and the benchmark scheme in case I, where the sub-channels have varying qualities. However, the benchmark scheme assumes that all sub-channels have a fixed quality (6). As seen in Fig. 7, our proposed system outperforms the benchmark scheme both in terms of jamming probabilities and bandwidth efficiency. Please note that when the channel qualities of all sub-channels are fixed, the proposed scheme will perform similarly to the benchmark scheme. In other words, our proposed scheme is the generalization of the benchmark scheme to the general case, where the sub-channel qualities are varying. Fig. 8 shows the probability of successful jamming by the jammer and the bandwidth efficiency of the SU, for both case I and case II, when one or two random jammers are considered. With increasing epochs, the intelligent SU could learn the environment in a better way and the probability of successful jamming is expected to be reduced, while the bandwidth efficiency would increase. The probability of successful jamming for the two-jammer scenario is slightly higher than that of the single-jammer scenario, but the probabilities converge to zero after 60 epochs as shown in Fig. 8a. Furthermore, in the more challenging case I, where the channel quality varies across the sub-channels, our algorithm still works well despite requiring a longer training period (or epochs) to reach the convergence point, as seen in Fig. 8b.
Our proposed algorithm allows SU to intelligently choose sub-channels with higher channel capacities, when the channels are varying as in case I. Hence, the average bandwidth efficiency of the system is improved. It can also be seen from Fig. 8b that the average bandwidth efficiency in the single-jammer scenario of case I (solid line) is higher than that of the single-jammer scenario of case II (dotted line) after 40 epochs. A similar pattern can be seen for the two-jammer scenario in Fig. 8b, after 45 epochs. In other words, our algorithm works better for case I after a sufficient training period. Hence, the proposed scheme that operates in variable-quality channels (in case II) outperforms the benchmark scheme of [34] that works in fixed-quality channels (in case-I). Furthermore, the average bandwidth efficiency of the SU in the two-jammer scenario is almost equal to that of single-jammer scenario for both case I and case II. This indicates that our algorithm performs equally well when it has to work against two random jammers.

C. THE EFFECT OF HAVING DIFFERENT TYPES OF ATTACKS
In this scenario, we investigate the effect of having an intelligent jammer in the system. Keep in mind that the intelligent jammer also learns from its Q values as given in (21) based on the parameters given in Table 1. Fig. 9a shows the jamming probabilities of the random and intelligent jammers for both case I and case II. In particular, the jamming probability converges to zero after 30 epochs when a random jammer is invoked. When an intelligent jammer is present, the successful jamming probability in case I (dashed dot line) is around 10% at the 30 th epoch. Hence, the successful jamming probability is greater for an intelligent jammer compared to that of a random jammer as expected. The two curves (case I and case II) are almost the same for the random jammer case (not the intelligent jammer case). Our proposed scheme (case-I) in solid blue is similar to the benchmark scheme (case-II) in dotted yellow because of the non-intelligent behavior of the random jammer. From a random jammer perspective, the sub-channel qualities do not matter at all. Therefore the successful jamming probabilities against a random jammer for case-I and case-II are almost similar. In contrast, the two curves are different for the intelligent jammer case. The focus of this paper is to combat against an intelligent jammer. The intelligent jammer looks for sub-channels with good sub-channel attributes.  Table 1.
Our proposed scheme works better against intelligent jammer in variable sub-channels case (case-I, dash dot red) as compared to the benchmark scheme (case-II, dash indigo). Our proposed scheme (case-I, dash dot red) reduces the successful jamming probability to zero after 70 epochs, while benchmark scheme (case-II, dash indigo) is not capable of doing so.
As seen in Fig. 9b, the corresponding average SU bandwidth efficiency in the presence of an intelligent jammer (solid line) is almost equal to that when having a random jammer (dashed line) after 40 epochs. Hence, our intelligent SU is capable of avoiding the intelligent jammer after a certain training period. Furthermore, the SU bandwidth efficiency in case I is higher than that in case II. Hence, the SU is also capable of choosing intelligently sub-channels with higher capacity, in the case I, for increasing the average bandwidth efficiency of the system, while successfully avoiding the intelligent jammer.

D. THE EFFECT OF USING DIFFERENT DEFENSE STRATEGIES
Here, we discuss all four possible intelligent/random SU against intelligent/random jammer scenarios, based on case I. VOLUME 9, 2021 FIGURE 10. Probability of successful jamming and bandwidth efficiency of the system having intelligent/random SU against intelligent/random jammer in case I.
As seen in Fig. 10, when a SU chooses a random sub-channel strategy in the presence of a random jammer, then both the successful jamming probability and the average bandwidth efficiency of SU remain almost constant (dotted lines). The performance of SU improves remarkably when it behaves intelligently against the random jammer (dashed lines).
It is visible that a SU using random strategy against an intelligent jammer will result in a severe jamming and the SU bandwidth efficiency degrades drastically (dashed dotted lines). Hence, the SU must be intelligent to combat against an intelligent jammer, in order to reduce the successful jamming probability and to increase the average SU bandwidth efficiency (solid lines).

E. THE EFFECT OF MULTIPLE INTELLIGENT SUs AGAINST AN INTELLIGENT JAMMER
In Fig. 11 we compared the performances when increasing the number of intelligent SUs in the secondary network. In Fig. 11a, it is shown that the jamming probability is higher for two intelligent SUs in the presence of single intelligent jammer compared with the situation when only one intelligent SU is transmitting. The huge impact of increase in bandwidth efficiency is shown in Fig. 11b. Bandwidth efficiency of the secondary network is almost doubled when we have two intelligent SUs against an intelligent jammer.

F. TIME SLOTTED VIEW
In Fig. 12, the x-axis shows the time slot index and the height of the bar shows the utility earned based on the decision of each player after a training period of 100 epochs. As we have described in the proposed model, each sub-channel has different quality based on the received SNR. Without loss of generality it is assumed that SNR 1 ≤SNR 2 ≤ · · · ≤ SNR i ≤ · · · ≤ SNR N −1 ≤SNR N , i ∈ N i.e. the SNR is increasing from sub-channel 1 to sub-channel 10. The impact of changes in SNR and the decision of each player on the utilities are shown in Fig. 5, where the utilities earned by each player are opposite to each other.

G. INTELLIGENT SU AND INTELLIGENT JAMMER
When the players are trained adequately, then the corresponding Q tables would be updated properly, which would result in good decisions for all players. The intelligent SU has more choices in terms of choosing an optimal sub-channel. More explicitly, the SU can choose any of the available subchannels, while there is only one sub-channel that the jammer can choose 'correctly' for a successful jamming. Hence, it is less probable for the intelligent SU to get jammed based on the updated Q table values. Both intelligent SU and intelligent jammer would opt for high quality channels as depicted by the height of each bar in Fig. 12a. As seen in Fig. 12a, the SU chooses sub-channel 10, while the jammer chooses sub-channel 4 at time slot 1. Hence the SU has a positive utility, while the jammer has a negative utility. Also shown in Fig. 12a, the jammer only manages to jam the SU at time slot 10, over the 15 time slots considered. Hence, the intelligent SU manages to avoid the jammer, while choosing highcapacity sub-channels. Fig. 12b shows the decision patterns for the case when intelligent SU adopts the Q learning for strategy update, while the jammer uses random strategy. As seen in Fig. 12b, the intelligent SU manages to avoid the random jammer in all of the 15 time slots considered, while at the same time capable of choosing high-capacity sub-channels. Fig. 12c depicts the decision patterns for the case when the SU uses a random strategy against an intelligent jammer that invokes the Q learning for strategy update. As depicted in Fig. 12c, the jamming rate is high, which is 11 successful jamming out of the 15 time slots considered. Fig. 13 shows the impact of increasing the number of sub-channels and the number of jammers. As the number of sub-channel increases, the intelligent SU will have a higher chances of avoiding the jammer. It is shown that the bandwidth efficiency of SU increases as the number of sub-channels increases due to the increase of the choices in the sub-channel space. We find that increasing the number of random jammers does not effect much the bandwidth efficiency of the intelligent SU. It seems that frequency hopping is a very good strategy for the SU to avoid the jammers, especially when the number of sub-channels is high.

VII. CONCLUSION
In this paper, we have investigated an anti-jamming stochastic game in conjunction with multi-agent reinforcement learning algorithm. Both random and intelligent jammers were considered. The anti-jamming game was designed as a Markov game based on the Q-learning algorithm. We devised a game-theoretic optimal frequency hopping scheme in a dynamic environment in the presence of adversarial jammers by using Q-learning approach to pick high-capacity subchannels while avoiding the jammer. We developed a game model with both players as independent learners, where SU and jammer selfishly and independently select their optimal sub-channels in a multi-agent reinforcement learning setting to increase their individual utilities. The proposed game theoretic model in conjunction with learning based frequency hopping algorithm helps the SU to avoid the attacker, hence reduces the probability of jamming and increases bandwidth efficiency of SU. It was shown in our simulation results that the proposed method outperforms the benchmark system in terms of both the bandwidth utilization and the jamming probability. Furthermore, when the channel exhibits variable channel quality (as in case I), our intelligent SU is capable of intelligently choosing sub-channels with higher capacity, while avoiding the intelligent jammer. Moreover, the bandwidth efficiency of the SU does not decrease significantly when the number of random jammers increases.