An Intelligent Anti-Jamming Scheme for Cognitive Radio Based on Deep Reinforcement Learning

Cognitive radio network is an intelligent wireless communication system which can adjust its transmission parameters according to the environment thanks to its learning ability. It is a feasible and promising direction to solve the spectrum scarcity issue and has become a research focus in communication community. However, cognitive radio network is vulnerable to jamming attack, resulting in serious degradation of spectrum utilization. In this article, we view the anti-jamming task of cognitive radio as a Markov decision process and propose an intelligent anti-jamming scheme based on deep reinforcement learning. We aim to learn a policy for users to maximize their rate of successful transmission. Specifically, we design Double Deep Q Network (Double DQN) to model the confrontation between the cognitive radio network and the jammer. The Q network is implemented using Transformer encoder to effectively estimate action-values from raw spectrum data. The simulation results indicate that our approach can effectively defend against several kinds of jamming attacks.


I. INTRODUCTION
Cognitive radio (CR) is a new form of wireless communication whose transceiver can detect available communication channels intelligently [1]. By optimizing usage of available radio-frequency (RF) spectrum, cognitive radio can relieve contradiction between limited RF spectrum resource and growing demand for spectrum. In cognitive radio network, users are capable to sense the available portion of the spectrum, and use the idle channel for communication. Hence, cognitive radio has attracted extensive attention and become a research focus.
However, cognitive radio network is vulnerable to security attacks since its openness and broadcast nature [2], among which channel jamming attack is the most common one that can severely degrade network performance. The jammer can interfere communication channels by injecting continuous jamming signals or short jamming pulses to deteriorate the signal to noise ratio (SNR) in these channels. As a result, the throughput capacity of ongoing transmission declines, or even the transmission is interrupted. Traditional radios usually use spread spectrum techniques, The associate editor coordinating the review of this manuscript and approving it for publication was Eyuphan Bulut . such as frequency hopping or direct-sequence spread spectrum [3] to mitigate jamming attacks. However, smart jammer can track and interfere the hopping frequencies and these anti-jamming schemes cannot be directly used by cognitive radios. Recently, game theory was extensively studied to address anti-jamming task and achieved impressive result [4]- [6]. However, these approaches need prior knowledge such as the jamming pattern, which is unpractical in actual usage scenario.
Therefore, in this article we aim to mitigate channel jamming attack in cognitive radio network and develop an anti-jamming scheme based on deep reinforcement learning techniques. As shown in Fig. 1, the cognitive radio network is composed of a transmitter-receiver pair and the channels of their communication link is attacked by a jammer. In this anti-jamming communication game, the transmitter-receiver pair tries to setup communication link in appropriate channels to avoid or mitigate the jamming attack and maximize its throughput capacity, whereas the jammer aims to estimate the channels of the ongoing communication between transmitter and receiver to interrupt the network communication. Thus, the anti-jamming communication decisions of the transmitter-receiver pair in such a dynamic game is a typical Markov decision process (MDP), and the optimal communication policy can be learned by reinforcement learning (RL) techniques, even the transmitter-receiver pair has no prior knowledge of the jamming pattern and communication channel model. RL based anti-jamming policy learning has been studied recently. For instance, literature [7] also modeled the anti-jamming game as a MDP and proposed policy iteration method to solve this task. With the rapid development of deep learning [8], deep reinforcement learning (DRL) which can better sensing the environment and learning policy by combining deep neural networks and reinforcement learning has been proposed to tackle video games [9] and robot control [10]. Inspired by the success of DRL, Liu et al. [11] proposed a deep anti-jamming Q-network to estimate the Q-values of communication actions by directly inputting the spectrum waterfalls [12] into a convolutional neural network (CNN) [13]. Compared with these existing works, our anti-jamming scheme adopts deep reinforcement learning, specifically Double DQN [14], in which a Transformer Encoder [15] is used as the Q-network to effectively model the action-value function. The Transformer Encoder takes the raw spectrum data as input and outputs the action-value of each communication action. This Transformer Encoder style Q network is more flexible and powerful than CNN style Q network, since it can estimate action-values more accurately with arbitrary spectrum vectors. To evaluate the effectiveness of our proposed anti-jamming scheme, simulations are performed. We apply our anti-jamming approach to cope with three typical kinds of channel jamming attacks, including sweep jamming, random jamming, and sensing-based jamming. Our main contributions can be summarized as follows.
• An intelligent anti-jamming communication scheme is proposed based on deep reinforcement learning. This is a model-free approach, which means that the jamming patterns and channel models are not needed as a prior.
• A Transformer Encoder style Q-network is designed to map the state space to action space. Specifically, the raw spectrum data is defined as a state to describe the features of the jammer and channels without any information loss. Our simulation results demonstrate that algorithm with this Transformer Encoder style Q network is more effective to defend against jamming attacks than that with traditional CNN style Q-network. The remaining of this article is organized as follows: First, we make a brief review of recently published works strongly associated with our study in Section II. Then we build the system model in Section III. Our anti-jamming scheme is then introduced in section IV. We give the detailed design of our proposed intelligent anti-jamming scheme based on Double DQN. Furthermore, the structure of the Q-network implemented with Transformer Encoder is also illustrated in this section. The simulation settings and results are given in Section V. We conclude our work in the end.

II. RELATED WORK
Wireless communication is vulnerable to security attacks since its openness and broadcast nature. The signal-tointerference-plus-noise ratio (SINR) at the receiver end can be decreased by the jammer who injects noise or recorded signal into the channels to disrupt the ongoing communications. Thus, anti-jamming ability is essential for radios. Traditional anti-jamming approaches such as frequency hopping spread spectrum (FHSS) and direct sequence spread spectrum (DSSS) [3] have fixed transmission patterns and are vulnerable to smart jammers powered by machine learning techniques [16], [17].
Jamming and anti-jamming between jammers and cognitive radios can be considered as a game process. With the rapid development of cognitive radio which is equipped with sensing, learning, and decision abilities, game theories have been studied to mitigate jamming attacks in wireless communication. Wu et al. [18] proposed a power allocation strategy based on Colonel Blotto anti-jamming game to withstand jamming attacks. Similarly, other game theories such as Stackelberg game theory were also tried to achieve anti-jamming defense in wireless networks in [19], [20]. To select appropriate frequency channel, the stochastic game has been investigated to find the optimal control and data channels to achieve maximum throughput under jamming attacks [21]. In spite of the successful application of game theories in anti-jamming task, these approaches need prior knowledge such as the jamming pattern, which is unpractical in actual usage scenario.
Recently reinforcement learning techniques have been applied to help the communication agent achieve an optimal policy via continuous interaction with environment and jammers without prior knowledge of the jammers. A novel channel access strategy to cope with channel jamming based on Q learning has been proposed in [22]. Literature [23] designed an interference-aware routing protocol and proposed a cooperation framework based on reinforcement learning to defend the network against jamming attacks. Since traditional Q learning is inefficient and hard to converge when the state space or action space is large, deep neural networks are adopted by reinforcement learning to achieve deep reinforcement learning which can take the spectrum waterfall as input and outputs channel selection actions [12].
Bi et al. [24] designed a multi-user anti-jamming strategy based on deep Q learning to achieve global optimization for multi-user system. A sequential deep reinforcement learning algorithm is studied in [25] to confront with multiple jammers. [26] proposed a fast DQN-based anti-jamming mobile communication scheme to cope with jamming attacks. DQN is also used to achieve optimal strategy to cope with unmanned aerial vehicle jamming attack [27]. Benefited from the powerful learning ability of deep neural networks, the above anti-jamming methods achieved superior performance than traditional approaches. Our work is inspired by these previous works and our proposed anti-jamming scheme is also based on deep reinforcement learning. Being different from existing works, instead of CNN, we use a Transformer Encoder style neural network.

III. SYSTEM MODEL
As shown in Fig. 1, without loss of generality, the cognitive radio network considered in this article consists of one transmitter, one receiver, and a jammer. We divide the continuous time into discrete time slots and we assume that both the transmitter and jammer share the same time slot. This operation simplifies the analysis. At each time slot t, the transmitter selects one channel f U ,t from a predefined frequency set f = {f 1 , f 2 , . . . , f N C } of the communication band to transmit data packet to the receiver with power P U ,t , while the jammer also selects an arbitrary channel f J ,t of the same band, trying to disrupt this transmission with power P J ,t . Following [11], [28], we assume the bandwidth of jamming signal (b J ) is equal to the bandwidth of communication signal Based on the above setting, the SINR (Signal to Interference plus Noise Ratio) at the receiver can be calculated using the following formula: where I (x) is an indicator function whose value is 1 if x is true, otherwise 0. n is the noise power, h U is the channel gain from transmitter to receiver while h J is the channel gain between jammer and receiver. Following existing work [11], we assume that the transmitter and jammer make their communication and jamming decisions at the beginning of each time slot. As shown in Fig. 2, the blue block and yellow block are respectively selected by the transmitter and jammer. If the transmitter transmits data in a channel which is also selected by the jammer (see the red block), the SINR of received signal deteriorates seriously and the transmission fails when the SINR is under the demodulation threshold. Since the transmitter cannot be able to know the jamming channel selected by the jammer in the current slot, it has to select communication channel based on its previous interactions with environment. Following [11], we also use the raw spectrum data as environment state description. To be specific, the transmitter continuously senses the channel frequencies and stores the sensed results. The sensed spectrum vector of time slot t is denoted as P t = [p t,1 , p t,2 , . . . , p t,N S ], where N S is the number of sample points over the whole bandwidth and [, ] denotes concatenation operation. To sufficiently sense and analyze the channel information, we record the historical spectrum data in the recent M time slot, denoted as: This two-dimensional historical spectrum data contains rich spectrum information until time t and has been proven that it is a better choice to describe the channel status than traditional estimated channel parameters [11]. In the dynamic anti-jamming game, we use the above s t as the environment state. The transmitter makes action decision a t based on s t and receives immediate reward r t . In our setting, the action is a combined selection of channel and power level, e.g., a t = {f U ,t , P U ,t } represents the action on time slot t. Since the transmitter-receiver pair is expected to achieve successful transmission with low cost of channel changing and energy cost, the reward is designed as follows: The reward is composed of three terms, and they are all scalars without units. The first term r SINR (a t ) is the reward of successful transmission. The transmission is considered successful if the SINR of the received signal (denoted as SINR t ) exceeds demodulation threshold SINR threshold . Then the transmitter gets a reward r m , otherwise the transmission is failed and the reward is zero. Hence, r SINR (a t ) is formally defined as: The SINR of the received signal, which is transmitted by the control link, is the basis for calculating reward. Thus, we assume that the control link is jamming-resistant, and this assumption is widely used and can be found in many literatures [12], [28], [29]. VOLUME 8, 2020  The second term −c(a t ) is added since the transmitterreceiver pair is expected to communicate on fixed channel using stable power since communication cost is need for changing channels. Hence we define the cost of switching channels. If the transmitter takes a different action at time slot t (a t = a t−1 ), it will be penalized. The formal definition is as follows: The third term −C p P U ,t denotes the cost of the transmit power, where C p is the cost of the unit transmit power. It is obvious that higher transmit power results in higher probability of successful transmission. If there is no constraint on transmit power, the best policy for transmitter is always selecting the highest transmit power. However, in our system, the transmitter-receiver pair is expected to achieve successful transmission with as lower power consumption as possible.
Thus, we add this term to the reward. The values of all the above hyper-parameters can be found in Table 1.

A. DOUBLE DQN
The anti-jamming communication decision of the transmitter-receiver pair in a dynamic environment is a typical Markov Decision Process (MDP) and has been studied using value-based or policy-based reinforcement learning algorithms. However, traditional Q-learning algorithm is unable to handle the game in our work described in Sec. III, since the raw historical spectrum data, which is viewed as environment state, is infinite. We propose a novel anti-jamming scheme based on Double DQN [14]. Fig. 3 illustrates the framework of our approach. At each time slot t, the transmitter interacts with the environment by selecting action a t according to the sensed state s t . After executing the action, the transmitter receives an immediate reward r t and observes the next state s t+1 . According to equation 3, higher reward means higher probability of successful transmission. Therefore, the transmitter aims to select anti-jamming actions to maximize cumulative reward R t = ∞ i=0 γ i r t+i+1 where γ is the discount factor. Double DQN algorithm used in this work achieves this goal by finding the optimal action-value function Q * (s, a): where E[] is to calculate expectation. π is a policy mapping sequences to actions. Given the environment, Q function outputs the action-values for all actions. Actions with higher action-values have more cumulative rewards and are selected with higher probability. Thus, at each time slot, the transmitter can select the action with highest action-value to effectively mitigate jamming attacks. Accurately modeling the Q function is the key of our approach. We use the interactions to train a deep Q network which is detailed in section IV-B as function approximator to estimate the action-values, denoted as Q(s, a; θ ) ≈ Q * (s, a). In general terms, the probability of taking an anti-jamming action should be a function of the environment. According to the universal approximation theorem for neural networks, the Q network composed of deep neural network can approximate the function at any accuracy when enough previous interactions are provided as training samples.
To stabilize the optimization of approximation, two deep neural networks are used in Double DQN: one called current Q network, namely Q(s, a; θ ), for action selection and another called target Q network, namely Q(s, a; θ − ), for evaluating the target action-value. The idea of using two Q networks is to decouple the selection from the evaluation, alleviating overestimated value problem [30]. It is worth to note that these two Q networks have the same structure. The weights of target Q network are periodically copied from current Q network. The current Q network is trained by minimizing the following loss function which calculates the mean squared error between current action-value and target action-value: where B denotes the batch size and y i is the target action-value estimated by target Q network using greedy strategy: where γ is the discount factor. The gradient of the loss function with respect to the learnable weights can be calculated as follows: Finally, gradient decent algorithm can be adopted to update the weights θ of current Q network. The details of the algorithm for intelligent anti-jamming scheme based on deep reinforcement learning are given in Algorithm 1.

B. Q-NETWORK BASED ON TRANSFORMER ENCODER
As discussed in Sec. III, we use the historical spectrum data in recent time slots as state, denoted as s t ∈ R N S ×M . The Q network is used to estimate action-values based on this state, which plays the key role of Double DQN algorithm.
In this article, we design a modified Transformer Encoder to implement the Q network. Existing works [12], [28], [29] use Convolutional Neural Network (CNN) as Q network, which mines the correlation between historical spectrum vectors in an implicit manner and is hard to explore the relations between all the spectrum vectors due to the limited receptive field of CNN structure. Our Transformer Encoder style Q network is not subject to this restriction, since it is capable to exploit the correlation between all the historical spectrum vectors from multiple perspectives by using multi-head attention mechanism. Fig. 4 shows the detailed structure of our proposed Q network, which is composed of three modules including a Transformer Encoder and two classifiers composed of fully connected layers. Transformer composed of encoders and decoders is originally proposed to address machine translation task [15]. Now it has become a popular module in many computer vision tasks such as Visual Question Answering [31]. In this article, we adopt Transformer Encoder to extract features from historical spectrum vectors. As shown in Fig. 4, the Transformer Encoder contains a multi-head self-attention sub-layer and a feed-forward sub-layer. At time slot t, the transmitter senses the environment and store the historical spectrum vectors in recent M time slots, denoted as s t = [P t−1 ; . . . ; P t−M ] ∈ R N S ×M . Then the state is updated by incorporating the Algorithm 1 Intelligent Anti-Jamming Scheme Based on Double DQN 1: Input: Number of iteration T , action set A, discount factor γ , exploring rate range [ min , max ], attenuation cycle of exploring rate d, weights of current Q network θ and target Q network θ − , batch size B, target update cycle N T 2: randomize θ and copy it to target Q network θ − = θ, empty replay memory D 3: for t = 1 to T do 4: sensing the spectrum and record the historical spectrum data s t as environment state 5: calculate exploring rate = min + ( max − min )e −t/d 6: generate random number ε, and select action usinggreedy algorithm: 7: select channel f U ,t and transmit power P U ,t according to a t and launch communication 8: sensing the new spectrum s t+1 and calculate reward r t according to Eq. 3 9: store the trajectory (s t , a t , r t , s t+1 ) into replay memory D 10: if |D| ≥ B then 11: randomly select a batch of samples (s i , a i , r i , s i ), i = 1, . . . , B from D 12: calculate loss for the batch according to Eq. 7 13: update current Q network's weights θ using gradient Eq. 9 14: end if 15: if t%N T = 0 then 16: update target Q network's weights: θ − = θ 17: end if 18: end for correlations between these spectrum vectors by multi-head self-attention as follows: H ×N S are learnable matrix and H is the number of attention heads. S = s t + PE is the element-wise sum of the spectrum data and their position embedding PE = [pe t−1 ; . . . ; pe t−M ]. Following [15], the position embedding of the (t − i) th spectrum vector is as follows: The usage of position embedding enables our model to utilize the order of the sequence. a ∈ R M ×M in Eq. 12 depicts dependence between each spectrum vector and can be calculated by using row-wise softmax on the dot-product of query Q = W h Q S and key K = W h K S as follows: where W h Q and W h K are also learnable matrix. These updated spectrum vectors are further passed through a Feed-Forward Network (FFN) composed of two fully connected layers: where W 1 and W 2 are learnable projection matrix while b 1 and b 2 are bias terms. Residual connection [32] and Layer Normalization (LN) [33] are also applied to facilitate optimization. By using Transformer Encoder, the updated spectrum feature S ec contains rich information. Now we can estimate action-value based on S ec using simple feed-forward networks. In our anti-jamming setting, at each time slot t, the action has two dimensions, one for selecting channel f U ,t and another for determining transmit power level P U ,t . Thus, the action-value can be denoted as Q(s t , a t ; θ ) = p(f U ,t ) × p(P U ,t ) and can be calculated as follows: where FFN 1 and FFN 2 are two feed-forward networks with different weights.

V. NUMERICAL SIMULATION A. SIMULATION SETTINGS
To verify the effectiveness of our proposed anti-jamming scheme, we conduct extensive simulations. In our simulation, the transmitter-receiver pair and the jammer combat with each other. Following existing works [11], [25], they combat with each other in a frequency band of 20MHz. Specifically, we uniformly select 5 (N C = 5) center frequencies (f = {53, 57, 61, 65, 68}MHz) as candidate frequency channels. The frequency resolution of spectrum sensing is 100kHz and the transmitter senses the full band every 1 ms and store the spectrum vectors in recent 400 ms. Thus, M = 400, N S = 200. One time slot is defined as 5ms. At the beginning of each time slot, the transmitter selects one frequency channel to sends data packet and the jammer injects jamming signal into one channel. At the end of each time slot, the receiver sends the SINR to transmitter through control link. The demodulation threshold is 10dB (SINR threshold = 10dB), which is same as [11]. The transmit power level are chose from 25dBm − 45dBm with 0.5dBm step. We implement our algorithm using PyTorch on a machine with Intel i5 CPU, 16GB RAM, and NVIDIA Geforce 1070 GPU. More detailed simulation parameters are given in Table 1.
Without loss of generality, one malicious jammer is considered in our anti-jamming game. This jammer sends jamming signal on the selected channel to disrupt the ongoing communication of the transmitter-receiver pair by heavily deteriorating the SINR at the receiver. In this article, we consider the following three popular kinds of jammers similar to [34] and [11].
• Random jammer which randomly jams a channel in each time slot. The jamming frequency is randomly changed with the step of 0.5 MHz.
• Sweep jammer which jams all the communication band with 0.8 GHz/s sweeping speed.
• Sensing-based jammer that continuously observes the probability that the communication signal appears at each frequency point, and chooses the one with largest probability as jamming channel.

1) DOUBLE DQN VS OTHER DEEP REINFORCEMENT LEARNING
As we have illustrated in Sec. IV, the core of our intelligent anti-jamming scheme is the Double DQN. To prove the effectiveness of our scheme, we compare our method with the following popular approaches: • We implement another popular value-based reinforcement learning algorithms DQN, which is similar to Double DQN except that only one Q network is adopted to simultaneously select action and evaluate target actionvalue. In DQN, the target action-value y i is estimated as follows: For fairly comparison, we also use Transformer Encoder as shown in Fig. 4 to implement this Q network.
• To prove the effectiveness of applying reinforcement learning technique, we also try a Greedy method, in which the transmitter updates the score of each action according to the average reward it ever received and selects the action with highest average reward at each time slot. Fig. 5 shows the experimental results. We show the SINR performance and power consumption of different methods under different jamming attacks. From these results, we can arrive the following points: 202568 VOLUME 8, 2020

a: OUR METHOD CONVERGES TO HIGHER SINR WITH LOWER POWER CONSUMPTION THAN OTHER APPROACHES
For example, the comparison in Fig. 5(a) demonstrates that our intelligent anti-jamming scheme converges to high SINR (13.5 dB) after about 350 time slots under the sweep jamming attack, while the method based on DQN needs 500 time slots to arrive its convergence (SINR=12.6 dB). Fig. 5(b) shows that our method can achieve lower power consumption than other approaches. Consistent results are yielded under other jamming attacks.

b: DEEP REINFORCEMENT LEARNING BASED ANTI-JAMMING METHODS ARE SUPERIOR TO GREEDY METHOD
By comparing the results of deep reinforcement learning based methods and greedy method, we can see that both of Double DQN based anti-jamming scheme and DQN based anti-jamming scheme achieve higher SINR performance and lower power consumption than greedy method, under all the three kinds of jamming attacks. In addition, performance fluctuation of greedy method is more evident.

c: OUR METHOD COPES WITH SENSING-BASED JAMMING ATTACK SUCCESSFULLY
As shown in Fig. 5(e) and (f), our intelligent anti-jamming scheme can perfectly dodge the sensing-based jamming attack. The reason of this phenomenon can be explained as follows. Our method selects each channel with almost the same probability, resulting that the sense-based jammer cannot recognize the communication pattern of our transmitter-receiver pair. To prove the above explanation, VOLUME 8, 2020  we give the statistical probabilities of selecting each channel in Fig. 6. We can see that the transmitter using our anti-jamming scheme selects each channel equally with about 20% probability after convergence, which is much more stable than that using DQN based method.
In our simulation test, the floating-point operations per second (FLOPS) of our algorithm is 0.45 × 10 9 , and the transmitter takes on average 0.8ms to update the network parameters and make the communication decision on our machine.

2) TRANSFORMER ENCODER STYLE VS CNN STYLE
As we have discussed in Sec. IV-B, we use Transformer Encoder to implement the Q network in our algorithm. These Transformer Encoder style Q networks play key role to the good performance. To prove our point, we conduct an experiment in this section. In this experiment, following [11], we design a Convolutional Neural Network (CNN) to replace the Transformer Encoder.
As shown in Fig. 7, the CNN is composed of three convolutional layers and two classifiers. We also use the spectrum vectors s t as input. The first convolution layer has 16 kernels and the kernel size and stride are set to be 7 and 2 respectively. Thus, the output of first convolution layer is a tensor composed of 16 feature maps whose size are 97 × 97. Then these feature maps are further passed through the second and third convolution layers, whose hyper-parameters are given in Fig. 7. Finally, the classifiers estimate action-values including p(f U ,t ) and p(P U ,t ) based on the output of the last convolution layer.
From the experimental result shown in Fig. 8, we can see that the Transformer Encoder style Q network works better than CNN style Q network. To be specific, the Transformer Encoder Q network converges to higher SINR (12.9 dB) than CNN style Q network (12.3 dB), since that Transformer Encoder style Q network has the ability to exploit the relations between all spectrum vectors by self-attention operation.

3) IMPACTS OF HYPER-PARAMETERS
In this section, we emphatically analyze several important hyper-parameters of our approach, including M (number of spectrum vectors), C c (cost of switching channel), and C p (cost of unit transmit power).
We first analyze the impact of M which denotes the number of spectrum vectors. As shown in Fig. 9, under all the three kinds of jamming attacks, the performance of our anti-jamming scheme based on Transformer Encoder style Q network obtains steady increase with the increasement of spectrum vectors until M = 400. The reason for this phenomenon is that more historical spectrum vectors provide the Q network with more sufficient information to make correct decisions. Hence, the hyper-parameter M is set to 400 in all the other experiments.
C c , the cost of switching channel, is another important hyper-parameter in our approach. Thus, we evaluate its impact on the SINR performance here. We set C c to 0, 0.1,   Fig. 10. We can see that the average SINR remains stable when C c increases from 0 to 0.2, and then declines rapidly after 0.2. Hence, we finally set C c = 0.2 to achieve a tradeoff between performance and cost of switching channel.
As discussed in section III, we add a term −C p P U ,t to our reward definition, guiding our algorithm to cope with jamming attacks with low power consumption. Hence, we evaluate the impact of C p which denotes the cost of unit transmit power. We set C p to 0.001, 0.003, 0.005, 0.007, 0.009 respectively and record the average performance ratio calculated as SINR P U . As shown in Fig. 11, small C P encourages the transmitter to select high transmit power to achieve successful transmission, but resulting in low performance ratio, while big C P enforces the transmitter to excessively concern about power consumption, also leading to low performance ratio. C P = 0.005 results in best tradeoff between performance and power cost.

VI. CONCLUSION
In this article, we propose an intelligent anti-jamming scheme based on deep reinforcement learning. Specifically, we adopt Double DQN to model the confrontation between the cognitive radio network and the jammer. Different from existing work which use CNN style Q network, we design Transformer Encoder style Q network to effectively estimate action-value from raw spectrum data. To evaluate our proposed anti-jamming scheme, we conduct extensive experiments. Simulation results indicate that our approach can effectively defend against several kinds of jamming attacks, including sweep jamming, random jamming, and sensing-based jamming.