Intelligent Anti-Jamming Communication for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach

In this article, we investigate intelligent anti-jamming communication method for wireless sensor networks. The stochastic game framework is introduced to model and analyze the multi-user anti-jamming problem, and a joint multi-agent anti-jamming algorithm (JMAA) is proposed to obtain the optimal anti-jamming strategy. In intelligent multi-channel blocking jamming environment, the proposed JMAA adopts multi-agent reinforcement learning to make online channel selection, which can effectively tackle the external malicious jamming and avoid the internal mutual interference among sensor nodes. The simulation results show that, the proposed JMAA is superior to the frequency-hopping method, the sensing-based method and the independent reinforcement learning. Specifically, the proposed JMAA has the higher average packet receive ratio than both the frequency-hopping method and the sensing-based method. Compared with the independent reinforcement learning, JMAA has faster convergence rate when reaching the same performance of average packet receive ratio. In addition, since the JMAA does not need to model the jamming patterns, it can be widely used for combating other malicious jamming such as sweep jamming and probabilistic jamming.


I. INTRODUCTION
A S A NOVEL network to realize the comprehensive information interaction between human and the objective world, the Internet of Things (IOT) is based on information perception, transmission and processing. Wireless sensor network (WSN) is an important underlying network technology to realize the wide application of the IOT. It is a short-distance wireless communication network composed of a large number of low-cost, low-power, multifunctional sensor nodes [1], [2]. In recent years, the WSN has engulfed many application fields for its potential advantages [3], [4], [5]. When WSN is applied to some pivotal scenarios, such as traffic monitoring [6], health monitoring [7], military target tracking [8], etc., its information transmission needs to be guaranteed with strict reliability. data sets for training, and its learning process is characterized by autonomous exploration of optimal strategies. It means that the online learning can be realized by RL. In the communication anti-jamming problem, the jamming environment may change rapidly. Malicious jamming may be dynamic jamming, unknown type jamming or even intelligent jamming, which means that it is difficult to give the training data set in advance. RL in the face of unknown jamming environment can learn the jamming pattern in real time and gradually improve the transmission strategy, which is of great benefit to realize reliable communication in complex dynamic jamming environment. By introducing RL into the anti-jamming problems, users can continuously adjust the transmission strategy by trying different actions under the jamming environment, and finally obtain the optimal strategy. RL such as the classical Q-learning algorithm has been widely used in solving anti-jamming problems [26], [27], [28]. Nevertheless, the existing anti-jamming schemes based on RL also have some limitations and shortcomings. For example, in [29], [30], the optimal frequency-hopping strategy under dynamic jamming environment was obtained by using standard Q-learning or improved Q-learning algorithm. However, only the singleuser scenario was considered, and hence it is not applicable to WSN with a large number of sensor nodes. In [31], the anti-jamming problem was extended to multi-user scenario. Each user adopted an independent Q-learning algorithm to obtain the optimal channel switching strategy. Then, the authors in [32] considered the coordination among users, and a collaborative multi-agent anti-jamming algorithm based on RL was proposed to obtain the optimal anti-jamming strategy. However, only the conventional sweep jamming was considered in [31], [32]. The authors of [33] further studied the problem of anti-jamming communication under intelligent comb jamming environment. The deep reinforcement learning (DRL) technology combining deep learning (DL) and RL was introduced to obtain the optimal anti-jamming strategy. However, only the single-user scenario was considered, and the rigorous demand of DRL technology on computing resources limits its application in WSN.
To solve these problems mentioned above, this article investigates the anti-jamming problem of multi-sensor nodes based on multi-agent reinforcement learning (MARL) for the malicious jamming with a certain intelligence, so as to provide a preliminary solution and technical support for the realization of "using intelligence to counter intelligence" in WSNs. Note that the limited transmission distance of sensor nodes makes the compact WSN easy to be covered by highpower jammers. Thus, in order to focus on the problem of MARL-based approach against intelligent dynamic jamming, we can consider the external jamming faced by each node as equivalent, without considering the differences of jamming power or jamming channel faced by different nodes. Specifically, the stochastic game framework is introduced to model and analyze the multi-user anti-jamming problem. Than, to effectively counter the external malicious jamming and avoid the co-channel interference among sensor nodes, the cooperative learning is considered. Thus, a joint multiagent anti-jamming algorithm (JMAA) based on multi-agent Q-learning is proposed. The main contributions of this article are as follows: • In order to avoid external multi-channel intelligent blocking jamming and mutual interference among sensor nodes in WSN, a joint multi-agent anti-jamming algorithm (JMAA) based on multi-agent Q-learning is proposed. The proposed algorithm has the characteristics of "cooperative learning, distributed computing, and centralized decision-making", which can quickly converge to the optimal anti-jamming strategy. • The proposed algorithm does not need to estimate the jamming patterns or any parameters of the jammer, which can be applied to a variety of anti-jamming scenarios.

A. SYSTEM MODEL
The system model is shown in Fig. 1. In order to facilitate the research, we make the following assumptions: 1) the WSN is composed of N sensor nodes and one sink node. The sensor nodes can communicate directly with each other, and the sink node is responsible for coordinating the transmission channel between each sensor node. The set of sensor nodes is denoted as N = {1, . . . , N}. There are M channels in the area that can be used for transmission between sensor nodes. The sensor node does not have a priori knowledge about the channel occupied by other nodes or the jammer, but can sense whether there is external jamming in all M channels. In addition, the sink node and sensor node can achieve reliable signaling interaction through the protocol-reinforced low-capacity control link. 2) The communication/jamming time is divided into communication/jamming timeslots with equal length, which is the minimum time unit for channel switching of the node/jammer. The sensor node divides the communication timeslot into sensing sub-slot, transmission sub-slot and local learning sub-slot. Each transmission sub-slot can transmit one data packet, and an ACK message can be received if the transmission is successful. The sensing sub-slot and local learning subslot are used to jamming sensing and local learning respectively. Besides, the sink node divides the communication timeslot into decision-making sub-slot and learning sub-slot. The decision-making sub-slot is used to decide and coordinate the transmission channel of each sensor node. And the learning sub-slot is used to execute the learning algorithm.
3) The high-power jamming signal emitted by an external intelligent jammer can completely cover all sensor nodes. Since the sensor nodes are close to each other, they can be considered to face the same external malicious jamming. Besides, The intelligent jammer can continuously sense all available channel, and the K (K < M) channels that are occupied for the longest time in the current jamming timeslot will be the blocking targets of the next jamming timeslot. The above malicious jamming has the characteristics of frequency tracking and selective jamming, which obviously has a certain degree of intelligence. In this article, we will call it intelligent multi-channel blocking jamming. 4) When multiple sensor nodes occupy the same channel for transmission, mutual interference will occur. Both mutual interference and intelligent multi-channel blocking jamming can cause transmission failure, while the effect of channel noise on transmission is ignored. Since all nodes are synchronized according to the timeslot, when a sensor node is sensing, other nodes are also performing the same operation, which makes it impossible to directly perceive mutual interference. However, if the node neither receives ACK message nor senses malicious jamming, it can be determined that the reason for the transmission failure is mutual interference.

B. PROBLEM FORMULATION
In traditional single-agent reinforcement learning, a Markov decision process (MDP) that includes a single agent and multiple environment states is generally used for problem formulating. However, in the multi-agent scenario considered in this article, the actions taken by any agent will have an impact on the state of the environment, as well as the rewards that can be obtained by other agents. This is a game involving multiple agents and multiple states. Therefore, extending MDP to multi-agent scenarios is a stochastic game, also known as Markov game [34], which can be used to model multi-agent reinforcement learning (MARL) problems. Mathematically, the anti-jamming problem can be expressed as a tuple <N, S, where the specific meanings of each element are as follows: • N represents the number of sensor nodes; • S represents the environment state space; s ∈ S is the element of the state space, representing the environment state of the WSN; • A n , n = 1, 2, . . . , N represents the action space of sensor node n; a n ∈ A n is the optional action of sensor node n; is the state transition probability function, which represents the probability that the environment state is transferred to s after different nodes take action a n ∈ A n in state s; • R a , n = 1, 2, . . . , N represents the reward obtained after node n executes action a n ∈ A n in state s. The environment state of the WSN is closely related to the jamming signal, and hence the environment state space is defined as follows: where j k ∈ {1, . . . , M}, k = 1, . . . , K represents the serial numbers of K blocked channels sensed by the sensor node through broadband spectrum. We represent the state s of the environment by arranging K different j k from small to large. There are C K M states in the environment state space. The action of each sensor node is to select its own transmission channel. Therefore, the independent action space of each sensor node is the same, i.e., A 1 = A 2 = · · · = A N . Then, the independent action space of any node n can be defined as: where independent action a n ∈ {1, . . . , M}} represents the number of the transmission channel selected by node n. A joint action a = {a 1 , . . . , a N } is a combination of independent actions of different nodes, and hence the joint action space can be defined as follows: where ⊗ represents the cartesian product operation. There are C N M+N−1 joint actions in the joint action space. The transition of environment state depends on the change of jamming channel. As mentioned above, the change of jamming channel depends on the statistics and selection of the transmission channel by the intelligent jammer. Obviously, The transitions of environmental state are difficult to predict and model when the sensor nodes are not aware of the jamming strategy.
The local reward for node n taking independent actions a n in state s depends on whether there are other nodes or jamming signals in the selected transmission channel, which can be defined as follows: r n (s, a n ) = 1, a n = j k & a n = a m (m ∈ N /n); 0, otherwise.
The above formula means that when the data packet sent by node n is successfully received (confirmed by ACK message), the reward is 1, otherwise it is 0. Different nodes get the same reward by taking joint action a = {a 1 , . . . , a N }, which is the sum of local rewards of each node. It can be expressed as follows: In a stochastic game, agents may have cooperative, competitive or mixed relationships. Stochastic game can be divided into different categories according to different reward functions. When the reward function of all agents is exactly the same, there is a cooperative relationship between agents, which can be called a complete cooperative game. If the sum of reward functions of two agents is zero, there is a competitive relationship between them, which can be called a zero-sum game. When there are multiple types of reward functions among agents, there is a mixed relationship between agents, which can be called general and random games.
In the above stochastic game, different nodes clearly have a completely cooperative relationship, and their common goal is to obtain the optimal joint strategy * . Each sensor node can obtain the largest cumulative discount reward for long-term execution of the optimal joint strategy * . State-action value, also known as Q-value, can reflect the cumulative discount reward that a certain strategy can obtain [35], which can be defined as: (6) where s t and a t are the state and joint action at step t, respectively. R t+τ is the global immediate reward under strategy π at step t + τ . E π [ · ] is the mathematical expectation operator. 0 ≤ γ < 1 is the discount factor which represents the importance of long-term reward [36].
If the optimal state-action values corresponding to all states-action pairs can be obtained, the optimal joint strategy can be introduced according to the optimal state-action value function as follows: otherwise.

III. JOINT MULTI-AGENT ANTI-JAMMING ALGORITHM A. DETAILED DESCRIPTION OF THE ALGORITHM
According to the analysis in the previous section, to obtain the optimal joint strategy, we need to obtain the optimal Q-values corresponding to all state-action combinations. Besides, since the state transition probability function is difficult to model, model-free reinforcement learning algorithm should be adopted to calculate the optimal Q-value. Q-learning algorithm is a classical model-free reinforcement learning algorithm, which can approach the optimal Q-values gradually through simple iteration [37]. Specifically, the Q-learning algorithm creates a Q-table to store the corresponding Q-values for all state-action pairs. In any given state, the algorithm selects an action according to the current Q-table. After performing the selected action, the algorithm observes the immediate reward and the next state, and then updates the Q-values based on the Q-value function. In the above-mentioned fully cooperative stochastic game, all nodes have the same reward function, and hence the state-action value function of each node executing any joint strategy should also be the equal. In other words, all the nodes only need to update the same Q-table. The Q-values can be updated as follows: otherwise.
Besides, substituting Eq. (5) into Equation (6) can be obtained as follows: In the above, Q n (s, a n ) represents the Q-value corresponding to the independent action path (i.e., the independent strategy) of node n, which may be called independent Q-value. Then, Eq. (8) can be rewritten as: Q n s t , a n,t + α t r n,t +γ max a t+1 Q n s t+1 , a n,t+1 −Q n s t , a n,t , if s = s t , a n = a n,t ; Q(s t , a t ), otherwise.
Therefore, the update of the Q-value Q(s t , a t ) in joint Q-table can be converted into updating the independent Q-values Q n (s t , a n,t ) of each sensor node separately as follows: Q n s t , a n,t = Q n s t , a n,t + α t r n,t +γ max a t+1 Q n s t+1 , a n,t+1 − Q n s t , a n,t , and then summing, thereby achieving distributed update of the Q-value Q(s t , a t ). When all Q-values in the joint Q-table converge to the optimal value, the nodes can obtain the optimal joint strategy according to Eq. (7).
According to the analysis above, we propose a joint multiagent anti-jamming algorithm (JMAA) based on Q-learning. As illustrated in Fig. 2, each sensor node maintains an independent Q-table and the sink node maintains a joint Q-table. The rows and columns of the independent Q-table correspond to the environment states and independent actions, respectively. Therefore, the independent Q-table has C K M rows and M columns. Similarly, the rows and columns of the joint Q-table correspond to the environment states and joint actions, respectively. Therefore, the joint Q-table has C K M rows and C N M+N−1 columns. The core idea of the proposed algorithm is that each node updates its independent Q-value according to local sensing results and transmission rewards, while the sink node accepts all independent Qvalues to update the joint Q-value and decides the next  transmission action of all nodes. In brief, the proposed JMAA has the characteristics of "distributed learning, centralized decision-making, and independent execution".
As shown in Fig. 3, the sink node divides the communication timeslot into decision-making sub-slot and learning sub-slot, which are used to coordinate the transmission channel and implement the learning algorithm respectively. The sensor node divides the communication timeslot into sensing sub-slot, transmission sub-slot and local learning sub-slot, which are respectively used for jamming sensing, data transmission and local learning. Each timeslot corresponds to an

Algorithm 1:
Joint Multi-Agent Anti-Jamming Algorithm (JMAA) 1 Initialize: α, γ , ε ∈ [0, 1), Q(s t , a t ), Q n (s t , a n,t ); 2 for t = 1, . . . , T do 3 Sensor nodes obtain state s t = (j 1 , . . . , j K ); 4 Sensor nodes transmit s t and Q n (s t−2 , a n,t−2 ) to the sink node; 5 The sink node selects a joint action a t by the -greedy algorithm; 6 The sink node sends instructions to sensor nodes according to a t ; 7 Sensor nodes perform independent action a n,t according to the instructions; 8 Sensor nodes calculate reward r n (s t , a n,t ); 9 Sensor nodes update the independent Q-value Q n (s t−1 , a n,t−1 ) by Eq. (11); 10 The sink node updates the joint Q-value Q(s t−1 , a n,t−1 ) by Eq. (9); 11 t = t + 1.
iteration of the JMAA. The details of the JMAA is provided in Algorithm 1. The specific flow of the JMAA is as follows.
1) Firstly, in the sensing sub-slot, each sensor node obtains the current environment state s t by jamming sensing (line 3), and then transmits s t and the locally updated independent Q-value of the previous timeslot to the sink node together (line 4). 2) Secondly, in the decision-making sub-slot, the sink node selects a joint action by Softmax algorithm based on the current joint Q-table (line 5), and then sends instructions to all the sensor nodes to coordinate their transmission channels (line 6). 3) Thirdly, in the transmission sub-slot, each sensor node executes its own independent actions according to the instructions from the sink node, i.e., data transmission is carried out in the assigned channel respectively (line 7). 4) Lastly, in the local learning sub-slot, Each sensor node calculates its own reward based on the sensing results and the ACK message (line 8), then updates the independent Q-value by Eq. (11) (line 9). 5) While the sensor node performs the above two steps, in the learning sub-slot, the sink node updates the joint Q-value by Eq. (9) based on the independent Q-values updated in the previous timeslot (line 9).
In step 2 above, the Softmax algorithm is introduced to select the joint action, which is one of the common method to solve the "Exploration-Exploitation dilemma [38]" faced by reinforcement learning. Specifically, the strategy of selecting the joint action of the sink node can be expressed as: where ξ > 0 is called "temperature". The smaller ξ is, the greater the probability that the joint action with higher Q-value will be selected. As ξ approaches 0, the Softmax algorithm will tend to "exploit only". Conversely, as ξ tends to infinity, the Softmax algorithm tends to "explore only." To achieve a smooth transition from "exploration" to "exploitation", the temperature is updated according to the following rules: where the initial temperature ξ 0 is positively correlated with the "exploration" ability of the algorithm at the initial stage. When υ > 0, ξ can approach 0 gradually with the algorithm iteration, and its value determines the length of the transition time.
Operations of nodes presented in the form of a flowchart in Fig. 4. Different from the offline algorithm, which needs to complete the training before output the strategy, the proposed JMAA is online, and its iterative learning process is also a process of constantly improving the transmission strategy. It means that sensor nodes and the sink node will continue to execute the proposed algorithm until the transmission is terminated. As the transmission progresses, the joint Q-table is continuously updated, i.e., the transmission strategy is continuously improved. After a finite number of iterations, when all Q-values in the joint Q-table do not change significantly, it means that the Q-values have converged to the optimum. The strategy based on the joint Q-table converges to the optimal strategy.

B. COMPLEXITY ANALYSIS
The main computational complexity of the proposed Algorithm 1 lies in steps 3 to 7. The steps 3 to 7 are performed only once in each iteration, and their computational complexity is independent of the size of the Q-table.
Hence, the computational complexity of step 5, 6 and 10 of the sink node can be expressed as O(3T). The computational complexity of each sensor node can be expressed as O(5T), then the computational complexity of N sensor nodes is N · O(5T). The total computational complexity of Algorithm 1 can be expressed as C = (N + 1) · O(T), which means that the proposed algorithm can achieve an optimal solution in polynomial time.
As previously mentioned, the size of independent Q-table and joint Q-table are C K M × M and C K M × C N M+N−1 , respectively. Therefore, the space complexity of the sensor node and the sink node can be expressed as , respectively. The total space complexity of Algorithm 1 can be expressed as , which means that the space complexity of Algorithm 1 will increase sharply with the number of channels and sensor nodes.

C. CONVERGENCE ANALYSIS
The authors in [36] have proved that when the learning rate α t in Eq. (10) and (11) satisfies the following conditions: Q-learning algorithm can traverse all states with the number of iterations increases, and finally converge to the optimal Q-values for all state-action pairs after a finite number of iterations. The proposed JMAA obtains the joint actions according to the joint Q-table, and hence it can converge to the optimal strategy.

D. SIGNALING OVERHEAD ANALYSIS
Since the proposed JMAA relies on the information interaction between the sink node and sensor nodes, the signaling overhead should be considered. As previously mentioned, in each iteration, the sensor node sends sensing result and independent Q-value to the sink node, and receives channel assignment instructions from the sink node. Let I s , I q and I a denote the quantity of information contained in sensing result, in independent Q-value and in channel assignment instruction, respectively. The signaling overhead of each sensor node can be expressed as (I s + I q )/T s , while the signaling overhead of the sink node can be expressed as I a /T s . Since N sensor nodes have to send information to the sink node in each iteration, the signaling overhead of N sensor nodes can be expressed as [N(I s + I q )]/T s . Therefore, the total signaling overhead of Algorithm 1 can be expressed as [N(I s + I q ) + I a ]/T s , which means that the total signaling overhead is proportional to the number of sensor nodes.

A. SIMULATION SETTING
The simulation parameter settings are shown in Table 1.
To evaluate the performance of the proposed JMAA, we compare the performance of the proposed algorithm with the following methods: • Frequency-hopping based method: The sensor nodes switch transmission channels according to the randomly generated fixed frequency-hopping patterns, and the frequency-hopping patterns of different sensor nodes are orthogonal to each other to ensure that the same channel is occupied by only one sensor node at the same time. • Sensing based method: Each sensor node can sense all the jammed channels. If the channel in use is blocked in the current timeslot, the sensor node will randomly switch to an idle channel in the next timeslot, otherwise leaving the channel unchanged. Furthermore, there is no exchange of information among nodes. • Independent Q-learning method (IQL): Each sensor node performs a Q-learning algorithm individually. Moreover, the decisions of each sensor are based solely on locally learning results, and the ACK mechanism is not adopted. • Independent Q-learning method with ACK mechanism (IQL-ACK): This method introduces the ACK mechanism on the basis of IQL, and can determine whether there is mutual interference by combining with the result of jamming sensing. The difference between IQL-ACK and JMAA is that there is no information exchange among nodes in IQL-ACK, and each sensor node's decision is based on the local independent Q-table rather than the joint Q-table. • Distributed Q-learning method (DQL): Each node adopts a multi-agent reinforcement learning algorithm called distributed Q-learning [39]. Similar to IQL, this method does not require information exchange between sensors. Each node maintains local Q-value Q n (s t , a n,t ) through its own actions and rewards. The update of Q-value is carried out in the direction of increasing Q-value: Q n s t , a n,t = max Q n s t , a n,t , Q n s t+1 , a n,t+1 , (15) We introduce the average packet receive ratio to compare the anti-jamming performance of different methods. The average packet receive ratio can be defined as ρ avg (t) = 1/N N n=1 (D n (t)/W), W is the number of independent runs of the proposed algorithm. D n (t) is the number of data packets successfully transmitted by sensor node n in timeslot t when the algorithm runs independently W times. Besides, the following simulation results about the average packet receive ratio are the average of 5000 independent runs. Fig. 5 compares the average packet receive ratio of JMAA when Softmax algorithm has different parameters. The smaller υ is, the longer the exploration process of JMAA is, which means that the convergence rate is slower. However, sufficient exploration can make the convergence value of the average packet receive ratio higher. When it takes at least 6000 iterations to complete the transition from exploration to exploitation, the average packet receive ratio of JMAA can converge to the optimal value, about 0.93. Fig. 6 shows a comparison of the average packet receive ratio for different methods. Considering that the performance of anti-jamming algorithm based on reinforcement learning is affected by parameter setting of Softmax algorithm, we choose the optimal performance curve of DQL and IQL (i.e., the case with the least number of iterations when the optimal performance is achieved) as the comparison scheme. Besides, due to the excessive number of iterations required for IQL-ACK to converge to the optimal value, the performance curve when the convergence is completed within 10,000 iterations is shown in Fig. 6. Firstly, it is known from Fig. 5 that the average packet receive ratio of JMAA can converge to the optimal value of 0.93 within 6000 iterations, while the IQL-ACK can converge to 0.9 within 6000 iterations and 0.915 within 9000 iterations. It means that IQL-ACK needs more iterations to achieve similar performance to JMAA. The reason is that cooperative learning among nodes is not introduced in IQL-ACK, and it takes more time for nodes to independently explore and find the optimal strategy. Secondly, although the optimal performance curves of DQL and IQL converge quickly, the optimal value after convergence is significantly worse than that of JMAA. The reason is that updating Q-value of DQL according to Eq. (15) can always make it proceed in the direction of increase, which has the advantage of accelerating convergence and the disadvantage of falling into local optimization. IQL ignores the mutual interference among nodes, resulting in fast convergence but poor anti-jamming effect. Finally, due to the fixed anti-jamming strategy, the average packet receive ratio of FH-based and Sense-based methods commonly used in practice is far lower than that of JMAA, and even lower than the above several comparison algorithms based on reinforcement learning.

B. SIMULATION ANALYSIS
Since the proposed JMAA does not need to model the jamming patterns, and has the ability to explore and learn from the unknown jamming environment, it should be able to solve the problem of reliable communication in various jamming environments. Hence, the following simulation verify the performance of the proposed JMAA when the external malicious jamming is sweep jamming or probabilistic jamming [40]. Among them, the sweep jamming is a conventional dynamic jamming which can periodically jam the target frequency range or the target channel in turn. Moreover, probabilistic jamming can determine the target channel of different timeslots according to the specific jamming probability matrix. To be specific, if the jammer determines the jamming channels according to the probability matrix shown in Fig. 7(a), then Fig. 7(b) shows the generated jamming pattern in two jamming cycles. More details about probabilistic jamming can be found in [39]. Fig. 8 and Fig. 9 show the average packet receive ratio of JMAA in probabilistic jamming environment and sweep jamming environment. In both sweeping and probabilistic jamming environment, with the decrease of the parameter υ, the convergence speed decreases, but the convergence value is closer to 1. Obviously, the average packet receive ratio can converge to 1 when the appropriate parameter is set, which means that the proposed JMAA can completely avoid the malicious jamming and mutual interference. In addition, the average packet receive ratio of JMAA requires at least 4000 iterations of exploration before it converges to 1 in probabilistic jamming environment, while in the sweep jamming environment, it only needs 1000 iterations of exploration.

V. CONCLUSION
In this article, we investigate the problem of anti-jamming communication in a wireless sensor network. For the internal mutual interference caused by competition among sensor nodes and external intelligent multi-channel blocking jamming. We model the anti-jamming problem as a stochastic game framework, and a joint multi-agent anti-jamming algorithm (JMAA) is proposed for achieving real-time antijamming channel selection. By cooperative learning, the proposed JMAA can eliminate mutual interference and effectively avoid the tracking of intelligent multi-channel blocking jamming. The simulation results show that the proposed JMAA is superior to the frequency-hopping based method, the sensing-based method and the independent Q-learning method (with or without ACK mechanism). In addition, we prove the effectiveness of the proposed JMAA in sweep jamming or probabilistic jamming environment, which indicates the proposed JMAA can be widely used in various of jamming environments.
In future work, the transfer learning approach may be a good candidate to obtain faster convergence speed in multiuser sensor networks with limited computing resources. In addition, it would be more meaningful to consider that different nodes face different external jamming.