Introduction
To enable the emerging Net-Centric Warfare (NCW) needs, the next generation of airborne tactical networks (ATN) must evolve with multi-unmanned aerial vehicle (UAV) systems to provide swarm combat capability. Aeronautic swarm network (ASN) consists of multi-UAVs is a new kind of airborne tactical networks inspired by biological swarm behaviors, the intensive use of UAVs combat system will be standard practice in the next decade. In military operations, aeronautic swarm networks may vary from slow dynamic to dynamic and have intermittent links and fluid topology, which would bring new challenges to the design of mission-centric ATN. While it is believed that ad hoc mesh network would be most suitable for aeronautic swarm system and offers the promise of improved capacity and maintaining reliable communications for multi-UAVs [1].
Aeronautic swarm network is used for exchange of constantly growing amount of battlefield situation information and it also causes a lot of interferences so coexistence among swarm nodes becomes a demand. Moreover, with the aerial battlefield electromagnetic environment getting increasingly complex and intentional jamming, swarm nodes are non-permanent, wireless channels may be impaired, and communication links connectivity between peer nodes is intermittent. This necessitates resiliency anti-jamming technologies to be closely integrated into ASN to provide robust connectivity and gain competitive advantage of future electromagnetic spectrum warfare (EMSW).
The traditional anti-jamming solutions do not work as well in complicated electromagnetic scenarios, due to current airborne tactical radios are statically configured to operate within a pre-allocated spectrum channel prior to deployment in temporal, frequency and geographical domains. The paradigm of static spectrum allocation results in a situation where some frequency bands are utilized effectively where as some portions of spectrum remain under-utilized. Aeronautic tactical radios need share spectrum with other in- and out-of network radios to improve frequency spectrum utilization. Latterly, the conception of cognitive radio (CR) based anti-jamming communication technology was bring forward to improve spectral efficiency of tactical network in a congested electromagnetic environment [2]–[3]. Cognitive anti-jamming (CAJ) radio can sense the jamming signal and opportunistically avoids the jammer spectrum for secure data transmission in the presence of intentional and accidental interferences, and emerge as an intelligent aeronautical wireless communication system through dynamic spectrum access (DSA) feature of CR, that has some autonomy to make decisions about the spectrum usage.
Cognitive anti-jamming technology has attracted widespread attention and considerable researches. To address the interactive competition between the legitimate users and the jammers, game theory and Markov decision process (MDP) has been firstly used for cognitive anti-jamming network. A stochastic zero-sum game framework is proposed in [2] and Minimax-Q learning algorithm is utilized to explore an optimal channel accessing strategy in dynamic anti-jamming game. For the same stochastic game model, Singh et al. presents the use of state-action-reward-state-action learning and QV learning that are the on-policy and non-greedy variant of Q-learning algorithm which outperform the Minimax-Q algorithm [3]. Further, assumed system model allows multiple tactical radios to simultaneously operate over the same spectrum band, and each radio attempts to evade the transmissions of other radios as well as avoiding jamming signal, a multi-agent reinforcement learning (MARL) algorithm based on Q-learning is proposed to find optimal anti-jamming and interference avoidance policies in [4]. Moreover, a new decision policy for the sub-band spectrum state to reduce the computational complexity of learning is developed in the multi-agent environment.
All these above-mentioned cognitive anti-jamming works is mainly based on Game theory model and utilizes Q-learning to solve. The stateful Q-learning approach requires explicit modeling of network states and actions from an underlying MDP. Unfortunately, for the aeronautic swarm network, it is difficult to deal with this model directly because of the more state of the environment. Wang et al. [5] have modeled the DSA problem with partially observable Markov decision process (POMDP) framework which considers channel quality to decide about the channel to sense, however, it has comparatively higher complexity. On the contrary, the cognitive anti-jamming problem is modeled under the multi-armed bandit (MAB) framework which turns to be very easy and less complex to implement. Therefore, we investigate the stateless MAB model that address the exploration-exploitation dilemma for allocating power and channel selection on sequential reward sampling.
The classical MAB models a sequential interaction scheme between a learner and an environment. The learner sequentially selects one out of
Some of the related bandit-based anti-jamming studies have been reported recently. From a multi-domain perspective, the anti-jamming defense scheme which includes both power domain and spectrum domain is proposed [6]. To be more specific, a Stackelberg power game is formulated to fight against the jamming attacks in the power domain, and a UCB1 bandit algorithm-based channel selection scheme with a channel switching cost is designed to achieve anti-jamming in the spectrum domain. In [7], an adversarial multiplayer MAB game is employed to model the problem of joint channel and power allocation in underwater acoustic communication networks, and presents a game-based distributed hierarchical exponential learning algorithm that effectively improves user learning ability and decreases learning time. Based on multi-player bandit model, Sawant et al. [8] study distributed algorithms that are robust against malicious jamming attack and give constant regret with high confidence.
The bandit algorithms used in the above anti-jamming approaches rely on the knowledge of this horizon
On the other hand, most of contemporary research work has been done in the context of bandit-based anti-jamming assume that perfect spectrum channel sensing in physical layer (PHY), and the key to anti-jamming operation is the radio’s ability to sense its surrounding electromagnetic environment, this functionality is known as jamming sensing. However, imperfect sensing has some limitations concerning the anti-jamming capability. There have been some attempts in [10], to consider the energy detector (ED) output as a reward for general reinforcement learning algorithms, but they lack from significant theoretical guarantee and a relation with achievable throughput. In contrast, we jointly investigate DT kl-UCB++ bandit algorithm and jamming sensing, with the objective of maximizing the throughput of each airborne radio, design a optimal configuration of transmit power and spectrum channel for enabling ASN anti-jamming communication.
The remainder of this paper is organized as follows. Section II, we describe the aeronautic swarm network model consists of in-band full-duplex (IBFD) enabled CRs, which has a good advantage of increased throughput and real-time sensing ability. In Section III, the detection/false alarm probability of jamming sensing based on improved energy detection (IED) is analyzed theoretically. Further, according to the accurate reward calculation from jamming sensing output, the distributed anti-jamming scheme using DT kl-UCB++ algorithm is proposed in Section IV. In Section V, the performance evaluation of the presented bandit anti-jamming is analyzed with simulations. Finally, the conclusions are drawn in Section VI.
System Model
We consider the aeronautic swarm network is illustrated in Figure 1. UAV nodes of ASN are hovering over a geographical area and are equipped with tactical cognitive radio. In the battlefield the ally and enemy tactical radios face each other in a competition to dominate an open spectrum resource to achieve higher throughput. Using accurate local spectrum situation sensing information, airborne radio applies a strategy to perform transmit or silence action. Similarly, we assume the smart jammers have cognitive features such as spectrum sensing, learning and reconfigure ability, subsequently causing more damages than the conventional jammers. During the operation, the radio nodes periodically exchange control information to select the best radios as local controllers, i.e. cluster heads (CH) in the swarm ad hoc network. If operational conditions of a specific CH degrade, its role can be taken over by another radio node of the network. Then, the radio nodes are also selected to act as gateways (GW) between clusters if required.
In the following, we present mathematical formulation for the ASN with
For the presented aeronautic swarm network configuration, a suitable mathematical formulation needs to be created. Since there are multiple radios in the tactical network, our problem is classified as multi-player MAB. In the bandit model, the radios are the players (agent) in the swarm network, and they play (i.e., transmit) the channels, an arm (action) corresponds to be a frequency channel and transmit power level that the anti-jamming radio may choose under competition. In the case of decentralized decision making, each radio computes its own action. For radio
For the aeronautic swarm network in which multiple tactical radios and jammers have to coexist, it generally operates in highly congested and contested electromagnetic environments, which may result the spectrum resources is scarce. Therefore, the same-frequency simultaneous transmit and receive (SF-STAR) technologies is employed for cognitive radio in this paper. It is worth noting that SF-STAR CAJ (SCAJ) radio is transmitting and receiving information signals at the same time and at the same center frequency, and promise to double the network throughput of a wireless link, compared with traditional half-duplex operation. The military IBFD radios will have the progressive capability for SF-STAR by which they can conduct electronic warfare at the same time when they are also using the same frequency band for communication. It is quite obvious that, by utilizing the STAR capability, SCAJ radio could gain a major technical advantage over an opponent that does not possess similar technology [11]–[12], and the use of artificial noise generated by FD receiver technology has been presented to enhance physical layer security [13].
We design IBFD transmitter-based transmit-sense-receive (T-S-R) mode for SCAJ radio as depicted in Fig. 2(a)-(b). Firstly, to check channel availability, the radio initially senses in a half-duplex fashion for a duration
Jamming Sensing
The design of cognitive anti-jamming strategy started from the premise that the frequency bands usage information can be available. Such information gives an advantage during the operation mission because not only helps to ensure information transmission but could be used for electromagnetic warfare too. Spectrum situation awareness of jamming signal is a part of cognitive anti-jamming communication system and would be utilized to learn and adapt to the environment. Likewise, sensing accuracy indicates the detecting probability when the jammer is present. There are several methods of channel sensing including energy detection, matched filtering based detection and cyclostationarity-based detection are the popular methods of sensing and estimation used in the CR implementation. However, these sensing approaches can’t achieve the trade off between performance and complexity. To perform well in jamming sensing, we make use of an improved energy detector [14], i.e. a
For the swarm network with \begin{equation*} \mathrm {y}\left ({n }\right)=\begin{cases} h_{SI}u\left ({n }\right)+w\left ({n }\right)\!, &{H}_{0} \\ \displaystyle \sum \nolimits _{i=1}^{M} {h_{i}(n)s_{i}(n)} +h_{SI}u\left ({n }\right)+w\left ({n }\right)\!, &{H}_{1} \end{cases}\tag{1}\end{equation*}
Based on the signal model described above, and defining \begin{equation*} \Omega =\mathop \sum \nolimits _{n=0}^{N_{s}} {\mathrm {\eta (n)}}\tag{2}\end{equation*}
Since
The probability distribution function of \begin{align*} {f}_{\eta \left |{ H_{0} }\right.}\!\left ({\mathrm {x} }\right)=&\frac {2}{p}{\left ({\frac {1}{1+\gamma _{inr}} }\right)}x^{\frac {2}{p}-1}exp\left [{ -\left ({\frac {1}{1+\gamma _{inr}} }\right)x^{\frac {2}{p}} }\right] \\ \tag{3}\\ {f}_{\eta \left |{ H_{1} }\right.}\!\left ({\mathrm {x} }\right)=&\frac {2}{p}{\left ({\frac {1}{1+\gamma _{inr}+\sum \nolimits _{i=1}^{M} {\gamma _{snr}(i)}} }\right)}x^{\frac {2}{p}-1} \\&\times exp\left [{ -\left ({\frac {1}{1+\gamma _{inr+}\sum \nolimits _{i=1}^{M} {\gamma _{snr}(i)}} }\right)x^{\frac {2}{p}} }\right]\tag{4}\end{align*}
\begin{equation*} {H}_{0}:\begin{cases} \mu _{0}=N_{s}\left ({1+\gamma _{inr} }\right)^{\frac {p}{2}} \Gamma \left ({1\!+\!\dfrac {p}{2} }\right) \\ \sigma _{0}^{2}=N_{s}\left ({1\!+\!\gamma _{inr} }\right)^{\frac {p}{2}}\left [{ \Gamma \left ({1+\dfrac {p}{2} }\right)\!-\!\Gamma ^{2}\left ({1+\dfrac {p}{2} }\right) }\right] \qquad \end{cases}\tag{5}\end{equation*}
\begin{align*} H_{1}:\begin{cases} \mu _{1}\!=\!N_{s}\left [{ 1+\gamma _{inr}+\displaystyle \sum \nolimits _{i=1}^{M} {\gamma _{snr}(i)} }\right]^{\frac {p}{2}} \Gamma \left ({1+\dfrac {p}{2} }\right) \\[3pt] \sigma _{1}^{2}\!=\!N_{s}\left [{ 1+\gamma _{inr}\!+\!\displaystyle \sum \nolimits _{i=1}^{M} {\gamma _{snr}(i)} }\right]^{\frac {p}{2}}\\[7pt] \quad \times \left [{ \Gamma \left ({1+\dfrac {p}{2} }\right)\!-\!\Gamma ^{2}\left ({1+\dfrac {p}{2} }\right) }\right]\!\!\!\!\!\!\!\! \end{cases}\tag{6}\end{align*}
After some algebraic manipulations, the probability of miss detection in each SCAJ radio can be obtained as \begin{equation*} P_{md}\!=\!Pr\left \{{\Omega \ge \lambda \left |{ H_{1} }\right. }\right \}\!=\!1-Q\left ({\frac {\mu _{1}}{\sigma _{1}} }\right)\!-\!Q\left ({\frac {\lambda -\mu _{1}}{\sigma _{1}} }\right)\tag{7}\end{equation*}
\begin{equation*} P_{f}=Pr\left \{{\Omega \ge \lambda \left |{ H_{0} }\right. }\right \}=Q\left ({\frac {\lambda -\mu _{0}}{\sigma _{0}} }\right)\tag{8}\end{equation*}
Hence the total error probability of SCAJ radio jamming sensing can be calculated as \begin{align*} P_{e}=&P_{f}+P_{md} \\=&1+Q\left ({\frac {\lambda -\mu _{0}}{\sigma _{0}} }\right)-Q\left ({\frac {\mu _{1}}{\sigma _{1}} }\right)-Q\left ({\frac {\lambda -\mu _{1}}{\sigma _{1}} }\right)\tag{9}\end{align*}
By differentiating the preceding equation (10), we can get \begin{align*}&\hspace {-2pc}\frac {dP_{e}(\lambda)}{d(\lambda)}=\frac {1}{\sigma _{0}}\frac {1}{{\sqrt {2\pi } \sigma }_{0}}exp\left [{ -\left ({\frac {\lambda -\mu _{0}}{\sigma _{0}} }\right)^{2} }\right] \\&\qquad \qquad -\,\frac {1}{\sigma _{1}}\frac {1}{{\sqrt {2\pi } \sigma }_{1}}exp\left [{ -\left ({\frac {\lambda -\mu _{1}}{\sigma _{1}} }\right)^{2} }\right]\tag{10}\end{align*}
Let \begin{align*} \lambda _{opt}=&\frac {\frac {\mu _{0}}{\sigma _{0}^{2}}-\frac {\mu _{1}}{\sigma _{1}^{2}}}{\frac {1}{\sigma _{1}^{2}}-\frac {1}{\sigma _{0}^{2}}} \\&\,\,-\sqrt {\frac {\left ({\frac {\mu _{0}}{\sigma _{0}^{2}}\!-\!\frac {\mu _{1}}{\sigma _{1}^{2}} }\right)^{2}\!-\!\left ({\frac {1}{\sigma _{0}^{2}}\!-\!\frac {1}{\sigma _{1}^{2}} }\right)\left ({\frac {\mu _{0}}{\sigma _{0}^{2}}-\frac {\mu _{1}}{\sigma _{1}^{2}}+2ln\frac {\sigma _{0}}{\sigma _{1}} }\right)}{\left ({\frac {1}{\sigma _{0}^{2}}-\frac {1}{\sigma _{1}^{2}} }\right)^{2}}} \\ {}\tag{11}\end{align*}
In Fig. 3, the receiver operating characteristic curves (ROCs) for improved energy detection are illustrated for different number of jammers, simulation parameters are
Next, we consider that the SCAJ radio operates in 5-jammers environment, where the simulation parameter
Fig. 5 plots the jamming sensing total error probability of SCAJ radio versus threshold by setting
Fig. 6 plots total error probability of jamming sensing under threshold
Cognitive Antijamming Strategy
In section II, we have presented the multiuser bandit model for anti-jamming communication in swarm network. Normally, this problem can be considered as an approximation of contextual MAB, and the conventional contextual bandit considers the existence of a context that influences the action-selection process. As a consequence, the available strategies vary with the context and the probability distribution of a given reward. However, due to the dynamic change of aeronautical swarm network topology, such information is difficulty to be obtained in practice. Therefore, we are more focus on the case where no context can be inferred, and the anti-jamming communication problem is modeled as an adversarial bandit in which no stochastic assumption is taken and several tactical radios compete against each other. Especially, recent research shows that bandit algorithms tailored for a stochastic model is still useful in non-stochastic adversarial bandit problem [16]. This is a very encouraging and beneficial result, and we will explore the cognitive anti-jamming strategies based on stochastic bandit learning algorithm. In the following, we present a selfish doubling trick KL-UCB++ algorithm to cope with this kind of bandit problem.
A. Reward Definition
In swarm network, the radio shapes an anti-jamming strategy according to the obtained rewards. And a reward function allows a radio conducting its action towards a given performance metric. When choosing an action in anti-jamming scheme based on bandit learning, the SCAJ radio has access to the history of rewards and actions. The radio’s objective is to choose a strategy that maximizes the expected reward over a finite time horizon
Let \begin{equation*} C_{i}^{\ast }=B_{i}log\left ({1+{SNR}_{i} }\right)\tag{12}\end{equation*}
\begin{equation*} C_{i}=\left ({1-P_{f} }\right)r_{i}^{\left ({1 }\right)}+P_{md}r_{i}^{\left ({2 }\right)}\tag{13}\end{equation*}
\begin{align*} r_{i}=&\frac {C_{i}}{C_{i}^{\ast }} \\=&\frac {\left ({1-P_{f} }\right)r_{i}^{\left ({1 }\right)}+P_{md}r_{i}^{\left ({2 }\right)}}{B_{i}log\left ({1+{SNR}_{i} }\right)}\tag{14}\end{align*}
This reward definition characterize a selfish behavior which purely reflect the decentralized and adversarial problem. Through selfish learning, each tactical radio tries to learn the best configuration for their own gain, regardless of the performance experienced by other radios in swarm network. Under these circumstances, each radio ignores the existence of other learners. In particular, the accumulated regret \begin{equation*} R_{i,T}=\sum \nolimits _{t=1}^{T} \left ({r_{i,t}^{\ast }-r_{i,t} }\right)\tag{15}\end{equation*}
Since the agent in multiuser MAB model of aeronautical swarm network can’t get a priori information about the state transition probabilities, KL-UCB++-based model-free reinforcement learning algorithms would be suitable to solve this game through trial-and-error interactions. Accordingly, we introduce the KL-UCB++ algorithm to present a decentralized bandit anti-jamming strategy.
B. KL-UCB++ Algorithm
The KL-UCB++ algorithm is a slight modification of algorithm KL-UCB+. We first present some definition of a bandit problem with \begin{align*} R_{T}=&T\mu ^{\ast }-\mathbb {E}\left [{ \sum \nolimits _{t=1}^{T} r_{t} }\right]=\mathbb {E}\left [{ \sum \nolimits _{t=1}^{T} {\left ({\mu ^{\ast }-\mu _{A_{t}} }\right)}r_{t} }\right] \\=&\left [{ \sum \nolimits _{t=1}^{T} \left ({\mu ^{\ast }-\mu _{a} }\right) }\right]\mathbb {E}\left [{ N_{a}\left ({T }\right) }\right]\tag{16}\end{align*}
Let \begin{equation*} \bar {\mu }_{a}\left ({t }\right)=\bar {\mu }_{a{,N}_{a}\left ({T }\right)}=\frac {1}{N_{a}\left ({T }\right)}\sum \nolimits _{s=1} Y_{s} {\mathsf 1}_{\left \{{A_{s}=a }\right \}}\tag{17}\end{equation*}
The KL-UCB++ algorithm is described as Algorithm 1, where \begin{equation*} \mathrm {g}\left ({n }\right)={log}_{+}\left ({\frac {T}{Kn}\left ({{log}_{+}^{2}\left ({\frac {T}{Kn} }\right)+1 }\right) }\right)\tag{18}\end{equation*}
Algorithm 1 The KL-UCB++ Algorithm
The horizon
Pull each arm of {
For
Compute for each arm
Play
end
The following results state that the kl-UCB++ algorithm is simultaneously minimax- and asymptotically-optimal.
Lemma 1 (Minimax Optimality[9]):
For any family \begin{equation*} R_{T}\le 76\sqrt {VKT} +\left ({\mu ^{+}-\mu ^{-} }\right)K\tag{19}\end{equation*}
Lemma 2 (Asymptotic Optimality[9]):
For any bandit model \begin{equation*} \mathbb {E}\left [{ N_{a}\left ({T }\right) }\right]\le \frac {log\left ({T }\right)}{kl\left ({\mu _{a}\!+\!\delta,\mu ^{\ast }-\delta }\right)}+O\left ({\frac {loglog\left ({T }\right)}{\delta ^{2}} }\right)\tag{20}\end{equation*}
The kl-UCB++ algorithm is simultaneously minimax optimal and asymptotically optimal, but it is not anytime due to the total number of decisions making is unknown for anti-jamming communication in ASN. Hence in such cases, it is crucial to devise anytime kl-UCB++ algorithms which do no rely on the knowledge of this horizon
The doubling trick is a well known idea in online learning, and the key to guarantee regret is to choose correctly the doubling sequence. Empirically, the term doubling trick usually refers to the geometric sequence
C. DT KL-UCB++ Anti-Jamming Strategy
For our
In our decentralized anti-jamming framework, some important implications must be considered with regards to practical application of bandit learning to ASN. Since each radio attempts to learn by its own in highly dynamic environments, the action selection procedure is held in a disorganized way and the competition unleashed by the adversarial radios exits among the SCAJ radios. Although the radio can sense a channel to detect the presence of jammers before deciding to transmit data on this channel. However, distinct radios may transmit on the same frequency band leading to intensive collision, which may reduce throughput and cannot always guarantee a sublinear regret. Therefor, the decision making strategies which guarantee collision-free transmissions in the ASN are desired.
To reduce collision and speed up convergence, we propose a decentralized anti-jamming strategy with finite coordination, which is shown schematically in Figure 7. The basic principle can be described as follows. Firstly, the available frequency channel is achieved by IED jamming sensing and shapes a channel list. It is assumed that the list is stored in cluster heads of ASN. If a radio have accessed one channel, it will feedback the channel occupation information to cluster heads. Then, this channel index would be canceled from current list to avoid collision, and cluster heads broadcast the updated channel list information to other radios to access. Finally, the opportunity of channel collision can be reduced by this partial coordination. The detailed anti-jamming strategy is illustrate in Algorithm 2.
Algorithm 2 DT KL-UCB++ anti-jamming strategy
KL-UCB++ algorithm
exponential sequence
channel list
Let
for
if
Initialize KL-UCB++ algorithm
Update
end
Perform
Compute an index
Choose the action with highest index.
Computing the theoretical throughput using (12).
Computing the reward using (14).
Update
end
Clearly, we find that the doubling trick kl-UCB++ strategy depends on a non-decreasing diverging doubling sequence
Lemma 3 (Regret Upper Bounds[18]):
For any bandit model and algorithm \begin{equation*} R_{T}(\mathscr {DT}({\mathscr {A}}(T_{i}))_{i\in \mathbb {N}})\le \mathop \sum \nolimits _{i-0}^{L_{T}} {{R_{\boldsymbol {T}_{\boldsymbol {i}}-\boldsymbol {T}_{\boldsymbol {i-1}}}(A}_{\boldsymbol {T}_{\boldsymbol {i}}-\boldsymbol {T}_{\boldsymbol {i-1}}})}\tag{21}\end{equation*}
Further, it can be observed that Algorithm 2 is a heuristic doubling trick algorithm, in which a fresh algorithm
Simulation Results
In this Section, we evaluate the performance of selfish bandit-based cognitive anti-jamming strategy for aeronautic swarm network. In our numeric simulations, the available bandwidth is
In bandit learning-based strategy, a quantity termed as expected cumulative regret is often used to characterize the learning performance, which represents the cumulative difference between the reward of the chosen actions and the maximum expected reward. Accordingly, the objective of anti-jamming strategy is equivalent to minimizing the expected cumulative regret. We compare DT-kl-UCB++-based anti-jamming strategy with the UCB and kl-UCB++ strategies. Figure 8 presents the growth of cumulated regret with time of all these anti-jamming strategies. As expected, it can be observed that the cumulative regret performance of DT-kl-UCB++ anti-jamming strategy clearly outperforms the UCB, kl-UCB++ strategies and Lai & Robbins lower bound.
Figure 9 compares the aggregate average throughput achieved by UCB and kl-UCB++ and DT-kl-UCB++ strategies. In the figure, we find that the UCB and kl-UCB++ strategies perform slightly worse than the anytime DT-kl-UCB++ strategy.
The probability of selection of the optimal action is shown in Figure 10 for different strategies. Similarly, it can be observed that the proposed DT kl-UCB++ strategy enjoys more opportunity to select the optimal action than the other non-anytime strategies.
Conclusion
This paper has dealt with the potential and feasibility of applying decentralized selfish bandit anti-jamming strategy to aeronautic swarm network. We analyze the main characteristics of ASN in electromagnetic spectrum warfare scenario and establish an adversarial multiuser multi-armed bandit model. Then, we propose a doubling trick kl-UCB++ bandit-based multidomain anti-jamming strategy to cope with this model. We highlight practical issues such as accurate reward generation from jamming sensing results and anytime kl-UCB++ algorithm design when applying bandit learning methods into ASN. Our studies show that the proposed multidomain anti-jamming strategy is able to achieve larger average throughput and low cumulative regret than state-of-the-art bandit learning strategies. Even though each radio performs anti-jamming by selfish and has no knowledge of the number of players, which is appealing to engineering implementation in dynamic ASN scenarios. In addition, the presented DT-kl-UCB++ bandit strategy is only an heuristic and lacks systematic theoretical proof, which is left for our future work.