Anti-Jamming RIS Communications Using DQN-Based Algorithm

Wireless sensor networks (WSNs) are widely applied in battlefield surveillance, where the data collection (e.g. target tracking or contention zone observation) employed by local sensor nodes needs to send to military bases for tactical decisions. Since data transmissions are susceptible to malicious attacks, jammers of adversaries can successfully block their victim’s communications by transmitting interfering signals to legitimate transmissions. Nowadays, owing to the ability to reconfigure the wireless propagation medium, reconfigurable intelligent surface (RIS) is regarded as an effective tool to enhance transmission performance, especially in the jamming context. This paper considers the anti-jamming communication tactical scenario of a solar-powered RIS network, in which the RIS is used to improve the uplink transmission performance between a wireless device (WD) and a base station (BS). We investigate the long-term anti-jamming communications of the WD powered by a solar energy harvester. Our objective is to jointly assign the optimal amount of transmission energy and the RIS phase shifts to maximize the data rate of the system in the long run. To this end, we formulate an anti-jamming communication optimization problem as a Markov decision process (MDP) framework and then design a deep Q network (DQN)-based algorithm to generate an optimal policy. As a result, the optimal resource allocation is achieved through trial-and-error interactions with environment by observing the predefined rewards and the network state transition. The Python simulation results conducted by averaging 104 time slots show that the proposed algorithm is not only able to learn from environment, but also yields better performance than baseline schemes under network changes. Moreover, the performance of RIS communication schemes is verified to be superior to that of without-RIS communication schemes in the jamming context.


I. INTRODUCTION
With the tremendous growth in data services for wireless communications (e.g., music, video, and games), demands for utilizing wireless resources have been also emerging explosively in recent years. However, due to constrained wireless resources and increasing power consumption, conventional networks might not satisfy these services. Therefore, both the academic and the industrial communities are paying more attention to developing efficient resource management schemes to improve both spectrum utilization and The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy . energy conservation for the emerging wireless devices in mobile Internet [1]. To this end, energy harvesting technology plays a crucial role, by which it allows a harvesting device to harvest energy from ambient sources, which is considered a potential solution to attain sustainable operation in dense cellular networks [2].
Nowadays, WSN is regarded as a crucial underlying framework to realize many applications of the internet of things (IoT) scenarios, such as health monitoring [3], traffic monitoring [4], military target tracking [5], and underwater communications [6]. However, WSN is extensively vulnerable to artificial malicious jamming owing to the open nature of wireless transmission medium and simple network architecture. For instance, the jammers aim to inject interfering electromagnetic signals into wireless channels to suppress legitimate communications of sensor nodes [7]. Moreover, they are capable of tampering with data or masquerading as normal sensor nodes to transmit interfering false data [8]. For the above-mentioned issues, investigating countermeasures against disruptive threats caused by jammers is imperative and challenging in WSNs.

A. CONSIDERED JAMMING CONTEXT
Along with a proliferation of WSNs in civil cases, the military sensor network has been paid more attention to in battlefield surveillance applications, such as attack reconnaissance, battle damage assessment, and so on [5], [9]. In this paper, we mainly focus on the anti-jamming military battlefield context where reliable data transmissions are dominant to defend against interfering attacks. In the military, sensor nodes are deployed heterogeneously on the battlefield to gather data for combat purposes. The sensor nodes collect the data from combat zones and frequently send it to the main fusion center for tactical decisions. In this article, we consider the uplink transmission between a sensor node to a base station (i.e. main fusion controller), where there exist a lot of jamming attacks conducted by the adversary sides. The considered jamming attacks use high-power noise signals to disrupt legitimate communications. Apart from jamming attacks, sensor nodes generally suffer from short lifetimes due to a limited-capacity battery. Consequently, they cannot guarantee Quality of Service (QoS) for their critical transmission missions. Using energy harvesting becomes a promising solution in military sensor networks to prolong the battery lifespan of sensor nodes [10]. Therefore, it is imperative to design the optimal countermeasures against jamming attacks in energy harvesting-powered military sensor networks.

B. RELATED WORKS
To deal with jamming issues, several spectrum spreading technologies have been studied, e.g., direct sequence spectrum spreading (DSSS) [11] or frequency-hopping spectrum spreading (FHSS) [12]. However, on the one hand, they have some limitations such as significantly relying on a local pseudo-random code or a predetermined frequencyhopping pattern. On the other hand, new trends such as the diversity and dynamics of intelligent jammers, the higher requirements are needed for anti-jamming communication strategies. Recently, many studies have been conducted for the cooperative network using trusted relays to ameliorate the physical layer security [13]- [15]. There exist some efforts of designing the optimal strategy for anti-jamming communications [16]- [18]. The authors in [16] studied a dynamic channel allocation mechanism to deal with the spectrum conflicts and reduce the impacts of various jamming attacks. However, the wireless channels are simply modeled without considering the channel fading effect of wireless devices, leading to low feasibility of the proposed approaches. The spectrum assignment and routing scheme was developed to address jamming issues in a multi-hop scenario in [17], while the authors in [18] modeled the jamming attack using the game theory-based method to define the optimal strategies by assuming complete information of the adversary. Nevertheless, these works, [16]- [18], did not consider the energy harvesting context.
To cope with the uncertainty of jamming behaviors and power levels, some reinforcement learning (RL) algorithms were proposed to acquire the optimal jamming resistance strategy in wireless networks [19], [20]. Although the effectiveness of the foregoing schemes was validated [13]- [15], [19], [20], deploying many active relays may increase the hardware cost as well as complexity for the networks. Nowadays, thanks to the ability to reconfigure the wireless transmission medium for enhancing the system capacity, reconfigurable intelligent surface (RIS) has received increasing attention [21]. RIS is a planar array that consists of passive reflecting elements for adaptively controlling the electromagnetic wave. Especially, it can enhance or weaken the signals received by various wireless users. Consequently, the RIS has been broadly investigated for wireless security performance optimization [22]- [28]. However, in [22]- [25], the authors only pay attention to the security enhancement for RIS-assisted communications systems in presence of eavesdroppers by jointly determining the optimal beamforming and phase shifts of the RIS.
To the best of the authors' knowledge, only a few works focus on anti-jamming strategies for RIS-assisted communications [26]- [28]. The RIS carried by an unmanned aerial vehicle (UAV) is used to simultaneously mitigate jamming signals and enhance legitimate signals [26]. However, employing the assisting UAV may impose a high cost, and a trade-off between the achievable throughput and additional overhead was not considered. The authors in [27] designed an RL-based anti-jamming scheme by jointly deriving optimal power allocation and reflecting beamforming, while the RIS-assisted approach against both jamming and eavesdropping attacks was proposed to maximize the achievable rate [28]. Nevertheless, the studies in [26]- [28] assumed the knowledge of the jamming channel states is known, which is not realistic because the jammer is considered an external user whose channel information is hard to achieve. Furthermore, none of these existing works considers the sustainable operation of the wireless devices using energy harvesting and the impact of jamming sensing errors on the network performance. Thus, this paper studies a solar-powered RIS communications under jamming attacks, where a wireless device capable of harvesting solar energy aims at achieving efficient data transmissions against external jamming. On the one hand, we consider the impact of jamming sensing errors when the legitimate user senses the presence of jamming signals in its communication region. On the other hand, we employ the deep reinforcement learning (DRL) framework to solve the issues of large state and action spaces of the considered optimization problem, which is challenging for traditional RL VOLUME 10, 2022 approaches. We summarize the main contribution of this work as follows: • We first investigate the RIS-assisted military wireless communication system in the presence of jamming attacks. In contrast to [26]- [28], where the illegitimate channel state information is assumed to be known or predicted, we adopt the spectrum sensing using energy detection to detect the state active/inactive of the jammer. Although this sensing method does not require illegitimate channel knowledge of the third party, the system may experience the degradation of system capacity due to false alarm and misdetection circumstances. Therefore, we propose a scheme to obtain intelligent resource allocation under dynamic jamming and imperfect sensing.
• We further apply the solar energy harvesting for the wireless device to attain sustainable operation in the battlefield deployment. The constraints of limited harvesting energy, battery capacity, and environment dynamics are taken into account to develop a secure RIS transmission strategy against jamming attacks.
• Next, the joint transmission energy allocation and RIS phase-shift configuration problem for long-term data rate maximization against jamming attacks is formulated as a MDP framework. Subsequently, the DRL-based resource allocation scheme is proposed to achieve an optimal solution to the MDP problem. Thereby, the legitimate user can interact directly with the environment to learn the environment knowledge and dynamic behavior of the jammer, thus obtaining the optimal transmission energy at the wireless device and phase shifts of the RIS via trial-and-error training.
• Lastly, the proposed scheme is verified to efficiently work with the dynamic changes of environment and the imperfect sensing through simulation results. The effectiveness of our algorithm is assessed through the simulation in comparison with the conventional existing approaches, in which long-term and short-term optimization are taken into consideration. Moreover, the algorithms with non-RIS-assisted communication are also evaluated to validate the advances of employing the RIS on improving the performance of wireless communications under the jamming attack scenario.
The remainder of the work is organized as follows. Section II elaborates the system description. Next, Section III presents the DRL-based resource allocation scheme. We discuss the numerical simulation results in Section IV and then conclude the study in Section V.

II. SYSTEM DESCRIPTION
We consider a RIS-assisted battlefield uplink transmission network, which is composed of a wireless device (WD), a base station (BS), a reconfigurable intelligent surface, and a malicious jammer (JM). The WD attempts to transmit data collected in the observed battlefield zone to the BS, while the JM aims to transmit artificial noise signals to block the transmissions between the WD and the BS, as shown in Fig. 1. This jamming scenario usually occurs when one side attempts to combat with its opposing side in warfare. In the network, the WD has a single antenna while the BS is equipped with M antennas. On the other hand, the RIS has K reflecting elements, thus it can adjust the phase shifts to reconfigure wireless channel conditions between the WD and the BS. The system operation time is divided into T time slots with time slot duration τ , and each slot is indexed by t ∈ T = {1, 2, . . . , T }.

A. TRANSMISSION MODEL
By leveraging RIS's aid, the WD transmits the signal to the BS through two different wireless links: direct link and reflected RIS link. The direct path between the WD and the BS is assumed to be a non-LOS (NLOS) channel. The channel of the direct link is given by DB is the path loss between the WD and the BS at the distance d DB with a path loss exponent α L ; and h D m denotes the small-scale fading between the WD and the BS. The reflected RIS path consists of two links: an NLOS link between the WD and the RIS and an LOS link between the RIS and the BS. The NLOS channel between the WD and the RIS is denoted by where g DR is the path loss between the WD and the RIS and h R k is the complex Gaussian random variable, [1,K ] ∈ C M ×K represent the channel between the RIS and the BS, where g RB = 1/ 4π d 2 RB represents a free space path loss attenuation between the RIS and the BS at the distance d RB , and h mk is the LOS channel state between antenna m of the BS and element k in the RIS. We assume that h DB , h DR , and G RB are quasi-static and perfectly known at the WD by using the pilot signal transmissions and feedback. Hence, the WD can obtain these information at the beginning of each time slot to make the transmission decision. When the RIS receives the signal transmitted by the WD, it will reflect the signal to the BS with the vector of phase shifts = diag e jφ 1 , e jφ 2 , . . . , e jφ K , is the phase shift of element k, and b represents the resolution of the phase shifter of the RIS elements [29]. The signal received at the BS is expressed as follows where p t = e tr τ −t s is the transmission power of the WD, e tr represents the transmission energy, and x denotes the information of the WD; n = [n 1 , n 2 , . . . , n M ] T represents noise vector where each element is the zero-mean complex white Gaussian noise with variance σ 2 , i.e., n m ∼ CN 0, σ 2 . The signal-to-noise ratio (SNR) received at the BS can be given as follows In this paper, we model the jamming behavior of the JM on the channel as a 2-state discrete-Time Markov Chain process. Particularly, the state of the JM in a time slot is denoted as ''A'' (active) or ''I '' (inactive), and the state transition probability between two adjacent time slots of the JM is represented as P ij |i, j ∈ {A, I } . We further assume that the JM has enough energy for its jamming operation, such that data received at the BS can not be successfully decoded due to the high-power interference injected by the JM once it performs jamming action on the channel (i.e., the JM is active). Thus, the achievable data rate of the system in (b/s/Hz) is calculated as where κ = 1 if the JM is inactive 0 otherwise is the jamming indicator, and t s represents the sensing duration. In this article, the WD is assumed to have a battery with the limited capacity, E bat , and it is equipped with an energy harvester to collect solar energy from environment for long-term operation. We assume that the harvested energy of the WD during time slot t, e h (t), is limited, where 0 < e h (t) < E bat , and that it follows a Poisson distribution with mean harvested energy, [30]. The operation in a time slot of the system can be expressed as follows. At the start of a time slot, the WD performs spectrum sensing to detect the presence of the jammer on the channel. Subsequently, if the sensing result indicates the state ''inactive'' of the JM, the WD will transmit data to the BS; otherwise, it has to stay silent to avoid the interference of the JM. In this paper, we take into account the sensing errors of the WD. The sensing result, θ (t) ∈ {A, I }, indicates the state (i.e., active or inactive) of the JM in time slot t. Nevertheless, the sensing error is indeed inevitable owing to the imperfection of the sensing engine. Two crucial metrics generally considered are false alarm probability, and misdetection probability, P m = Pr (θ (t) = I |A ). The former represents the probability that the state of the jammer is sensed as ''active'', but it is actually ''inactive''. In contrast, the latter indicates the probability that the state of the JM is sensed as ''inactive'' but it is actually ''active''. The transmission performance can be significantly degraded due to these sensing errors. Hence, designing a jamming-resistance communication strategy of the RIS-assisted network under imperfect spectrum sensing is essential.

B. PROBLEM FORMULATION
Although maximizing the transmission power at the WD can optimize the immediate reward, it may lead to the energy shortage for the upcoming time slots due to limited-capacity battery and stochastic harvested energy. Furthermore, the accumulative data rate in the long run can also be affected significantly by channel fading, misdetections, and false alarms. We study the problem of jointly determining transmission energy at the WD and phase shifts of the RIS for uplink transmissions. Our objective is to maximize the long-term system rate. The problem formulation is expressed as where e tr (t) denotes the transmission energy at the WD and e max is the maximum transmission energy at the WD.
(t) is the vector of phase shifts in the RIS. Constraint (a) ensures the transmission energy assigned for the WD does not exceed the maximum level, while the received instant rate will meet the minimum requirement of data rate, r min , with constraint (b). Constraint (c) is imposed to satisfy that the phase shift of each RIS element should be chosen in the set of phase shift candidates, F ris .
It is challenging to obtain directly the solution for non-trivial optimization problem (4) due to network constraints of the limited battery capacity, time-varying energy resource, and jammers' dynamic behavior. The exhaustive search may have to be adopted to acquire the optimal solution by utilizing classical mathematical tools and sophisticated formulations. However, it is impossible for such a large-dimension optimization problem. The partially observable Markov decision process, considered as one of the feasible solutions, can be adopted to achieve the optimal solution using value iteration-based programming. However, the input state values are generally discretized and the environment knowledge is assumed to know in advance, leading to the impractical implementation. On the other hand, the conventional RL algorithms, such as Q-learning and actor-critic can provide the sub-optimal solution to address the unknown prior knowledge of the network by optimizing the Q-value function through trial-and-error training. Unfortunately, these VOLUME 10, 2022 RL approaches can not effectively work with the enormous dimension problem.
Unlike most of the above-mentioned schemes which generally face with the difficulties of prior knowledge, continuous state spaces, and enormous dimension optimization problems, DRL is considered one of the potential tools.
The key idea is to use a deep neural network where the value function is approximated by continuously modifying parameters through training. Therefore, in this paper, rather than directly solving the challenging optimization problem mathematically, we reformulate the long-term data rate optimization problem (4) to the MDP framework and then obtain the optimally feasible e tr and by using the DRL-based method. By leveraging the RL principle, the agent (i.e., the WD) can continuously interact with environment to learn the channel knowledge, energy harvesting pattern, and jammer behavior, resulting in the optimal strategy over time without prior knowledge from environment. The formulation of the MDP framework and proposed DRL scheme are respectively elaborated in the following section.

III. DEEP Q NETWORK FRAMEWORK
In this section, we reformulate problem (4) into a MDP framework. Subsequently, we investigate a DQN-based resource allocation scheme to obtain the solution to the MDP problem where the energy harvesting distribution is unknown in advance. By leveraging the advancement of the reinforcement learning, the proposed scheme can directly learn the optimal policy by interacting with the environment. Furthermore, it can also cope with the issue of the large dimensional space of the optimization problem, which is hard to tackle for traditional RL approaches.

A. FRAMEWORK OF MARKOV DECISION PROCESS
We first define the state, action spaces, and reward of the MDP problem as follows. Let S be the state space, in which each state includes the jamming probability of the jammer on the channel (i.e., the probability that the JM is active), ρ (t), the energy level in the battery of the WD, e re (t), and the channel information,h (t) as the following: The action with respect to state s(t) is denoted by a(t). We define the action space of the network by where a i (t) = (e tr (t) , (t)) is the action including the amount of assigned transmission energy of the WD and the phase shifts of the RIS. The system receives an immediate reward after executing action a(t) in state s(t), which is calculated by equation (3), i.e., R(t) = r (t). The operation of the WD can be described as follows. At the beginning of a time slot, the WD recognizes the current state of the network, s(t). Then it selects and executes an action, a(t), by which the WD will transmit data to the BS with an assigned amount of transmission energy and phase shifts. Subsequently, the BS receives an immediate reward and then feedbacks ACK/NACK messages regarding whether the data is successfully decoded or not to update the state s(t +1) at the end of the time slot. It is worth noting that data received by the BS is unsuccessfully decoded if the JM is active due to the assumption of the high noise power infected by the JM to the legitimate transmission. In the following, we will elaborate the process of updating the network state based on possible observations. Observation 1 (O 1 ): The sensing result indicates the state ''A'' of the JM, the WD will trust the result and stay silent. Thus, there is no reward in this case, i.e., R (s(t), a (t) |O 1 ) = 0. The jamming probability for the next time slot is updated as where P II and P AI respectively represent the transition probability from state I to state I , and from state A to state I of the JM between two adjacent time slots. The energy level of the WD can be calculated by where e s (t) denotes the sensing energy of the WD. Observation 2 (O 2 ): The WD obtains the state ''I '' from the sensing engine and then transmits data to the BS, and it finally receives the ACK message sent by the BS at the end of the time slot. In this case, the reward is given by The jamming probability in the next time slot is We update the energy level of the WD as follows e re (t + 1) = e re (t) − e s (t) − e tr (t) + e h (t) .
Observation 3 (O 3 ): The sensing result shows the state ''I '' of the JM and then the WD transmits data to the BS, and it finally receives the NACK message sent by the BS at the end of the time slot. Thus, the misdetection is confirmed in this case; thus, there is no reward obtained, i.e., R (s(t), a (t) |O 3 ) = 0. The jamming probability is updated by The energy level of the WD is calculated same as Eq. (11).
In this work, our target is to maximize the long-term data rate of the solar-powered network under the jamming attack from the current state. That means the impact of the current action on the future reward is taken into consideration owing to the uncertainty of environment. We define the state-action value function (or Q-value function) as the expected value of accumulative system reward with state s and action a as follows: where E [.] refers to the expectation operator, and α is the discount rate. The Q-value function is a metric to assess the effect of the action choice on the expected future reward obtained by the training process under policy ψ. The Q-value function satisfies the following Bellman equation: where s and a are respectively the next state and next action; R (s, a) is the instant reward received by performing the action a at state s; P a ss represents the state transition probability of the network from state s to state s with action a being performed. Based on Q-learning principle, the optimal Q-value function associated with the optimal policy ψ * is expressed as a ). (15) The optimal Q-value function in Eq. (15) can be obtained by recursive rule without the exact knowledge of environment dynamics and state transition models which are highly dependent on each action. To this end, the state-action value function is frequently updated with the learning rate λ ∈ (0, 1) as follows: a ) . (16) Q(s, a) will converge to the optimal value through the training process if it is continuously updated at every time slot under a proper configuration. Nevertheless, the original Q-learning algorithm might yield wide variances in function approximation, resulting in the locally optimal policy. Especially, the problem becomes more challenging when the size of the problem increases [31]. To address the large dimensional issues, we adopt the deep reinforcement learning mechanism, in which a neural network is used to approximate the Q-value function. More particularly, we proposed the DRL scheme using a neural network with weight w to approximate the Q-value function, denoted by Q (s, a, w), by which the Q-value function is determined by w as Q(s, a) = Q (s, a, w). Consequently, the proposed scheme can work efficiently in the large dimension networks. The optimal policy is explored by using ε−greedy policy regarding the action choice in each step, where an action is randomly selected with probability ε, or extracted from the policy with probability 1 − ε. The proposed DQN-based scheme is presented in more details in the following subsection.

B. THE PROPOSED DQN SOLUTION
In this subsection, we present the proposed DQN scheme to obtain the solution of the MDP problem, in which the feed-forward neural network (FNN) is adopted to approximate the Q-value function according to each state, named a Q-network. The FFN network is composed of an input layer, multiple hidden layers, and an output layer, as illustrated in Fig. 2, where the input is the system state s while the output represents a set of the Q-values of state-action pairs. The input layer contains M+2 neuron units that refer to the number of elements in each state. Each hidden layer is a fully connected layer, and the rectified linear unit function is utilized as a nonlinear activation function. The output of the hidden layers can be expressed by where w and u denote the weight and bias, respectively. The output of the FNN is designed to connect with the last hidden layer to estimate the Q-value of each state-action pair. It has the size of |A| and uses the linear activation function. Let us define the loss function as the mean square error between the current value and the target Q-value as follows: where R + α max a Q s , a , w refers to the target value.
By training the FNN, the network parameters are adjusted to minimize the loss function. However, the crucial issue in Q-value function approximation using neural network is that, the states are remarkably correlated in time domain and cause the reduction of randomness of states because they are all extracted from the same episode, resulting in the high oscillation as well as low convergent rate during the training. To improve the network performance, we adopt two well-known methods: experience replay [32] and fixed target network [31]. By using the experience replay, the sample s, a, R, s is restored in a data storage, namely, a replay buffer, D, from which a random mini-batch is chosen for training the Q-network in each time slot. By doing this, rather than updating the parameters from the last state, the DNN can update from a batch of randomly sampled states to the experience replay.
On the other hand, with fixed target network method, we can compute the target value by using another neural network with weightŵ, and the network parameters are kept unchanged within a finite number of time slots. Consequently, by using fixed target neural network, the system reward is optimized by using the following equation: The target network is designed with the same structure of the Q-network, and its parameters are frequently replaced by those of the Q-network. We update the weight vector w, such that the loss function is minimized by using stochastic gradient descent as follows s, a, w) , where λ ∈ (0, 1) is the learning rate, and the temporal difference (TD) error, δ, between the current Q-value and the target value, is computed by s, a, w) . Fig. 2 illustrates the flowchart of the proposed DQN-based scheme. With this method, the agent selects an action at the beginning of each time slot based on ε-greedy policy, where ε denotes the exploration rate. After the WD transmits data to the BS, the system receives a reward and turns to the next state. The data sample, s, a, R, s , will be stored in the replay buffer. Subsequently, the mini-batch is selected among the dataset in the replay buffer to update the parameters of the target network and deep Q network. Through each step, the exploration rate decays with the decay rate η, and the training process is repeated until convergence. Algorithm 1 outlines the training procedure of the proposed DQN-based scheme. Select an initial state s ∈ S. Determine ε = max (εη, ε min ).

17:
Observe next state s(t + 1). 18: Store the sample of the transition to memory D.

19:
Randomly select a mini-batch from memory D.

20:
for i in mini-batch's size do 21: Compute the estimated value of Q (s i , a i , w). 22: Update target network parameter,ŵ ← w. 28: end for

IV. NUMERICAL RESULTS AND DISCUSSIONS
In this section, we present numerical simulation results to assess the effectiveness of the proposed scheme, compared to other baseline schemes. The results are obtained by employing Python 3.8 with Anaconda 2021 distribution. The comparisons were made by considering three baseline schemes: classical DQN scheme [33], Myopic scheme [34], and Random scheme. In the classical DQN scheme, the agent adopts the deep Q-learning method without considering the sensing error metrics such as false alarm and misdetection. The term ''myopic'' in the Myopic scheme implies that the system cares more about the immediate reward than the future reward. In particular, the agent in Myopic scheme selects the actions by optimizing the immediate reward. On the other hand, the policy in the random scheme was made based on random fashion.
We set the distance between the WD and the BS, the WD and the RIS, and the RIS and the BS at 28, 21, and 15 meters, respectively. The number of antennas of the BS is M = 8 and the number of elements in RIS is K = 4. The battery of the WD has a capacity of E bat = 30µJ . The mean value of harvested energy is = 7µJ . The maximum transmission energy is set to 14 µJ . The state transition probabilities of the JM between two adjacent time slots are P II = 0.8 and P AI = 0.2. The false alarm and misdetection probabilities are P f = 0.1, and P m = 0.9, respectively. We set the discount rate to α = 0.9. The learning rate is λ = 0.001. Four layers are designed in the neural network: an input layer, two hidden layers with 24 nodes each, and an output layer. We use the ReLU function and the linear function for the activation function of the hidden layers and the output layer of the DQN, respectively. Furthermore, we adopt an adaptive Adam optimization algorithm to periodically update the weights of the Q-network. The size of replay memory and minibatch were set to 300 and 30, respectively. The initial exploration rate is initialized by 1, then is updated by the decay rate at η = 0.9999, and the minimum exploration rate is ε min = 0.01. In addition, other simulation parameters are given in Table 1. We train the Q-network over 200 episodes, each of which contains 10 3 time slots. The simulation results were realized by averaging 10 4 time slots. The selection of network parameters determines the learning convergence rate and system efficiency. To demonstrate the importance of network parameters selection, we examine the convergence rate of the proposed algorithm with the increasing number of training episodes and different learning rates, i.e., λ = {0.0001, 0.001, 0.01, 0.1}, in Fig. 3. In the simulation, we regularly compute the average value of the cumulative rewards received in the number of time slots through each episode to visualize a curve with less oscillation during the training. In this paper, the Q-network keeps training until it meets the convergence condition (≤ 0.001) or reaches a predefined number of training episodes. We can see that learning rates yield significant impacts on the reward performance of the proposed DQN algorithm. Although the figure shows that the proposed algorithm is able to learn the environment knowledge by observing the immediate rewards and improves the effectiveness of the policy step by step, the reward may converge to the locally optimal values. If we set too-small learning rate of λ = 0.0001, the system requires more time to acquire the convergence, resulting in a low reward. On the other hand, there exists oscillation in behavior and worse convergent property for too-large learning rate of λ = 0.1. With the case of learning rate of λ = 0.001, the system can obtain the best reward compared to the others, although it has a slightly lower convergence rate than in the case of λ = 0.01. Thus, we should select the proper learning rate which is neither too large nor too small, and the value is chosen at 0.001 for the rest of the simulation. Fig. 4 compares the performance of the proposed algorithm with the other conventional methods. The increase in the average rewards of the proposed scheme and classical DQN scheme shows that the policies are improved as time goes on. During the first 20 episodes, these policies bring similar rewards because the system requires more time to explore the environment dynamics by frequently executing random actions. As depicted from the figure, the classical DQN converges much faster than the proposed scheme because it does not consider the jamming probability in the state, which lowers the state space but can not obtain the optimal policy due to the sensing errors. In contrast to the classical DQN scheme, the proposed scheme can estimate the jamming probability of the JM to obtain the proper action for the WD against the JMs' attack in each time slot. As the result, the reward in the proposed algorithm rapidly increases within about the first 75 episodes and then gradually converges to the optimal value during the training time of 200 episodes. On the other hand, the rewards of the non-learning methods: Myopic VOLUME 10, 2022  and Random schemes, remain unchanged with the increasing number of training episodes. Consequently, the figure can verify that the proposed scheme outperforms the considered conventional learning and non-learning schemes.
In order to get better understanding of our proposed scheme, we further inspect the impact of the maximum transmission power e max on the performance in Fig. 5, in which two settings e max = 11µJ and e max = 14µJ with episodic average reward and episodic instant reward are considered as the function of the number of episodes. In the following simulation, the episodic average rewards at episode j are calculated by where R ins z = 1 T T t=1 R (t) represents the episodic instant reward at episode z. We can see that, the instant reward converges faster at the low SNR (i.e., e max = 11µJ ) than high SNR (i.e., e max = 14µJ ). It is because the dynamic range of the transmission energy amount is smaller, leading to the good convergence. In contrast, with the higher SNR,  the transmission energy range is large, the system can cause more fluctuations and worse convergence during the training process.
To assess the effectiveness of the proposed scheme, we plot the data rate of the proposed scheme and other baseline schemes versus increasing mean values of harvested energy in both cases of with-RIS and without-RIS scenarios in Fig. 6. It is observed that the system can obtain higher throughput as the harvested energy increases with all schemes. The reason is that the WD has more energy to transmit data to the BS with the increment of energy harvested from environment. Furthermore, the proposed scheme brings the best performance as compared to other traditional schemes in both cases of with-RIS and without-RIS scenarios. In this article, the energy efficiency of the system is defined as the number of achievable bits over the amount of energy consumption at the WD. We further see that the energy efficiency of the proposed scheme is superior to that of the classical DQN, Myopic, and Random schemes, as illustrated in Fig. 7. Fig. 8 shows the impact of the battery capacity of the WD on the performance of the schemes. We see that the  improvement of data rate can be achieved with the increased battery capacity. This can be explained as follows. For the larger battery capacity, the WD stores more harvested energy for its long-term operation, which can enhance the performance by frequently using higher transmission power for data transmissions. It is worth noting that the data rate of our algorithms can dominate other approaches. To explain this, the classical DQN did not consider the sensing error impact, even though it adopts the long-term data rate optimization, and the WD in the short-term scheme (i.e., Myopic scheme) aims to optimize the system performance by only maximizing the immediate reward without considering the impact of the current action on the future utilization. As a consequence, the system using these conventional schemes can not achieve the optimal policy due to the dynamic environment of the wireless network. In addition, we can see from the figure that the schemes using RIS can achieve up to 3% gain in terms of data rate, compared to those without using the RIS.
To evaluate the effect of the number of RIS elements on system rate, the performance comparison of different algorithms is shown in Fig. 9. The curves show that increasing the  number of RIS elements will help the system obtain a higher data rate. That is because the gain of reflecting channel in the system goes up with the number of RIS elements. In Fig. 10, we further investigate the system performance according to various changes of false alarm probabilities of the sensing engine, i.e., P f ∈ [0.1, 0.2], and state transition probabilities of the JM, i.e., P II = {0.6, 0.8}. It can be seen that the system is significantly degraded as the probability of false alarm becomes large. It can be explained as follows. When the sensing engine of the WD causes more false alarms, the WD will miss more opportunities to transmit data to the BS, resulting in lower throughput. On the other hand, the system achieves more rewards when P II increases. It reflects the fact that the WD will frequently perform data transmissions. As a result, the proposed scheme yields an optimal strategy for the WD because it can estimate the jamming probability of the JM for selecting the optimal action in each step. This leads to energy-efficient anti-jamming communications under the network changes.
Finally, in Fig. 11, we plot the data rate of the schemes versus the number of antennas of the base station and the varying misdetection probabilities caused by the sensing engine. The curves show that the data rate goes up with the increase in the number of BS antennas. In contrast, the performance decreases with the increasing values of P m , because the system will frequently experience the jamming interference with the higher misdetection probability. The results show that the proposed scheme is robust with the environment dynamics by not only considering the energy conservation at the energy-constrained WD but also predicting the JM activity to acquire the optimal strategy for the anti-jamming optimization problem in the long run.

V. CONCLUSION
In this article, we investigated the joint transmission energy allocation and RIS phase shifts optimization problem for the solar-powered RIS-assisted communication against jamming attacks. The DQN-based algorithm was proposed, in which the agent can adapt its policy to the dynamic variations of environment to attain the optimal strategy under sensing errors, limited battery capacity, and energy harvesting. Specifically, by considering the jamming prediction of the jammer and learning the energy evolution of the wireless device, the proposed scheme can gradually select the optimal strategy to maximize long-term data rate. As a consequence, numerical simulation results validated the superiority of the proposed scheme to other existing benchmarks with respect to the changes in the network situation. Although the effectiveness of the proposed scheme was verified, the system only addressed the considered network problem with discrete actions. Therefore, our future work aims at developing intelligent resource allocation schemes to solve secure transmission problems with continuous action spaces by using other artificial intelligence methods such as deep deterministic policy gradient algorithms.