Energy Harvesting Reconfigurable Intelligent Surface for UAV Based on Robust Deep Reinforcement Learning

Integrating unmanned aerial vehicles with RIS (UAV–RIS) can offer ubiquitous deployment services in communication-disabled areas, but is limited by the on-board energy of the UAVs. In this paper, a novel energy harvesting (EH) scheme on top of the UAV–RIS system, called EH-RIS scheme, is developed for the next generation high performance wireless system. The proposed EH-RIS scheme extends the simultaneous wireless information and power transfer (SWIPT) system by splitting the passive reflected arrays on the geometric space for transporting information and harvesting energy simultaneously. However, pedestrian mobility, and rapid channel changes post challenges to efficient resource allocation in wireless systems. Thus, a robust deep reinforcement learning (DRL)-based algorithm is developed to improve the proposed EH-RIS scheme for guaranteeing the quality of service (QoS) under dynamic wireless environments. The simulation results demonstrate the effectiveness and efficiency of the proposed robust DRL-based EH-RIS system, which not only outperform the existing state-of-the-art solutions but also approach to the performance of the exhaustive search method.


Energy Harvesting Reconfigurable Intelligent
Surface for UAV Based on Robust Deep Reinforcement Learning Haoran Peng , Member, IEEE, and Li-Chun Wang , Fellow, IEEE Abstract-Integrating unmanned aerial vehicles with RIS (UAV-RIS) can offer ubiquitous deployment services in communication-disabled areas, but is limited by the on-board energy of the UAVs.In this paper, a novel energy harvesting (EH) scheme on top of the UAV-RIS system, called EH-RIS scheme, is developed for the next generation high performance wireless system.The proposed EH-RIS scheme extends the simultaneous wireless information and power transfer (SWIPT) system by splitting the passive reflected arrays on the geometric space for transporting information and harvesting energy simultaneously.However, pedestrian mobility, and rapid channel changes post challenges to efficient resource allocation in wireless systems.Thus, a robust deep reinforcement learning (DRL)-based algorithm is developed to improve the proposed EH-RIS scheme for guaranteeing the quality of service (QoS) under dynamic wireless environments.The simulation results demonstrate the effectiveness and efficiency of the proposed robust DRL-based EH-RIS system, which not only outperform the existing stateof-the-art solutions but also approach to the performance of the exhaustive search method.

I. INTRODUCTION
R ECONFIGURABLE intelligent surfaces (RISs), an arti- ficial meta-surface of an electromagnetic material with large passive reflected arrays, have recently received widespread attention as a promising solution for enhancing wireless communications [1].The passive reflective antenna elements in the RIS system can be intelligently configured with amplitude, polarization, and phase shift in The authors are with the Department of Electrical and Computer Engineering, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan (e-mail: peng.ee07@nycu.edu.tw;wang@nycu.edu.tw).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TWC.2023.3245820.
Digital Object Identifier 10.1109/TWC.2023.3245820 a programmable manner to create a desirable multipath effect, thereby enhancing the signal strength of the overall received signals or suppressing any interference [2], [3].
The utilization of RISs for sustainable and green wireless communications has been explored and demonstrated [4].
Nevertheless, despite the numerous recent advances in RIS technology, most systems are for static deployment (e.g., installed on buildings), limiting their effectiveness in dynamic scenarios.
Combining unmanned aerial vehicles (UAVs) with RISs can provide on-demand deployment services in dynamic situations [5].Because of their controllability and flexibility, UAVs have numerous applications in the blind areas of fixed communication infrastructures, such as serving as temporary base stations (BSs), assisting internet of things (IoT) and vehicle-to-vehicle networks, and enhancing hotspot coverage [6].However, the finite on-board battery capacity on UAVs limits the performance and endurance of UAV-assisted RIS communications.
Energy harvesting (EH) can ensure that UAV-assisted RIS communications last longer, whereas the simultaneous wireless information and power transfer (SWIPT) system collects energy from impinging radiofrequencies (RFs) and therefore mitigate the on-board energy issue of UAV-RIS systems [7].One of the most efficient SWIPT modes, the harvesttransmit-store (HTS) model, divides each time block into two time slots for EH and information transmission [8].However, the resource allocation for the HTS model in the UAV-RIS system involves the joint optimization of transmit power, reflective elements' phase shifts, transmission time scheduling, and RIS scheduling under UAV trajectory design and communication quality requirements, which is difficult to efficiently reach a near-global optimum by splitting time domain only.Additionally, when there is a small number of user terminals (UTs) in the service coverage, using all the reflect-arrays for signal transmitting may result in a waste of resources.A space-splitting EH model, using partial reflection units to collect energy from received RF signals, while the other units reflect any signal, extends the dimension of resource allocation and improves the energy efficiency of RIS [9].Therefore, the endurance of UAV-RIS systems has the potential to be further enhanced by jointly optimizing resource allocation in the time and space domains (dual domains) simultaneously.However, maximizing the harvested energy while guaranteeing the communication quality in both the dual domains results in a nonconvex problem.
Various studies have been conducted in relation to balancing the EH and communication qualities of UAV-aided RIS wireless communications [1], [9], [10], [11].However, joint optimization problems are in general nonconvex and intractable [12].Various approaches, including alternating optimization, decomposing the nonconvex problem into multiple subproblems, and penalty-based iteration approach have been proposed to obtain low-complexity and suboptimal solutions in practice [13], [14].However, the previous solutions are problem-specific and are hard to extend to the general cases.Recently, deep reinforcement learning (DRL) has been used to resolve nonconvex optimization problems pertaining to wireless communications systems [15], including in terms of resolving the coupled objective optimization and performing instant decision making in communication networks [16].This provided the motivation behind applying DRL for the EH and resource allocation pertaining to UAV-aided RIS communication networks.Nevertheless, the widely used DRL algorithms, namely, the deep deterministic policy gradient algorithm (DDPG) and the twin-delayed DDPG (TD3), suffer from overestimation and underestimation issues, respectively [17], [18], [19], which will reduce the performance of EH in complex wireless communication environments.To address this issue, we use a softmax operator and a clipped action space approximation to develop a robust DRL-based EH as in [19].
Existing SWIPT techniques for UAV-RIS systems aim to maximize energy efficiency by splitting time or space, while this study takes advantage of time splitting and space splitting EH models simultaneously.Motivated by the successful application of DRL [3], [7], [20], [21], [22], this technique is used to handle complicated control problems related to resource allocation.To the best of our knowledge, this is the first method to enhance the endurance of UAV-aided RIS communication systems through harvesting energy on dual domains while meeting the required communication quality of service (QoS) constraints.The contributions of the present work are as follows: • The energy-efficient optimization and endurance enhancement issue of UAV-assisted RIS communications systems is investigated and a novel scheme combining SWIPT and resource allocation is proposed.A resources allocation-based HTS (RAHTS) model and an access point (AP-)-RIS-UT channel model are adopted to formulate the proposed optimization problem while satisfying the required communication QoS constraints.The remainder of the paper is organized as follows.The related work is detailed in section II before the system model is described in section III and the formulation of the nonconvex optimization problem is described in section IV.Section V presented the design of the UAV trajectory in the dynamic scenario.Section VI then discusses the proposed robust DRL-based SWIPT method for UAV-RIS communications before the effectiveness of the proposed robust DRLbased SWIPT/RIS resource allocation system is verified in section VII.Finally, concluding remarks and recommendations for future work are provided in section VIII.

II. RELATED WORK
As an emerging technique, RIS technology has received a great deal of attention since its potential to improve the performance of wireless communication networks [1], [23].However, the optimization of RIS-assisted communication systems always involves multiple objectives, such as resource allocation, phase shifts, and energy efficiency.Joint optimization is nonconvex and cannot be resolved directly using standard convex optimization algorithms.From the existing works [3], [7], [20], [22], [24], DRL can efficiently resolve the nonconvex optimization problem for RIS-assisted communication systems.

A. RIS-Assisted Signal Transmission
The RIS-assisted multiuser wireless communication system in [10] minimizes the total transmit power through optimizing the passive beamforming of the RIS and the transmit power of the BSs.Subsequently, it was demonstrated in [11] that an RIS system can overcome the non-line-of-sight (NLoS) radio propagation problem between the UAV and the ground terminals.Meanwhile, in [20], a UAV was integrated with RISs to enhance the propagation environment between the BS and the intended IoT devices (IoTDs).The UAV-RIS system described in [20] effectively overcame the blockage between the IoTDs and the BS, however, the battery-powered UAV presented the challenge of limited service time.In terms of the decode-and-forward-based RIS-assisted UAV communication system described in [25], the fixed RIS was able to significantly improve the coverage and average capacity of the UAV communication system, whereas the frame-based RIS-assisted transmission protocol outlined in [26] enhanced the coverage and communication quality of the UAV-user link.Furthermore, the resource management problem of the UAV-RIS system was studied in [27] to minimize the energy consumption of the system by joint optimization of UAV deployment, phase shift, and the UAV-RIS-user association.However, this study focuses on investigating the performance of the dual-domain EH model of UAV-RIS systems, whereas the UAV-RIS-user association problem will be studied in the future.In [12], an RIS system was deployed to enhance the received power and mitigate the mutual interference in the device-to-device communications, with an alternative optimization algorithm used to maximize the system's total rate depending on the respective QoS, power, and practical discrete phase shift constraints.Furthermore, a holographic multiple input multiple output (MIMO) surface technique was explored to reach low-cost, low power consumption for massive MIMO, which is supported by RIS and intelligent resource allocation algorithms [28], [29].

B. RIS-Assisted Energy Harvesting
In [7], the RIS was equipped with an energy storage system for the EH, resulting in an improvement of the overall energy efficiency of the RIS-assisted cellular network through harvesting energy from the received RF signals.Focusing on the research on RIS-assisted EH, the authors in [13] recently demonstrated that the RIS-based SWIPT system can minimize the transmit power of the AP through designing passive phase shifts of all the RISs and optimizing the transmitter precoders of the AP.An iterative algorithm was proposed to maximize the secure energy efficiency of UAV-RIS systems by jointly optimizing reflective elements' phase shift, transmit power, and UAV trajectory [30].The distributed RISs architecture was investigated to maximize energy efficiency in the joint optimization of transmit power and RIS scheduling [31].In [32], the RIS-aided multiuser multiple-input single-output SWIPT system was found to enhance the propagation of both the energy signal and the information signal.The successive convex approximation-based resource allocation algorithm in [33] minimizes the BS transmit power of the large RIS-assisted SWIPT systems, subject to the QoS requirement of both information decoding receivers and energy harvesting receivers.The author of [5] proposed a dual-domain EH scheme based on DDPG to enhance the endurance of UAV-RIS systems, whereas other EH schemes focused on the time-domain EH.However, the DDPG-based EH approach was only validated in the single-UT case and suffered from the underestimation problem in reinforcement learning, resulting in limited EH efficiency.

C. Deep Reinforcement Learning for RIS Systems
The DRL-based framework outlined in [3] efficiently optimizes the RIS phase shifts and tackles the nonconvex unit modulus constraints, whereas the DRL-based secure beamforming algorithm described in [34] optimizes the passive and active beamforming at the RIS and BS, respectively.In [35], the DRL-based framework was found to efficiently improve the downlink throughput and reduce the intercell interference of dynamic ultradense small cells.Elsewhere, in [36], a DRL-based passive phase shifts optimization scheme was developed for the RIS-assisted nonorthogonal multiple access networks, whereas the DRL-based framework outlined in [37] predicts the RIS interaction matrices with minimal beam training overhead.A DRL-based algorithm was explored to maximize the sum rate of massive MIMO systems by jointly optimizing the active and passive beamforming of BS and RIS, respectively [38].Finally, the DDPG-based power managing and passive phase shifts scheme described in [16] enhances the energy effectiveness of RIS-assisted UAV networks.

D. Limitation of Related Works
Table I shows a comparison of the related works on RIS-assisted communication networks.As the table shows, all the above-related works mainly focused on maximizing the system's total rate and minimizing energy consumption.Although the work in [10], [11], [25], [26], and [12] guaranteed the communication QoS requirement of UTs, the active energy efficiency solution for RIS-assisted communication systems has not yet been considered.The successful paradigms of the RIS-aided SWIPT framework outlined in [7], [13], [32], and [33] can harvest energy on the time-domain.Despite many benefits, the energy efficiency of RIS-assisted communication systems is limited by the resource utilization of meta-surface elements.In [20], various UAVs were integrated with RISs to flexibly deploy the latter in dynamic scenarios, whereas other approaches involve installing RISs on a static building.However, the energy consumption of the battery-powered UAV presents the challenge on the endurance of UAV-aided RIS communications.

III. SYSTEM MODEL
As shown in Fig. 1, a UAV-RIS system was deployed to assist the signal transmission from the AP to the K singleantenna with the UTs denoted by K = {1, 2, . . ., K} since the obstacles block the line-of-sight (LoS).The location of the antenna of each UT k at time slot t is indicated as is the altitude of the antenna of the UT k of the Cartesian coordinate system where the AP is located at the origin, x k (t), y k (t) is the horizontal position of the UT k.In this work, the AP with Z antennas transmits signals to UAV-RIS system consisting of L(= M × N ) metasurfaces, with the assumption that the UTs can only receive signals reflected by the UAV-RIS system.The meta-surface element at the i-th row and the j-th column is denoted by R i,j .The location of the meta-surface element R i,j at each time slot t is indicated as C r i,j (t) = x r i,j (t), y r i,j (t), H r i,j (t) .H r i,j (t) and x r i,j (t), y r i,j (t) are the altitude and horizontal position of the meta-surface element R i,j , respectively.Furthermore, the position of meta-surface elements is associated with the trajectory of the UAV.Without a loss of generality, the meta-surface element array of the UAV-RIS is denoted as R = {R i,j } M,N i,j=1 .Additionally, the RIS can exchange channel state information with the AP via the attached smart controller.To enhance the UAV's endurance while transmitting signals, the system model consists of three key components: an HTS-based model, a reflecting unit RAHTS model, and a AP-RIS-UT channel model.

A. Harvest-Transmit-Store Model
An HTS-based model was proposed to enhance the UAV's endurance via harvesting energy on the time-domain.The UAV-RIS system was equipped with a rechargeable battery that stores the harvested energy and converts it into electrical power [7].It was assumed that linear transmit precoding is used at the AP for simplicity of implementation.For the EH and reflecting signals, the whole time period was divided into T equal time slots, denoted as T = {1, 2, . . ., t, . . ., T }, with each slot containing two phases: the EH phase and the information transmission phase.Similar to in [39] and [40], the normalized unit time slot in the sequel was considered.At the t-th time slot, the length of the EH phase is denoted by τ (t).Then, the length of the information transmission phase at the t-th time slot was (1 − τ (t)).During the EH phase, all reflecting units only harvest energy.Following the EH phase, the information transmission phase begins immediately, with all the meta-surfaces used to reflect signals during this phase.Following [13], the AP's transmit signals can be presented as follows: where V k ∈ C D×1 and S k are the precoding vectors and the signals for the k-th UT, respectively, and S k is a circularly symmetric complex Gaussian random variable with zero mean and unit variance, that is S k ∼ CN (0, 1) [32].Therefore, the total transmit power at the AP is given by where ∥ • ∥ represents the vector's Euclidean norm and p max is the upper limit of the AP's transmit power.p k = ∥V k ∥ 2 is the transmit power for UT k.Hence, the UAV-RIS harvested energy at the t-th time slot can be expressed as follows: where is the channel vector between the Z antennas' AP and the meta-surface element R i×j and follows the path loss of the air-toground (ATG) propagation model [6], [11], [41].Furthermore, small-scale channel fading in the channel matrix Z×L is assumed to be the Rayleigh fading distribution.η ∈ (0, 1) is the EH efficiency, and p = E(X H X) is the transmission power of the AP.The path loss, P L i,j , of the channel vector, g i,j , from the AP to each reflective element, R i,j , can be expressed as [11], [41]: where α is the path loss exponent from R i,j to the AP, φ is the additional attenuation factor caused by the NLoS connection, and P i,j (LoS) is the LoS probability between the AP and meta-surface element R i,j .Following [8], the LoS probability P i,j (LoS) could be calculated according to Eq. ( 5): where A and B are constants depending on the environments [42].The elevation angle between the AP and the meta-surface element R i,j is given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Resources Allocation Based Harvest-Transmit-Store Model
To further enhance the UAV's endurance, a RAHTS model was designed for harvesting energy on the dual domains.As shown in Fig. 2, the UAV-RIS often operates in the communication outage area.Unlike with the HTS model, partial meta-surfaces in the UAV-RIS are used to reflect signals at the information transmission phase, whereas the remainder of the meta-surfaces in the system are for harvesting energy.At each time slot t, the UAV-RIS harvested energy can be redefined as follows: where ω k i,j = 1 denotes the fact that the element R i,j is adopted to reflect signals to the k-th UT and ω k i,j = 0 otherwise.Therefore, the energy harvesting efficiency of the UAV-RIS system in each time slot t can be defined as where is the total received energy from the impinging RF signal in each time slot t.

C. Access Point-RIS-User Terminal Channel Model
In this work, passive reflective beamforming at the UAV-RIS system is considered.At the information transmission phase in the time slot t, k) and G ∈ C Z×L represent the baseband equivalent channels from the UAV-RIS to the k-th UT and from the AP to the UAV-RIS, respectively.Moreover, the UAV-RIS passively reflects the received information signals via controlling reflecting phase shifts.Following [13], a diagonal matrix Φ was defined as the reflection coefficients matrix of the UAV-RIS as follows: where j = √ −1 is the imaginary unit, θ r l ∈ (0, 2π) represents the phase shift of the l-th reflection unit, and ϱ l ∈ [0, 1] represents the amplitude reflection coefficient.Furthermore, ϱ l is ideally set to unit since each meta-surface element's antenna can be independently controlled to maximize signal reflection efficiency for simplicity [13].Based on Eq. ( 1), the received RF signal at the k-th UT via the AP-RIS-UT channel can be expressed as follows: where ν k ∼ CN (0, σ 2 k ) represents the additive white Gaussian noise at the k-th UT with noise power σ 2 k .ĥr,k is the channel matrix from UAV-RIS to UT k with RIS scheduling and can Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
be expressed as The study considers path loss and small-scale fading for h r,k .The path loss between the UAV-RIS and the UTs is given by κ d k i,j (t) d ′ − ᾱ, where ᾱ represents the path loss exponent for the RIS-UT links, d k i,j (t) = ∥C k (t) − C r i,j (t)∥ 2 is the distance between the reflective element R i,j and the UT k, and ∥•∥ 2 is the Euclidean norm.κ corresponds to the path loss exponent at the reference distance of d ′ = 1m.The small-scale channel fading in channel h r,k is assumed to be the Rician fading distribution with the Rician factor K rician = 10, it is represented as where h LoS r,k and h N LoS r,k represent the deterministic LoS and the NLoS (Rayleigh fading) components, respectively.As in [32], it was assumed that each UT can perfectly cancel interference from other RIS-UT links before decoding a desirable signal S k .Hence, the received signal-to-noise ratio (SNR) at the k-th UT is given by According to Shannon's capacity formula, the average throughput in k -th UT in bits/ second/Hz during time slot t is given by where B is the bandwidth.The average throughput in each UT must be greater than or equal to a given Γ min within the finite time horizon to maintain the service quality, i.e.,

IV. PROBLEM FORMULATION
This work aims to maximize the total energy harvesting efficiency of the UAV-RIS within a finite time horizon T while satisfying the required minimal throughput constraints.Without loss of generality, the total transmits power at the AP must also satisfy a constraint.The optimization problem is formulated as the following: where P = [p 1 , • • • , p K ] is the transmit power vector for K UTs, p ′ max is the upper limitation of the transmit power for each UT.Θ = [θ r 1 , • • • , θ r L ] is the phase shift vector for all reflective elements on the RIS.ω is the RIS scheduling matrix and can be expressed as C1 represents the required minimum throughput constraints on each UT to guarantee the QoS of wireless networks, and C2 is the time constraint.C3 and C4 are the maximum power control constraint of the AP and each UT k, respectively.C5 and C6 are the constraints for the binary variable ω k i,j of the reflective units scheduling.C7 and C8 indicate that each reflective element l in RIS can only provide a phase shift θ r l ∈ [0, 2π] without amplifying the input signal.The optimization problem in (P1) is nonconvex because of the nonconvex constraints and the coupling of multiple variables, meaning it is difficult to resolve (P1) effectively using standard convex optimization methods [10].Thus, a DRLbased framework was developed to deal with this issue, as is described in the following section.

V. UAV TRAJECTORY DESIGN
This study considers human mobility for the dynamic scenario.Therefore, the UAV-RIS must re-deploy to provide seamless services for mobile UTs.Following [43], the UAV-RIS is assumed to be fixed at a given altitude and to move horizontally of the Cartesian coordinate system.Furthermore, the deployment of UAV-RIS is expected to reduce the total path loss of this system, which is positively correlated with the total Euclidean distance between the UAV-RIS and all UTs.Therefore, this study discusses two state-of-the-art UAV trajectory designs, the density-aware deployment method and the Fermat point-based approach, to evaluate the proposed dual-domain EH model [43], [44].

A. Density-Aware Deployment Method
For the density-aware deployment method, the UAV-RIS is deployed at the point that minimizes the squared Euclidean distances and satisfies the following: where Ĉr (t) and Ĉk (t) are horizontal positions of UAV-RIS and UT k, respectively.The value of Ĉr (t) can be obtained using the standard K-means algorithm and will not be elaborated on in this study [45].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Fermat Point-Based Approach
Following [44], the trajectory of the UAV-RIS can be obtained by finding the horizontal Fermat point between all UTs in the Cartesian coordinate system.Unlike the K-means algorithm that minimizes the squared Euclidean distances, the Fermat point aims to minimize the sum of the Euclidean distances from the point to each vertex.Therefore, the deployment point obtained by the Fermat point-based approach can be expressed as arg min

VI. DEEP REINFORCEMENT LEARNING ALGORITHM-BASED FRAMEWORK
Recent research results motivated the use of the DRL-based resource allocation method to maximize the harvested energy while guaranteeing the required QoS of communications [7], [46].However, conventional DRL algorithms often involve overestimation and underestimation issues, which reduce the performance in complex wireless communication environments [19].Inspired by the success of the SD3 algorithm, a robust DRL-based approach that uses a softmax operator and a clipped action space was proposed to address this issue.First, the essential principle of the generalized DRL is briefly reviewed before the proposed architecture is outlined in detail.

A. Generalized Deep Reinforcement Learning
The reinforcement learning derived from the Markov decision process (MDP) interaction between intelligent agents and the external environment [47].The formulated MDP can be expressed as follows: where S and A represent finite sets of states and actions, respectively.R : S×A×S → R denotes the state reward function that specifies rewards for particular transitions between states.The state transition probability, P : S × A × S → [0, 1], maps the probability distribution from the current environment state combined with the action's interaction into the next environment state.The discounting factor, γ ∈ [0, 1], determines the importance of future rewards concerning the current state.At each coherence time step t, the intelligent agent takes an action, a t = π * (s t ), based on the current environment state, s t ∈ S, according to its policy, π * .Following this, the agent receives an instantaneous reward r t = R(s t , a t ) and the evolved state s t+1 ∈ S. Typically, the reward function R and the transition function P comprise the model, π * : S → A, of MDP for maximizing the long-term reward calculated by Similarly, the action-value (Q-)function can be defined as Prior research has demonstrated that exploring continuous action space in Q-learning can be time consuming [48], [49].The DDPG uses a deterministic policy, π(s | δ π ), in which its function approximators are parameterized by δ π , to maximize the Q-function in continuous action space [17].The critic net, Q s, a | δ Q , parameterized by δ Q , is learned using the Bellman equation to criticize the performance of the actor net.A copy of the actor and critic nets, π ′ s | δ π ′ and Q ′ s, a | δ Q ′ , are created as the target nets for fast convergence.At each step, the DDPG creates an exploration policy for learning in continuous action spaces by adding a noise sampled from the stochastic noise process N , while N can be chosen to suit the environment.Taken together, the actor net updates its policy using the following approximation: where N b is the transitions' quantity for random mini-batch sampled from the replay buffer D. The critic net updates its policy to minimize the loss according to the following: where i is expressed as Then, the DDPG updates the weights of the target nets as follows: where ψ ≪ 1 is the learning rate for the soft updating actor and critic networks.

B. The Robust Deep Reinforcement Learning-Based Scheme
One critical concern of DDPG is issue of overestimation [50].Focusing on the overestimation problem, the authors in [18] recently demonstrated that the TD3 algorithm notably enhances both the convergence speed and the performance of DDPG by leveraging clipped double estimators, Q 1 and Q 2 , for the critics.Similar to the double Q-learning formulation, the pair of critics (Q 1 , Q 2 ) is parameterized by (δ Q1 , δ Q2 ) [51].Finally, the TD3 proposed involves taking the minimum estimation values between the two critics via the clipped double Q-learning method as follows: (28) where δ Q1 − and δ Q2 − are the parameters for the target critic nets.Consequently, any additional overestimation of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the value targets can be reduced using the clipped double Q-learning approach.The proof of the TD3 approach was clearly described in [18] and will not be repeated herein.However, the TD3 still suffers from an underestimation bias that significantly degrades its performance [19].
To resolve this problem, the SD3 uses the softmax operator in the TD3 to reduce any overestimation and underestimation bias in continuous control.The softmax operator can be defined as follows: (29) where β is the parameter of the softmax operator.By inducing the softmax operator to express the expected value of the Qfunction, SD3, an unbiased estimation is obtained as follows: where p(a ′ ) represents the probability that follows a Gaussian distribution.Furthermore, the Qi (s ′ , •) takes the minimum estimation value between all critic nets and is given by where Q j represents the indices of all critic nets expect critic net Q i .The estimation value of target critic Q i is defined by where T SD3 (s ′ ) denotes the softmax operator for SD3 in action space and is expressed as Additionally, the sampled actions are obtained by adding a noise N to the target action π(s ′ , | δ π − ).Since each sampled noise is clipped to [−c, c], the sampled action can be expressed as follows: One practical advantage of SD3 is that the limited range of the action space can guarantee that the taken action is approximate to the original one.Consequently, the SD3 can obtain accurate and robust estimation values of the softmax Q-function.
The implementation details for the SD3-based learning algorithm are provided in Algorithm 1.Here, the communication environment state was formulated as the input of the proposed algorithm, whereas a pair of actor networks )) were initialized with the random parameter pairs (δ π1 , δ π2 ) and (δ Q1 , δ Q2 ), respectively.Then, the target networks for all the actor and critic networks were initialized with the same parameters as their corresponding networks.An empty replay buffer D with the size of N D was initialized for the learning process.At each time step, the actor produces an action, a t , according to the current policy pair of (π i , π 2 ) and the clipped exploration noise N .The algorithm then obtains the instantaneous reward r t after executing the corresponding action.The reward in terms of the harvested RF energy is described in Section VI-C.Following this, the tuple (s, a t , r t , s ′ , d) is stored into D where d is the done flag.A mini-batch of N B transitions is then immediately sampled from the replay memory D to calculate the target Q-value following softmax operation according to Eq. (30).Then, the critic net Q i and the actor net π i are updated according to the Bellman loss and the policy gradient respectively.Lastly, the target nets are soft updated as follows: The outputs of the algorithm are the optimal action a = {τ (t), P , ω, Θ}, and the total energy harvesting efficiency Ē of the UAV-RIS system.

C. Observation, Action and Reward Design
In this work, the DRL environment relies on the wireless network assumption, with the RIS interacting as an agent.The state and observation space, action space, and reward design are described below.
• State Space: At each time step t, the observation is constructed by the current environment state s t , which consists of the baseband equivalent channels from the AP to UAV-RIS G and from the UAV-RIS to the k-th UT, h r,k ∈ C 1×L , for all k ∈ K, the distance between the each mete-surface element R i,j and the k-th UTs, d ru k , for all k ∈ K, the location of each meta-surface element, C r i,j , and the position of the antenna of each UT C k .Hence, the observation of the proposed SD3based learning algorithm can be expressed as follows: • Action Space: At the t-th time step, the action a t of the proposed DRL-based framework for the time-domain EH scheme consists of three main components, the length of the EH phase τ (t) ∈ [0, 1], the transmit power level p k ∈ [0, p ′ max ] for each UT k, and the phase shift θ r l ∈ [0, 2π] for each reflective element l.In addition to the action space of the time-domain EH scheme, the reflective element scheduling variable ω k i,j ∈ {0, 1}, ∀i ∈ [0, M ], j ∈ [0, N ], k ∈ K is added to the action space for the dual-domain EH scheme.Furthermore, τ (t), p k , and Algorithm 1 The Proposed SD3-Based Scheme 1 Input: G, h r,k , ∀k ∈ K, d k i,j , ∀k, i, j, C k (t), ∀k ∈ K, C r i,j (t), ∀i, j, the size of experience replay N D , the size of mini-batches N b ; 2 Initial: Actor π 1 (s | δ π1 ) and critic Q 1 s, a | δ Q1 networks with random parameters δ π1 and δ Q1 , respectively; 3 Initial: Actor π 2 (s | δ π2 ) and critic Q 2 s, a | δ Q2 networks with random parameters δ π2 and δ Q2 , respectively; 4 Initial: Target networks δ π1 − ←− δ π1 , δ Q1 − ←− δ Q1 , δ π2 − ←− δ π2 , δ Q2 − ←− δ Q2 ; 5 Initial: Experience replay memory D with the capacity of N D ; 6 Output: Optimal action a = {τ (t), P , ω, Θ}, and the total energy harvesting efficiency Ē of the UAV-RIS system. ]; Update the critic δ Qi using Bellman loss: 1 23 Update the actor δ πi according to policy gradient: l are defined in a continuously feasible region, whereas ω k i,j is transformed into a discrete variable.• Reward Design: The positive reward represents the objective of the proposed framework, that is, to maximize the overall energy harvesting efficiency of the UAV-RIS system.At each time step t, the instantaneous reward has a positive correlation with the energy harvesting efficiency E(t), which is defined in Eq. ( 8).The proposed framework must also account for the users' minimum capacity requirement defined in the constraints C1.
Hence, reward r t can be described as follows: where ρ is the number of UTs that address the required Γ min and is defined by where ρ k (t) is give The cumulative reward is given by max J = t γ t r t .

VII. SIMULATION RESULTS
In this section, the performance of the proposed SD3-based SWIPT associated with the dual-domain EH developed in this work is evaluated in terms of both single-UT and multiple-UT cases.The number of users was set to K = 1 and K = 3 for the single-UT and multiple-UT cases, respectively.Table II lists the partial parameters for the simulation.Here, the UTs are located in an area of 20m × 20m, whereas the number of passive reflected elements in the system was set to 16.The shows, the SD3-based SWIPT system were extremely close to the exhaustive search method, which produces the optimal resource allocation but is expensive.Moreover, the SD3-based SWIPT system outperformed the TD3-based system in terms of collecting energy in all the steps, while the TD3-based method can reach around 58% of the energy harvesting efficiency per step.However, the DDPG-based SWIPT system demonstrated a better performance than the TD3-based system in terms of the timedomain EH, as is shown in Fig. 4(a), which was because the TD3-based system suffers from underestimation problem.
Based on the simulation results for the single-UT case, the proposed SD3-based approach can harvest, on average, 22.5% and 64.2% of the energy from the received RF signal in the time-domain and dual-domain EH, respectively.Meanwhile, the TD3-based method achieved values of 14.9% and 58.5% in terms of time-domain and dual-domain, respectively, whereas the DDPG harvested 21.5% and 30.4% of the energy in the corresponding schemes.The upper limit of EH obtained through searching all the probabilistic actions was 26.4% and 67.6% for the time-domain and the dual-domain schemes, respectively.Clearly, the proposed dual-domain EH outperformed the time-domain scheme in terms of different learning algorithms and the exhaustive search method.Furthermore, the SD3-based SWIPT system achieved the best performance among all the learning algorithms in the dual-domain EH scheme.However, the complexity of the exhaustive search algorithm due to the nondeterministic polynomial-time results in a lack of practicality in terms of real-world application.To summarize, the simulation results demonstrated the supremacy of the proposed SD3-based method in the single-UT case in terms of trade-off effectiveness and practicality.
Meanwhile, Fig. 5 illustrates the convergence behavior of the proposed SD3-based SWIPT system for the single-UT Fig. 6.EH percentage per testing step for the multiple-UT case.The EH percentage is the ratio of collected energy to the received energy of the impinging RF signal.
case.Here, the rewards had a positive correlation with the EH objective.As Fig. 5 shows, the cumulative rewards of the dual domain EH scheme increased significantly from 0 to around 0.52 per episode between 100 and 700 episodes, whereas from 1,000 to 2,000 episodes, the cumulative rewards per training episode gradually increased with the continuation of the training iterations.The learning processing converged from around approximately 2,000 episodes after certain fluctuations caused by the exploration, after which point, the rewards remained stable at around 0.65 and 0.23 for the dual-domain and time-domain EH schemes, respectively.

B. Multiple-User-Terminal Case
The percentages of the harvested energy to the received energy per step in the multiple-UT case are shown in Fig. 6(a) and Fig. 6(b) in terms of the time-domain and dual-domain schemes, respectively.Here, in each EH scheme, the time used with the exhaustive search-based method was consistently higher than that with the other learning-based algorithms, which was because the exhaustive search explores the optimal solution in a time consuming way.As Fig. 6(a) shows, the values for the proposed SD3-based SWIPT and the TD3 system were close to those of the exhaustive search.Moreover, the difference between the DDPG-based method and the exhaustive search method was wider than that between the other learning algorithms in majority of the steps.Meanwhile, as Fig. 6(b) shows, the line of the proposed SD3-based method was close to that of the exhaustive search, whereas the EH-related performance of the DDPG-based SWIPT system was extremely close to that of the SD3, despite several deviations in a few of the steps.
Based on the simulation results, 67.2% and 25.5% of the energy of the impinging RF signals was collected by the exhaustive search algorithm in the dual-domain and the timedomain schemes, respectively.In terms of the time-domain scheme, the percentage collected by the DDPG-based SWIPT scheme (23.6%) was slightly higher than that collected by the SD3 scheme (23.2%), whereas the TD3-based method achieved the lowest value with 18.7%.In the dual-domain scheme, the proposed SD3-based SWIPT harvested 55% of the received energy, surpassing the performance of the TD3based approach (52.9%), whereas the DDPG-based SWIPT scheme had the worst performance with 29.6%.Clearly, the SD3-based SWIPT scheme outperformed the other learning algorithms in terms of the dual-domain scheme, whereas the DDPG scheme achieved similar performance in terms of the time-domain scheme.Moreover, much like with the single-UT case, the dual-domain EH scheme surpassed the time-domain scheme.
The training behavior of the proposed SD3-based dualdomain SWIPT system in the multiple-UT case is shown in Fig. 7. Here, the cumulative rewards per episode underwent a sharp increase from zero to approximately 0.58 between around 100 and 600 episodes, after which the training rewards increased slightly to 0.62 over a period of 1,000 episodes.Thus, the cumulative rewards converge to around 0.63 and were expected to continue until the training phase ends.
Figure 8 shows the EH performance of the proposed SD3based method in terms of the density-aware design and the Fermat point-based design for the UAV-RIS trajectory.The proposed SD3 model was trained using the density-aware UAV trajectory and tested with both the density-aware and Fermat point-based UAV trajectory.From Fig. 8, the EH performance of the K-Means algorithm-based UAV trajectory was extremely close to that of the Fermat point-based UAV trajectory in both the time-domain and dual-domain EH schemes.The simulation results demonstrated that the proposed SD3based EH method is indeed robust with regard to the different UAV trajectory design schemes.Overall, the dual-domain EH scheme outperformed the time-domain scheme in terms of all the learning and exhaustive search-based methods.The proposed SD3-based robust SWIPT system achieved the best performance among all state-of-the-art systems in terms of dual-domain EH since it achieved a good balance between effectiveness and time consumption.

VIII. CONCLUSION AND FUTURE WORK
In this work, the limited battery power issue of UAV-assisted RIS communications, which limits its service capabilities, was investigated.In the process, a long-lasting scheme based on the SWIPT scheme was proposed for the UAV-RIS system by splitting the passive reflected arrays on the geometric space for transporting information and harvesting energy simultaneously.For rapid and robust learning, an SD3-based SWIPT was developed for the proposed dual-domain EH, with the effectiveness and efficiency of the proposed dual-domain EH scheme demonstrated using rigorous simulations.The simulation results showed the supremacy of our SD3-based SWIPT scheme in terms of trade-off efficiency and practicality.Furthermore, the proposed dual-domain EH was demonstrated to reach a near-global optimal for the joint optimization of transmit power, reflective elements' phase shifts, transmission time scheduling, and RIS scheduling under dynamic communication environments, whereas the performance of the traditional time-domain EH was limited by the resource allocation dimension.Furthermore, it is recommended that in future work an association problem between UAV-RISs and users is investigated for the multiple UAV-RIS scenario.

Manuscript received 9
March 2022; revised 7 November 2022; accepted 10 February 2023.Date of publication 23 February 2023; date of current version 11 October 2023.This work has been partially funded by the National Science and Technology Council under the Grants MOST 110-2221-E-A49-039-MY3, and MOST 111-2221-E-A49-071-MY3, and NSTC 111-2634-F-A49-010, and NSTC 111-3114-E-A49-001, Taiwan.This work was also financially supported by the Center for Open Intelligent Connectivity from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.This work was supported by the Higher Education Sprout Project of the National Yang Ming Chiao Tung University and Ministry of Education (MOE), Taiwan.The associate editor coordinating the review of this article and approving it for publication was H. Yang.(Corresponding author: Li-Chun Wang.)

Fig. 2 .
Fig. 2. Resources allocation combined with an HTS model for the UAV-assisted RIS communication system.

8 Receive the current G; 9
7 for episode N e = 1 to N epoch do Initialize a stochastic noise process N ; 10 Collect h r,k , ∀k ∈ K for N e -th episode; 11 for t = 1 to T do 12Select action a t with exploration noise N based on policy π 1 and π 2 ;13 Execute action a t to observe its corresponding reward r t , the next state s ′ and the done flag d; 14 Store the transition tuple (s, a t , r t , s ′ , d) into D; 15 for i = 1, 2 do 16 Randomly sample a mini-batch of N b transitions {(s, a, r, s ′ , d)} from D; 17 Sample K noises ϵ ∼ N (0, σ);

Fig. 4 .
Fig. 4. EH percentage per testing step the single-UT case.The EH percentage is the ratio of collected energy to the received energy of the impinging RF signal.

Fig. 5 .
Fig. 5. Cumulative rewards per training episode with increasing iterations for the single-UT case.

Fig. 7 .
Fig. 7. Cumulative rewards per training episode with increasing iterations for the multiple-UT case.

Fig. 8 .
Fig. 8. SD3-based EH performance on different UAV trajectory design schemes for the multiple-UT case.

TABLE I COMPARISON
OF RELATED WORKS AND THIS WORK