A Meta-DDPG Algorithm for Energy and Spectral Efficiency Optimization in STAR-RIS-Aided SWIPT

This letter studies a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-assisted wireless system where a multi-antenna base station (BS) transmits both wireless information and energy-carrying signals to single-antenna users. To explore the trade-off between spectral efficiency (SE) and energy efficiency (EE) in this system, a multi-objective optimization problem (MOOP) is formulated to maximize SE and EE. The beamforming vector at the BS, the power splitting ratio at each user, phase shifts and amplitude coefficients of the STAR-RIS are jointly optimized, subject to the constraints of the maximum transmit power of the BS and the minimum harvested energy of users. To tackle this MOOP, we propose a Meta-DDPG algorithm that combines deep deterministic policy gradient (DDPG) and meta-learning approaches. Simulation results demonstrate that the Meta-DDPG algorithm outperforms the classic DDPG and genetic algorithms in terms of EE. Besides, via simulation results, it is illustrated that Meta-DDPG reaches a close performance to the exhaustive search and optimization-based solutions.

Optimizing RIS phase shifts, known as passive or active beamforming, is a challenge in designing RIS-assisted systems.Existing studies focus on phase shift optimization for STAR-RIS or RIS-aided systems [2], [3], [4], [5], [6].For instance, [2] aimed at maximizing the total data rate in a STAR-RIS-assisted non-orthogonal multiple access system by jointly optimizing the decoding order, transmit power, active beamforming, and transmission and reflection beamforming.Also, the authors of [3] proposed an iterative algorithm to minimize power consumption, where the active beamforming at the base station (BS) and the passive transmission and reflection beamforming at the STAR-RIS were jointly optimized.Moreover, in [4], a RIS-assisted multiple-input single-output (MISO) system was considered, where a joint transmit beamforming and phase shift optimization algorithm was proposed to minimize the total transmit power of the BS.Furthermore, in [5], a multi-RIS system was studied in which multiple users transmit information to a BS with assisting of two RISs.Specifically, a cooperative passive beamforming algorithm was proposed to maximize the minimum signal-tointerference and noise ratio (SINR) of all users.Besides, in [6], a RIS-aided simultaneous wireless information and power transfer (SWIPT) system was considered, where a RIS aids a multi-antenna BS in transmitting information and energycarrying signals to users.
The related works proposed optimization-based algorithms to optimize RIS phase shifts.However, due to the nonconvex nature of resource management problems in RIS-aided systems, these algorithms tend to have high complexity.
Besides, the iterative nature of optimization-based algorithms may result in solutions that are far from optimal in non-convex problems for RIS-assisted systems.As a result, these algorithms do not perform well in such scenarios.Recently, deep reinforcement learning (DRL) methods, like deep deterministic policy gradient (DDPG), have been proposed to tackle resource management challenges in wireless communication [7].DDPG is particularly suitable for optimizing problems with continuous decision variables as it generates probability distributions for actions at each state [7].DDPG's learned model may not adapt to new environments, making it unsuitable for resource management in a dynamic network.To address this, meta-learning methods can be combined with classic DRL methods.Meta-learning, which teaches models how to learn, is employed in conjunction with conventional DRL methods to boost adaptability in different environments and enhance overall performance [8].
In this letter, we formulate the spectral efficiency (SE) and energy efficiency (EE) maximization problem for a MISO STAR-RIS-assisted SWIPT system as a multi-objective optimization problem (MOOP).The beamforming vector at the BS, the power splitting (PS) ratio at each user, phase shifts and amplitude coefficients at the STAR-RIS are  considered decision variables.The MOOP problem is nonconvex, so there is no polynomial time algorithm to solve it.To tackle this difficulty, first, it is converted to a singleobjective optimization problem (SOOP) by using the weighted Tchebycheff approach [9].Then to address the SOOP problem, we propose the Meta-DDPG algorithm combining metalearning with classic DDPG [8].To the best of our knowledge, this is the first work proposing a meta-learning-based approach for SE and EE maximization in a MISO STAR-RIS-assisted SWIPT system.Specifically, in contrast to [2], [3], [4], [5], we consider a SWIPT system, where a STAR-RIS assists the BS in transmitting information and power signals to users.Moreover, in comparison with [2], [3], [4], [5], [6], which proposed optimization-based algorithms, we propose a Meta-DDPG algorithm for a STAR-RIS-assisted system that is well suited for next-generation wireless networks due to its adaptation to new environments.Simulation results illustrate the superiority of Meta-DDPG over classic DDPG and genetic algorithms in terms of EE.Furthermore, via simulation results, we illustrate that Meta-DDPG reaches a close performance to the optimization-based and optimal solution with lower computational complexity.
II. SYSTEM MODEL AND ASSUMPTIONS Consider a STAR-RIS-assisted MISO system, where a N tantenna BS serves a set of K = {1, . . ., K r , K r + 1, . . ., K r + K t } single-antenna users.K r and K t , respectively, denote the number of users in reflection and transmission zones, as shown in Fig. 1.K = K r + K t denotes the total number of users.In our system, a STAR-RIS-assisted link is used to serve users when the direct link quality between the BS and users is poor.The STAR-RIS consists of M phase shifters.We assume perfect channel state information for a flat-fading channel model is available at the STAR-RIS and the BS.
We denote the channel matrix from the BS to the STAR-RIS by Υ ∈ C M ×Nt .The channel vector from the STAR-RIS to k-th user is represented by h H RIS ,k ∈ C M ×1 , and the direct channel vector between the BS and k-th 1 , ψ λ 2 , . . ., ψ λ M ) denote the diagonal phase shift matrix for the STAR-RIS in which ψ λ m = λ m e j ϕ λ m , and ϕ λ m ∈ [0, 2π] is the phase shift of the m-th element of the STAR-RIS and j represents the imaginary unit [10].Besides, λ ∈ Λ = {t λ , r λ }, where t λ and r λ , respectively, stand for transmission and reflection zones.Also, is the amplitude coefficient of the m-th component at the STAR-RIS for reflection and transmission.Moreover, Ψ = Ψ r λ for all users k ∈ {1, . . ., K r }, and Ψ = Ψ t λ for all users k ∈ {K r + 1, . . ., K r + K t }.
Let x = K k =1 w k s k denote the transmitted signal at the BS in which s k is the transmit data symbol to k-th user with unit power, i.e., E{|s k | 2 } = 1 and E{s k s * j } = 0, ∀j = k .The received signal at k-th user is expressed as is the corresponding transmit beamforming vector and z k ∼ CN (0, σ 2 k ) is the additive complex Gaussian noise.Using the power split technique, the received signal at each user is split into two parts: energy harvesting (EH) and information decoding (ID) signals.In particular, the received ID and EH signals at user k, respectively, are given by r Accordingly, the received SINR at user k can be expressed , where shows the additional noise as a result of the signal processing at the ID receiver [11].
Furthermore, we consider that the total harvested energy is linearly proportional to the received EH signal at user k, i.e., where k ∈ (0, 1] is the energy transformation efficiency [6].
We study maximizing SE and EE problem as a MOOP in a STAR-RIS-assisted SWIPT system, where a BS simultaneously transmits information and energy signals to users with the aid of a STAR-RIS.The SE is calculated as , where is the total consumed power in the system.In P C,total (w k ), P circuit = P BS circuit + K k =1 p k circuit is a fixed circuit power cost due to the power consumption of all electrical parts in both the receiver (i.e., p k circuit ) and the transmitter (i.e., P BS circuit ).Also, P RIS = P S + N t P D in which P D and P S are the dynamic power per reflecting component and the static power required to retain the basic circuit activities of the STAR-RIS, respectively.

III. PROBLEM FORMULATION
We formulate the MOOP problem, which aims at maximizing SE and EE by jointly optimizing the transmit beamforming W = [w 1 , . . ., w K ] ∈ C Nt ×K at the BS, phase shifts ϕ λ m and the amplitude coefficients λ m of each component m at the STAR-RIS, and the power splitting ratio η k for user k.The MOOP is formally stated as P1: max Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Constraint (1a) implies that the total transmit power of the BS should not exceed the BS's maximum power budget.Constraint (1b) ensures a minimum harvested energy at each user k.Equation (1c) represents that the power splitting ratio is a value between 0 and 1. Equation (1d) satisfies that the phase shift of the m-th component of the STAR-RIS should be between 0 and 2π.Finally, (1e) and (1f) are the constraints related to the amplitude coefficient of the m-th component at the STAR-RIS for reflection and transmission.Since defined EE is a fractional function of total data rate and total power consumption (P C,total (w k )), the objective function of problem P1 can be written as maximization of total data rate and minimization of the total power consumption.Accordingly, problem P1 in ( 1) is transformed into One of the points on the Pareto optimal region of the problem P2 corresponds to the optimal solution of the problem P1.Problem P2 is a MOOP problem, we convert it to a SOOP using the weighted Tchebycheff method [9].The SOOP problem is expressed as (1a), (1b), (1c), (1d), (1e), (1f), where Φ is an auxiliary variable.Furthermore, R max and P min are the utopia point of the total transmission rate and the total power consumption, respectively, which are obtained by solving the corresponding single-objective problems.Specifically, to obtain the value of R max , we consider a single-objective problem with the objective of maximizing the total transmission rate (P2a) subject to constraints (1a)-(1f).Likewise, to obtain P min , a single-objective problem consisting of the objective function (P2b) and constraints (1a)-(1f) is solved.Additionally, 0 ≤ ν ≤ 1 denotes a weighting coefficient indicating the importance of different objectives.The weighted Tchebycheff method produces a set of Pareto-optimal solutions for each value of weight ν [9].Moreover, with minimizing the value of Φ subject to constraints (3a), (3b), and (1a)-(1f), problem P3 maximizes total data rate ( K k =1 R k ) and minimizes total power consumption (P C,total ).Problem P3 is a non-convex optimization problem that cannot be efficiently solved by optimization-based algorithms.To solve problem P3, we propose a Meta-DDPG method which is explained in the next section.
IV. THE PROPOSED META-DDPG METHOD In this section, we describe the Meta-DDPG algorithm to solve problem P3.In what follows, first, we formulate problem P3 as a Markovian decision process (MDP).Then, we elaborate on the Meta-DDPG method.
The MDP problem is defined by state and action spaces, and the reward function.Specifically, state space S, action space A, and the reward function R are defined as follows [7], [8].
1) State Space S: S is a set of channel information, interference, and total data rate, which is expressed as 2) Action Space A: For problem P3, the action space contains decision variables, including power splitting ratio, beamforming vector, phase shifts, the amplitude coefficients, and Tchebycheff variable.The action space for problem P3 is defined as , Φ . (5) 3) Reward Function R: The objective of problem P3 is minimizing Tchebycheff variable Φ in such a way that constraints (3a)-(3b) and (1a)-(1f) are satisfied.Since the learning model should satisfy constraints, we assign a penalty value to the reward, where constraints are not satisfied.Therefore, the reward function is defined as Since the action space A in ( 5) is continuous, DDPG is appropriate to solve problem P3, although the learned model by DDPG may not adapt to a new environment.To improve the generalization ability of DDPG, we combine DDPG with meta-learning and propose Meta-DDPG to solve problem P3.
DDPG uses an actor network ϑ(s | Δ ϑ ) and a critic network Q(s, a | Δ Q ) with parameters of Δ ϑ and Δ Q .Furthermore, to stabilize DDPG, it employs a target actor network ϑ ϑ and a target critic network Q Q with parameters of Δϑ and ΔQ .In DDPG, at each state, action is determined by a policy that maps states to a probability distribution over actions.By considering a deterministic target policy ϑ, the state-value function is indicated as , where τ ∈ [0, 1) is the discount factor and r (s t , a t ) is an immediate reward.
Moreover, the critic network aims at minimizing [7].Likewise, the actor network minimizes the following loss function: Using DDPG, at each state s t ∈ S of the environment, the agent selects an action a t ∈ A. Executing the chosen action a t on the environment, the agent moves to the new state s t+1 ∈ S and receives a reward r t .These steps lead to a new experienced transition (s t , a t , s t+1 , r t ) that is stored in a replay buffer B.
According to [8], meta-learning can be expressed as a bilevel optimization problem as where the upper level involves meta-critic learning and the lower level involves classic critic learning.The main goal of meta-learning in ( 8) is to enhance the performance of DDPG by introducing an extra loss function meta-critic ζ for updating the actor network's parameters.In particular, in contrast to DDPG, in which the parameters of actor networks are updated only by minimizing J (Δ ϑ ) in (7), in meta-learning (as in ( 8)), the actor network is trained by J (Δ ϑ ) and meta-critic ζ via stochastic gradient descent.In addition, d trn and d val refer to distinct sets of transition batches randomly sampled from the replay buffer.In (8), the meta-critic parameter ζ is optimized through meta-learning to speed up the learning progress of the actor network.The loss function for meta-learning in ( 8) is defined

A. Time-Complexity Analysis of Meta-DDPG
According to [12], to analyze the time complexity of Meta-DDPG, we can consider the floating point operations per second (FLOPS) for each hidden layer of the actor, critic, and meta-critic networks.For instance, at each hidden layer for the actor network, there is a vector μ actor , and a matrix of size μ actor , × μ actor , +1 to perform dot product operations.The FLOPS computation in this case is given by (2μ actor , − 1) × μ actor , +1 , which involves multiplying μ actor , times and adding μ actor , − 1 times.Additionally, the time complexity of the activation layer should be taken into account, which involves operations such as addition, subtraction, multiplication, division, exponentiation, square root, etc.Thus, the time-complexity of the actor network equals 2 U−1 =0 ((2μ actor , − 1)μ actor , +1 + κμ actor , +1 ), where κ denotes the parameter associated with the activation layer.Using the same approach, we can calculate the corresponding time-complexity of the critic and meta-critic networks.Accordingly, the overall time-complexity of Meta-DDPG is V. SIMULATION RESULTS In this section, we evaluate the performance of the Meta-DDPG algorithm compared to the classic DDPG algorithm.The simulation parameters and hyper-parameters of Meta-DDPG are listed in Table I unless stated otherwise.To implement Meta-DDPG and DDPG methods, we use PyTorch 1.4.0, to obtain optimization benchmarks, we use Mathematica 13.3.0,and to generate results of genetic and exhaustive search, we use MATLAB 2021a, all on a MacBook Air (2020) with a specific CPU and GPU configuration.
Fig. 2 shows the convergence and performance of Meta-DDPG versus the number of episodes.Specifically, Fig. 2(a) depicts the achieved SE by Meta-DDPG for a different number of users and different values of ν.It can be seen that when the value of ν increases, SE improves.Besides, increasing the number of users reduces SE.Since the number of BS antennas and maximum transmit power of BS are limited, when the number of users is increased, the BS needs to serve more users.Furthermore, the STAR-RIS enhances SE compared to the conventional reflective RIS.
Fig. 2(b) demonstrates EE for a different number of users and different values of ν.We can observe that EE decreases with increasing the number of users.Also, when the number of users is increased, the BS should transmit more power to serve users resulting in more inter-user interference, and accordingly, less data rate.And the reduced total data rate leads to lower EE.Besides, the STAR-RIS achieves higher EE than the conventional reflective RIS.
The generalization ability of Meta-DDPG in comparison with classic DDPG is depicted in Fig. 2(c).To generate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.For testing, the Rician fading channel is used for the direct channel between the BS and users.As can be observed from Fig. 2(c), Meta-DDPG has a higher generalization ability compared to DDPG thanks to the using a new loss function for updating the parameters of the actor network.From Fig. 2(c), it can be seen that thanks to meta-learning, Meta-DDPG achieves a higher EE compared to DDPG.
Moreover, Fig. 3(a) shows the impact of the maximum transmit power budget of the BS on EE.As can be seen, since BS can transmit with more power to users, and the total data rate is enhanced, increasing P max results in EE improvement.
Also, the trade-off between EE and SE is shown in Fig. 3(b), where EE and SE are obtained considering different values of ν ∈ [0, 1].From Fig. 3(b), we can observe that at first, with increasing SE, EE improves.But then, increasing SE reduces EE.The reason is that to reach a higher SE, the BS should transmit with a higher power, which results in reduced EE.
Moreover, in Fig. 3(c), we compare the Meta-DDPG with genetic algorithm, optimization-based, and exhaustive search methods.Meta-DDPG obtains a close performance to exhaustive search and optimization-based solutions.Besides, Meta-DDPG improves energy efficiency compared to the genetic method.

VI. CONCLUSION
We studied the trade-off between SE and EE in a STAR-RIS-assisted SWIPT-enabled wireless MISO system.For this, we formulated a MOOP optimization problem for jointly optimizing the beamforming vector at the BS, PS ratio at each user, phase shifts and amplitude coefficients at the STAR-RIS.To solve the MOOP, first, we converted it to a SOOP problem using the weighted Tchebycheff approach.Then, we proposed a Meta-DDPG algorithm combining the classic DDPG with meta-learning.Simulation results demonstrated the superiority of Meta-DDPG over classic DDPG and genetic algorithms in terms of energy efficiency.Also, Meta-DDPG reaches a close performance to the exhaustive search and optimization-based solutions.

Fig. 3 .
Fig. 3. Energy efficiency of Meta-DDPG vs. (a) different values of ν and Pmax and (b) spectral efficiency considering different values of Pmax, and (c) different numbers of STAR-RIS elements M.

Fig. 2 (
Fig.2(c), we train learning models of Meta-DDPG and DDPG assuming Rayleigh fading for all channels.Then the trained model is used to analyze a different scenario during the testing phase.For testing, the Rician fading channel is used for the direct channel between the BS and users.As can be observed from Fig.2(c), Meta-DDPG has a higher generalization ability compared to DDPG thanks to the using a new loss function for updating the parameters of the actor network.From Fig.2(c), it can be seen that thanks to meta-learning, Meta-DDPG achieves a higher EE compared to DDPG.Moreover, Fig.3(a) shows the impact of the maximum transmit power budget of the BS on EE.As can be seen, since BS can transmit with more power to users, and the total data rate is enhanced, increasing P max results in EE improvement.Also, the trade-off between EE and SE is shown in Fig.3(b), where EE and SE are obtained considering different values of ν ∈ [0, 1].From Fig.3(b), we can observe that at first, with increasing SE, EE improves.But then, increasing SE reduces EE.The reason is that to reach a higher SE, the BS should transmit with a higher power, which results in reduced EE.Moreover, in Fig.3(c), we compare the Meta-DDPG with genetic algorithm, optimization-based, and exhaustive search methods.Meta-DDPG obtains a close performance to exhaustive search and optimization-based solutions.Besides, Meta-DDPG improves energy efficiency compared to the genetic method.

Algorithm 1
The Proposed Meta-DDPG Algorithm Maximum number of episodes E , maximum number of time steps T , and N .Initialize replay buffer B Initialize the critic and actor networks, Δ Q and Δ ϑ Initialize the target critic and the target actor networks ΔQ and Δϑ with parameters of ΔQ ← Δ Q and Δϑ ← Δ ϑ for each episode e = 1, • • • , E Reset the environment to get the initial state s 0 for each time step t = 1, • • • , T Select action a t Execute action a t on the environment and receive reward r t The network transits from state s t to s t+1 Transition s t , a t , s t+1 , r t is stored in B for each gradient descent step to solve problem (8) A random batch is sampled from B Input: