Terahertz Multiple Access: A Deep Reinforcement Learning Controlled Multihop IRS Topology

We explore THz communication uplink multi-access with multi-hop Intelligent reflecting surfaces (IRSs) under correlated channels. Our aims are twofold: 1) enhancing the data rate of a desired user while dealing with interference from another user and 2) maximizing the combined data rate. Both tasks involve non-convex optimization challenges. For the first aim, we devise a sub-optimal analytical approach that focuses on maximizing the desired user’s received power, leading to an over-determined system. We also attempt to use approximate solutions utilizing pseudo-inverse <inline-formula> <tex-math notation="LaTeX">$(P_{inv})$ </tex-math></inline-formula> and block solution <inline-formula> <tex-math notation="LaTeX">$(BLS)$ </tex-math></inline-formula> based methods. For the second aim, we establish a loose upper bound and employ an exhaustive search <inline-formula> <tex-math notation="LaTeX">$(ES)$ </tex-math></inline-formula>. We employ deep reinforcement learning (DRL) to address both aims, demonstrating its effectiveness in complex scenarios. DRL outperforms mathematical approaches for the first aim, with the performance improvement of DDPG over the block solution ranging from 8% to 57.12%, and over the pseudo-inverse ranging from 41% to 190% for a correlation-factor equal to 1. Moreover, DRL closely approximates the <inline-formula> <tex-math notation="LaTeX">$ES$ </tex-math></inline-formula> for the second aim. Furthermore, our findings show that as channel correlation increases, DRL’s performance improves, capitalizing on the correlation for enhanced statistical learning.


I. INTRODUCTION
T HE sixth generation (6G) wireless communications need to provide radically modern services and bandwidth-intensive applications compared to the fifth-generation (5G) such as immersive remote presence, connected robotics; autonomous systems (CRAS), immersive extended reality (XR), and digital twin.These applications demand a 1000× increase in capacity compared to 5G mobile systems [1].To achieve these requirements and overcome the conflict between service demands and spectrum scarcity [2], there is a need to boost the current wireless spectrum bands, and migrate towards higher terahertz M. Shehab, M. Elsayed, T. Khattab, N. Zorba, and M. Hasna are with the Electrical Engineering, Qatar University, Doha, Qatar. A. Badawy is with the Computer Science and Engineering, Qatar University, Doha, Qatar. A. Almohamad is with the Electrical and Computer Engineering, Texas A&M University Qatar, Doha, Qatar.D. Trinchero is with Dipartimento di Elettronica, Politecnico di Torino, Torino, Italy, e-mail: MuhammadShehab@ieee.org, abdullateef@ieee.org,hamid@qu.edu.qa,badawy@qu.edu.qa,tkhattab@ieee.org,nizarz@qu.edu.qa,hasna@qu.edu.qa, and daniele.trinchero@polito.it.This research work was made possible by grant number AICC03-0530-200033 from the Qatar National Research Fund (QNRF).Statements made herein are the sole responsibility of the authors.
(THz) frequency bands which range from 0.1 THz to 10 THz.These frequencies are considered a key element in 6G wireless communications because they possibly support considerable capacities and data rates.However, high-frequency values lead to severe path attenuation, high propagation losses, and sporadic wireless links.Further, these values produce very small wavelength (λ) values, which in turn results in very short communication distances and increases the susceptibility to molecular absorption and blockage [1].To improve the received signal power, and achievable data rate, this paper investigates the intelligent reflecting surface (IRS) as an emerging technology and promising solution [3].IRS manipulates the incident electromagnetic waves and adjusts the phase shifts of the semi-passive reflecting elements in a programmable behavior to yield a smart radio environment and enhance the data rate in an energy-efficient and cost-effective manner [4].
Many recent studies examined IRS deployment in THz communications to inspect the power of IRS to improve the coverage and achievable data rate [5] - [11].For instance, the authors in [5] studied the scenario of IRS-assisted multi-hop multi-pair unicast network, where multiple sources are communicating with multiple destinations.They proposed distributed multiple IRSs controls for a multi-hop interference channel with the purpose of maximizing the achievable rate.A multi-IRS assisted massive multiple-input multiple-output system was studied [6] to increase the minimum received signal power, where a base station (BS) equipped with multi-antenna transmits independent signals to a group of remote users equipped with single-antenna, and a cascaded line-of-sight (LOS) communication links are established among the BS and users by using the cooperative signal reflections of various IRSs groups.In [7] the authors examined the performance of the rate achieved for a decode-and-forward (DF) relaying assisted multi-IRS system, where a single source is communicating with a single destination (user) with the objective of obtaining the optimal IRSs configuration, number of IRSs, and number of IRS reflecting elements that maximize the ergodic rate.Further, in [8] the authors considered a multi-hop IRS-assisted multi-user downlink communication scenario, where the BS is communicating to K users with the scope of maximizing the sum rate by jointly optimizing the beamforming at the BS, and the multiple IRS phase shift reflection matrices.Moreover, in [9], the authors investigated the scenario of an uplink multi-hop IRS communication system where multi-users (sources) are communicating with a single destination.Their scope was to extend the link range in THz communications and maximize the power at the receiver.They proposed a cascaded passive IRS THz system to overcome the high propagation losses caused by the absorption in the air molecules.
To this end, the aforementioned research papers [5] - [9] adopted mathematical techniques to solve their optimization problem.
Unlike these studies, the research papers in [10] and [11] utilized the DRL algorithm to solve the non-convex optimization problem.The authors proposed a hybrid beamforming scheme for the cascaded IRS-aided networks to enhance the coverage of the THz communication links.They investigated the joint design of the analog beamforming at the IRSs and the digital beamforming at the BS to overcome the propagation loss in THz downlink broadcast system, which is a single source to a multi-destination (multi-user) scenario.

A. Contributions
To extend the range of the THz links and compensate for the losses at such a high operating frequency range, we adopt multi-hop IRSs (also referred to as cascaded IRSs) as the core component in our system model.Since, this is typical environment for users using THz links where the coverage is small, then we only consider small areas where several users are not expected.
Thus, we formulate an optimization problem under the assumption of a two-user system, where the objective is to find the optimum phase shifts of the multi-hop IRS's element in order to maximize the received rate for any specific user, and the sum rate for both users.The major challenge in our scheme lies in the non-convexity of the objective function due to the constant modulus constraints of the reflecting IRS elements, non-linear constraints, and computationally intractable multi-hop links.In general, the optimal solution to this NP-hard problem is unknown, and it is difficult to derive an analytical solution using traditional mathematical methods, and the exhaustive search is not practical for large-scale communication systems [10], [11].In addition, solving the problem of the received rate maximization leads to an over-determined system.To tackle this, we exploit a Deep Deterministic Policy Gradient (DDPG)-based algorithm, which is a DRL method, to obtain feasible solutions.
To the best of our knowledge, none of the existing research papers in the literature leveraged the DRL method to solve the over-determined system of equations for the uplink cascaded IRS multiple access scenario.In this paper, we address the above-mentioned gap in the literature by employing DRL-in particular, DDPG, to jointly optimize the phase shifts of each IRS in the cascaded IRS system taking into consideration the case of the spatially correlated channel [12] between IRS 1 and IRS 2 , to achieve two main objectives: a) maximizing the rate for any specific user, and b) maximizing the sum rate for both users.
We detail our contributions below against our two main objectives under the consideration of multi-hop IRSs and multiple access systems operating in the THz range: • Our first objective is to maximize the rate for any specific user under the assumption that the second user is considered as interference.For this objective: -We formulate the cascaded IRS phase shift optimization problem that includes IRS 1 and IRS 2 phases as a closed form, and we prove that it is non-convex and mathematically intractable.
-Further, to overcome this, we propose two sub-optimal solutions, where our objective is to find the optimal phases of the IRS 1 and IRS 2 elements that maximize the received power of the desired user.We show that when solving for the optimized phase shifts that maximize the received power of the desired user, the system is over-determined and therefore, we propose two solutions for this problem through the use of pseudo-inverse and block solutions.
-Moreover, we design a DDPG algorithm to find the optimum phases that maximize the rate of the desired user.
• Our second objective is to maximize the sum rate of two users, For this objective: -We provide analytical analysis for the problem in the case of cascaded IRSs.
-We design a DDPG algorithm to obtain the optimum phase shifts of the cascaded IRSs that maximize the received sum rate.
• We simulate our proposed solutions and compare the obtained results to those obtained through the exhaustive search algorithm and through randomly generated phase shifts.

B. Paper organization
The remainder of this research paper is structured as follows; section II describes the system and channel model for the multi-hop IRS scenario.Section III presents the problem formulation for maximizing the rate of the desired user operating Notation: For more convenience, frequent symbols and parameters along with their description are illustrated in Table 1.

II. SYSTEM AND CHANNEL MODEL
In our system model, we assume that we have two single-antenna users operating in an uplink multi-hop IRS communication system as shown in Fig. 1, where the operating frequency is within the THz range.The two users are equipped with highly directional parabolic antennas and transmit the signal focused to the center of IRS 1 .The antenna diameter for each transmitter is D t and for the receiver is D r .The distances between the two users and IRS 1 , IRS 1 and IRS 2 , and between IRS 2 and R x , are denoted as r tk , r 2 , and r 3 respectively.The horizontal distances between the transmitters and the center of IRS 1 , the center of IRS 1 and the center of IRS 2 , IRS 2 , and the receiver R X are denoted as r k,1,h , r 2,h , and r 3,h respectively.The incident and reflected angles with respect to the center of the illuminated areas at IRS 1 and IRS 2 are represented as θ i1,1 , θ i2,1 , θ r,1 , θ i,2 , and θ r,2 respectively.The heights of the two transmitters, IRS 1 , IRS 2 , and the receiver R x are denoted as T x,1 , T x,2 , s1 , s2 , and Rx respectively.The IRSs act as beamformers that focus the incident signal at a particular reflection direction by modifying the phases of the reflecting units (RUs).The number of RUs in IRS 1 is M = M x × M y and the number of RUs in Total absorption loss for Gain of IRS 1 RU from the incident and reflection angles P r,m The power reflected from the m th RU of IRS 1 P r,mn The power reflected from the n th RU of IRS2 because of being illuminated by the signal reflected by the m th RU of IRS 1 P rx,mn The received captured power at the R x P Rx k The total received power for user k at the receiver (R x ) α m The reflection coefficient of the m th RU of IRS 1 α n The reflection coefficient of the n th RU of IRS 2 ϕ t km The phase for the transmitter channel for user k and m th RU ϕ mn The phase for the h m,n channel for m th and n th RU ϕ rn The phase for the receiver channel for n th RU γ k The received SINR for the The data rate for user k R sum and in practical and in practical Ξ Phase search steps The transmitted signal for each user k, where k ∈ 1, 2, is represented by the following equation: where z k represents the signal for user k with unit power (i.e., E[|z k | 2 ] = 1, E[.] denotes the expectation), and P t represents the transmit power for each user.
Thus, the received signal for each user k is denoted by: is the channel between IRS 2 and the receiver, Φ M = diag(e −jη1 , e −jη2 , ..., e −jη M ) and Φ N = diag(e −jψ1 , e −jψ2 , ..., e −jψ N ) are the phase shift reflection matrices for IRS 1 and IRS 2 respectively that satisfy the constant modulus constraint .., N }, and diag(.)denotes the diagonal matrix.
Moreover, the phase shifts of the m th and n th reflecting elements are represented by η m and ψ n , where η m and ψ n values are between 0 and 2π, and the noise n 0 ∼ CN (0, σ 2 ) represents the AWGN for each user in linear scale.
Moreover, the deterministic phase shifts corresponding to the traveled distances of the signals from each user k over the first hop, and the IRS2-R X link over the third hop are represented as follows: where λ is the wavelength, and it is equal to c/f where c is the speed of light equal to 3 × 10 8 m/s and f is the frequency measured in Hertz (Hz).
The transmitter and receiver channels h t,k , and h r follow the Rician fading model [11], [13]: where It is controlled using the parameter ρ ∈ [0, 1] which represents the correlation-coefficient among the adjacent RUs, and it is expressed as: where θ i,2 is the angle of arrival between IRS 1 and IRS 2 .High values of ρ, result in high correlation among H mn elements, and in cases where ρ is less than 1 (i.e.not equal to 1), the significant correlations are between adjacent RUs only, with considerably low correlation at large distances.Further, we assume that the channels h t,k , and h r are perfectly known for all the transmitters and the receiver.Even-though the channel estimation, and finding out the channel state information (CSI) [14] is a challenging task for IRS-based communication networks, some studies proposed significant methods for obtaining CSIs.The authors, in [15] proposed an efficient channel estimation algorithm for a double IRS-based assisted multi-user MIMO communication system to obtain the cascaded CSI.In [16], the authors conducted a comprehensive survey on channel estimation for IRS-assisted wireless communications focused on solutions that tackle practical design problems.The study in [17] proposed a framework for IRS channel estimation where a small number of IRS elements are capable of processing the received signal to facilitate channel estimation.Thus, the IRS estimates the channel between itself and the BS, and between itself and the users based on the pilot signals received by the IRS semi-passive elements utilizing compressed sensing methods.
Further, the gains for the users' and receiver antennas G t (o) and G r (o) are expressed as: where J 1 (.) is the first-order Bessel function of the first kind, D κ is the diameter of the antenna, and κ ∈ {t, r} represents Tx or Rx antennas, respectively.The angle measured from the broadside of the antenna is represented as o [18].
Thus, the maximum gain is for o = 0 and it is denoted as: where e t and e r represents the aperture efficiencies for the T x and R x respectively.Further, the gain of every RU is expressed where θ i1,k is the angle of incidence from user k to IRS 1 .[18].
The total losses and gains on the path between each T x k and the R x are denoted by L τ,k .This includes the antenna gains, free space path loss (FSPL), and THz absorption loss (AS).
where L abs,τ,k represents the total THz absorption losses for each T x k .These THz losses are obtained under standard atmospheric conditions utilizing the simplified model suggested in [19] , and L F SP L,τ,k represents the total FSPL for each T x k , and it is denoted as: L F SP L,k for the signal reflected from IRS 1 towards IRS 2 is represented as and L F SP L,r between IRS 1 and the receiver is expressed as Thus, the total FSPL for each T x k is expressed as

III. MAXIMIZING THE RATE OF A DESIRED USER UNDER INTERFERENCE
In this section, we provide the analytical derivations of the first objective, which is maximizing the rate of a desired user, while the other user is considered an interferer.Our objective is to find the optimum phases of the cascaded IRSs that maximize the received rate of the desired user.We will show that the rate maximization problem is non-convex and finding a closed-form expression of the IRS phases is mathematically intractable.Then we propose sub-optimal solution to the problem through maximizing the received power of the desired user.In addition, we propose a DDPG algorithm that maximizes rate of a the desired user.

A. Rate of the Desired User Under Interference
The analytical form of the rate of the desired user under the interference scenario can be written as where γ k is the signal to interference plus noise ratio (SINR) of user k.The SINR, γ k , can be written as where P k Rx is the received power of user k and σ 2 is the noise variance.In the following, we provide the anlytical derivations of P k Rx .
1) User's Received Power (P k Rx ): In our scenario, both users are transmitting at IRS 1 , covering all IRS 1 elements from various angles and distances.The power reflected from the m th RU of IRS 1 can be written as in [18] without including the absorption losses calculation.L abs,τ,k is included in L τ,k in the expression for the total received power.
where α m = αe −jηm designates the reflection coefficient of the m th RU of IRS 1 ; G t,k represents the T x antenna gain of user k; G(θ i1,k ) and G(θ r,1 ) represent the gain of RU from the incident and reflection angles, respectively; and r t,k is the distance between T x k and RU m.Similarly, the reflected power from the n th RU of IRS 2 because of being illuminated by the signal reflected by the m th RU of IRS 1 is where α n = αe −jψn designates the reflection coefficient of the n th RU of IRS 2 , H mn represents the (m, n) element of the IRS 1 -IRS 2 channel matrix H. Finally, the received captured power at the R x through channel H mn can be written as follows and the total received power for user k at the receiver (R x ) is expressed as [18] where ϕ t km is the phase for the transmitter channel for user k and m th RU , ϕ mn is the phase for the h m,n channel for m th and n th RU, ϕ rn is the phase for the receiver channel for n th RU , |α n | and |α m | are assumed to be equal to 1, thus equation ( 21) can be re-written as: 2) Desired User's Rate Maximization Problem: To this end, substituting ( 22) in ( 16) leads to Therefore to maximize the rate of the desired user under interference, we have max s.t.
This optimization problem in ( 24) is NP-hard problem, and the solution is non-trivial, because of the non-convexity due to the constant modulus constraints of IRS 1 and IRS 2 reflecting elements [20], and it is mathematically intractable to obtain an analytical closed form expression of the optimum phases shifts of the two IRSs.The optimal solution will be a balance point between increasing the received SNR of the desired user while decreasing the interference from the other user which are not guaranteed to be aligned sub-objectives.Hence, we find a sub-optimal solution of (24) through maximizing received power of the desired user only.
3) Sub-optimal Solutions: Maximizing the Received Power of the Desired User: The total received power of the desired user (e.g.user 1) can be maximized by solving the following system of equations: where ν is any constant value, which means that P k Rx will be maximum when ν is constant ∀m, n.In our case, we will choose ν = 0. Thus, equation ( 26) describes an over-determined system of equations with M + N unknowns, and M × N equations.We illustrate below equation (26) in more detail: This is almost always inconsistent and has no solution.However, we solve this problem using sub-optimal mathematical techniques such as pseudo inverse and block solution to obtain the unknowns η m and ψ n , calculate Φ M , Φ N , and the received power for user 1, and then we compare the results generated from these techniques to those obtained from deep reinforcement learning solution.
a) Pseudo-Inverse Solution: The pseudo-inverse solution for the over-determined system is expressed as follows: where Θ with dimensions (M + N ) × 1 is the matrix that represents IRS 1 , and IRS 2 phase shifts from η 1 to ψ M +N , A with dimensions (M × N ) × (M + N ) represents the binary matrix, and C with dimensions (M × N ) × 1 represents the matrix containing constant values such as the phase shifts of the transmitter channel h t,k , the phase shifts of the channel between IRS 1 and IRS 2 H mn , and the phase shifts of the receiver channel h (t) r .
where A + is pseudo-inverse of a matrix A. Thus, the pseudo-inverse solution Θ is expressed as b) Block Solution: Moreover, based on the spatially correlated channel assumption, a low-complexity solution built on the exponential correlation model can be developed and is controlled using the correlation-coefficient ρ between the adjacent RUs.High ρ values lead to high correlation across H mn elements.This results in two-dimensional blocks of high correlations across the diagonal elements of the channel matrix.Building on this, the SINR can be maximized by selecting one element from each block and replicating its phase response to all other elements in the same block.This solution is denoted as a block solution.
In the case where the channel correlation is very high, the channel H mn has a block structure, where the elements in the channel matrix form groups, each having the same phase with no correlation between the contiguous blocks.Thus, the number of phases that the IRSs need to compensate for is minimized to the number of blocks in the channel matrix H mn .Thus, the over-determined system becomes solvable if the number of blocks in the channel matrix N blk ≤ M + N , where N blk denotes the number of blocks in the channel.
Therefore, the total received signal power when using the block solution can be rewritten as: where V = M Nblk , and W = N Nblk represent the number of blocks, and v, and w represent the index of the blocks in the correlated channel.calculate the total received power using (32).Solve (26) using the pseudo-inverse solution, and calculate the total received power using (22).endif Proposition 1.The total received signal power for each T x k at the R x in eq. ( 22) can be expressed as: The proof is in the Appendix A.

IV. MAXIMIZING THE SUM RATE FOR BOTH USERS
In this section, we provide the mathematical derivations of the optimization problem of our second objective, which is maximizing the sum rate of both users.In particular, in our second objective, we find the optimum phase shifts of IRS 1 and IRS 2 elements that maximize the combined sum rate for all users at the receiver.Moreover, we derive a a loose upper bound that is used to benchmark the results of our proposed solutions.The sum rate can be written as Accordingly, the formulated problem at IRS 1 and IRS 2 is to find out the phase shift matrices Φ N , and Φ M that maximizes R sum , and it is expressed as: Similar to the optimization problem in (24) this optimization problem is NP-hard problem, and the solution is non-trivial, because of the non-convexity due to the constant modulus constraints of IRS 1 and IRS 2 reflecting elements [20], and it is

A. Upper bound on Performance
By assuming the case of null interference in (23), a loose upperbound 38 on the sum rates can be determined as follows: The SINR on user k becomes: Hence, the upper bound on sum rate can be written as:

V. PROPOSED DDPG FOR CASCADED IRS PHASE CONTROL
As stated earlier, the optimization problems in (24) and (35) are non-convex and finding the optimum phases that maximize the rate through exhaustive search is computationally infeasible.Hence, we design DDPG solutions to find the optimum phases of the cascaded IRS.
In this section, we introduce the scheme of the proposed DDPG algorithm in (see Fig. 2) to solve the optimization problem in ( 24) and (35) for the cascaded IRS system.Deep Q-Networks are not suitable since they deal with discrete time spaces only.Moreover, the convergence of the policy gradient (PG) algorithm is not sufficient in the context of wireless Because action space is continuous, the exploration of the action space is handled with the noise that is generated by the OU process.The OU process samples the noise from a correlated normal distribution.
3) Reward function: The reward function for the first objective is defined based on the maximum received power for the desired user: where Rx k is the maximum received power for user 1.
For the second objective, it is defined as the maximum sum rate for users : where sum is the actual sum-rate for the users.

4) DDPG Algorithm Framework:
The objective of the DDPG algorithm is to train the agents IRS 1 and IRS 2 to take actions that maximize the long-term average reward (i.e.user 1' received power and user's sum rate) throughout the changes of the environments.The agents IRS 1 and IRS 2 adjust the randomized policy and the phases shift matrices in a way that deals with the random environmental statistical behavior to maintain a long-term average reward, rather than an instantaneous response to the channel random changes.
For each iteration, the agents IRS 1 and IRS 2 observe the state which includes the received SINR of the previous state for , at the R x , the received SINR of the previous state for T x2 , γ , at the R x , and the reward of the previous state.After that, it calculates the action Φ M and Φ N that maximizes the long-term reward.This is accomplished by the actor network, whereas the critic network accepts the state and the action as inputs and outputs the expected reward (i.e.user 1' received power and user's sum rate).After the reward calculation, a new state is observed, and the agents IRS 1 and IRS 2 will modify the phases accordingly till the system learns how to reach the optimal reward.To increase stability, the target actor and critic networks are updated periodically based on the newest actor and critic parameter values.
As revealed in algorithm 2, in step 1 we initialize the replay buffer D of the agent with capacity C. Then we initialize the weights of the actor network and critic network in step 2. In step 3, the actor target network and the critic target network are initialized.The phase shifts for all elements of IRS 1 and IRS 2 are chosen randomly from 0 to 2π at the start of each episode.The following steps from step 4 to step 13 are repeated for each timestep (i.e.iteration).For each timestep, we obtain the following channels h , and R sum , observe the state s (T ) for the agents IRS 1 and IRS 2 , and the actor network will determine an action a (T ) (i.e.Phase shift matrices) with exploration noise (OU) in step 5.The action a (T ) is reformed into a phase shift matrices for IRS 1 and IRS 2 Φ M = diag(e −jη1 , e −jη2 , ..., e −jη M ) and Φ N = diag(e −jψ1 , e −jψ2 , ..., e −jψ N ).After the agents determine and execute the action, a reward r (T ) is calculated, and a new state s (T +1) is observed.The state s (T ) , action a (T ) , reward r (T ) , and the new state s (T +1) are stored as one transition in the replay memory D in steps 6 and 7.Then, the critic network samples a random mini-batch of transitions from the main memory in step 8 to calculate the target Q-value Q(s (i) , a (i) |θ Q ) in step 9 using Bellman equation.In step 10, the weights of the target actor network are updated by minimizing the loss using the obtained target Q-value.In step 11, the weights of the target critic network are updated according to the sampled policy gradient.Finally, the target actor and critic networks are updated using the soft updates τ to increase the learning stability as shown in step 12. Observe state s (T ) and select an action with exploration OU noise a (T ) = µ(s (T ) |θ µ ) + nT 6:

8:
Randomly sample mini-batch transitions from D: Compute the targets: Update the θ Q in the critic network by minimizing the loss: Update the θ µ in actor network according to the sampled policy gradient: Update the target networks:

B. Complexity Analysis
To reveal the complexity of the DRL algorithm, we demonstrate a quantitative analysis of the proposed DRL-based algorithm C DRL versus the complexity of pseudo-inverse, block solution, and exhaustive search methods.The DRL algorithm is a neural network-based algorithm and its architecture possesses the multi-layer perceptron (MLP) structure, which is a fully connected class of feed-forward artificial neural network (ANN).Thus, the complexity of the forward pass in MLP is a vector or matrix multiplication.For our system, the evaluation of the DRL complexity depends mainly on the calculations throughout the exploitation phase.Hence, we are interested in the complexity of the trained network (steady-state) which relies heavily on the actor network (i.e.forward network architecture).
Deep neural networks are composed of multiple layers, an input layer, an output layer, and hidden layers.We assume that S is the number of states, which is the size of the actor network's input, U i is the number of neurons in each layer's input, n is the number of hidden layers, U j is the number of neurons in each layer's output, A is the number of actions, which is the size of the actor network's output.Thus, the complexity of the input layer is O (S × U i), the complexity of the hidden layers is O (n × U i × U j), and the complexity of the output layer is O (U j × A).Hence, the overall complexity of the DRL Further, the DRL algorithm will always select the action A that yields the highest reward, and performs a linear search on the output.Therefore, the overall computational complexity of one forward pass in the neural network is expressed as [23]: On the other hand, the pseudo-inverse solution of a matrix A with size M N × (M + N ) can be calculated from its singular value decomposition (SVD) as O((M N ) 2 × (M + N )), where M N > (M + N ), and M and N are the number of elements in IRS 1 and IRS 2 respectively [24], [25].However, for the block solution, the complexity of a matrix A with size , where N blk < (M + N ).Moreover, the complexity of the exhaustive scheme C ES assuming K users, M elements for IRS 1 , and N elements for IRS 2 , and phase search steps Ξ = 2π ∆Φ , can be expressed as Thus, the complexity of the DRL algorithm is much lower than that of the pseudo-inverse, block solution, and exhaustive search methods as the number of IRS elements increases as shown in the following figures Fig. 3, and Fig. 4.

VI. SIMULATION RESULTS
In this section, we evaluate the performance of the proposed DDPG-based cascaded IRS-assisted wireless THz communications.
To reveal the performance of our DDGP system, we need to compare it with a benchmark scheme considering the scenario of maximizing the rate for the desired user, and the scenario of maximizing the sum rate for both users.Therefore, when maximizing the rate for the desired user (i.e.user 1), we provide two benchmarking schemes as reference models for our system with reflecting elements M = 18 for IRS 1 , and N = 18 for IRS 2 ; the first scheme is based on pseudo-inverse, and the second is based on block solution.Further, when maximizing the sum rate for both users, we compare the sum rates obtained from using the DDPG algorithm to a discretized exhaustive search approximation to prove that our DDPG scheme performance is close to the exhaustive search.We employ this procedure to find out the optimum phase shift matrices that result in an approximation to the maximum sum rate.To avoid the high complexity of the exhaustive search algorithm, we consider a limited number of reflecting elements M = 4 for IRS 1 , and N = 4 for IRS 2 .For each IRS 1 and IRS 2 element, we consider the phase shifts between 0 and 2π with a search step of 2π/72.This will give us (72 + 1) M +N combinations of phase shift reflection matrices.After obtaining the optimum phase shift matrices, we calculate the sum rates accordingly for users.
Moreover, we compare the DDPG sum rates to those calculated based on random phase generation (i.e.without optimization) as another way to benchmark our system.
The default parameters used in the simulation for the DDPG-based cascaded IRS algorithm are shown in Table 2.The number of reflecting elements used for IRS 1 and IRS 2 is 18.The number of users is K = 2, the number of antennas per user is N t = 1 the number of antennas at the receiver is N r = 1, and the wavelength is λ = 10 −3 .The channel between user 1 and IRS 1 , user 2 and IRS 1 , IRS 2 , and the receiver follows the rician fading model with rician factor K1 = K2 = 10.
The path loss exponent for the channel between the transmitters and IRS 1 is 2, and the path loss exponent for the channel between IRS 2 and the receiver is 2. The carrier frequency is f = 300 × 10 9 , the bandwidth is BW = 2 × 10 9 , the noise spectral density N P SD = −174 db/Hz, and the noise figure at the receiver F dB = 10 db.Moreover, the coordinates for IRS 1 is (x r1 = 5,y r1 = 10,h r1 = 12), the coordinates for IRS 2 is (x r2 = 10,y r2 = 10,h r2 = 12), the distance between user 1 and IRS 1 is variable with range between r t1 = 3 m and 15 m, the distance between user 2 and IRS 1 is 15 m, and the reflection coefficients of IRS 1 and IRS 2 is α = 1.The coordinates of the receiver R x is (x rx = 20,y rx = 0,h r = 5).The antenna diameter is D t = 0.12 m.The height of user 1 and user 2 is h t = 5.We define the distance ratio as the distance r t1 between user 1 and IRS 1 , divided by the distance r t2 between user 2 and IRS 1 .Numerical results for the sum rates are calculated using 10 3 Monte-Carlo simulations.
The proposed DDPG scheme is composed of actor and critic networks.Both networks are dense neural networks (DNN) with 4 layers.Each layer is nn.Linear and accepts two parameters, the first is the input size and the second is the output size.
For the actor network, the states are the input with a size equal to 3 neurons, and the output is the action with size of 36 neurons.In addition to the input and output layers, there are two hidden layers each that accepts 128 neurons as input and outputs 128 neurons, followed by the ReLU activation function.To provide enough gradient, the output layer of the actor network uses the tanh(•).For the critic network, the input consists of the number of states and the number of actions, both are concatenated to represent the input of the critic network.This is followed by two hidden layers between the input layer and the output layer, each that accepts 128 neurons as input, and outputs 128 neurons, followed by the ReLU activation function.
The output layer of the critic network represents the Q-value with 36 neurons.To update the parameters, both actor and critic networks use Adam optimizer.Furthermore, the results are generated by considering the average rate for over 1000 iterations.
The learning rate of the actor network is set to 3 × 10 −4 , and the learning rate of the critic network is set to 1 × 10 −4 .The discount factor of the future reward Γ, the batch size is equal to 128, the replay buffer C is equal to 10 5 , and the number of episodes is 10000.
The generated results shown in Fig. 5 reveal the convergence of the DDPG algorithm.The figure shows the rewards versus the episodes, and that the rewards are increasing with time.This reveals that the learning process is conducted successfully.

A. Simulation Results for Maximizing The Received Power At The Receiver R x
In the following simulations, we demonstrate the results for maximizing the rate for the desired user (i.e.user 1) at the receiver by plotting the rate of user 1 versus the distance ratio between user 1 and user 2 utilizing various schemes such as DDPG, block solution, and pseudo-inverse solution methods.
Fig. 6 and Fig. 7 shows user 1's rate for the DDPG scheme versus the distance ratio between 0.2 and 1.As the distance ratio increases, the data rate for user 1 decreases, because user 2 will be closer to user 1, and thus the interference from user 2 to user 1 increases.Moreover, Fig. 6a shows that user 1's rate for the DDPG scheme exceeds that of pseudo-inverse and block-solution methods with a correlation-coefficient ρ equal to 1.0.This reveals the effectiveness of the DDPG algorithm and that it is superior to pseudo-inverse and block-solution.
Furthermore, the following figures Fig.  of correlation channels in our scenario and their benefit in increasing the data rate.On top of that, the DDPG scheme still achieves higher rates than block-solution and the pseudo-inverse solution even when the correlation-coefficient ρ decreases.
On the other hand, it is important to note that, for low values of correlation-coefficient ρ the difference between the DDPG's data rates and other schemes retracts.
Fig. 8 demonstrate the rates for the DDPG scheme versus the distance-ratio for various correlation-coefficients.It is clear that when the correlation-coefficient ρ increases the data rates for the DDPG scheme increase, since increasing the value of correlation, will increase the learning efficiency of the DDPG algorithm.Thus, the DDPG scheme achieves higher data rates than other methods especially when the correlation-coefficient ρ is high.

B. Simulation Results for Maximizing the Sum Rate for Both Users
In the following simulations, we show the results for maximizing the sum rate for both users at the receiver.We plot the sum rate versus the distance ratio between user 1 and user 2 utilizing the DDPG method for various correlation coefficients as shown in Fig. 9.It is obvious from the results that the sum rates obtained from the DDPG solution increase with the increase of the correlation-coefficient ρ.
Further, in all our simulations, constant learning rates were used for our proposed DDPG scheme, which is 10 −4 for actor network, and 3 × 10 −4 for the critic network.The influence of the learning rate on the DDPG data rates is shown in Fig. 10, which reveals a comparison between different learning rates, i.e., 10 −3 , 10 −4 , 10 −5 for actor network, and 3 × 10 −3 , 3 × 10 −4 , 3 × 10 −5 for critic network.Thus, the highest DDPG rate is achieved when the actor networks' learning rate equals 10 −4 and the critic networks' learning rate equals 3 × 10 −4 .Therefore, the average rewards are determined by the learning rates (e.g. 10 −4 ).Too small learning rates 10 −5 or too large learning rates 10 −3 produce lower average rewards, where the learning rate 10 −4 achieves better rewards.
To verify the performance of our DDPG scheme, we compare the sum rates produced by the DDPG algorithm to the discretized exhaustive search algorithm that is used to calculate the maximum sum rate by obtaining the optimum phase shift matrix.The complexity of the exhaustive search is too much high so the number of reflecting elements used for IRS 1 is M = 4, and for IRS 2 is N = 4 rather than M = N = 18.For each IRS element, we take into consideration the phases between 0 and 2π with a search step size equal to 2π 72 .Thus the number of combinations for the phase shift matrices is equal to (72 + 1) 4 .The sum rates are calculated for two users and 100 Monte-Carlo simulations.The results are revealed in Fig. 11 where the

VII. CONCLUSION
In this paper, we considered the uplink multiple access scenario of the cascaded IRS system to combat the short-range communications in THz networks, intending to achieve two objectives.The first objective is to maximize the rate of the desired user.We showed that the problem is non-convex and finding a closed form expression is mathematically intractable.Therefore, we proposed two sub-optimal solution for maximizing the received power of the desired user.The second objective is to maximize the sum rate for both users which is also non-convex problem and more complicated.We employed the DDPG algorithms which are capable of coping with non-convex optimization problems to solve the maximization problem for the cascaded IRS system.DDPG algorithm obtains the optimum IRS phases that maximize the received rate for the desired user, and the sum rate for both users.Simulation results for the first objective reveal that DDPG can achieve higher data rates than sub-optimal methods, pseudo-inverse and block-solution.For the second objective, it is clear that the DDPG sum rates are    close to the discretized exhaustive search with a search step equal to 2 * pi/72.Further, DDPG reveals the significance of the correlation in the channels to enhance the learning process and achieve higher data rates.

A. Proof of Proposition 1 (See page 12)
The total received signal power for each T x k at the R x in eq. ( 22) can be expressed as: • From the obtained result we can deduce that the total received signal power for each T x k at the R x can be written as follows:
and ht,k ∈ C 1×M are the LOS component, and non-LOS (NLOS) component respectively.Similarly, K 2 is the rician factor of h r , hr ∈ C N ×1 and hr ∈ C N ×1 are the LOS component and NLOS component respectively.The channel between IRS 1 and IRS 2 , H m,n ∼ CN (0, R), follows the spatially correlated Rayleigh fading channel model, where R is the covariance matrix obtained based on the exponential spatial correlation model.

Algorithm 2 DDPG-based Framework 1 :
Network Architecture: The architecture of the DDPG scheme is composed of 4 neural networks which include the actor and critic networks as well as the target actor and target critic networks which are used to improve the stability of the learning process.The target networks are used in the Q-target formula to estimate the value of the future states that are used to train the current networks.Initialization: Set T = 0 and initialize reply buffer of DDPG agent D with capacity M.2: Randomly initializes the weights of actor networks θ µ and critic networks θ Q .3: Initialize target networks: θ µ ← θ µ and θ Q ← θ Q .4: for T = 1 to ∞ do 5:
Fig. 6c, Fig. 7a, and Fig. 7b demonstrate that when the correlation-coefficient ρ decreases, user's 1 rate decreases for DDPG algorithm, block-solution, and pseudo-inverse solution.This shows the importance

Table 1 :
List of frequently used parameters and symbols.Phase shift reflection matrix for IRS 1 and IRS 2 , respectively φ m , φ n Phase shift of IRS 1 reflecting element m and IRS 2 reflecting element n, respectively η m Phase shift of m th IRS 1 reflecting element ψ n Phase shift of n th IRS 2 reflecting element D t , D r Antenna diameters for each T x k and R x , respectively r tk , r 2 , r 3 Distance between: each user k and IRS 1 , IRS 1 and IRS 2 , and IRS 2 and R x , respectively r k,1,h , r 2,h , r 3,h Horizontal distance between: user k and center of IRS 1 , centers of IRS 1 and IRS 2 , and center of IRS 2 and R x , respectively θ ik,1 Incident angle from user k w.r.t. the center of the illuminated area at IRS 1 θ r,1 Reflected angle w.r.t. the center of the illuminated area at IRS 1 θ i,2 Incident angle from IRS 1 w.r.t. the center of the illuminated area at IRS 2 θ r,2 Reflected angle w.r.t. the center of the illuminated area at IRS 2 T x,k , Rx , s1 , s2The height of the T x k , R x , IRS 1 and IRS 2