DRL-Assisted Dynamic Subconnected Hybrid Precoding for Multi-Layer THz mMIMO-NOMA System

Massive multiple-input multiple-output (mMIMO) techniques can be combined with the non-orthogonal multiple access (NOMA) scheme in terahertz (THz) communication to achieve multiplexing gains and satisfy the ultra-high capacity and massive connectivity requirements. However, the development of a near-optimal solution for energy and spectral efficiency problems in a dynamic wireless cellular environment remains challenging. In this paper, a cooperative THz mMIMO-NOMA enabled base station is established to optimize the power consumption and maximize the spectral efficiency. A multi-layer mMIMO antenna architecture is used to perform dynamic sub-connected hybrid precoding in each layer. The fuzzy c-means clustering algorithm is used to group densely located users into clusters to efficiently use the power coefficients. To optimize the power distribution constraints and coordination of the hybrid precoding structure, a multi-agent deep reinforcement learning algorithm is developed, which operates in a distributive manner. Each base station layer involves an agent that trains a deep Q-network, and optimal actions are executed by sharing exchangeable network parameters among layers. The simulation results indicate that the proposed scheme is able to learn the trade-off between maximization of the energy efficiency and overall system capacity.

Abstract-Massive multiple-input multiple-output (mMIMO) techniques can be combined with the non-orthogonal multiple access (NOMA) scheme in terahertz (THz) communication to achieve multiplexing gains and satisfy the ultra-high capacity and massive connectivity requirements.However, the development of a nearoptimal solution for energy and spectral efficiency problems in a dynamic wireless cellular environment remains challenging.In this paper, a cooperative THz mMIMO-NOMA enabled base station is established to optimize the power consumption and maximize the spectral efficiency.A multi-layer mMIMO antenna architecture is used to perform dynamic sub-connected hybrid precoding in each layer.The fuzzy c-means clustering algorithm is used to group densely located users into clusters to efficiently use the power coefficients.To optimize the power distribution constraints and coordination of the hybrid precoding structure, a multi-agent deep reinforcement learning algorithm is developed, which operates in a distributive manner.Each base station layer involves an agent that trains a deep Q-network, and optimal actions are executed by sharing exchangeable network parameters among layers.The simulation results indicate that the proposed scheme is able to learn the trade-off between maximization of the energy efficiency and overall system capacity.Index Terms-Deep reinforcement learning (DRL), hybrid precoding, massive multiple-input multiple-output (mMIMO), non-orthogonal multiple access (NOMA), Terahertz (THz).

I. INTRODUCTION
U LTRA-MASSIVE interconnectivity with rapid growth of wireless data rate requirements is one of the major challenges in future wireless networks.As a result, new spectra with promising features such as ultra-broad bandwidth is necessary.In addition to the millimeter-wave (24-300 GHz) band, attention must be focused on the terahertz (THz) band, which is associated with much higher bandwidth.In accordance with the recommendation of the International Telecommunication Union, the frequencies between 275 GHz and 3 THz are reserved for sixthgeneration THz wireless systems.The THz band can support a transmission rate of tens of Gbps while enabling ultra-lowlatency communications with improved directionality, confidentiality, and strong anti-interference ability [1].Therefore, high data rate short-range broadband THz wireless communication is feasible.In a THz cellular network, a data rate of 18.3 Gbps was achieved with 99.999% reliability for a Matern hardcore point process-based virtual reality users [2].However, the THz band, which typically has a multi-GHz bandwidth, incurs an extremely high propagation loss that decreases the communication range [3].The multiple-input multiple-output (MIMO) technology can provide a high beamforming gain to compensate the path loss.A THz base station (BS) must be implement with large antenna arrays (i.e. more than 500 antenna elements) referred to massive MIMO (mMIMO).Large antenna elements can be integrated in a physically limited space because of the sub-millimeter wavelength.Thus, a THz mMIMO BS can provide multiplexing gain by supporting multiple data streams, thereby enhancing the spectral efficiency.The system performance can be enhanced using non-orthogonal multiple access (NOMA) technologies, which can support an increased number of users and channel capacity by simultaneously using the same time and frequency domains [4].The successive interference cancellation (SIC) technique inherent to NOMA can help eliminate the interference from strong users, thereby increasing the system throughput [5], [6].In case of mMIMO antenna system, the transmitting beams are restricted by the number of transmitting antennas.An intra-beam superposition coding can be applied supporting multiple users enabling SIC when mMIMO and NOMA are used cooperatively [6].Therefore, the cooperative use of NOMA and mMIMO in THz communications has been recommended to enhance the spectral efficiency, energy efficiency (EE), and connectivity at a large scale.
In particular, certain researchers [7] presented analytical and numerical solutions for the power allocation problem in a cooperative THz MIMO-NOMA system to maximize the minimum achievable rate.Based on cooperative simultaneous wireless information and power transfer, a THz MIMO-NOMA system was proposed in [8] to enhance the wireless connectivity, resource management, scalability, reliability, and user fairness.To enhance the channel conditions, a spatial tuning technique was developed in [9] for a THz ultra-massive MIMO-NOMA configuration.Moreover, in the THz MIMO BS, to avoid lots of hardware complexity, hybrid precoding (HP) technique is being researched recently and can achieve satisfactory spectral efficiency compared to traditional digital precoding.In HP frameworks, the signal processing is divided into low-dimensional digital baseband precoding and high-dimensional analog radio frequency (RF) precoding.The networking performance and power consumption depend on the connection dimensions of the RF chain and mMIMO transmitting antennas.HP architectures can typically be divided into 1) fully connected HP (FCHP) and 2) sub-connected HP (SCHP).The critical difference is that in the FCHP, each RF chain is connected to every antenna, whereas in the SCHP, an array of antennas is connected to only one RF chain.Therefore, FCHP achieves higher spectral efficiency and consumes more power than that of SCHP.An appropriate precoder can be used to provide highly directional beams to mitigate multiuser interference [10].However, none of these techniques can adaptively and dynamically control the circuit connections, which is necessary for dynamically varying the network channel conditions.Furthermore, to perform resource optimization in THz mMIMO-NOMA networks, developing a suitable user clustering technique is a challenging research area.Only a few studies on MIMO-NOMA systems have considered the user pairing when clustering the users for a limited number of active users [11], [12], [13].A joint user clustering and power allocation algorithm was proposed in [14], where the main target was to elevate the sum-rate by selecting two best users in a cluster.However, this system introduced polynomial computational complexities in user clustering.An interference aware graph based clustering approach was proposed in [15] for two types of users: cellular and device-to-device.A channel graph was constructed by the BS for cellular users by measuring the channel correlation between two users.Although the user clustering for power optimization and resource management in low-frequency band networks has been extensively studied, the corresponding strategies for THz mMIMO-NOMA networks are limited.
The maximization of the energy or spectral efficiency of a cellular network is typically a nondeterministic polynomial (NP)-hard and nonconvex problem, and thus, an optimal solution is difficult to obtain.Although several sub-optimal solutions are available to address these problems, these solutions are based on a centralized approach that involves a non-negligible delay and are thus not suitable for a dynamic wireless environment, which is commonly encountered in practice.Recently, deep reinforcement learning (DRL) has been used to develop promising solutions to address various NP-hard problems in the communications and networking domains.DRL represents a combination of deep learning and reinforcement learning techniques, aimed at performing several practical decision-making tasks with a large state-action space.Certain researchers [16] proposed a reconfigurable THz MIMO-NOMA framework to intelligently coordinate beamforming between access points and reconfigurable intelligent surfaces (RISs) by using a multi-agent DRL algorithm.Moreover [17], a dynamic downlink-beamforming coordination system consisting of multiple BSs with a single user was proposed using a distributed DRL algorithm to maximize the achievable rate.A hybrid beamforming scheme [18] was developed for multihop RIS-assisted THz communication networks.Specifically, a DRL-based algorithm was developed for the multihop environment to enhance the coverage range by solving a NP-hard beamforming problem.Furthermore, a dynamic power allocation scheme was proposed [19] based on a distributively executed DRL technique for practical wireless scenarios.
This paper proposes a multi-agent DRL-based THz mMIMO-NOMA system, in which the mMIMO antennas are grouped into multiple layers, as shown in Fig. 1.For simplicity, the number of antennas per layer is kept equal.The distinct layers of the antenna are subject to support categorized users based on the requirements of their channel gain.Therefore, the antenna layers are incorporated with a dynamic SCHP (DSCHP) scheme to obtain an adaptive antenna configuration (see Fig. 2) that can satisfy the dynamic wireless channel requirements.The conventional FCHP scheme in mMIMO provides full array gain compared to SCHP while consuming more energy.Hence, the proposed DSCHP structure is advantageous in reducing power consumption while supporting users with moderate channel gain.Achieving this goal, a multi-agent DRL framework is developed where each layer corresponds to an agent.The agents determine the beamforming parameters by playing the best action with a fair reward by exchanging information between the layers.Fuzzy c-means (FCM) clustering algorithm is adopted for partitioning the users supported by NOMA-enabled beamforming.The layers can hold multiple clusters by subdividing the antennas into subarrays.The simulation results described in Section V show the proposed system achieves better performance in terms of energy efficiency while meeting the required channel gain.The key contributions of this manuscript are elaborated as follows r A multi-layer mMIMO antenna system for THz communi- cations is proposed which incorporates the DSCHP scheme in each antenna layer.Therefore, the antennas in each layer are divided into multiple subarrays, with each subarray responsible for one user cluster.
r The users of distinct channel responses are grouped into multiple clusters using Fuzzy c-means (FCM) clustering.The results show the clustering performance based on the proposed system has a higher fuzzy partitioning coefficient (FPC).
r A multi-agent DRL framework is presented where a deep Q-network (DQN), known as the training DQN, is centrally trained using the shared experiences stored in the replay memory of all the agents (i.e., layer) to decrease the computational overhead and memory usage.Each agent also involves another DQN named the target DQN, whose parameters are updated periodically and executed distributively in each BS layer.
r The proposed cellular environment was simulated in the Python environment.Based on the DRL framework, a nearoptimal solution is achieved for the system as the spectral efficiency converges to a higher value compared to other baseline schemes.Moreover, the EE performance shows a better result compared to a scheme with full array gain.The remaining paper is organized as follows.Section II describes the system model with the proposed DSCHP structure and channel properties.Section III describes the FCM-based clustering technique for the DSCHP system.Section IV describes the proposed DRL framework for the DSCHP-based THz mMIMO-NOMA system.Section V presents the simulation results and their discussion.Finally, the conclusion is drawn in Section VI.

II. SYSTEM MODEL
We consider a multi-layer architecture of a single-cell downlink mMIMO-NOMA enabled BS for 6 G THz communications.As depicted in Fig. 1, the BS has N antennas divided into L layers, with each layer containing N l = N/L antennas.Multiple layers are introduced to exploit the different combinations of HP schemes.Therefore, each layer consists of subarrays of antennas, with each subarray containing N sa antennas.The HP scheme is applied to the mMIMO transmitting subarray antennas in each layer, where N rf,l is the number of RF chains connected in lth layer.NOMA is incorporated to satisfy the massive connectivity requirements.NOMA is implemented with each transmitted beam to support multiple users by using the same time and frequency resources.In this mMIMO-NOMA system, all the single antenna users are served in the form of clusters supported by the NOMA-enabled transmitted beams.The users are divided into C clusters, with each beam dedicated to one cluster.A group of clusters, C l = N l N sa is supported by the lth layer, and K l,c is the number of users in the cth cluster of the C l th group, with c = 1, . . ., C and l = 1, . . ., L. Without loss of generality, the number of clusters supported by the lth layer is set equal to the number of RF chains in that layer to ensure the multiplexing gain.

A. DSCHP Modelling
In the FCHP system, each RF chain is connected to every subarray antenna through finite-resolution phase shifters to achieve a full array gain by all the RF chains.In contrast, in the SCHP system, a finite set of transmitting antennas are connected to each RF chain in N rf,l by fewer phase shifters.Therefore, a lowcomplexity circuit arrangement is required for the SCHP system, and it consumes less power than the FCHP system.However, the SCHP system exhibits inferior network performances, such as the spectral efficiency, compared to the FCHP.Considering these limitations, we establish a multi-layer architecture in which each layer incorporates the DSCHP scheme following the received signal strength of the user.Fig. 2 shows the DSCHP structure with a junction network introduced between the RF chain and subarray antennas in the lth layer.The objective is to decrease the power consumption by using fewer connected RF chains and phase shifters while satisfying the user requirements.
The signal received at time t for the ith user in the cth cluster of the lth layer can be represented as where X l,c (t) represents the superposed signal for all K l,c users in the cth cluster of the lth layer, defined as where l,c represents the signal of the ith user in the cth cluster of the lth layer, [x ], at time t.P l,c represents the set of transmit powers at each subarray of the lth layer, and ρ (i) l,c denotes the power coefficient of the ith user in the cth cluster of the lth layer, [ρ ], subject to the following conditions: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. and where P l is the transmit power of the lth layer.
In (1), D l,c represents the digital precoding vector of size N rf,l × 1 for the lth layer, A l represents the analog precoding matrix of size N l × N rf,l such that A l D l,c 2 = 1, and η l,c is a complex Gaussian thermal noise vector following the independent and identically distributed CN (0, σ 2 b ).For the FCHP structure, A l can be represented as where the matrix elements a with different phases (phase shifting is performed after the digital precoding).For the SCHP structure, each RF chain is connected to a smaller number of antenna subarrays, N sa .Generally, N sa is an integer and calculated as N sa = N l N rf,l .Therefore, A l for the sub-connected structure is where the matrix elements a A junction network J l ∈ R N rf,l ×N rf,l ×N rf,l is designed to achieve the dynamic switching of each RF chain to the phase shifters in each lth layer.
where j m,n ∈ R N rf,l ×1 is a column vector, where m = 1, 2, . . ., N rf,l .Here, j m,n is either an all-one vector or a zero vector if it connects or disconnects the nth RF chain to the mth subarray antennas, respectively.

B. Channel Properties
After being precoded by the dedicated baseband hybrid precoder, the superposed signal is transmitted by the antenna through the high-frequency THz wireless channel to the users located in the cth cluster.This channel matrix has dimensions L × C × N l × K l,c and can be defined as where Therefore, the N l × 1 channel vector h l,c of the ith user in the cth cluster of the lth layer is formulated as where Ω l,c denotes the line-of-sight (LoS) path loss that depends on the THz frequency f and distance d between the BS and user; G r and G t are the receive and transmit antenna gains, respectively; and α is the steering vector of the N l × 1 layer.φ l,c and θ l,c are the azimuth and elevation angles of departure for the ith user in the cth cluster of the lth layer, respectively.The complex gain of the LoS path loss term, Ω l,c , can be defined as where κ(f ) represents the frequency-dependent molecular absorption coefficient, and V is the speed of light.κ(f ) is computed as a sum of the absorption contributions from the isotopes of gases in a medium.
Because the mMIMO-NOMA BS is subdivided into multiple layers, the interlayer interference must be considered in addition to the intercluster interference.Therefore, the received signal in (1) can be modeled by considering the desired and interfering signals as where I IC and I ILC represent the intra-cluster and inter-layer interference, respectively, defined as and According to (12), the signal-to-interference-plus-noise ratio (SINR) for the ith user in the cth cluster of the lth layer can be expressed as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where The achievable rate for the ith user in the cth cluster of the lth layer can be represented using the THz channel capacity model [1]: where ξ is the bandwidth employed in the lth layer of the THz BS.The achievable sum rate at the lth layer can be expressed as C. Problem Definition The proposed system aims to reduce the power consumption by dynamically switching the connections of the RF chain.This section describes the power consumption profile and formulation of the utility function for the system.The total power consumption, P con , at the mMIMO-NOMA BS is the sum of the power consumed by the circuitry and transmitted signals: where is the total transmitted power; P rf , P sw , P ph , P amp , and P com is the power consumption for each RF chain, switch, phase shifter, power amplifier, and combiner, respectively; P bb is the baseband power consumption, and N sw,l is the number of closed switches in lth layer.
The spectral efficiency is defined as the achievable sum rate, as indicated in (18).The EE is formulated as the ratio of the spectral efficiency to the total power consumption [1].Therefore, the system utility function can be expressed as C3 : is the achievable sum rate of the users in the cth cluster, and Γ th,l is the minimum required sum rate for the lth layer.Constraint C1 ensures that the maximum total transmit power is P T ; C2 ensures that the sum of the power coefficients of all users in a cluster is 1; C3 is the sum rate constraint that ensures that the achievable sum rate is greater than Γ th,l ; and C4 is the inherent constraint of P l,c and ρ l,c , which ensures that the power allocated to each cluster is positive, and the power coefficient is a positive fraction ranging between 0 and 1.

III. FCM CLUSTERING FOR THE MULTI-LAYER MMIMO SYSTEM
The FCM is an unsupervised clustering algorithm for feature analysis, aimed at classifying the users into several clusters.First, the algorithm initializes the number of clusters and fuzzy exponent.Next, a membership function is assigned to each user to determine a fractional relation with a cluster.The cluster head is computed repeatedly by updating the membership coefficient of each data point to minimize the objective function at a certain threshold [20].Details of the FCM clustering algorithm are presented as Algorithm 2.
The initial operation of the FCM algorithm is to design fuzzy partitioning (S, Q), where S is generally known as fuzzy matrix [μ 11 , . . ., μ K l,c ,C ].Here μ ic is termed as the membership function of the cth cluster satisfying 0 ≤ μ ic ≤ 1, i = 1, . . ., K l,c , and c = 1, . . ., C. And the Q is a data point matrix of elements q i where i = 1, . . ., K l,c .The objective of the FCM algorithm is to identify the fuzzy matrix and a set of mean of the ith points in the cth cluster i.e. the centroid.We can define the objective function as where r ≥ 1 denotes the fuzzy exponent, v c represents the centroid of the cth cluster.With a view to converging the algorithm up to the minimum error , the membership function and the centroids is updated in every iteration upon each of the clusters.Those are given as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. and where are the distance between the ith user to the cth cluster and kth cluster, respectively.

IV. MULTI-AGENT DRL-BASED MMIMO-NOMA SYSTEM
A. Overview of DRL Among value-based reinforcement learning techniques, Qlearning is a widely used and efficient algorithm to address Markov decision process problems [21].Q-learning represents the mathematical formulation of a decision-making problem to be solved by a decision maker or an agent, defined as the quintuple (S, A, R, T , γ), where S is the set of different states, s ∈ S; A is the action space, a ∈ A; R is the set of rewards, r ∈ R → R; T is the state transition probability, T (s t , a t , s t+1 ); and γ represents a discount factor.An agent follows a policy π(a t |s t ) to execute an action a t while remaining in state s t at time t.The agent immediately receives a reward r t and transits to the next state s t+1 according to the transition probability T .The probability of transition to the next state depends only on the current state and action played and not on any previous states or actions, i.e., P r(s t+1 |s t , a t ).
In Q-learning, the agent seeks to achieve the final goal by considering the future cumulative reward instead of only the immediate reward.A certain policy π of a couple of action and state is associated with an action value function named the Qvalue function.Therefore, the agent can achieve the optimal Qvalue function by following an optimal policy π * (a|s t ) involving the best action a t = a.The optimal Q-value function can be expressed as where γ ∈ (0, 1] adds the discounted future rewards to the system.The Q-value function can be represented by the Bellman equation [22], and its convergence can be achieved by iteratively applying the Q-learning algorithm as The algorithm constructs a table of Q values based on the Q-value function.The values are randomly initialized and later updated considering the best action that can be taken in the future state, reflected by the temporal difference (TD).T D t (s t , a t ) can be considered an intrinsic reward obtained by the difference between r t + γ max a Q(s t+1 , a) and Q(s t , a t ), where r t is the reward obtained by taking action a t in state s t .Artificial intelligence is reinforced with higher values of T D t (s t , a t ). Therefore, where α ∈ (0, 1] is the learning rate and If the action and state spaces are extremely large, as in the case of the proposed system, the classical Q-learning fails because the Q-table cannot be feasibly stored.This problem can be addressed by applying a deep neural network, that is, DQN, which estimates the Q-function.In the proposed DRL framework, two DQNs (actor and critic networks) are used to estimate the action-state value function.The actor DQN constructs a policy according to the observed states and produces an action.The critic DQN evaluates the current policy based on the rewards.The policy can be represented as π(θ|s, a), where θ denotes the weight parameter, which is a real-valued vector.The Q-value function is estimated by the critic DQN, and its policy parameters are updated by the actor DQN [21].The following gradient rule is applied to update the weight parameter where Δ θ applies the gradient of loss function L(θ).The loss function is computed as the difference between the target and training Q-value functions.Therefore, the loss function is expressed as where D is the mini-batch of M b experiences.

B. Proposed Multi-Agent DRL Scheme
In the proposed multi-layer DSCHP system, each layer determines its own SCHP structure and downlink parameters.Therefore, this problem can be formulated as a multi-agent DRL where each layer of the BS acts as an independent agent.We adopt a distributed scheme for the proposed multi-agent DRL.In this approach, the DQN are executed distributively at the antennas groups in each layer.Each agent l holds a copy of target DQN parameter at time t.The proposed model has a train DQN which is trained centrally using the shared experiences from all the agents that reduces the memory and computational overhead of the layers.The illustration of the distributed multi-agent DRL for the proposed DSCHP system is shown in Fig. 3.The agent l executes an action a (l) t at time t based on its current state s (l) t , which is obtained by following its updated current policy π l .The DQN helps to determine the policy π l from the past experience that maximize the expected future reward.An experience reply memory is formed with a fixed size to store new sets of experiences (s t+1 ) from each agents l.During the training stage, the agent selects a mini-batch D (l)  of M b experiences from the reply memory.Once the agent take D (l) , the training DQN updates its parameters to minimize the loss in ( 27) using an optimizer such as stochastic gradient Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Store new experience (s t+1 ) in its experience pool for l = 1, . . ., L; 12: counter ← counter + 1; 13: until counter == M 14: A mini-batch D of size M b is sampled from the experience pool, where M b ∈ M; 15: Updates the parameters θ (l) train using the gradient decent optimizer in (27) for l = 1, . . ., L; 16: step ← step + 1; 17: until step == N step 18: Update θ target after each N step ; 19: until convergence descent optimizer.The trained DQN then broadcast its latest updated parameters θ train after each N step to update θ target .
1) Beamformer Approximation: A simplified method was proposed in [17] to address the complex valued beamformer problem, where it was decomposed into the transmit power and normalized beamformer.Therefore, inspiring by this technique, the beamformer of the sub-array is represented by the transmit power coefficient and the direction of the beam.Assume that P is the set of available power levels which is achieved by distributing total transmit power into the subarrays and K l,c users by applying the conditions from ( 3) and ( 4).Therefore, l,c P l,c , . ..ρ A beamforming codebook matrix C with dimensions of N sa × Q is designed, in which each column represents a beam directional code satisfying Q ≥ N sa .The set of beam directional vectors are represented as D of Q directional codes d q ∈ R N sa ×1 , which covers directions in [0, 2π) and can be expressed as Therefore, at time t, an agent l in the lth layer of the BS can determine the beamformer for the cth subarray by selecting the required power levels and codes from the defined sets.
2) State Space: An agent in the BS layer l has states s (l) t that are the representative features of the connected channels with the associated clustered users in a given time slot t.The lth layer of the BS obtains the received signal strength and total interference-plus-noise power from each ith user in the cth cluster, at time t i.e., ρ and l,c , respectively.Next, evaluates the equivalent channel gain, the channel SINR ), and achievable rate Γ ) for the lth layer.These parameters are the five constituents of the state space observations s (l) t .Subsequently, based on the action taken at time t − 1, there exist two member elements for the transmit power ρ (i) l,c (t − 1)P l,c and selected normalized beamformer from set D. Another N sw,l member element for the s (l) t is identified by evaluating the junction configuration matrix J l (t − 1) status at t − 1.Therefore, 7U l + N sw,l input ports of the DQN are occupied by the local parameters where U l = C l K l,c is the number of users in the lth layer.
In addition, we consider several other input features based on the status of the interferers and the iterfered neighbors.These are the shared information between layers (i.e., agents) evaluated to perform an action and the resultant effect on the common objective, which is to maximize the EE.Four input ports are occupied by the interference information from the interferers (interfering layer, l = l): i) the interferer l , ii) total interference power pertaining to layer l , iii) normalized beamformer used by the l , and iv) utility function for the l , Γ (i) l ,c (D(t − 1)).Therefore, 4(L − 1) members exists in the state space s (l) t corresponding to the interferer information.Similarly, the agent l receives the feedback regarding the effect of its action on the interfered layers, specifically, i) the amount of interference power induced to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the neighboring layers, ii) neighbors' received channel gain, and iii) influence on the utility function of the interfered neighbors.The sum yields a total of 7U l + N sw,l + 7(L − 1) members in state space s (l) t .
3) Action Space: The action space of the agent determines the number of output ports of the DQN.The power coefficient ρ l,c for each user in the cluster is used to discretize P l,c into K l,c levels.Following the conditions applied in (3) and ( 4), the total number of power levels for the lth layer is C c=1 P l,c .Because the BS take a discrete power value and a specified beamformer code, D C c=1 P l,c member elements are determined for action space a (l) t .Additionally, actions are executed to control the junction switches at time t.Because the switch has two phases (ON and OFF), 2N sw,l additional member elements exist for a (l) t .The total number of output ports in action space a (l) t is D C c=1 P l,c + 2N sw,l .4) Reward Function: As described in Section IV-A, the agent observes the future cumulative rewards when transiting the next state by executing the best action.Therefore, we define the reward function r (l) t+1 , which indicates the effect of executed action a (l) t on the optimization of the network objective specified in (20).Assume that to maximize the network objective, agent l selects the best beamformer that results in a higher transmit power in an advantageous direction.Alongside, full array connection to the switches in J l (t − 1) will be required to achieve full array gain at the transmitter.However, full array connection will cause additional power consumption as P sw and P ph that consequently affects the network objective.In addition, while the agent l performs such advantageous actions to augment the system utility, it also instigate higher interference power to its neighboring layers.Therefore, the reward function at time t + 1 is defined as where utility , (34) The System (l) utility in (33) refers to the maximization of the spectral efficiency with the highest possible minimization of the connection between the RF chain and consecutive subarrays, represented in (20).The second term, System (l) penalty is the penalty for agent l, which refers to the cumulative loss of the achievable rate of the interfered neighbors l .

V. SIMULATION RESULTS
The performance of the proposed multi-agent DRL-based DSCHP scheme is evaluated in MIMO-NOMA system.Specifically, we analyze the FCM clustering performance with a variation of number of users and cluster heads.Additionally, the performance of the multi-agent DRL technique has been evaluated with a comparative analysis.First, the DQN network performance is examined by evaluating the training loss for both agents in both the BS layers.Second, the average achievable rate is assessed and compared with the random and greedy approaches.Third, the EE of the proposed DSCHP is compared with that of a scheme involving a full array gain.
In the simulation, we consider a mMIMO BS whose antennas are equally divided into two layers L 1 and L 2 .In each layer, a dynamic HP scheme is established according to the configuration shown in Fig. 2 and the antennas group is divided into two subarrays.Hence, two RF chain is connected to the sub-arrays and the junction matrix takes 4 switches.The radius of the cellular BS is considered to be 200 m.The users are randomly located within the cell radius.Firstly, the users u are classified according to the region and being supported by the two layers as where U l1 and U l2 are the number of users for L 1 and L 2 , respectively and U dist is the Euclidean distance of the users from the BS.The antenna groups in L 2 supports the distant users with lower channel gain than that of L 1 .Therefore, the SCHP structure will be different for both the layers.For instances, L 1 will intent to minimize the antenna array gain for its nearer users and require comparatively reduced number of RF chains and phase shifters to be connected to the antenna sub-arrays.In Fig. 4, an example of the bi-layer BS network configuration is shown where 100 users are randomly distributed within a radius of 200 m.Initially, we consider 100 users to effectuate the simulation of FCM clustering on nearly overlapped users.We have distinguished the users in L 1 and L 2 based on (32) which are shown in Fig. 4(b) and (c), respectively.The FCM clustering is applied to users in both the BS layers to allocate them into multiple clusters and each cluster is supported by each antenna sub-array.FPC is used to define how meaningful the data points can be clustered and it is a value between 0 and 1. Fig. 4(b) and (c) represent FPC values for clustered users in both the layers which are found maximum for the case of two cluster heads.
We observed the FPC performance by varying the number of cluster heads from 2 to 10 while considering 100 users.The result has been taken for both the BS layers L 1 and L 2 and comparing them with the conventional non-layering case as shown in Fig. 5.The comparison shows a better representation of users can be obtained when the users are classified through antenna layering.A higher FPC is achieved for the case with fewer cluster heads because of the nature of user distribution.A similar analysis on the performance of FCM clustering is shown in Fig. 6.The maximum FPC for different clustering cases in which the number of users is varied from 10 to 100 is recorded.The cases involving BS layering yield higher FPC values, and the clustering is more meaningful in the presence of fewer users.
For the preliminary evaluation of the multi-agent DRL algorithm for the proposed system, we considered two users per cluster and set the hybrid beamformer parameters as P = 4 and D = 2.For the THz channel, the carrier frequency is chosen as 0.34 THz in directional propagation in order to provide large channel capacity and avoid path loss peak [23].The AWGN power spectral density σ 2 is considered as −174 dBm/Hz.We compute the path loss term between BS and user as given in (11), licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where the parameters used for frequency dependent molecular absorption coefficient κ(f ) can be found in [24].The maximum transmit power for the BS layers is set to 37 dBm.Typical values are used for the other parameters: The power consumption per RF chain P rf , at closed switch P sw , for the 4-bit phase shifter P ph , of the combiner P com , at the power amplifier P amp , and of the baseband P bb is 160 mW [1], 24 mW [25], 42 mW [26], 6.6 mW [27], 60 mW [28], and 200 mW [1], respectively.
The hyper-parameter used in the multi-agent DQN algorithm is shown in Table I.In our algorithm, an agent trains a DQN which has an input layer, two fully connected hidden layers, and an output layer.The hidden layers have 64 and 32 neurons, respectively.The total number of input ports is determined from the states described in the Section IV-B.The value of U l is considered as 4. Therefore, we have a total of 7U l + N sw,l + 7(L − 1) = 39 input ports in the input layer.In the output layer, the number of output ports is equivalent to the action space length described in the Section IV-B.We considered the length of the beam directional vector set D = 2 and set of the available power levels for cth cluster P = 2.As we considered two sub-arrays per layer, there are four power levels for the lth layer.Therefore, the number of member elements in the action space (i.e., output ports) is D C c=1 P l,c + 2N sw,l = 16.The memory size of the experience pool is set to 500 and the size of the mini-batch is fixed at 32.The target DQN updates its parameters after 100 time slots, i.e., N step = 100.We used rectifier linear unit (ReLU) function for each hidden layer to be activated.Additionally, RMSprop optimizer is used to update the parameter in which the initial learning rate is set to 0.0005 and λ = 0.5.
We evaluate the training loss performance of the proposed multi-agent DRL framework under the fixed learning rate of 0.0005.Because the agents in BS layers 1 and 2 train their networks simultaneously, both the loss performances are recorded to observe the required no. of epochs for convergence.As shown in Fig. 7, the loss function value decreases as training progresses and stabilizes after 150 epochs.The Q values predicted in the later slots are more accurate and stable than those in the initial slots.The obtained average rewards over each iteration are presented graphically in Fig. 8.The reward function is evaluated from (33) which subtracts the received penalty from the system's utility for a particular action.Therefore, based on the training Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.model, the average reward an agent is receiving at every epoch is gradually increasing in pace with the average spectral efficiency.Therefore, the agent learns well the environment gradually.
The average spectral efficiency for the proposed multi-agent DRL-based DSCHP scheme is evaluated.The performance of the proposed scheme is compared with that of the random policy and the greedy approaches.In the random policy, each agent randomly chooses actions.Whereas, in the greedy approach, each layer (i.e., agent) uses the best beamformer to achieve higher channel gain without considering the interference power to the neighboring layers.Which eventually degrades the average spectral efficiency.The proposed system shows better performance as the agent learns about the suitable beamformer by receiving a reward based on the action played.This is because the decision-making policy is enhanced by updating the weights of the trained DQN regularly.As shown in Fig. 9, the proposed scheme starts performing better than the greedy policy after approximately 4,000 time slots and eventually reaches a relatively stable situation in approximately 45,000 epochs.The results show that the multi-agent DRL-based DSCHP scheme can learn the trade-off between maximizing EE and minimizing the interference power to the neighboring layers.Further, we evaluate the EE of the proposed scheme and compare it with the traditional method.In the traditional approach, to maximize the performance, the full array gain is exploited by connecting all the switches linked to the RF chain.In this framework, the additional power consumed by the switches and phase shifters decreases the overall EE.Fig. 10 shows the EE for the proposed DSCHP system and scheme with full array gain.As the systems converge, the proposed scheme, in its preliminary configuration, outperforms the traditional scheme by achieving a 3.6% higher EE.The EE can be further enhanced by increasing the number of RF chains in the case of ultra-massive MIMO antenna arrays.The EE variance of the proposed approach and FCHP scheme under different N rf values is shown in Fig. 11.The number of connecting switches increases with increasing N rf .Consequently, in the case of the DSCHP scheme, the number of switches maintaining the OFF state increases with increasing N rf .In addition, the phase shifters connected with these switches become idle.Therefore, the power consumption defined in (19) considerably decreases, thereby enhancing the EE variance with the increment in N rf .The results for three values of P T indicate that a higher P T corresponds to an enhanced EE variance.The objective function we have formulated in (20) is to maximize the EE while maintaining the key constraints.The EE is defined as the ratio of the spectral efficiency to the total power consumption.However, for the system, the optimal solution can be defined as achieving maximum spectral efficiency with minimum power consumption which is practically not feasible.Therefore, we find a near-optimal solution that can improve the EE while maintaining a satisfactory achievable sum rate with the least possible power consumption.

VI. CONCLUSION
This paper proposes a multi-layer DSCHP architecture for THz mMIMO-NOMA systems to enhance their spectral efficiency and EE.The DSCHP scheme allows the number of RF chains connected to the subarray antenna elements to be increased or decreased by sensing the channel conditions and requirements.A multi-agent DRL-based approach is applied to solve the problem of maximizing the utility function.In the proposed framework, an agent is responsible for a BS layer.The agents train a DQN centrally that periodically shares the updated parameters, and the agents execute actions until a stable solution is obtained.The FCM clustering algorithm is used to group users under each subarray and efficiently distribute the power coefficients.The simulation results show that the proposed approach can achieve an excellent spectral efficiency.Moreover, the EE can be enhanced when mMIMO antenna arrays with a larger number of RF chains are used.

Algorithm 1 :
FCM clustering algorithm in lth layer.Input: Data point matrix Q Output: S and v c 1: Initialize r ≥ 1 and C ≥ 2; 2: Partitioning of S with Q; 3: Compute v c and μ i,c ; 4: if J r (S, Q) ≤ then 5: take final output; 6: else 7: update v c and μ i,c 8: end if where Γ sum c

Fig. 4 .
Fig. 4. Example of the BS network configurations, (a) BS with 100 random users, (b) classified users in BS layer 1, and (c) classified users in BS layer 2.

Fig. 7 .
Fig. 7. Training loss function for the two agents in both the BS layers.

Fig. 8 .
Fig. 8. Average rewards with a moving average value over the last 1000 steps.

Fig. 9 .
Fig.9.Average spectral efficiency with a moving average value over the last 1000 steps.

Fig. 10 .
Fig.10.Energy efficiency compared with full array gain scheme when N rf = 2. Representations are a moving average value over the last 1000 steps.

Fig. 11 .
Fig.11.Deviation of the energy efficiency value of the proposed scheme with respect to the full array gain scheme.