Deep Reinforcement Learning Based Beam Selection for Hybrid Beamforming and User Grouping in Massive MIMO-NOMA System

This paper presents a deep reinforcement learning-based beam-user selection and hybrid beamforming design for the multiuser massive multiple-input multiple-output (MIMO) non-orthogonal multiple access (NOMA) downlink systems. The conventional hybrid beamforming in massive MIMO provides multiple directional beams, but each beam serves only one user. The integration of NOMA with the massive MIMO enables power domain multiplexing within a beam, hence increasing the system capacity. In this paper, we first design a channel gain and correlation-based users grouping algorithm per beam, and then using the deep reinforcement learning-based beam selection, a beamspace orthogonal analog precoder is obtained. The deep Q-network consists of a main network and target network with Adam optimizer. Finally, optimal power is allocated to the users in each beam. Simulation results show that at transmit SNR of 10 dB, the proposed scheme provides a 42% increase in sum-rate and energy efficiency performance as compared to the state-of-the-art $K$ -means users’ grouping and Stable Matching-based beam selection NOMA scheme.


I. INTRODUCTION
The current and future demand for wireless and mobile data can only be met by the ultra-high-speed beyond-fifthgeneration (B5G) wireless networks. The 5G/B5G technology uses a high-frequency millimeter wave (mmWave) band ranging from 30 GHz to 300 GHz with massive multiple-input-multiple-output (MIMO) systems [1]. The hybrid beamforming overcomes the high power consumption in the radio frequency (RF) chain, and non-orthogonal multiple access (NOMA) enables multiplexing users within each beam. However, the mmWave frequencies suffer from high pathloss and have low penetration power. These shortfalls are compensated by the massive MIMO technique. Usually, The associate editor coordinating the review of this manuscript and approving it for publication was Olutayo O. Oyerinde . massive MIMO is deployed at the base-station to get the benefits of massive MIMO precoding in the downlink and combining in the uplink. The conventional MIMO techniques use one RF chain (analog-to-digital converter (ADC), digital-to-analog converter (DAC), mixer, data converters), which is infeasible in massive MIMO scenario because of the large capital cost (CAPEX) and the operational cost (OPEX). In order to reduce the number of RF chains and hence, the power consumption, researchers take advantage of the mmWave channel's sparsity and split the beamforming into two stages: analog beamforming (AB) and digital beamforming (DB). This method is called hybrid beamforming (HBF). In the downlink, the data streams are first processed by the digital beamformer then, and then the pre-processed signal is fed to the phase shifter network in the analog (or RF) domain. Finally, the VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ symbol vector is transmitted from a large antenna array at mmWave frequencies. In this way, mmWave massive MIMO system becomes the key enabler of next-generation B5G communications. The mmWave massive MIMO system can produce large number of narrow directional beams and can serve many users in beam domain without interference, for example, discrete Fourier transform (DFT) based orthogonal beams. Most of the current research on beam selection focus on onebeam one-user model or at least one beam per user [2], [3], [4]. However, there is always a probability that a particular beam is the strongest beam for more than one user, even if there are a large number of beams. For B beams and K users, the probability of a user sharing the same beam is [5]. For B = 128 and K = 16, P = 62.4%. In orthogonal multiple access (OMA) MIMO networks, the multiplexing gain and the system capacity reduce when channel correlation increases. However, when the users' channels are highly correlated, the use of NOMA-MIMO can provide the optimal dirty paper coding (DPC) performance [6]. Therefore, NOMA has an inherent optimal match with the highly directive and correlated mmWave MIMO channel characteristics.
There is a lot of research work on the design of hybrid beamforming and NOMA in the mmWave massive MIMO system. In [7], a beamspace channel-based massive MIMO-NOMA scheme is presented. It designs a zero-forcing (ZF) precoder by taking a stronger user's channel per beam. Then, a dynamic power allocation scheme is devised to maximize the sum-rate subjected to total power and per user minimum rate constraints. However, this paper does not discuss the beam-user selection. The paper [8] introduces multi-beam beamspace NOMA-MIMO, where NOMA is applied across multiple beams using a power splitter connected to one RF chain. In the beam-user pairing, each user selects the beam which has the maximum gain for that user. The inter-group interference is controlled by the singular value decomposition (SVD) based on equivalent channel ZF precoding [7]. The NOMA power allocation is based on the minimization of the total power consumption. This scheme has inferior energy efficiency (EE) and spectral efficiency (SE) with higher transmit power as compared with the single beam NOMA. In [9], beam selection has been performed by using an intelligent technique, like particle swarm optimization (PSO) and a correlation-based scheme on the beamspace channel. Then, target rate-based and PSO-based power allocation are carried out for NOMA users. The proposed beam selection methods only utilize the beam strength and beam correlation without considering the users' channels' gains with the assumption that the number of users (K ) is greater than the number of RF chains (N RF ). A two-user per beam NOMA for joint power allocation and hybrid beamforming design is proposed in [10]. It decomposes the problem into two sub-problems: a power allocation problem and the hybrid beamforming design problem with constant modulus constraint due to the phase shifter implementation of the analog beamformer.
The authors compare the proposed solution with the time division multiple access (TDMA) and show the efficacy of the proposed solution. It does not show the performance gap between the proposed suboptimal solution and the optimal solution. Another two-user per beam NOMA is proposed in [11]. Each beam serves two users who have maximum channel gain difference and channel covariance. It uses a finite resolution analog beamformer which minimizes the difference with the angle of the channel vector. Finally, the digital beamformer is formed by using ZF precoding based on the stronger user equivalent channel. But this paper does not consider the massive MIMO system and utilizes only 8 antennas. In the extended work [12], the authors compare the EE and SE performance of fully-connected and subconnected architectures. Though optimal power allocation has been derived for multiple users [13], [14] but this is for a single channel. Multi-channel NOMA optimal power allocation for two users per channel is given in [15].
In most of the previous work, NOMA techniques within the same beam are realized by different combinations of K -means, intra-beam inter-user correlation, and channel gains [9], [16], [17]. However, we propose to use the upper bound on the correlation for the selections of users' groups, e.g., among larger channel gain users, a user with a correlation less than the threshold is selected as primary user or grouphead. In most of the previous work, analog beamformers have been realized by discrete Fourier transform (DFT), conjugate transpose channel vectors, eigenvalue decomposition (EVD), and singular value decomposition (SVD) based methods followed by ZF or minimum mean squared error (MMSE) based baseband precoders.
In this paper, we propose a novel users' channel gain and channel correlation-based users grouping and deep reinforcement learning based beam selection and analog precoder design to efficiently reduce the inter-user interference in the analog domain in the mmWave massive MIMO channel. It provides the precoder with orthogonal column vectors and reduces the effective channel dimension to help design a low-dimensional regularized zero-forcing (RZF) digital precoder. 1 Finally, we consider a general case of multi-beam and multiuser in the beam for power allocation and obtain an optimal solution.
The rest of the paper is organized as follows. The system, signal, and channel models are described in Section II. The problem is formulated in Section III. Section IV presents the proposed users grouping and various beam-user selection methods for hybrid precoder design in a multiuser massive MIMO system. Deep reinforcement learning-based beam selection for grouped users is given in Section V. Power allocation solution is presented in Section VI. Simulation results are given in Section VII, followed by the conclusions in Section VIII.
Notations: Bold upper/lower case letters denote vectors and matrices, respectively. The notations X † , X T , X * , and X H denote the pseudo-inverse, transpose, conjugate, and conjugate transpose of a matrix X. The element in the ith row and jth column of X is denoted by [X] i,j . For a set A, Card(A) is the cardinality of set A. For the quick reference, the list of main symbols used in this paper is given in Table 1.

II. SYSTEM MODEL
We consider a downlink of a single cell multiuser massive MIMO NOMA system. The base-station (BS) is located in the cell center and is equipped with N antennas and N RF RF chains to serve K , single antenna users, as shown in Fig. 1. The MIMO-NOMA system has N RF RF chains, such that N RF ≤ K < N . At BS, each antenna is connected with all the RF chains through an independent phase shifter to form a fully-connected structure [1]. The user set K is defined as Assuming N s the number of information symbols streams at the input of baseband precoder is equal to the number of users K , the symbol vector is The hybrid precoder can be written as, With the N s = K is the number of information symbols streams at the input of the baseband precoder, upon receiving the data block, the BS first uses a low dimensional digital precoder F DB ∈ C N RF ×K and then, an analog precoder F AB ∈ C N ×N RF is used to create the N × 1 transmit symbols as shown in Fig. 1. The digital precoder is given as and the analog precoder is After passing through the digital and analog precoders, the transmitted symbol vector is x ∈ C N ×1 . It can be written as, where x ∼ CN (0, I N ) is the normalized transmit signal vector with E{xx H } = I N or E{|s m | 2 } = 1 and P = diag( √ p 1 , . . . , √ p K ) is the diagonal matrix representing the power allocation to users. The received signal y k at the user k is given as where z k ∼ CN (0, 1) is the AWGN noise and h k is the C N ×1 channel vector between BS and user k. The received signal of each user can be combined to form the composite received signal as y = [y 1 , . . . , y K ] T . We assume N RF < K , for MIMO-NOMA implementation.
In order to describe the mmWave channel, we use the extended Saleh-Valenzuela model [18]. Due to the directional nature of mmWave propagation, we assume that the channel contributes limited propagation paths L p between the transmitter and receiver [19]. It is a geometrical channel model which describes the physical propagation between transmit and receive antenna array. At mmWave frequencies, the electromagnetic wave propagation is near optical line-ofsight (LOS) wave propagation. The N × 1 MISO channel matrix can be written as where j = √ −1, α 0 k and α l k , l = 1, . . . , L p represent the complex gains of the LOS, l = 0 path and non-line-of-sight (NLOS), l = 1, . . . , L p paths with i.i.d. CN (0, 1), respectively. Moreover, a is the array steering vector of the uniform linear array (ULA). The variable φ l k is the l th path's azimuth angle (boresight angles in the array) of arrival [2] for user k and it is uniformly distributed with φ l k ∼ U(−π/2, π/2). The transmit steering vector is given by where I(N ) = {i − (N − 1)/2, i = 0, 1, . . . , N − 1} is a symmetric index set centered at zero [20], φ l k = d λ sin θ l k is the array response angle, θ is the physical angle whose value ranges between − π 2 ≤ θ ≤ π 2 , λ is the wavelength, and d is the antenna spacing at the base-station. Finally, the overall VOLUME 10, 2022 channel matrix H of multiuser MIMO system is given as,

III. PROBLEM FORMULATION
In this section, we formulate the joint beam selection and hybrid beamforming design problem for the MIMO-NOMA downlink system to maximize the sum-rate performance. Let K n be the set of users in a beam n, where n ∈ B and B is the set of selected beams. No two beams serve the same user i.e., n∈B Card(K n ) = K . After the NOMA users' grouping, we get the channel matrixH ∈ C N ×Card(B) , with Card(B) number of users groups. The mmWave channel has a sparse structure with a limited number of channels with significant power. This mmWave massive MIMO spatial channel can be transformed into a beam domain channel U HH ∈ C N ×Card(B) , where U is a unitary discrete Fourier transform (DFT) matrix of size N . Based on the beam domain channel U HH , we select Card(B) number of beams using the beam selection algorithms such that U HH (b, :)| b∈B ∈ C Card(B)×Card (B) . This also gives us the DFT-based analog precoder implementation as F AB = U(b, :)| b∈B of size Card(B) × N . Finally, the equivalent channel matrix is obtained by stacking back the users' channel columns within each group. The equivalent channel matrix for the n th beam is given bỹ whereh eq,k,n = f H AB,nh k,n is the equivalent reduced dimension channel vector for user k ∈ K n in the beam n. When using the user's of highest channel gain per beam and stacking all selected beams together, we get the overall equivalent reduced dimension channel matrix of size Card(B) × Card(B), The ZF digital precoder is obtained by the pseudo-inverse ofH eq as F DB,eq =H eq (H H eqH eq ) −1 . This can be represented as a stack of Card(B) digital precoders for users, To meet the total power constraint, the normalized digital precoder is given as Although, this digital precoding cannot completely eliminate the inter-beam interference because of the use of an equivalent channel of the stronger user in a beam. But in the sparse mmWave channel, if LOS exists (which is often the case in highly directive mmWave channel), then the sparse channel vectors of different users in the same beam are highly correlated [7]. Without loss of generality, we assume that |h H eq,1,nf DB,eq,n | ≥ |h H eq,2,nf DB,eq,n | ≥, . . . , ≥ |h H eq,Card(K n ),nf DB,eq,n | for n = 1, . . . , Card(B). The successive interference cancellation (SIC) at user with higher channel gain decode and removes the signals of weaker users. With this hybrid precoding, the received signal at user k in the beam n is given by This equation can be decomposed into the following components, the desired signal, intra-beam interference, inter-beam interference, and the AWGN noise as follows: +h H eq,k,nf DB,eq,n After the SIC, the signal-to-interference-and-noise (SINR) of user k in the beam n can be expressed as (15), shown at the bottom of the next page. We assume that perfect channel state information (CSI) is available at the BS, then the spectral efficiency (SE) for user k in the beam n is given as where is the SNR gap that relates the Shannon capacity and the received SNR by the employed modulation and coding scheme in the practical wireless channel. In case of M-ary quadrature amplitude modulation (M-QAM) and target bit error rate Pe, = −(2/3)ln(5Pe) [21]. Our objective is to design a hybrid precoder to maximize the spectral efficiency (SE) for multiuser massive MIMO-NOMA systems. The sum-rate is given by The optimization problem is given as P: where χ k,n is the beam-user selection binary variable. χ k,n = 1 if user k is selected in beam n, otherwise χ k,n = 0. The first constraint (18a) is a uni-modulus constraint for phase shifter implementation of the analog beamformer. The constraint (18b) ensures that the beam-user selection variable does not exceed the number of users in a particular beam. Constraint (18c) is used to allocate at most one beam to a user. (18d) restrict total users in all beams to K . Constraints (18e) and (18f) regulate the power allocation. The per user quality of service (QoS) is ensured by the constraint (18g). The beam-user selection variable can take only binary values (18h), and the constraint (18i) gives the upper bound on the beam-user selection. The problem P is a mixed-integer programming (MIP) problem, which is NP-hard [22]. Even the optimal solution to the beam-users selection problem with K ≥ Card(B) can only be obtained by the exhaustive search method. This solution has very high computational complexity of In addition, the power allocation to users is coupled and depends on the power allocation to other users in a beam. The intra-beam interference is minimized by the SIC technique in each beam, and the inter-beam interference is minimized by the ZF precoding with the help of stronger users' equivalent channels.

IV. USER GROUPING AND BEAM-GROUP SELECTION
In a massive MIMO system with N antennas at BS and K single antenna users, an exhaustive beam search gives optimal beam-user pairing. In order to select K optimal beams out of N beams, the required number of searches is N K . For example, if N = 128 and K = 16, then it requires 9.33 × 10 19 searches. When K = N RF , the beam-user selection problem becomes an unbalanced assignment problem. In an unbalanced assignment problem, one set of the bipartite graph is larger than the other set, as shown in Fig. 2. One way to make the problem tractable is to transform the unbalanced assignment problem into a balanced assignment problem by adding N − N RF new vertices to the smaller set with edges of VOLUME 10, 2022 cost zero. In this section, first, we form users' grouping and then apply various beam-group selection algorithms.

A. USERS GROUPING
We form Card(B) groups of K users such that Card(K n ) = K . The users' grouping and beam-group based hybrid beamforming algorithm 1 is summarized below: Step 1: In the first step, we find the channel-to-noise ratio (CNR) in line 2 with the noise power of unity. The correlation matrix is computed in line 4. Next line masks the C i,j = 1, ∀i = j with zero values to eliminate the correlation with itself.
Step 2: Since there is one primary-user in each beam, there should be a minimum correlation between the primary-users to minimize the inter-beam interference. Let low be the upper threshold required to minimize the correlation between primary-users. The while loop forms a set of N RF primary users. If the maximum correlation of a user is less than the low , then that user is selected as the primary user (PU). After that, the second highest user is checked for Corr < low then, this second highest user is selected otherwise next in line user is tested for Corr < low and so on till Card(B) users are selected. Ideally = 0 for no correlation.
Step 3: The for loop in line 13 finds the secondary users for primary users in each beam. The primary user makes the pair with a user in the set K\K PU with maximum correlation.
An RL-based analog precoder F AB is designed with the help of primary/strong users' channel-based channel matrix H. Since the number of users is greater than the number of beams, Card(B) RF chains can only produce N RF beams. Therefore, we find the equivalent channel with size of N × Card(B). We use the strongest user-based equivalent channel [7] because, in each beam, the strongest user has to perform SIC to decode the signals of all the other users. The N × Card(B) channel is given bỹ where h st,n is the channel vector of strongest user in the n th group.
Though the use of the strongest user-based equivalent channel reduces the intra-group interference, there is significant inter-group interference. We propose an RL-based selected beam from the DFT-based orthogonal analog precoder to minimize inter-group interference. The detailed design of RL-based beam selection will be presented in the next section. In addition to RL, we use Hungarian, Gale-Shapley, Greedy search, and mutual information-based beam selections.

B. HUNGARIAN-BASED BEAM-GROUP SELECTION
When the unbalanced assignment problem of Card(B) user groups and N beams is transformed to the balanced assignment problem by adding N − Card(B) edges of cost zero. The balanced assignment problem can be solved by the well-known Hungarian method with polynomial complexity of O( N K N 3 ) [23].

C. GALE-SHAPLEY STABLE MATCHING FOR BEAM-GROUP SELECTION
The Gale-Shapley stable matching (or deferred acceptance) algorithm [2] finds stable match between two equal sets with an equal number of elements. Each element of a set has a preference order in the other set. It has the complexity of O( N K N 2 ).

D. GREEDY SEARCH-BASED BEAM-GROUP SELECTION
In this method, users selection with each beam within each beam combination is done iteratively on the basis of the maximum sum-rate [2]. In this way, each beam combination has a user combination set. After that, the beam combination which has the maximum sum-rate is chosen along with the user set. The computational complexity of this method is O( N K N 2 ).

E. MUTUAL INFORMATION BASED BEAM-GROUP SELECTION
This method selects K out of N beams by maximum relevance and minimum redundancy principle [9]. Specifically, it calculates the correlation among beams. Between two beams with the highest correlation (max redundancy), the beam which has low energy (low relevance) is eliminated iteratively till K beams are left. The computational complexity of the mutual information-based method is O(N (N − 1)K ).

F. DIGITAL BEAMFORMING
The reduced dimension digital precoder requires Card(B) RF chains, where Card(B) ≤ N RF < N . The digital precoder is a ZF precoder to eliminate the inter-group interference by taking the strongest user channel as the equivalent channel gain of the group. Since the grouped users have a high correlation, therefore, the inter-group interference for weak channel gain users can be easily minimized [12]. Finally, the optimal power allocation has been done for multiple users using the technique in [13].

V. DEEP REINFORCEMENT LEARNING BASED BEAM SELECTION AND HYBRID BEAMFORMING DESIGN
In this section, we briefly describe reinforcement learning, then the deep reinforcement learning, deep Q-network, and hyperparameters for our beam selection problem are discussed.

A. REINFORCEMENT LEARNING
In the simplest form, reinforcement or Q learning is a learning of an entity called an agent by interaction with the environment as shown in Fig. 3. For a give state s t agent performs an action a t on the environment. In return, environment produces reward r t+1 and new state s t+1 . Thus we get the sequence of state, action, reward as s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , . . .. Reinforcement or Q-learning is a value iteration method based on if C max (i) < low then 10: end if 12: end while 13:

16:
K\K PU − k * 17: end for {Beam Selection and Precoder Design} 18:   empirical data to discover the best policy. It works as a series of activities in order to increase the expected cumulative reward in a long-term period. And this long-term expected reward is known as Q-function, which is the sum of the discounted rewards received when an action is performed at the initial state. The objective of reinforcement learning is to find the sequence of the actions (a policy) that maximize the total reward. In general Q-function is the mapping of two variables, s and a as s = f (a). In literature, this mapping often expressed as s = π(a), where s ∈ S is a state from state space S and a ∈ A is an action from action space A.
Conventionally, this mapping is from a where α is the learning rate (0 < α ≤ 1) and γ ∈ [0, 1] is the discount factor. When s t+1 reaches the last or terminal state, an episode of the training algorithm ends. The learning rate α = 0 corresponds to 'noting learning new' or exploiting the old knowledge, whereas α = 1 means pure exploration. Generally, at the start of training, we set α = 1 and then gradually decrease its value up to 0. The discount factor γ ensures less weight for future rewards. The solution of control problems using the Bellman equation is called dynamic programming. In the case of the discrete domain, the Bellman equation solution is called Markov Decision Processes. The computational complexity grows exponentially with the dimensions of state or action, which is known as the ''curse of dimensionality''. In order to solve the ''curse of dimensionality'' problem, deep reinforcement learning plays its role. Deep neural networks (DNNs) use a function approximator to replace the Q-table-based state and action mapping function. The DNN takes the state as input and gives action as output. But the training of this DNN requires samples and target data as in any other DNN. As we know, in neural network (NN), the loss or error function L = i (y i −ŷ i ) 2 minimizes over the weights of the network (i.e., differentiate w.r.t weights and set to zero). Here, y i is the target value of i th sample andŷ i is the predicted value. In the Bellman equation, the term r t + γ max a t+1 Q(s t+1 , a t+1 ) is the target value and VOLUME 10, 2022 Q old (s t , a t ) is the predicted value. In a neural network, the loss function is used to adjust the weights of NN. If we use the same NN for getting the predicted value at one time stamp and the target value at the next time stamp, we cannot minimize the difference between the two values. Hence, we use two separate NNs, Q-network and target network for predicted Q-value and target Q-value, respectively. The target network is a clone of the Q-network. The weights of the target network are updated from the weights of Q-network after certain time steps. This target network update frequency is also a hyperparameter which needs to be set according to the requirements.
A Q-value function is an estimation of how good it is to perform a given action in a given state [24, A3.5].
The Q-function is given by the below equation: In this equation, a represents the action, s is state, and γ ∈ [0, 1] is the discount factor which determines the weight of the future rewards, P a ss is the probability of the state transition from state s to state s when action a is taken. After an agent executes the action at state the new stateaction pair becomes s , a . Moreover, the aim of the agent is to maximize the long-term cumulative reward Q π (s, a) by finding the optimal policy π * (s). And to do that we can write the previous equation as: In addition, the optimal policy is given as: However, these equations cannot be calculated directly because the agent isn't aware of the state transition probability. Hence, Q-Learning Algorithm is used to deal with this problem. It constructs a Q-table with Q-values Q(s, a) as elements, and in order to select an action for each state, the agent adopts an −greedy algorithm and updates each element in the Q-table by using the equation below: where α represents the learning rate. However, the Q-learning algorithm performance relies on the space size of the stateaction. It is easier for the agent to find the optimal action policy when the state-action space is small. But, when the size of the state-action space becomes larger, the performance of the Q-learning algorithm becomes limited because the agent may not explore all the state-action space.

B. DEEP REINFORCEMENT LEARNING
As mentioned above, the agent can be replaced by a deep Q-learning network in order to deal with the drawback of the Q-learning algorithm. The Deep Q-Learning uses a small DQN instead of the large Q-table and stores only the weights of the DQN in local memory. In other words, the deep Q-network uses the input and output method, the input is the state-action pair, and the output is the Q-values. In consequence, the enhancement of the Q-function Q(s, a) in the Q-learning algorithm is equivalent to the enhancement of the set of weights in the deep neural network (DNN) Q(s, a; ).
To stabilize the learning, deep reinforcement learning adopts two specialized DQN: a Q-network Q(s, a; Q ) and a target network Q(s, a; Q − ) along with an experience replay memory as shown in Fig. 6. Instead of training the DQN with one experience, the agent samples a group of batches of a random size from the replay memory for batch training of the Q-network. The weights − Q of the target network are updated periodically with the weights Q of the Q-network according to a predefined hyperparameter 'update frequency'. Mathematically, the loss function is given by: (25) where y represents the output (target value) of the target DQN, which given by: The gradient descent method is used to minimize the loss and get the corresponding weights of the Q-network where ∇ is the gradient operator.

1) ACTOR-CRITIC
In deep reinforcement learning, a DNN that implements the policy π is called an actor. An actor π(s; π ) selects an action deterministically/stochastically based on the input state s without consulting a value function. During the training, the actor tunes the weights to maximize the Q-value. The critic Q(s, a; Q ) is a DNN which estimates the long-term expected reward (Q-value) for a given state s and a given discrete action a. For a low-dimensional action space, the critic can be used stand-alone. But for the larger dimension action space, actorcritic architecture is more efficient because in this case, the critic gets only one input action at one time stamp. The actor and critic approximators could be a DNN, basis function, or a lookup table. In all cases, the critic is used to learn the policy parameters weights as shown in Fig. 5. The critic network Q-value is compared with the reward to calculate the loss function, which is, then, used to update the weights of the critic as well as the actor. We use a model-free and off-policy deep Q-Network (DQN) based agent. It consists of critics only and provides a value-based output. During the training, the DQN explores the action space with a given exploration probability epsilon ( ). At each time step, either it selects a random action with probability ( ) or follows the value function to determine the action with probability (1 − ). The weights of DQN are updated after every mini-batch samples. These mini-batches are taken randomly from the experience buffer. The DQN consists of a Q-network and a target network, as shown in Fig. 5 and Fig. 6.
A complete picture of actor-critic with Q-network and target network is shown in Fig. 6.

2) STATE AND OBSERVATION
The state space is derived from the channel state information H ∈ C N ×K . The state space S ∈ R 2×N ×K consists of i) N × K absolute squared real values of channel matrix and ii) N × K binary selection matrix χ. The binary element χ n,k = 1 if user k is selected in beam n. At any time t, the state is represented by a N ×2 tensor with N ×1 user channel vector absolute squared values and N × 1 binary tensor to indicate the selected beam in the previous time step.

3) ACTION
The action space A is discrete and it consists of N × 1 beams tensor, A = {1, 2, . . . , N }. At any time t, only one beam is selected by the agent. In an episode, there are total of card(B) time steps, so total of card(B) beams are selected out of N beams. Once we have trained policy π * (s), action can be found as

4) REWARD
The reward function consists of two components, i) the channel element-based information rate I (t) as a reward, and ii) the penalty ϑ(t) to avoid the same beam selection within an episode. Mathematically, these functions are written as and where w is the weight parameter for the penalty. Thus, the total reward is given by Fig. 7 shows the reward function implementation.

C. DRL AGENT HYPERPARAMETERS FOR BEAM-GROUP SELECTION 1) EXPERIENCE REPLAY BUFFER
During the training of DQN, we calculate target value as in (20) and the loss function in (25). These expressions require (s t , a t , s t+1 , r t ) information tuple (s t , a t , s t+1 , r t ). This information is stored in the experience buffer. The agent takes a random mini-batch from the experience buffer for an episode of training.

2) EPSILON GREEDY EXPLORATION (0 ≤ ≤ 1)
In order to incorporate a suitable trade-off between exploration and exploitation of action space, we use epsilon greedy hyperparameter (0 ≤ ≤ 1). The agent opts exploration (i.e., random selection of action) with probability and exploitation (i.e., determining the action by (28)) with probability 1 − .

3) DISCOUNT FACTOR (0 ≤ γ ≤ 1)
The agent's goal is to maximize the expected cumulative reward within an episode. The expected reward at time t is the reward at time t and all the future rewards till the terminal time step. Since the future reward has less weight as compared to the present reward, therefore, the agent tries to select actions so that the sum of discounted rewards over the future is maximized [24].
where γ is the discount factor, and N m is the number of time steps from the present time to the terminal time step. If γ = 0. the agent is concerned with maximizing the immediate reward irrespective of the future rewards.

4) LEARNING RATE (0 < α < 1)
Learning rate (α) is used to control the step size during the learning of neural network weights. A too small value of α increases the training time, and a too large value results in a suboptimal trained network.

VI. POWER ALLOCATION
There are two levels for power allocation: power allocation for beams and the power allocation to users within each beam.
Since each beam contains multiple users and not necessarily an equal number of users, the total power is divided among beams according to P n = P tot Card(K n ) K . Within a beam, a user performs the SIC decoding of all the low channel gain users to subtract those users' signals and treats the higher channel gains users' signals as intra-beam interference.
Though optimal power allocation has been derived for multiple users [13], [14] but for a single channel. Multichannel NOMA optimal power allocation for two users per channel is given in [15]. We consider a general case of Card(B) beams and Card(K n ) users in the beam n, where n = 1, . . . , Card(B). The power allocation problem is given as p 1,n < p 2,n < . . . < p Card(K n ) , R k,n ≥ R min k,n , ∀ n ∈ B, k ∈ K n . (33c) The optimization problem P1 is non-convex problem due to inter-user interference. First, we decompose the opti-VOLUME 10, 2022  mization problem into N sub-problems and then maximize each sub-problem independently. This is equivalent to the maximization of the original problem P1. We use transformation of optimization variables for the beam n as, q k,n = k j=1 p j,n , where k = 1, . . . , Card(K n ) and n = 1, . . . , Card(B). This results in the following p k,n = q k,n k = 1 q k,n − q k−1,n k = 2, . . . , Card(K n ).
(34) Therefore, the sum-rate of user k in the beam n can be written as where k,n is the channel to noise power ratio. After a simple rearrange, (35) can be written as in (36), shown at the bottom of the page.
The transformed optimization sub-problem for beam n is given by s.t. q Card(K n ),n < P n (37a) 2 −R 1,n − 1 1,n ≤ q 1,n ≤ q 2,n − q 1,n ≤ . . . ≤ q Card(K n ),n − q Card(K n )−1,n (37b) q k−1,n ≤ q k,n k,n − 2 R k,n + 1 2 R k,n k,n , The transformed problem P2 has a concave objective function, hence, it can be solved by standard convex optimization techniques [13]. With the condition R min k,n ≥ 1 for k = 2, . . . , Card(K n ) the optimal solution is given by Therefore, the optimal solution to the parent problem P1 is, 2 R k,n k,n q * k,n +2 R k,n −1 2 R k,n k,n k = 2, . . . , Card(K n ). (39)

VII. SIMULATION RESULTS
This section evaluates the performance of the proposed users grouping and RL-based beam selection for MIMO-NOMA using MATLAB. The proposed scheme is compared with the recent beam selection and power allocation scheme MI-based DFT-NOMA in [9], Gale-Shapley based Stable-matching and greedy-based beam selection in [2], and Hungarian-based beam selection in [23]. To implement NOMA with the beam selections schemes of [2] and [23] we use K -means based NOMA users grouping.

A. SIMULATION SCENARIO
The simulation scenario consists of a single cell with K = 16 users. The base-station is equipped with N = 64 antennas and N RF = 8 RF chains. In the proposed beam selection scheme, we use a maximum of two NOMA users [25]. The ULA has λ/2 inter-element distance, where λ is the transmission wavelength. We use limited multipath mmWave channel model with L p = 5 [4], [26]. The transmit SNR is defined as ρ = P σ 2 and the value used is shown in each plot. The low correlation threshold low = 0.5 to select the groupheads. The QoS requirement is set as target minimum rate R min n,k = 2 bits/s/Hz for all users. List of hyperparameters used for DQN is given in Table 2. In the simulation, the results are obtained by averaging over 50 channel realizations.

B. LEARNING PERFORMANCE OF THE DEEP Q-NETWORK
The DQN agent has two inputs (State, Reward) and one output (Action). These inputs are provided by the environment, and action is executed on the environment. We first (36) VOLUME 10, 2022  investigate the learning performance of the agent. Fig. 8 shows the episode reward graph. At the start, all weights of DQN are randomly initialized, and the agent starts learning with = 1 i.e., it takes action randomly and then gradually decreases the exploration according to the epsilon decay policy as shown in Fig. 9. As the training progresses, the agent learns and updates its weights. After 500 episodes, the average reward becomes 6.9529, with the last episode reward of 6.3936. The learning curve is approximately stable after 270 episodes. Fig. 10 presents a graphical view of the agent inputs and output for an entire length of an episode. Since we choose a sampling time of 0.025 sec, therefore, the x-axis shows 0 to 0.2 sec for an episode length of 8 (corresponds to number of users' groups). The top-left graph is the action output of the agent. It shows that beam 3 is selected for users' group 1, beam 14 is selected for users' group 2. and so on. The bottom-left graph is for the reward input at each time step within an episode. The right-side graph shows the input of 64 × 8 absolute squared values of the channel matrix. The legend on the top shows the beam selected for users' groups. For users' group 1 (i.e., time step 1 between 0-0.025 sec), beam 3 is selected, and so on. One can notice that beam 3 has less power than beams 14 and 56, but RL agent selects the beam based on the cumulative expected reward.
The effect of various hyperparameters on the learning performance is shown in table 3. We train the agent with 500 episodes and record the average reward over all episodes.  We can see as the learning rate decreases, the average reward increases but at the cost of more learning/training time. Also, the average reward depends more on the present and near future values, as indicated by the higher reward with lower discount factors. The experience buffer length between 10,000 to 50,000 does not affect the reward significantly. When we vary the epsilon decay factor, 0.3 gives the largest reward. Fig. 11 depicts the sum-rate performance with various values of transmit SNR. In order to assure the QoS, the minimum   rate is set as R min = 2 b/s/Hz. The proposed user grouping and RL-based beam-group selection scheme outperform all the other schemes. Specifically, at 10 dB SNR, the proposed scheme gives 42% better performance than the Stable Matching-based MIMO-NOMA. It can be seen that RLbased-DFT-NOMA becomes more beneficial at low transmit power per beam due to less interference within a beam. We also simulate the RL-beam selection with K-means-based users grouping, whose performance is similar to the Stable Matching-based NOMA. The K-means algorithm takes the number of groups as input but number of users within a group varies depending on the Euclidian distance. Therefore, in K-means grouping, there could be a different number of users. In our proposed users' grouping algorithm, the maximum number of users within a group is two. This limits the intra-beam interference. The Stable Matching-based beam selection is an exhaustive search-based optimal scheme with very high computational complexity, a low complexity but the suboptimal version is used in greedy-based DFT-NOMA. The MI-based DFT-NOMA scheme first selects the beams by using the maximum relevance and minimum redundancy principle. Then all users calculate their best strongest beam. If a beam is strongest for two or more users, that beam is assigned to those users. Due to first selection of the beams without considering the users' channel badly affects the performance of this scheme as shown in the graph.

C. PERFORMANCE ANALYSIS OF USER GROUPING AND BEAM SELECTION DESIGN
The energy efficiency comparison is shown in Fig. 12. The less interference is due to the maximum of two users in a group. The proposed RL-based DFT-NOMA exhibits superior performance at low SNR values. At 10 dB SNR, we get 42% better EE as compared to the Stable Matchingbased scheme. The RL-means DFT-NOMA also performs well for low SNR values, and the performance gap between two RL-based schemes is around 0.0255 bits/s/Joule. It is obvious that transmission power increases the interference and decreases the SE, hence, decreasing the EE as well.
In Fig. 13, the sum-rate performance has been investigated with the increasing number of users. With K ≤ 8 MIMO-NOMA is equivalent to the OMA because N RF = 8, and each beam serves at most one user. Again, three schemes, RL-based DFT NOMA, RL-Kmeans-based DFT NOMA, and Stable Matching-based DFT NOMA, are competitive in this figure. It can be seen that as the number of users VOLUME 10, 2022 increases, the performance gap of RL-based DFT NOMA with the competitors increases. Due to the K-means-based users grouping, other schemes perform better with less number of users. The mean and standard deviation across the users are (8.739,0.595), (7.833,1.402), and (7.298,1.047) for RL-based DFT-NOMA, RL-kmeans-based DFT-NOMA, and Stable Matching-based DFT-NOMA, respectively. The lowest performer is again MI-based DFT-NOMA. There is a reason for the very low performance of the MI-based DFT-ABF scheme; i.e., it does not cater for the minimum correlation between the selected beams, which induces severe interbeam interference. Fig. 14 depicts the sum-rate versus users with SNR value of 20 dB. It can be noticed that RL-based DFT-NOMA performance w.r.t. other schemes is much better when the number of users approaches 2 × Card(B) and the SNR is between 10-20 dB. In the proposed users grouping scheme, the two users in a group have a large channel gain difference, and the low channel gain user requires more transmission power, which causes more interference at high transmit SNR regimes. In the case of OMA (i.e., K = 8) and NOMA in a few beams (when K = 9,10), there is marginal or no performance gain with RL-based DFT-NOMA. The mean and standard deviation across the users are (12.08,0.97), (11.72,1.5), and (11.57,1.227) for RL-based DFT-NOMA, RL-kmeans-based DFT-NOMA, and Stable Matching-based DFT-NOMA, respectively. This shows less deviation of SE in RL-based DFT-NOMA even with the change in the number of users.
Finally, the sum-rate versus minimum target date-rate is shown in Fig. 15 and Fig. 16 for transmit SNR of 10 and 20 dB, respectively. Due to the suboptimal performance of Hunfarian-, Greedy-and MI-based schemes and to ensure the information-rate well above the R min for each user, their sum-rate remains at least 4 b/s/Hz lower than the other near optimal schemes. The RL-based DFT-NOMA sum-rate exhibits highest sum-rate performance but decreases more sharply with increasing QoS requirement in the form of R min . This is, to ensure the higher R min for low channel gain users who requires more power which in turn deprives the power share of the group-head, hence, results in the substantiate decrease in the sum-rate.

VIII. CONCLUSION
In this paper, we propose a novel users' grouping and reinforcement learning-based beam-user selection for a massive MIMO NOMA downlink system. We use channel correlation and channel gain information for intra-beam and interbeam user selections in the users grouping algorithm. After the users' grouping, a deep Q-network selects the optimal beams as an action on the basis of CSI-based states and information rate-based reward function. Finally, multibeam multiuser optimal power is allocated. It has been shown that the proposed RL-based DFT-NOMA outperforms the state-of-the-art Gale-Shapley based Stable Matching, Hungarian-, and MI-based MIMO-NOMA schemes. Specifically, we get 42% increase in SE and EE performance at 10 dB transmit SNR.
As an extension of this work, the performance of the DRL-based design can be examined by using proximal policy optimization (PPO) or trust region policy optimization (TRPO) agents. These agents are policy-based actor-critic agents. He is an author of several journals and conference papers in the field of communications and information technology. He has worked on LTE MiFi Clouds, Hotspots, Wingles, USB Dongles, and Drive testing for CDMA/EVDO Network for checking QoS parameters using NEMO Analyzer and Genex Probe. His current research interests include wireless communication, 5G communications, optical wireless communications, fiber optic systems and networks, optical transmission, optical fiber access networks, technology management, operational management, project management, and industrial organization.
TARIG FAISAL received the master's degree in mechatronics engineering from IIUM University, in 2006, and the Ph.D. degree in signal processing from the University of Malaya, Malaysia, in 2011. He has been the Dean of academic operations at the Higher Colleges of Technology, since 2018. He has more than 20 years of academic and industry experience of which he worked as an Engineer, an Assistant Professor, the Programs Chair, the Head of Department, the Division Chair, and the Campus Director. His research interests include biomedical signal processing, intelligent systems, robotics, control, embedded system design, the IoTs, machine learning, and outcome-based education. He has been a reviewer for multiple journals including IEEE, Elsevier, Taylor & Francis, and Springer Nature. He is also a Charted Engineering as well a Senior Fellow of the Higher Education Academy. VOLUME 10, 2022