A DRL-Based Automated Algorithm Selection Framework for Cross-Layer QoS-Aware Scheduling and Antenna Allocation in Massive MIMO Systems

Massive multiple-input-multiple-output (MIMO) systems support advanced applications with high data rates, low latency, and high reliability in next-generation mobile networks. However, using machine learning in massive MIMO resource allocation is challenging due to quality of service (QoS) and network complexity across layers. This work presents a novel framework for adapting the scheduling and antenna allocation in massive MIMO systems using deep reinforcement learning (DRL). Rather than directly assigning execution parameters, the proposed framework utilizes DRL to select combinations of algorithms based on the current traffic conditions. The DRL model is trained using a specialized Markov decision process (MDP) model with a componentized action structure and is evaluated in realistic traffic scenarios. The results show that the proposed framework increases satisfied users by 7.2% and 12.5% compared to static algorithm combinations and other cross-layer adaptation methods. This demonstrates the effectiveness of the framework in managing and optimizing resource allocation in a flexible and adaptable manner.


I. INTRODUCTION
The rapid development of mobile networks proliferates the demands of high data rate, low latency, and high-reliability applications [1].While the traditional mobile network confronts challenges on spectrum insufficiency, the multiple-inputmultiple-output (MIMO) technology, which contributes crucial progress in system capacity and reliability, is regarded as a necessary feature in the fifth-generation (5G) and beyond (B5G) wireless network systems [2].Multi-user MIMO (MU-MIMO) has been widely exploited in current wireless systems and furnishes significant improvement through conventional MIMO [3].However, with a roughly equal number of service antennas and terminals under frequency-division duplex operation, MU-MIMO lacks scalability in various scenarios [4].The massive MIMO system [5], [6] has achieved breakthroughs in practice by accessing a large number of antennas on active terminals with time-division duplex [7].It is characterized by a base station (BS) equipped with more than 100 antennas that simultaneously serve multiple users with shared time-frequency resources.Extra antennas are used to steering energy in small regions to improve system throughput and energy efficiency.More advanced resource allocation is required for BSs in massive MIMO systems due to the scale of antenna numbers and resource options [8].
In massive MIMO, zero-forcing (ZF) precoding is a primary linear scheme to attain virtually optimal capacity performance taking advantage of the asymptotic orthogonality of user channels under different reflecting and scattering paths in a rich-scattering environment [9].It is generally realized through baseband processing, requiring radio frequency (RF) chains performing RF-baseband frequency transceiving and analogto-digital conversion.The immense hardware demand limits the desired scalability coming with massive antenna array sizes.The hybrid precoding is widely adopted in recent research to mitigate hardware constraints while realizing the potential of massive MIMO systems [10], [11].Hence the hybrid precoding is utilized in this work.
User selection and precoding strategies in massive MIMO across media access control (MAC) and physical (PHY) layers have been actively investigated.In [12], the authors proposed user and antenna selection algorithms to maximize the system sum-rate of a massive MIMO system with various precoding schemes.Lagen et al. [13] presented a procedure for joint user scheduling, precoder design, and transmission directing in TDD MIMO small cell networks.In [14], an utilitybased antenna allocation algorithm is proposed to efficiently allocate number of antennas to UEs in a massive MIMO system.The work considered scalable video streaming.Choi et al. [15] proposed a joint user selection, power allocation, and precoder design algorithm for massive MIMO downlink systems providing gains in spectral efficiency.However, the rule-based joint precoding methods focused less on application quality of service (QoS) and require higher complexity in 5G scenarios.With advances in artificial intelligence, deep reinforcement learning (DRL) approaches are adopted to deal with wireless network scheduling problems.Wei et al. [16] proposed an actor-critic-based model for user scheduling and resource allocation to utilize radio resources in HetNets efficiently.Fiandrino et al. [17] also laid out a framework for machine learning (ML) based optimization for future networks.With potential to perform joint resource allocation in next-generation mobile networks, the DRL-based approaches suffer from practical learning issues [18].Coordinating the interaction of algorithms across scheduling and precoding functions is still an open problem.
The reinforcement learning (RL) based algorithm selection [19] can be a robust framework to handle diverse QoS requirements across layers while taking advantage of established algorithms for high-performance joint adaptation.To be specific, in B5G scenarios, when a large portion of UEs is under restricts latency constraints, the systems can primarily benefit from providing higher priority to UEs with data expiring.At the instant when most UEs are traffic demanding, proportional fairness can be the preferred criteria.Though crosslayer adaptation algorithms can be developed to schedule and allocate network resources, the complexity of resulting rules increases rapidly under 5G features and diverse QoS requirements.Therefore, joint approaches adapting among feasible fundamental algorithms are worth investigating.Following the ideas of hybrid algorithm design [20], [21], the algorithm selection problem can be modeled as a Markov decision process (MDP) and solved by RL.Studies have shown that deep learning-based algorithm selection models that timely interact with environments have advantages on nonlinear and high complexity dynamic tasks [22].The concept was applied to 5G new radio (NR) resource allocation tasks to improve the training process but focused solely on user scheduling [23], [24].
In this work, we investigate a joint user scheduling, antenna allocation, and precoding problem in a massive MIMO system running 5G applications.The problem is evolved from conventional precoding and scheduling problems to handle strict QoS requirements from users.Instead of directly assigning resources, such as the number of antennas, the process is transformed into selecting algorithm combinations for scheduling, allocation, and precoding.An MDP model is specifically designed to resolve the dynamic algorithm selection task.The main contributions can be summarized as follows: • We formulate a novel QoS-aware radio resource allocation problem for joint scheduling, antenna allocation, and massive MIMO precoding.The utility function integrates user requirements and constraints toward a long-term system-wide objective that matches the MDP return.• A componentized MDP action structure is proposed with resource allocation functions and fundamental algorithms identified.The dynamic algorithm selection policy can thus be effectively trained.• A deep deterministic policy gradient (DDPG) [25] based training process incorporating action embedding [26] is designed to convert the action into a continuous space and take full advantage of DDPG.
The simulations are performed under realistic traffic scenarios referring to traffic types in the 5QI table [27].Static algorithm combinations and baselines in the literature are implemented for comparison.Simulation results suggest that the proposed dynamic algorithm selection satisfied 7.2% and 12.5% more users against static algorithm selection and related joint allo-cation schemes under demanding scenarios.
In the rest of the paper, we first present related works on resource management and machine learning in Section II.Section III introduces the massive MIMO system model and problem formulation of joint scheduling and precoding.Section IV describes the proposed MDP model with componentized actions.The simulation results are demonstrated in Section V. Finally, Section VI concludes the article.

II. RELATED WORKS A. Joint User Scheduling and Precoding
User scheduling has been one of the primary resource allocation topics across generations of mobile communication technologies.With massive MIMO, the precoding further controls the availability of underlying physical resources and can be jointly considered for enhanced performance.[28] presented an adaptive algorithm for joint user scheduling, precoding design, and beamforming in dynamic MIMO small cell networks.The transmit direction is optimized every frame using conventional allocation strategies across scheduling, precoding, and power control.Based on a two-stage precoding framework for large-scale MIMO systems with frequency division duplexing, authors in [29] proposed an improved user scheduling approach with low-rank channels and precoding design.Both throughput gain and fairness were achieved.In [30], [31], joint scheduling and precoding for matching MIMO cellular networks were investigated and analyzed.In [32], the authors proposed an antenna and user selection algorithm for downlink massive MIMO transmission with ZF precoding.Singh et al. [33] developed an optimal resource fraction algorithm (ORFA) combining the proportional fair UE ranking and water filling resource allocation for MIMO networks with a minimum mean square error (MMSE) precoder.In [14], a utility-based layer and antenna allocation (UBLAA) algorithm is proposed to maximize the transmission efficiency for layered video streaming.The marginal utility is evaluated to determine the number of antennas assigned to UEs in a massive MIMO system.
While the joint user scheduling and precoding can be executed to some extend with conventional methods, the challenging application QoS requirements and rising complexity of 5G and beyond systems lead to performance degradation.To comprehensively integrate cross-layer functions for future networks, machine learning-based approaches are worth investigation [17].

B. Resource Management with Machine Learning
Machine learning-based techniques are actively developing for next-generation network resource management.Authors in [17], [34] addressed the benefits of artificial intelligence-aided wireless systems and categorized primary machine learning algorithms in the context of next-generation networks.Applicable wireless communication technologies include massive MIMO, cognitive radios, heterogeneous networks, small cells, and device-to-device networks.[35] built a resource management model with DRL for network slicing and demonstrated that DRL could incorporate the relation between demand and supply, enhancing network slicing agility.[36] proposed a deep reinforcement learning-enabled coverage and capacity optimization algorithm for massive MIMO systems.DRL is used to coordinate the inter-cell interference and support user scheduling dynamically.[37] also applied a DRL model for resource allocation agents in vehicle-to-vehicle (V2V) communications.The agents determine the sub-band and power levels for transmission with local observations.Zhang et al. [38] proposed DRL-based control for resource management in spectrum sharing.With dynamic power control, both primary and secondary users efficiently meet QoS requirements.
Overall, DRL has been applied on various resource management tasks in wireless networks.However, cross-layer coordination is more complex, less studied, and requires specifically designed ML structure to be effectively solved.

C. Deep Reinforcement Learning and DDPG
In general, RL is a machine learning technique to solve decision-making problems typically modeled as an MDP, a mathematical framework to describe the target environment [39].In RL, an agent learns through interacting with the environment and iteratively improves its ability to achieve a pre-defined goal.An MDP problem consists of states s t ∈ S, actions a t ∈ A, transition probabilities P r(s t+1 |s t , a t ), and rewards r t = r(s t , a t ) ∈ R, where S and A are state and action spaces.In each time step t, an agent recognizes s t from the environment and chooses a suitable a t .After the action is applied, the next state s t+1 and reward r t are observed from the environment.In this model, the goal is to learn the stochastic policy π(a t |s t ), which maximizes the long-term return where T and γ ∈ (0, 1) are termination time and the discount factor.The action-value represents the expected return when executing action a t in state s t following π as When modeled by MDP, the massive MIMO resource allocation is a high complexity problem with large state and action spaces to present possible situations.Therefore, a deep learning-based approach, specifically DDPG [40], is proposed to integrate with the resource allocation process.DDPG is a landmark scheme in the policy gradient family and more suitable for applications with complex actions comparing with the deep Q-network (DQN) [41].The deterministic policy gradient (DPG) [42] concept, experience replay, slow-learning target networks from DQN, and the actor-critic structure are all integrated into DDPG.
The DDPG algorithm utilize the recursive Bellman equation to evaluate action-value functions.Thus the deterministic policy µ : S → A provides the action a t = µ(s t ), and the action-value function becomes Furthermore, considering function approximators parameterized by θ Q and θ µ , we can optimize the action-value function by minimizing the loss function.The actor network updates the policy with aids from the critic network, where the policy gradient is [42] Accordingly, the training procedures using samples from experience replay, E, can be realized.

III. SYSTEM MODEL AND PROBLEM FORMULATION
This section describes the massive MIMO network structure and problem formulation.A joint user scheduling, antenna allocation, and precoding problem is proposed with the potential to be modeled as an MDP.

A. System Model
We consider a single-cell massive MIMO system consisting of an M -antenna BS and K single-antenna user equipments (UEs).Thus we have m ∈ M = {1, 2, . . ., M } and k ∈ K = {1, 2, . . ., K}.During transmission, the BS allocates N k,t number of antennas to UE k at the t-th transmission time interval (TTI), where a TTI is T I -second long.Each UE is associates with a type of traffic, referring to the 5QI table [43].Considered UE properties include channel quality indicators, requested data, and a traffic type.The channel quality indicator, CQI, can be obtained from the table defined in [44] given the signal to interference and noise ratio (SINR).Requested data, D k , is the set of data packets generated for transmission.The properties attached with a traffic type, TYPE, are packet size, mean packet arrival time, latency constraint, guarantee bit-rate, and error rate constraint.
The joint user resource allocation and precoding model is illustrated in Figure 1.The procedure consists of three function components, including user prioritization, antenna allocation, and precoding.The outcome of user prioritization is defined as Ô = {O t |1 ≤ t ≤ T }, which ranks UEs every TTI.O t is an ordered subset of UEs containing ones that have requested data to be transmitted at t.The antenna allocation results, N = {N t |1 ≤ t ≤ T }, record the number of antennas assigned to prioritized UEs, where N t = {N k,t |k ∈ O t }.The precoding matrix set, P = {P t |1 ≤ t ≤ T }, is the set of precoding matrix given antenna allocation, i.e., P t ≡ P(N t ).
Given user data requests, D k , the policies of function components are determined according to channel quality feedbacks H and CQI.Finally, a UE decodes the received signal y k for the data.

B. Hybrid Precoding
The precoding influences the spectral efficiency by evaluating the precoding matrix and providing raw capacity for resource allocation [45].Assuming there are K r UEs simultaneously receiving data at a TTI, the received signal vector y = [y 1 , y 2 , . . ., y Kr ] T ∈ C Kr×1 can be expressed as H ∈ C Kr×M denotes the channel matrix of all K r users;  , where I and ρ refer to the unit matrix and the total transmission power.With the system model defined in Figure 1, the received signal of user k is Therefore, after utilizing resource allocation functions, we have K r = O t with the time index t.The signal-tointerference-plus-noise ratio (SINR) of UE k at t is [46 (8) where W is the system bandwidth.We formulate the precoding problem as max subject to p m k,t is an element of p k,t , and ( 11) is the constraint of the precoding matrix gain.

C. User Scheduling and Antenna Allocation
In general, user resource allocation aims to maximize the total system utility by actively distributing resources.Packetlevel transmission utility is first defined to describe the QoS status in terms of requested data.Assuming set t(d) is the TTIs assigned to transmit data packet d within its latency constraint.u d k,t indicates the receiving status of data packet d ∈ D k and is defined as A packet is successfully received, i.e., u d k,t = 1, if sufficient resources are allocated to a packet in time.ε d is the packet size.Consequently, the number of successfully received packets by UE k up to time t is Based on the receiving status mentioned above, the user resource allocation problem is defined to maximize the total utility every TTI.The utility gain of UE k up to the t-th TTI, U k,t , is a function of transmission data rates allocated to it over time.Simultaneously, application requirements, including guarantee bit rate (GBR), packet loss rate, and latency, are subjected to be satisfied.The allocation decision prioritizes UEs for O t and determines the corresponding number of antennas N t .We formulate the problem as subject to E k is the packet error rate requirement from UE k's traffic type.(15a) shows the derivation and constraint of providing GBR for k. (15b) is the packet error rate constraint.(15c) limits the total number of antennas allocated.The utility function is open to be further defined.

D. QoS-Aware Joint Resource Allocation
It is challenging to adapt options from precoding to scheduling effectively.Problems ( 10) and ( 14) are inter-dependent with different objectives.Given the antenna allocation, the precoding matrix determines the resulting throughput, while the antenna resources are allocated through user scheduling based on throughput-dependent utility.We model the complex interaction with a utility function integrating requirements and dependencies toward a long-term system objective.Under the componentized structure and adaptive algorithm selection, the problems are jointly processed.
The QoS-aware joint resource allocation objective is to maximize the number of satisfied users in the system given their application requirements.Therefore, we propose to redefine the general utility objective with requirements integrated up to the termination time T .The received utility of UE k, U k , is set to 1 when GBR, loss, and latency requirements are all satisfied given allocated antenna resources.The utility function can be expressed as The joint problem is formulated by QoS requirements embedded utility and resource constraints as The objective is to maximize the total number of satisfied UEs by determining the optimal Ô, N, P over time.(18a) limits the total numbers of the allocated antennas.(18b) is the constraint of the precoding matrix gain.The problem features a utility function depending on complex criteria and longterm returns.Therefore, an MDP-based solution, which models complex agent-environment interaction and optimizes future return during the process, adequately fits the problem.
IV. DEEP REINFORCEMENT LEARNING FOR JOINT MASSIVE MIMO RESOURCE ALLOCATION In this section, we formulate the MDP problem with states, actions, and rewards.Also, the resource allocation function components and the DDPG training procedures are detailed.TYPE.Thus, the state at the t-th TTI is defined as

A. Markov Decision Process Formulation
Based on the problem formulated in Section III-D, the resource allocation action is formed as a combination of components: user prioritization, antenna allocation, and precoder selection.Fundamental schemes proven to be helpful in specific scenarios are included in a component.The componentized architecture is shown in Fig. 3.The action dynamically selects a scheme in each component according to the state observed and is expressed as The details of included components are described later in Section IV-B.The reward is designed to keep the data transmission on pace considering traffic type-specific GRB and latency requirements.Due to higher uncertainty to quantify the advantage of proactive transmission [47], we adopt negative rewards to discourage situations with transmission progress falling behind.The reward function is formulated as The first term reflects the incompletion ratio of requested data up to time t.As a penalty, the value is negative if requested data from UE k are not fully transmitted.If all request data are transmitted, the first term and thus the reward r k becomes zero.
The second term is the adjustment to keep the transmission data rate Φ on the pace of GBR k .α is the penalty weight, and α = 0 when the traffic type has no GBR assigned.Therefore, the reward function is which is recorded in the training process to maximize the future return, R t .With the reward function design and MDPbased future return optimization, the learning process is fully aligned with the utility optimization problem (17).

B. Componentized Actions
The componentized action is the concept introduced to facilitate dynamic resource allocation via algorithm selection and improve DDPG training.As shown in Fig. 3, we decompose the scheduling and precoding process into three function components; a component contains several fundamental methods as algorithm options.The included algorithms are diverse in design concepts for meaningful selection.Component 1 (C 1 ) prioritize UE according to specific criteria.Component 2 (C 2 ) decides the number of antennas allocated to each UE per TTI.The hybrid precoding method is determined in Component 3 (C 3 ).The adopted fundamental methods are introduced as follows.
The UE prioritization component ranks UEs in the system.Four implemented sorting methods are: • Channel quality first (CQI): sorts UEs according to channel conditions.A UE with higher channel quality is ranked higher.• Expiring time first (Delay): ranks UEs on how close its oldest requested data is expired.It depends on the trafficspecific time constraints and how long the transmission is being delayed.• Remaining data first (Remain): sorts UEs according to the size of requested data remaining in the queue, i.e., D k,t −ν k,t−1 .A UE receives higher priority with more untransmitted data.
• First-in-first-out (FIFO): prioritizes UEs with the arrival time of the earliest arrival packet.Thus, component c 1,t ∈ C 1 = {CQI, Delay, Remain, FIFO}.An ordered UE set O t is generated every TTI.
The second component is to allocate the system resources, i.e., the number of antennas N k,t , to UEs based on the ordered set O t .As a result, c 2 also controls the final number of UE, which can be granted a transmission opportunity.In addition to fundamental methods, a percentage parameter ι is also integrated to extend the options.The fundamental allocation methods implemented are: • Fully satisfy in order (FSO): allocates sufficient numbers of antennas to fully transmit the remaining requested data of each UE, D k − ν k,t−1 , in the order of O t until exhausting the system resource.The number of antennas to fully satisfy a UE, N f s k,t , is defined as • Minimum guarantee (MinG) [48]: evenly distributes a portion of antennas to a subset of UEs, O G t ⊆ O t , and applies FSO on the remaining resources.Therefore, several UEs can receive a minimum share of antennas and the portion of resources reserved for even distribution, ι G , is a key parameter to consider.We determine the number of UEs receiving guaranteed resources according to the smallest N f s k,t and can be expressed as where ι G = {25%, 50%, 75%, 100%}.Thus, there are four MinG-based options in C 2 .For example, the option with ι G = 50% is denoted as MinG50.• Proportional fair (PF) [49]: considers a subset of UEs and allocates antenna resources proportional to the ratio of currently available data rate, Φ k,t , to historical transmission rate.In practice, the historical transmission rate can be updated through moving averages.The parameter ι pf = {25%, 50%, 75%, 100%} determines the percentage of UEs in O t to be included.With all the fundamental schemes and parameters, the complete option set C 2 has nine elements.
For high spectrum efficiency in massive MIMO transmission, the third component selects a hybrid precoding algorithm to evaluate the precoding matrix.The fundamental hybrid precoders are: • Antenna selection (AS) [50]: greedily chooses antennas to achieve high single antenna efficiency.• Cross entropy (CE) [51]: is a probabilistic model-based algorithm iteratively solving the combining problem.The algorithm computes the achievable sum-rate of each candidate and selects the best candidates as "elites."Base on the selected elites, the probability distribution is updated by minimizing the cross entropy.CE precoding performs well with sufficient resources and a less saturated system.• Adaptive cross entropy (ACE) [52]: is a variation based on the CE algorithm.The ACE algorithm weights "elites" adaptively based on its achievable sum-rates.This precoding method can gain better SINR than CE in saturated situations.The component c 3,t ∈ C 3 = {AS, CE, ACE}.Overall, the action a t can be one of 108 component combinations with all options considered.

C. Action Embedding and Training Procedures
As introduced in Section II-C, DDPG takes advantage of DQN, DPG, and the actor-critic structure [25]; it is utilized to make resource allocation decisions for our target MDP problem with continuous or high dimensional states and actions.We extend the actions in this work to a continuous space through action embedding [26], where the original discrete actions are embedded in continuous upon which the actor can generalize.The function υ : R dim(A) → A is defined to convert the continuous action ǎt used for training into the discrete action a t applied to the environment, with dim(A) denoting the dimension of action space A. Therefore, the converting function is expressed as where ǎt = [č 1,t , č2,t , č3,t ] is the action formed by continuous component values.Also, the deterministic policy generating continuous action μ : S → R dim(A) is applied in the model as the actor network μ(s t |θ µ ).
The training process is described in Algorithm 1. First, networks are initialized.Every TTI, the agent generates continuous action ǎt = μ(s t |θ µ ) + N t from the actor with random noise N t for exploration.The discrete action a t is obtained from (25) and applied to the environment for the reward r t and the next state s t+1 as feedbacks.In order to reuse execution experiences, DDPG stores transition (s t , a t , s t+1 , r t ) in the replay buffer.After that, DDPG samples B number of transitions from the replay buffer to form a mini-batch B. With mini-batch inputs, the target actor network μ (s t+1 |θ µ ) outputs the action to the target critic network Q , where the resulting action-value can be evaluated based on (3).Therefore, the critic network is updated by minimizing the loss function The actor network is updated following the deterministic policy gradient theorem modified from (4) as [26] ∇ We note that the replay buffer stores the discrete action generated by (25), but the policy gradient is taken at μ.This allows the learning algorithm to leverage action executed to the environment for critic network training, while taking the policy gradient at the actual output of the actor network.Finally, DDPG uses the soft-update to improve both critic and actor target networks with the constant τ as The parameters in target networks change slowly and considerably improve the learning stability.Convert the action form continuous to discrete a t = υ(ǎ t ) to embedding on three components [c 1,t , c 2,t , c Sample a random mini-batch from the replay buffer 13: Update the critic by minimizing the loss (26) 14: Update the actor using the gradient (27) 15: Update the actor and critic network with the equation (27) (28) 16: end for 17: end for

V. NUMERICAL RESULTS
This section introduces simulation settings for traffic scenarios, the massive MIMO environment, and DDPG training.Numerical results compare the proposed learning-based method with baselines, including static combinations of fundamental methods and related works.

A. Simulation Setup
The simulation scenarios are built as mixes of applications in a massive MIMO system.Table II shows six selected traffic types based on 5QI specifications [43], including voice over IP (VoIP), video streaming, gaming, and virtual reality (VR) / augmented reality (AR).The properties attached to a traffic type include latency requirement, GBR, packet size, mean packet arrival time, and error rate requirement.A UE is admitted to the system as a traffic session with a predetermined type and properties to generate requested data.For scenarios, traffic sessions from all types are mixed in various UE ratios listed in Table III with specific focuses.For the communication system, COST2100 [53] is used to model the MIMO channel, and UEs are distributed following the Poisson point process (PPP).The simulation datasets are form by 60000-TTI-long data blocks containing CQIs and requested data of UEs every TTI.We generate four data blocks for each of six traffic types for training, resulting in 24 distinguish traffic data blocks.In an epoch, the training goes through 24 data blocks in random orders.The model converges after 72 to 74 epochs of training.Therefore the resulting model is expected to handle traffic scenarios in an arbitrary mix of data types.The testing is performed on ten separately generated data blocks for each scenario.Also, the penalty weight α in ( 21) is set to 0.5 when GBR is available.The continuous component values in (25) are set in [−1, 1] and evenly distributed for discrete actions with dim(A) = 3.Table IV shows the complete parameter list.The training and decision-making models are implemented using TensorFlow [54] version 1.14 on a desktop machine with Intel i7-3770 CPU and Nvidia RTX 2080Ti GPU under parameters listed in Table V.
Several fundamental method combinations and algorithms in the literature are compared with the proposed one to evaluate dynamic method selection effectiveness with DDPG.The benchmark actions are the most frequently selected ones from the learning results and are kept static during simulation runs.For example, CQI-MinG75-AS applies channel quality first, minimum guarantee with 75% resources reserved, and antenna selection precoder.We expect the learning scheme to choose a suitable resource allocation combination under various traffic and channel status of the environment.The static schemes always apply the same algorithms.

B. Dynamic vs. Static Algorithm Combinations
We compare the proposed learning-based method against static combinations in this section to demonstrate the advantages of dynamic algorithm combinations.Performance metrics in total system utilities (17), which is the proposed main objective, and throughputs are illustrated.
Figure 4a illustrates the normalized system utility defined as the percentage of satisfied UEs through termination time T .We observe that the proposed learning-based approach gains 2.2% to 7.2% more system utility than the best static scheme across all scenarios due to its adaptive nature.The most significant advantage appears in Scenario 2 with doubled VR/AR traffic showing the learning method is more capable of achieving high bandwidth and low latency simultaneously.The performances of static schemes are inconsistent across application scenarios.For example, the delay emphasizing scheme, Delay-MinG75-ACE, ranks second in Scenario 2 and 5, where more latency demanding VR/AR or gaming traffic exists.The scheme Remain-MinG50-ACE is comparable with the best ones in data rate demanding Scenario 3, 4, and 6, but achieves significantly less in others because UEs with more remaining data are ranked higher.Furthermore, CQI-MinG75-AS is a more versatile static combination because CQI provides high system throughput while MinG75 forces even distribution of most antenna resources.The greedy nature of AS precoder is also fitted well with CQI and MinG75.
From the system throughputs presented in Figure 4b, we observe that greater throughput not necessarily reflects greater utility.Schemes that apply the CQI method for c 1 , CQI-MinG75-AS and CQI-PF50-ACE, result in the highest throughputs because UEs are ranked according to channel quality.The proposed learning-based method ranked only behind CQI methods in throughput and outperforms them in system utility.When the overall traffic demand and throughput are lower in Scenario 4, all schemes achieve system utility greater than 0.9.
Algorithm selection details in Figure 5 can further reveal

C. Comparison with Joint Resource Allocation Algorithms
Figure 6a and 6b present the normalized system utility and system throughput comparing with joint resource allocation algorithms in the literature.Though not providing the highest throughputs, the proposed learning approach outperforms the best among ORFA, UBLLA, and LWDF-PF algorithms in normalized system utility.The largest utility gap is observed at 12.5% in Scenario, with heavy traffic on high data rate and low latency types.In comparison, the smallest gap presents in the less loaded Scenario 4 at 4.4%.ORFA consistently achieves greater than 0.8 in utility due to the optimality of the waterfilling algorithm.However, its' general-purpose proportional fair scheduling suffers from degraded performance under diverse application requirements.UBLAA fulfills data rate requirements and results in high throughput in all scenarios.Since latency is not effectively presented via marginal utility, system utility performance is not satisfactory in Scenario 2, 5, and 6, with latency demanding VA/AR applications.LWDF-PF performs worse than the learning and ORFA methods and does not suffer from significant drops by adopting the relatively straightforward greedy weighted delay and proportional fairness allocation.Figure 7 shows detailed results for two representative scenarios.Scenario 1 with a balanced traffic mixture and Scenario 2 emphasizing VR/AR applications are selected.In the proposed problem (17), the overall utility of a UE, U k , can be concluded after the last requested data is processed at the session ending TTI of k.The average utility of UEs ending in the same 100-TTI windows is evaluated as shortterm average utility to analyze the system condition over time.Figures 7a and 7d illustrate the cumulative distribution function (CDF) of short-term average utilities with 128 antennas and 500 UEs.We can see that the learning method spread mainly in 0.9 and above.ORAF has samples lower from 0.85.UBLLA and LWDF-PF perform differently in Scenario 1 and 2. UBLLA keeps all samples greater than 0.83 in the balanced condition, while some samples fall below 0.75 when there is more latency-sensitive traffic in the system like Scenario 2. Figure 7b and 7e present the system utility trend using 64 to 224 BS antennas with 500 total (42 average coexist) UEs.In general, systems gain more utility with more antennas.When the resources are limited at 64 antennas, the learning method gains 6.2% to 40% more utility than others in the balanced cases and 7.1% to 22% more in VR/AR emphasized cases.Figure 7c and 7f show the system utility trend with 100 to 600 total (8.3 to 50 average coexist) UEs at 128 antennas.The advantage of learning-based algorithm selection grows with the saturation level resulting from a greater number of UEs.Also, overall utilities drop faster in VR/AR emphasized Scenario 2.
To summarize, the comparing joint methods fulfill the user scheduling and resource allocation problem objective (14) in general, where the decision is made to maximize the instant utility U k,t .In contrast, the proposed MDP-based method maximizes the long-term utility ( 16) and thus joint objective (17), because maximizing the long-term return (1) is the nature of MDP.The cross-layer integration of scheduling and precoding also shows effectiveness.

VI. CONCLUSION
A DRL-based radio resource allocation approach for joint scheduling and precoding in a massive MIMO system is investigated in this work.We suggest an architecture decomposing the cross-layer adaptation decision as a combination of algorithms and learning a dynamic algorithm selection policy in challenging 5G traffic scenarios.Comprehensive simulations are carried out to justify the effectiveness of the proposed method.Overall, the componentized structure can be the core of an extensible smart agent to deal with complex decision-making problems in future mobile networks.

Figure 2
Figure 2 illustrates the massive MIMO resource allocation problem in the DDPG structure.During the RL process, the control agent collects state information to determine resource allocation actions.The information includes the sets of UE channel quality Ĉ QI, UE data requests D, and traffic types TYPE.Thus, the state at the t-th TTI is defined as

Algorithm 1 5 : 7 :
The DDPG Training with Action Embedding 1: Randomly initialize critic network Q and actor network μ in the DDPG agent 2: Initialize target network Q and μ with weights θ Q ← θ Q , θ µ ← θ µ .3: Initialize replay buffer 4: for episode= 1 to end do Initialize a random process N for action exploration 6: Receive initial observation state s 1 for t=1, T do 8: Generate continuous action ǎt = μ(s t |θ µ ) + N t from actor in DDPG 9:

2 (
More UEs with high data rate)

TABLE I SUMMARY
Kr ] T .Processed by a power amplifier, the transmitted signal vector x ∈ C M ×1 is transmitted through the antennas and is given as OF NOTATIONS Notation Description t Index of the t-th TTI M/M BS antenna set / total number of antennas K/K/Kr UE set / total number of UEs / simultaneously receiving UEs Ô/Ot Ordered UE set / the ordered UE set at t N/Nt/N k,t Antenna allocation set / the antenna allocation set at t / number of antennas allocated to UE k at t P/Pt/p k,t Precoding matrix set / the precoding matrix at t / k-th column vector of Pt H/h k,t Channel matrix / channel vector of UE k at t Φ k,t Data transmission rate of UE k at t U k /U k,t Overall utility of UE k / utility of UE k up to t ν k,t Number of successfully received packets by UE k at t E k Packet error rate requirement of UE k D k Set of requested data packets of UE k t(d) TTIs assigned to transmit data packet d in time the additive white Gaussian noise (AWGN) with variance σ 2 , n = [n 1 , n 2 , . . ., n P = [p 1 , p 2 , . . ., p Kr ] ∈ C M ×Kr with column vectors p k ∈ C M ×1 , is the set of hybrid precoding matrices.χ = [χ 1 , χ 2 , . . ., χ Kr ] T ∈ C Kr×1 is the modulated user signals with E[χχ H ] = ( ρ Kr )I Kr 3,t ] Execute action a t and observe reward r t and new state s t+1 11: Store transition (s t , ǎt , r t , s t+1 ) in replay buffer 12: