Learn to Schedule (LEASCH): A Deep Reinforcement Learning Approach for Radio Resource Scheduling in the 5G MAC Layer

Network management tools are usually inherited from one generation to another. This was successful since these tools have been kept in check and updated regularly to fit new networking goals and service requirements. Unfortunately, new networking services will render this approach obsolete and handcrafting new tools or upgrading the current ones may lead to complicated systems that will be difficult to maintain and improve. Fortunately, recent advances in AI have provided new promising tools that can help solving many network management problems. Following this interesting trend, the current article presents LEASCH, a deep reinforcement learning model able to solve the radio resource scheduling problem in the MAC layer of 5G networks. LEASCH is developed and trained in a sand-box and then deployed in a 5G network. It has been evaluated under different numerology settings. The experimental results show that it is both numerology-agnostic and efficient when compared to conventional baseline methods in many key performance indicators.


I. INTRODUCTION
The rapid evolution of networking applications will continue to bring new challenges to communication technologies.In the fourth-generation (4G), also known as long term evolution (LTE), throughput and delay were the main foci.In 5G and beyond, services have reached completely new levels.This new era of communication is featured by new killer applications that will benefit from emergent technologies like Internet of things (IoT) and next generation media such as virtual reality (VR) and augmented reality (AR), to name a few.
Unlike LTE, 5G is a use-case driven technology.In addition, 5G is not only machine-centric but also user-centric, where the user notion has evolved to cover a wider range of entities other than a traditional human-on-handset notion.Small devices that use 5G infrastructure are basically users [21].
The main use cases supported by 5G, for now, are enhanced mobile broadband (eMBB), ultra-reliable and low latency communications (URLLC) and massive machine-type communications (mMTC).eMMB supports high capacity and high mobility (up to 500 km/h) radio access with 4 ms user plane latency.URLCC provides urgent and reliable data exchange with sub 1 ms user plane latency.The new radio (NR) of 5G will also support massive number of small packet transmissions for mMTC with sub 10 ms latency.
The main key-enablers to handle the requirements of this new era include flexible numerology, bandwidth parts (BWPs), mini-slotting, optimized frame structure, massive MIMO, inter-networking between high and low bands, and ultra lean transmission [1], [21].In addition, emergent technologies like software-defined networking (SDN), network function virtualization (NFV), and network slicing will also be key technologies in paving the way for enhancements in 5G and LTE and 5G both rely on the same multi-carrier modulation approach OFDM.Nevertheless, the NR supports multinumerology structures having different sub-carrier spacings (SCS) and symbol duration.The NR frame structure is more flexible which, on one hand, makes it possible to deliver data for all three main use-cases but, on the other hand, makes it difficult to manage resources efficiently.In addition it is expected that more use cases will emerge and more flexibility is foreseen to be added to the NR frame in the future, making the resource management task even more complicated.For instance, current specifications of the physical layer supports only four BWPs for each user with only one BWP being active at a time.However, UEs in the future will be able to use multiple BWP simultaneously [17].
The service requirement [5], the significant diversity of the characteristics of the traffic [12], and the user stringent requirements, make 5G a complex system that can not be completely managed by tools inherited from ancestor networks [5].Therefore, industry and academia are looking for novel solutions that can adapt to this rapid growth.One of the main paths is to rely on new AI advancements to solve the network management problems in 5G.
The current article focuses on a fundamental problem in 5G: the radio resource management (RRM) problem.In general, RRM can be seen as a large problem with many tasks.This article specifically studies the radio resource scheduling (RRS) task in the media access control (MAC) layer.This work shares the same view with many scholars about the necessity of developing AI-based solutions for network management tasks [11].An important AI tool gaining a lot of attention is the deep reinforcement learning (DRL).This trend is recently known as learn-rather-than-design (LRTD) approach.
The main contributions of this work are: • A numerology-agnostic DRL model; • A clear pipeline for developing/training DRL agents and for their deployment in network simulators; • A comparative analysis in several network settings between the proposed model and the baseline algorithms; • A reward analysis of the model.This article is organized as follows: The RRS problem is described in the reminder of this section.Section II presents a brief description of the DRL theory.The related work is presented later in Section III.Sections IV and V present the proposed approach and the results, respectively.The article is then concluded in Section VI.

A. Radio resource scheduling problem
The continuous update of physical layers to handle new use cases in communications is the main surge behind the development of flexible MAC layers or components thereof.
As new use cases emerge, handcrafted MAC layers become more complicated and prone to error.This is, in fact, the main problem in modern network and resource management [11], [18].Human-centered approaches lack flexibility and usually require continuous repairs and updates, which lead to a degradation in the level of service and compromise in the performance [26].
Improving the ability of communication systems, to effectively share the available scarce spectrum among multiple users, has always been the main research target of academia and big com industry.As more user requirements are added to the system, the need to find better resource sharing approaches becomes inevitable.Therefore, RRS is an essential task in communications.The main objective of RRS task is to dy- namically allocate spectrum resources to UEs while achieving an optimal trade-off between some key performance indicators (KPIs), like spectrum efficiency (SE), fairness, delay, and so on [7].Achieving such trade-off is known to be a combinatorial problem.
As described in Figure 1, RRS simply works as follows: the scheduler runs at the gNB at every (or kth) slot to grant the available resource block groups (RBGs) to the active UEs.
Therefore, the problem boils down to filling the resource grid by deciding which UE will win the current RBG in the current slot.However, not all users can be scheduled at the current RBGs.Only those that are eligible (active) will be considered for scheduling and, therefore, allowed to compete for the RBGs under consideration.A UE is eligible if it has data in the buffer and is not retransmitting in the current slot, i.e., if it is not associated with a HARQ process in progress.
In many cases, obtaining an optimal solution for RRS problem is computationally prohibitive due to the size of the state-space and the partial observability of the states [25].Moreover, surged by new requirements, the RRS task will continue to expand, in the future, both horizontally and vertically.Horizontally, regarding the number and diversity of users, and traffic patterns it should support, and vertically by having to consider new (and perhaps contradictory) KPIs.
Therefore, RRS can easily become intractable even for smallscale scenarios.
Current RRS solutions are driven by conventional off-theshelf designated tools.This includes variants of the proportional fairness (PF), round robin (RR), BestCQI, and so on.This scheme has been successful but it will become obsolete in the future.In fact, this is the case of many current network management tasks [28].A new RRS approach is inevitable due to: • the rapid increase in network size; • the breadth of control decisions space; • the new perception from business-makers and end-users of the networking services and applications; • modern networks are delayed return environments; • lack of sufficient understanding of network underlying by conventional tools, i.e., they are myopic.
In this context, we share the same vision with [11] that research communities and industry have been focusing on developing services, protocols, and applications more than developing efficient management techniques.Fortunately, recent technologies in artificial intelligence (AI) offer promising solutions and there is a consensus among many scholars of the need of AI-powered network management [35].

II. DEEP REINFORCEMENT LEARNING
RL is a learning scheme for sequential decision problems to maximize a cumulative future reward.An environment of such scheme can be modeled as a Markov decision process (MDP) represented by the tuple (S, A, P , r, γ).Where S is a compact space of states of the environment.A is a finite set of possible actions (action space), P is a predefined transition probability matrix such that each element p ss determines the probability of transition from state s to s .The reward function r : S × A × S → R tells how much reward the agent will get when moving from state s to state s due to taking action a.The γ is a discount factor used to trade-off the importance between immediate and long-term rewards.
In RL, an agent learns a policy π by interacting with environment.In each time step t, the agent observes a state s t , takes an action (decision) a t , observes a new state s t+1 and receives a reward signal r(s t , a t ).The learning scheme can be episodic or non-episodic and some states are terminal.
A policy π is a behavioral description of the agent and the policy for state s, π(s), can be defined as a probability distribution over the action space A, such that the policy for the pair (s, a), π(a|s), defines the probability assigned to a in state s.Therefore, a policy simply tells us which action to take at state s.
The objective of training an agent is to find an optimal policy that will tell the agent which action to take when in a specific state.Therefore, the objective of an agent boils down to maximizing the expected reward for a long run.Starting from state s t , the outcome (return) can be expressed as: For a non-episodic learning scheme, we can see that γ < 1 is important not only to obtaining a trade-off between immediate and long-term rewards but also for mathematical convenience.
When an agent arrives at a state it needs to know how good it is to be at state s and following the optimal policy afterwards.A function to measure that is called the value, aka state-value, function V (s): Similarly, to measure how good it is to be at state s and take action a, a quality function Q, aka action-value function, can also be derived as: Once we know Q and π we can calculate V using: Therefore, V and Q can be related as: In addition, these two functions can also be related via an advantage function A [32]: where A subtracts the value function from the quality function to obtain a relative importance of each action, and tell the agent if choosing an action a is better than the average performance of the policy.
In fact, we are interested in finding Q since we can easily derive the optimal policy π * from the optimal Q * .Q(s, a) maps each (s, a) pair to a value, i.e., it measures how good it is to take action a when in state s and then following the optimal policy.Using the Bellman expectation function, we can rewrite Q(s, a) as: Therefore, following the Bellman optimality equation for Q * we have: The optimal policy can be then derived from the optimal values Q * (s, a) by choosing the maximum action value in each state.This scheme is known as value-based (compared to policy-based) learning since the policy is driven from the value function: However, finding π * is not easy since in may real world applications, the transition probability is not known.One algorithm to solve this Bellman optimality equation is the Qlearning algorithm [33].This algorithm is off-policy criticonly (compared to on-policy and actor-critic algorithms).In this algorithm, Q is represented as a lookup table, which can be initialized by random guesses and gets updated in each iteration using the Bellman Equation: For terminal state this update comes down to: In order to balance between exploration and expedition, the agent, in Q-learning, adapts an -greedy algorithm.In -greedy, the agent selects an action a using a = arg max with probability 1 − , otherwise selects a random action with probability .This randomness in decision making helps the agent to avoid local minimums.As the agent progresses in learning, it reduces via a decaying threshold δ .With this annealing property of -greedy, in practice, an agent is expected to perform almost randomly in the beginning and matures with time.
One drawback of the original Q-learning algorithm is scalability.Keeping a tabular for such iterative update is feasible only for small problems.For larger problems, it is infeasible to keep track of each (s, a) pairs.Therefore, in practice it is more feasible to approximate Q.
A common way to approximate Q is to use a deep neural network (DNN).This cross-breeding between deep learning and Q-learning has yielded deep Q networks (DQN), more generally known as deep reinforcement learning (DRL), which is the main breakthrough behind recent advancements in RL that delivered a human-level performance in Atari games [24] and even more strategic games [30] where the agent learns directly from a sequence of image frames via convolutional neural networks (CNN) and DRL.
In DQN, the Q function is approximated by minimizing the squared error between the Bellman equation and the neural network estimation, aka mean-squared Bellman error (MSBE): where Q target is the target Q function, known as the target critic, and θ is the set of DNN parameters.Q target is calculated as: where θ is the set of DNN's weights and is updated in a stochastic gradient descent (SGD) fashion.For a predefined learning rate α and a mini-batch size M , θ t is updated using: where ) is known as the temporal difference (TD) error.
In DQN, the state and actions are represented by two separate networks and combined via an Add layer.The output is a single value (Q value) in a way similar to classical regression.However, a more efficient architecture is to have the state as input and let the network output be equal to the length of action space.This way, each output represents the likelihood of an action given the state.As in classical Qlearning, the action with maximum likelihood will be selected.
In order to stabilize the results, and to break any dependency between sequential states, DQN uses two tricks.First, two identical neural networks are used one for on-line learning and another to calculate the target Q target .The target network is updated periodically, from the on-line network, every T steps.Therefore, the target is calculated from a more mature network, thus increasing the learning stability: where θ is a delayed version of θ Instead of copying the weights from the on-line to the target network at every T steps, it turns out that a smoothing (i.e., progressive) update approach can noticeably increase the learning stability: where β is a small real-valued smoothing parameter.
The second trick is to use an experience replay memory R, usually implemented as a cyclic queue.This memory is updated in every learning step, by appending the tuple (s, a, s t+1 , r t+1 ) to the end of the queue.Therefore, when training Q, random mini-batches are sampled from R and fed to the Q on-line network.R reduces the dependency between consequence input and improve the data efficiency via reutilizing the experience samples.
Q-learning and its variant DQN tend to be overoptimistic due to the noise in the Q estimates, and the use of the max operator in selecting the action and calculating the value of the action.A solution is the Double DQN (DDQN) [14], [29] model, which learns from two different samples.One is used to select the action and another one is used to calculate the action value.Therefore, in DDQN, the critic target Q target is calculated as: In expression (17), the selection of the action is made from the on-line network, i.e., arg max a Q(s t+1 , a, θ), and the evaluation and update is made from the target critic network A. Why DRL is suitable for RRM problem?
The notion of self-driving networks [11] is gaining more and more attention nowadays.The core vision of self-driving network engineering is to learn rather than to design the network management solutions [19].Such vision has radically changed some fields like computer vision via deep learning [20], by learning features rather than hand-crafting them.However, we did not witness such major progress in network management.
The reason is that supervised learning is not suitable for some control and decision problems, since collecting and labeling networking data is not trivial, expensive, and network states are non-stationary [6], [9].DRL can be quite suitable for such problems due to the following reasons: • All information about RRM can be centralized in the gNB thus creating a network wide view (although not fully) of the network.In addition, new paradigms like knowledge defined networking (KDN) can be used [23]; • DRL agents can continue learning and improving while the network operates.They can interact with other conventional components in the system and learn from them if necessary [31]; • Network dynamics are difficult to anticipate and exact mathematical models are not scalable.For 5G network management, it is extremely difficult to model the network state and traffic due to the diversity of the applications and traffic it supports [12].Therefore, DRL modelfree models can be the choice in this case.
• After the new breakthroughs, DRL becomes an extremely hot research topic [16].In networking, the popularity of DRL is increasing and some famous network simulators recently have been extended to support general DRL environments like gym [13].

III. RELATED WORK
DRL solutions for RRS are scarce in the literature but they can be divided according to the nature of the action space into two main categories: Coarse (high-level) and finegrained (low-level) decisions.In the former, the DRL agent acts as a method/algorithm selector [10], [28] or protocol designer [26], [18].For instance, for a given network state, the DRL agent selects which conventional algorithm is suitable to perform scheduling.In the latter, DRL decisions are hardwired in the networking fabric.The DRL agent makes finegrained decisions like filling the resource grid [31], air-time sharing between users and decide which user has rights to access the channel [25], [8], or select which coding scheme is suitable [36].In addition, the fine-grained methods can be classified into distributed [25], [34] and centralized [8], [31].In the distributed approaches, each UE acts as a DRL agent.This way the network is composed of multi-agents in a way similar to those in game theory.Such approach is scalable but sharing the network state among multiple entities makes it difficult to guarantee convergence.On the other hand, centralized approaches can benefit from a better computational power and better network state understanding.
Both coarse and fine-grained approaches have pros and cons.The coarse level scheme is more scalable, since the agent acts in almost a constant action space.On the other hand, such approach falls short in obtaining deep control of the network.Conventional algorithms are still the main working horses.In fine-grained approaches, the DRL agent deals with the finest decisions.Therefore, it can obtain a deep control of the network.However, these approaches require more sophisticated designs to be adaptive to networking dynamics.
Our work belongs to the fine-grained centralized approaches.

A. Coarse approaches
An algorithm selector approach can be found in [10].At each slot, an actor-critic agent chooses a scheduling algorithm, among a set of available PF-variants algorithms, to maximize some QoS objectives.The state is the number of active users, the arrival rate, the CQIs, and the performance indicator with respect to the user requirements.The reward function measures the impact of choosing a rule on the QoS satisfaction of the users.A similar approach can be found in [28] for 5G network but using a variant of actor-critic DRLs known as deep deterministic policy-gradient (DDPG) algorithm and with larger action space that controls more parameters.However, this approach is not numerology-agnostic.In [28], for instance, a distinct DRL design is required for each network setting.
In [26], AlphaMac is proposed which is a MAC designer framework that selects the basic building blocks to create a MAC protocol using a constructive design scheme.A building block is included in the protocol if its corresponding element in the state is 1, zero otherwise.As action, the agent chooses the next state that will increase the reward (which is the average throughput of the channel).Each selection by the agent is then simulated in an event-driven simulator that mimics the MAC protocol but with flexibility to allow adding and removing individual blocks of the protocol.
Physical layer self-driving radio is proposed in [19].The user specifies the control knobs, and other requirements, and the system learns an algorithm that fits a predefined objective (reward) function.The action space is the control knobs and their possible settings.The system then holds a set of DNN and applies the appropriate one to the input scenario.In fact this work can be regarded as hybrid since it combines both coarse and fine-grained approaches in a hierarchal design.

B. Fine-grained approaches
A general resource management problem is handled in [22] by a policy gradient DRL agent.The objective is to schedule a set of jobs at a resource cluster at a given time step.This work demonstrated the suitability of DRL agents in one hand but, on the other hand, it can not be applied directly to 5G RRS problems.
A RRS agent for LTE networks can be found in [31].A single RBG is considered and the authors have shown that DRL agent, trained by the DDPG algorithm, can achieve near PF results when it uses PF algorithm as an expert (guide) to learn from.This approach can ensure great stability since the agent learns from a well-established algorithm, but it diminishes the ability of agents to discover their own policies.
In [9]  In [25] a lightweight multi-user deep RL approach is used to address spectrum access problem where a recurrent Q network (RQN) [15] with dueling [32]  In [27], the duty cycle multiple access mechanism is used to divide the time frame between LTE and Wi-Fi users.
A DLR approach is then used to find the splitting point based on the feedback averaged from the channel status for several previous frames.Information like idle slots, number of successful transmissions, action, reward are used to represent the state of the agent.The action is a splitting point in the time frame (i.e., an integer), and the reward is the transmission time given to the LTE users while not violating the Wi-Fi users minimum data rate limit.
In [34] a DQN model is developed to learn how to grant access between an DRL agent and different infrastructures.
The agent learns by interacting with users that use other protocols, like TDMA and ALOHA, and learns to send its data in the slots where the other users are idle.

IV. THE PROPOSED DRL SCHEDULER (LEASCH)
A general sketch of the proposed scheduler is shown in for scheduling at a given RBG if the UE has data in the buffer and is not associated with a HARQ process.However, instead of feeding the buffer and the HARQ status of each UE to the LEASCH, and ask the agent to learn "eligibility", we simplify the task for the agent by calculating a binary vector g to act as an eligibility indicator: As we will see, g will help us designing a tangible reward function that allows the agent to effectively learn how to avoid scheduling inactive UEs.
b) Data rate: One way to represent this piece of information in the agent state is to use the data/bit rate directly.
However, we use the valid entries of modulation and coding schemes (MCSs) in Table 5.1.3.1-2 in the 5G physical layer specification TS 38.214 [4] to model this information.We denote to this information vector by d.
c) Fairness: We keep track each time a UE is admitted to an RBG.To that end, a vector with all-zero elements f = 0 is created in the beginning of each episode, and each time an RBG is scheduled f is updated: Therefore, f represents the allocation-log of the resources.
In the best case scenario, all entries of f are the same, meaning that all UEs are admitted to the resources with the same probability.In addition, f also represents the delay, because if a UE did not access the resources for too long, its corresponding value in f will be large.
Combining these three vectors g, d and f yields the state.
The size of the state can be further reduced by joining g and d via the Hadamard product: making the final state vector defined by: This way our state represents all pieces of information in a compact but descriptive manner.For a better learning stability we normalize d and f to the range [0, 1].

2) Action:
The action is to select one of the UEs in the system.It is encoded in hot-one encoding.
3) Reward: Reward engineering is a key problem in RL in general.In general, the reward is treated similarly to an objective function to be maximized.However, we believe that it should be engineered as a signal such that each state-action pair represents a meaningful reward.
From our state design the goal is to encourage the agent to transmit at the RBGs with the highest MCS, i.e., highest bitper-symbol, to increase the throughput in the system.At the same time, we would like the agent not to compromise the resource sharing between the users.Therefore, the adopted reward is given by: where K is a threshold to represent the negative penalization signal for scheduling an inactive UE, and f is updated using (19).We can easily see that, our reward is a variant of a discounted bestCQI function, where the data rate is discounted by the resource sharing fairness.

B. LEASCH training and deployment
LEASCH is trained for a sequence of episodes.The training procedure of one episode is described in Algorithm 1.In the According to LEASCH's decision, the simulator allocates the resources and records statistics.

V. RESULTS
In order to evaluate the proposed scheduler, a comparison with two baseline algorithms, proportional fairness (PF) and round robin (RR).These are widely used algorithms in literature and in practice.The main objective here is to assess LEASCH using different settings in order to: i) show its ability to solve the RRS problem; ii) try to understand which policy it was able to learn; and iii) to analyze the quality of its design.
8: calculate new state s using the equations ( 18) to ( add the tuple (s, u, r, s ) to the experience replay R 10: sample M mini-batches from R and train the on-line Q neural network with θ using ( 14) and ( 17) update the target critic Q neural network (with θ) using θ every T steps via smoothing ( 16). 12: s ← s 13: end for calculate the action u as: if u ∈ Û then The collected results were analyzed from different perspectives in order to accomplish these goals.

A. Experimental setup
The parameters adopted for LEASCH and 5G simulator are depicted in Tables I and II All methods and algorithms presented/discussed here are implemented in Matlab 2019b in a PC running Linux with i7 2.6GHz, 32GB RAM, and GPU Nvidia RTX 2080Ti with 11 GB.
In the training phase, LEASCH is trained in a pool of parallel threads in the GPU.As shown in Figure 2, LEASCH was able to converge in less than 300 episodes.The theoretical (long term) reward, the green line in the graph, has also shown a steady increase which indicates a stable learning of LEASCH with each episode.In addition, the average reward (averaged each 5 episodes) has revealed a stable experience by the agent.These results have clearly shown that LEASCH has a competitive and consistent performance compared to the baseline.
LEASCH has shown improvement in all measurements, which is not an easy task given that LEASCH has a simple design and has been trained off-simulator.In addition, when choosing a setting with higher theoretical throughput (e.g., 10MHz with 30kHz SCS instead of 5MHz with 15kHz SCS), LEASCH was able to scale well and improve the performance even further.One nice property of LEASCH is that it is able to push all the KPIs without compromising any of them.More specifically, LEASCH was able to improve the throughput but at the same time without compromising the goodput.This is why the goodput is enhanced even more than the throughput in all tests compared to the baseline.

C. Which policy did LEASCH learn?
This section tries to analyze and figure out which policy did LEASCH learn.This task is not trivial, not only for LEASCH but for almost every DRL agent.Here it is more difficult not only because of the stochastic nature of LEASCH, but also due to the complexity of the RRS problem.Therefore, the visual inspection approach of LEASCH behavior will be followed.
To that end, a testing run is sampled for a set of settings and the throughput and goodput curves are quantized into 10 time units (see Figure 6).These curves are then visually inspected with regard to those of PF and RR.By comparing these curves, both for each UE and for the cell, it is possible to construct an idea about which policy LEASCH has learned.In this figure, the second set of settings with 10MHz BW and 30kHz SCS is chosen.Since 30kHz SCS is used, the simulation time is only 1250 ms.For 15kHz this would be 2500 ms.This is due to the reduction in symbol duration as the numerology index increases.
According to the simulation settings in Table II, the channel changes (and consequentially new CQI feedbacks are signaled) each 125 ms, i.e., each 1 time unit in Figure 6.In this Figure, first it is interesting to see that the trends of the cell curves are similar in all approaches.However, at each time unit each method makes different scheduling decisions.
Second, LEASCH outperforms PF and RR since it reaches higher throughput-goodput, especially from the period 5 to 8 time units where major changes have occurred in the channel.quality of LEASCH's design.Therefore, a reward analysis is performed by separating both goals outcomes.
To that end, the learning curve in Figure 2 is decomposed into two curves as shown in Figure 7.In addition, instead of calculating the average total reward of the episode (as in Figure 2), the average reward of each episode is used.This allows us to study how LEASCH learns both parts of expression arXiv:2003.11003v1 [cs.NI] 24 Mar 2020 beyond.
Our approach is novel compared to AI-based approaches which are still scarce.First, this work proposes off-simulator training scheme, which maximizes the flexibility of training the agents and minimizes the training time.Second, our model is tested on an environment different from the one being trained in.From a generalization point of view, we think this should be the case for DRL agents.That is, the training and deployment tasks should be separated.Third, the designed model is new where the state and reward are novel.Finally, our work is tested on a 5G system level simulator that uses all recent components and configurations of a 5G network.Up to our humble knowledge, all these components have not been jointly addressed in any previous work.
a high volume flexible time (HVFT) traffic driven by IoT is scheduled on radio network via a variant of DDPG algorithm, where the scheduler determines the fraction of IoT traffic on top of conventional traffic.To empower the agent with time notion, a temporal features extractor is used, and these features are then fed to the agent.The reward function is a linear combination of several KPIs, like IoT traffic served, traffic loss due to the introduction of IoT traffic and the amount of served bytes below as the system-wide desired limit.In[8] a policy gradient DRL is proposed to manage the resource access between LTE-LAA small base stations (SBS) and Wi-Fi access points.The goal is to determine the channel selection, carrier aggregation, and fractional spectrum access for SBS while considering airtime fairness between SBS and WI-FI APs.The state includes all network nodes states, and the reward is the total throughput over the selected channels.The scheduling problem is modeled as a non-cooperative Homo Equalis game model where, in this model, the achievement of a player is calculated by its performance while maintaining a certain fair equilibrium regarding other players.To solve this model and establish a mixed strategy, a deep learning approach is developed, where LSTM and MLP networks are used to encode the input data (from IBM Watson Wi-Fi data set) and the objective function of the model is solved via a REINFORCE-like algorithm.The work has shown improvement in the throughput compared to reactive RL, when increasing the time horizon parameter.In addition, when compared to classical scheduling approaches like PF, the work shows enhancement in served network traffic but at the same time the average airtime allocation for Wi-Fi APs has degraded as the time horizon parameter increases.One disadvantage of this work is that it uses a heavy-weight architecture.

Figure 1 .
Figure 1.Our work has two phases.Training and testing.In the training, the scheduling task is transformed into an episodic DRL learning problem and LEASCH is trained until it converges.In the testing phase, a 5G system level simulator is used to deploy LEASCH.These two phases are described deeply in the following subsections.Each component of LEASCH is described from a DRL perspective first, and then the training and deployment algorithms are presented: beginning of each episode, a random state is created.Then the agent is trained for a set of episode steps.In each step the agent trains its on-line Q neural network, and transfers the learned parameters to the target critic neural network each T steps.After an episode has finished, the experience replay memory R and the learned weights are transfered to the next episode, and so on.The state is reseted in the beginning of each episode.Once the training phase has finished, LEASCH is deployed in a 5G simulator to test it.The deployment algorithm is shown in Algorithm 2. In this algorithm, the agent is plugged in like any other conventional scheduling algorithm.Each time an RBG is ready for scheduling, it is admitted to LEASCH which first calculates the set of eligible UEs, Û, and create a state s.Next, it decides which UE wins the RBG by performing a forward step on its neural network with weights θ and chooses the action with the highest probability.If the selected UE, u, belongs to Û then LEASCH assigns the current RBG to u.

3: initialize s randomly according to the ranges of d and f 4 :
for i = 1 : episode do 5:
, respectively.As for LEASCH's architecture, its Q neural networks are DNNs with two fully connected hidden layers of 128 neurons each, and relu activation functions.The input layer size is 2 × |U | while the output layer is a layer of size |U|.

Figure 6 :
Figure 6: A random testing run for the 10MHz BW and 30kHz SCS setting.Left column: throughput; right column: goodput.

( 21 )
separately.From this figure, the red curve represents the probability of scheduling active-only UEs while the blue curve is the throughput-fairness reward, i.e., du × show that LEASCH was able to jointly learn these two objectives and, around episode 300, LEASCH was able to converge for both objectives which clearly indicates the effectiveness of LEASCH's design.In addition, this also shows the suitability of DRL to handle the scheduling problem, which is usually a multi-objective problem.VI.CONCLUSIONSIt has been presented LEASCH, a deep reinforcement learning agent able to solve the radio resource scheduling problem in 5G.LEASCH is a breed of DDQN critic-only agents that learns discrete actions from a sequence of states.It does so by adapting its neural networks, known as DQNs, weights according to the reward signal it receives from the environment.What makes LEASCH different from conventional schedulers is that, it is able to learn the scheduling task from scratch with zero knowledge about the RRS at all.LEASCH is different from the extremely scarce and new AI-schedulers by many things.First LEASCH is trained off-simulator to break any dependency between learning and deployment phases.Making LEASCH a generic tool in any networking AI-ecosystem.Second, LEASCH has novel design not addressed in earlier approaches.Finally, LEASCH was designed as numerologyagnostic which makes it suitable for 5G deployments.Concerning LEASCH performance, it has been compared to the well-established approaches PF and RR.Despite LEASCH's simple design it has shown clear improvement and stability in throughput, goodput, and fairness KPIs.Further analysis has also shown that, LEASCH is able to learn not only to enhance the classical throughput-fairness tradeoff, but to learn not to schedule inactive users.It was able to learn both objectives at the same time as the learning curves depicted.Another interesting property of LEASCH is that, it avoids to penalize users with bad CQIs and tries to keep all KPIs high at the same time.Such property can be improved in the future.In addition, more interesting properties, that can not be easily obtained by conventional approaches, can be learned by LEASCH.As a future work, a more advanced version of LEASCH, LEASCH version 2, will be developed to serve larger set of users.LEASCH, as any other DQN agents, is currently not suitable for large scale networks, since DQN agents are known to compromise the performance as the action space increases.Therefore, in the future work, LEASCH version 2 will be developed and deployed under larger 5G network with a mixture of numerology and different type of services.VII.ACKNOWLEDGMENT This work is supported by the European Regional Development Fund (FEDER), through the Competitiveness and Internationalization Operational Programme (COMPETE 2020), Regional Operational Program of the Algarve (2020), Fundac ¸ão para a ciência e Tecnologia; i-Five: Extensão do acesso de espectro dinâmico para rádio 5G, POCI-01-0145-FEDER-030500. This work is also supported by Fundac ¸ão para a ciência e Tecnologia, Portugal, within CEOT (Center for Electronic, Optoelectronic and Telecommunications) and UID/MULTI/00631/2020 project.
1) Key performance indicators: Throughput, goodput, and fairness are the main key performance indicators (KPIs) used for evaluating the current work.For throughput, the sum of achievable data rate in the cell is reported.For goodput, the