Model-Based Reinforcement Learning With Kernels for Resource Allocation in RAN Slices

Network slicing is a key feature of 5G and beyond networks, allowing the deployment of separate logical networks (network slices), sharing a common underlying physical infrastructure, and characterized by distinct descriptors and behaviors. The dynamic allocation of physical network resources among coexisting slices should address a challenging trade-off: to use resources efficiently while assigning each slice sufficient resources to meet its service level agreement (SLA). We consider the allocation of time-frequency resources from a new perspective: to design a control algorithm capable of learning over the operating network, while keeping the SLA violation rate under an acceptable level during the learning process. For this purpose, traditional model-free reinforcement learning (RL) methods present several drawbacks: low sample efficiency, extensive exploration of the policy space, and inability to discriminate between conflicting objectives, causing inefficient use of the resources and/or frequent SLA violations during the learning process. To overcome these limitations, we propose a model-based RL approach built upon a novel modeling strategy that comprises a kernel-based classifier and a self-assessment mechanism. In numerical experiments, our proposal, referred to as kernel-based RL, clearly outperforms state-of-the-art RL algorithms in terms of SLA fulfillment, resource efficiency, and computational overhead.


Model-Based Reinforcement Learning With Kernels
for Resource Allocation in RAN Slices this paradigm, the infrastructure provider has to assign the necessary spectrum, backhaul, and computational resources to each network slice to fulfill the service level agreement (SLA) established with the tenant of that slice. The SLA specifies a set of requirements on performance indicators, such as throughput or latency, that depend on the tenant's preferences and service type. The allocation of resources should guarantee that network slices are properly isolated from each other, but this allocation should also be resource-efficient and elastic under varying radio and network traffic conditions [1]. In the radio access network (RAN), the radio frequency (RF) resources of each base station should be distributed among the network slices of the users (UE) connected to that base station, based on the states of the slices and their SLAs. The state of a network slice is determined by its traffic descriptors, its current distribution of resources, and the radio channel conditions of its UEs, resulting in state observations that can potentially involve multiple variables. For example, in an eMBB slice, the observation could comprise the incoming data rate, the delivered data rate, the buffered data, and the resources occupied by each type of traffic, plus the channel quality indicators per user flow. Besides, the SLAs involve a diverse combination of requirements which can be defined in terms of aggregated metrics. The SLA of an eMBB slice could set an average delay objective for guaranteed bit rate (GBR) flows whenever the resources occupied by this type of traffic are below a predefined level. See [2] for other examples of SLA configurations. Determining the minimum share of RAN resources fulfilling a specific SLA in each observed state is a challenging task, and is further complicated by the interactions with the mechanisms operating within each slice on a per-flow basis, such as scheduling algorithms or adaptive modulation and coding schemes.
Our objective is to develop a control algorithm capable of learning how to allocate RAN resources in an efficient way, maximizing the amount of free resources available to the infrastructure provider, while guaranteeing the SLAs of the hosted tenants. A crucial feature of our proposal is its plug-and-play capability, enabling it to learn on an operating network (online learning), without any previous information about the system's response.
Reinforcement Learning (RL) has become the most widely used control technique for radio resource management [3] in general, and resource orchestration in network slices [4], [5] in particular. The main limitation of previous works is their use of a model-free RL (MFRL) approach, which can be very effective if the agents are trained offline, either with a This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ simulator or with samples obtained from the real system, but is not especially suitable when the agents learn on a network in operation (online learning). MFRL algorithms generally require a large number of samples, which involves an extensive exploration of policies, including inefficient ones. In our scenario, this may lead to long training periods containing multiple episodes of SLA violations and/or excessive resource over-provisioning, which are detrimental to both the tenants and the infrastructure provider.
We propose a novel approach to the problem of dynamic allocation of RAN resources among network slices, and make the following contributions: • We present a new perspective that focuses on the importance of learning online (on the real system in operation), efficiently (with few samples), and safely (degrading as little as possible the QoS provided by the network slices). • To this aim, we use a model-based RL (MBRL) approach, which overcomes much of the limitations of the usual MFRL approaches. MBRL allows for greater sample efficiency and gives us more control over the two objectives pursued: maximizing the efficiency in the use of resources and guaranteeing the SLA of the slices. • We develop a novel modeling scheme comprising a nonlinear classifier and an estimator of the classifier's accuracy. The classifier is based on a kernel-based online learning scheme known as Projectron [6] that we enhance with a sample augmentation strategy that exploits the structure of our problem. As a result, we propose a novel mechanism referred to as kernel-based RL (KBRL) aimed at learning efficient resource allocation policies under SLA constraints. • We compare our proposal with state-of-the-art RL algorithms by means of extensive simulations over diverse environments, providing one of the most complete empirical evaluations of RL in this application so far. Our results show that KBLR systematically outperforms the baselines in terms of resource usage, SLA fulfillment guarantees and computational overhead. The rest of this paper is organized as follows. Section II discusses related work. The system under study is described in Section III and the addressed problem is formulated in Section IV. We provide the necessary preliminaries about online learning with kernels in Section V, and then we present our proposal in Section VI. The numerical results are provided in Section VII and finally we summarize our conclusions in Section VIII.

II. RELATED WORK
Network slicing has received considerable attention from the research community and the industry during the last five years. Survey papers like [1], [7], [8] provide an extensive coverage of recent contributions on different aspects of this network functionality. These papers also identify open problems and research challenges in network slicing, including the one addressed in our paper: to dynamically scale up/down the RAN resources assigned to network slices so that SLA requirements are met while the RF spectrum is efficiently utilized [1], [7]. In line with our proposal, [8] highlights the importance of the computational overhead of these algorithms, arguing that small computing times allow a more frequent update of the resource allocation, thus improving the elasticity of the system.
Network slices make use of resources in all sections of the underlying network, including RF spectrum, fronthaul, backhaul, and computational resources for virtualized network functions. Resource management in network slicing is therefore a broad topic, where we find related works focused on different network elements such as computational resources [9]- [11], radio resources [12], [13], or a combination of radio and computational resources [14], [15]. Two main resource management mechanisms have been studied: admission control for on-demand slice deployment [9], [15]- [17], and dynamic resource scaling for active slices [11], [13], [18], [19]. Our proposal falls in the latter category.
Radio resource allocation comprises two conflicting objectives: 1) to maximize efficiency in the use of resources (spectrum efficiency, SE), and 2) to maximize the SLA satisfaction rate (SSR). However, conventional RL formulations are limited to a single objective function, and consequently these previous works need to aggregate both objectives into a single one by means of a weighted sum of both metrics (SE and SSR), or by multiplying them [31]. The problem with this approach is that it cannot establish a performance target on some objective, e.g., to guarantee that the SSR remains above a desired level. Moreover, the relative performances of SE and SSR vary from one scenario to another (as shown in Section VII), and thus a fine tuning of weights should be done by trial and error, limiting the feasibility of this approach for online learning, which is the challenge addressed by our proposal.
Algorithms like Q-learning, used in [19], or its deep learning version, DQN, used in [18], are conceived for discrete action spaces thus, as noted in [13], they become infeasible as the number of network slices increases, since the number of actions grows almost exponentially with the number of slices. Note that [19] and [18] considered scenarios with relatively small action spaces (2 and 3 network slices respectively). To overcome this limitation, [13] proposes the use of normalized advantage functions (NAF), a technique allowing the use of DQN strategies for continuous action spaces. In fact, transforming the discrete action space into a continuous one is a standard strategy for applying RL in this problem, and is the one that we adopted to evaluate the RL baselines in our experiments.
But the most distinctive feature of all these previous works is the use of model-free RL methods, which require relatively long training periods. Using a simulator for training MFRL agents can be extremely costly, and does not offer sufficient performance guarantees when agents are deployed in production, since it is practically impossible to replicate all the relevant aspects of a real network in a simulator. If the agents are trained on a real operating network (online learning), the system will experience poor performance during the learning periods, because MFRL agents need to explore multiple policies before they converge to an efficient one. For example, the distributional RL approach of [32] required between 5000 and 15000 steps to converge, and [31] trained its DQN-based proposal during 2×10 6 steps before performing the evaluation experiments. In contrast, we follow a model-based approach, aimed at increasing the sample efficiency of the learning process so that the control algorithm is suitable for online learning. There are no precedents of this approach for dynamic resource allocation among network slices. Previous online learning proposals were focused on different network functions (e.g., interference coordination and energy saving [33]- [35]) and used specific ad hoc mechanisms based on multi-armed bandits [33], sequential likelihood ratio tests [34], or bayesian models [35].
MBRL is known to be more sample efficient than MFRL [36], [37] but also more demanding in terms of computation. However, we propose a novel modeling strategy involving an online learning classifier that reduces the computational overhead dramatically, and helps our proposal to outperform MFRL algorithms in this metric. In Section V we review the principles of kernel-based online learning, and provide references to related works on this topic.
The use of kernel-based online learning for the definition of the model in an MBRL algorithm is a novel approach. Nevertheless it should be emphasized that our proposal can be complementary to previous ones. For example, our method prescribes the amount of physical radio resources to be assigned to each slice, but does not arrange them within the time-frequency frame structure of the RF interface. For this task, the heuristic scheme proposed in [22] can be used. Our proposal can operate concurrently with an admission control scheme for on-demand incoming slices, such as those developed in [15], [16], and with a mechanism for the allocation of computational resources among slices [9], [26]. The promising results shown in Section VII suggest that our proposal could be extended to the control of additional resources (computation, storage, backhaul, fronthaul) requiring elasticity and efficiency.

III. SYSTEM DESCRIPTION
We consider a typical cellular network system, similar to the scenarios described in [13], [19], [31], [32], consisting of a base station providing access to multiple UEs belonging to K network slices. We consider downlink transmission on an hexagonal cell. Figure 1 shows a schematic overview of the Fig. 1. Diagram of controlled system. One base station transmitting downlink in a hexagonal sector cell covering UEs belonging to three different RAN slices. Each RAN slice is granted the exclusive use of a predefined subset of RBs in each radio frame. During an observation period, spanning several frames, the control system monitors the performance, and updates the RB allocation for the next period according to the observed variables of the slices and the SLA fulfillment indicators. system for K = 3 slices. The radio interface between the BS and the UEs is structured into frames, and each frame is divided into time and frequency partitions: in the time dimension, frames are divided into transmission time intervals (TTIs), also referred to as subframes, and in the frequency dimension, the bandwidth is divided into subcarriers. The physical layer of 5G RANs provides high flexibility in the use of waveforms and time-frequency frame structures, allowing diverse configurations of the TTI duration and sub-carrier spacing, known as numerologies. The selected numerology depends on the deployed frequency band and on the desired transfer service capabilities for the slice [2]. In our case, we will assume a TTI duration of 1 ms and a sub-carrier spacing of 15 MHz, but network slices using different numerologies can coexist in the radio frame.
Each network slice is assigned a subset of the time-frequency resources of the radio interface. The smallest time-frequency allocation unit is referred to as resource block (RB), and consists of 1 TTI and 12 sub-carriers (1 ms × 180 MHz, in our setting). We consider, as [13], [18], [19], [31], that the RBs assigned to a slice are used exclusively by that slice, thus ensuring slice isolation. Consequently, each slice runs its own scheduler for allocating its RBs among its users in a per-TTI basis, in accordance to the characteristics of the delivered service type (eMBB or mMTC). As shown in Figure 1, the time-frequency resources allocated to each slice consist of a group of RBs within each radio frame, sometimes referred to as a tile, where a specific numerology can be adopted.
The SLA defined for each slice depends on the service requirements and the preferences of its tenant, and comprises a set of configuration descriptors and key performance indicators (KPIs) that can be very diverse. We will consider descriptors of the authorized capacity for each slice (see [2]). For eMBB slices, these descriptors can set specific limits to the average number of RBs consumed by each type of traffic (non-GBR and GBR) within the slice. In mMTC it is usual to define a limit on the maximum number of simultaneous active devices (UE contexts). Key performance indicators can specify QoS objectives such as maximum average delay or maximum buffer length. Section VII contains the SLA specifications for eMBB and mMTC slices used in our numerical simulations.
The number of RBs allocated to each slice can be scaled up or down in periodic time instants referred to as decision stages or steps, which are typically spaced by several frames. At each stage, the control agent selects a control action that specifies the number of RBs that will be available exclusively to each slice until the next decision stage. Between consecutive stages, the agent collects per-slice measurements regarding user data traffic, channel quality conditions and SLA compliance parameters. These observations are used by the control agent to make the next decision and to learn about the response of the system. Table I shows a set of variables that can be measured in an eMBB or an mMTC slice. In the case of eMMB, the observation comprises a differentiated subset of variables for each type of traffic (GBR and non-GBR), since each type is associated to a specific QoS requirement in the SLA. These variables provide aggregated information at the system level: incoming traffic rate, delivered traffic rate, average resource occupation, average queue length, and average signal to interference ratio (SINR). In mMTC, each device is associated to a constant number of packet repetitions related to its estimated pathloss [38]. Therefore, the observed variables include the number of simultaneous UE contexts, the average delay per UE, and the average number of remaining packet repetitions per UE by the end of the previous stage.
Let us summarize the control sequence: at each decision stage n, the control agent receives the observation vector of each slice (containing the variables of Table I), and the KPIs of each slice (allowing the agent to assess whether or not the slice's SLA has been fulfilled during the last decision period). Based on these observations, the agent selects a resource allocation with which the system will operate until the next decision stage. The control objective is to allocate the RBs as efficiently as possible while ensuring that the SLAs are fulfilled with high probability.

A. Observations and Actions
Let K denote the set of K active network slices coexisting in the RF interface, and let C denote the total number of RBs to be allocated among the slices. At each decision stage n = 1, 2, . . . , the control agent receives the per-slice observations gathered between stages n − 1 and n, denoted by s (i) n−1 , for each network slice i ∈ K. The combination of the K slice observations at stage n, denoted by s n = (s n is a vector containing samples of the variables defined in Table I, which are random variables since they are obtained by aggregating and/or averaging the realizations of multiple stochastic processes during the TTIs elapsed between stages n − 1 and n. These processes are, for example, the arrival and departure of UEs to the cell, the GBR or non-GBR traffic generated by each UE, the data buffered at each UE, or the SINR measurements on each channel. Therefore, we will use the uppercase notation S (i) n to refer to the slice observation as a random (multi-dimensional) variable, and the lowercase notation s (i) n to denote a particular sample of the random variable S (i) n . Note also that s n is not the state of the system, which is extremely complex to define since it involves multiple UEs, protocol layers, traffic flows, propagation conditions, and so on. Instead, it is a partial observation comprising system-level variables that can be sufficient to make decisions on the allocation of bandwidth resources to network slices.
At decision stage n, the control agent selects the number of RBs a (i) n to be allocated to each slice i ∈ K. The combination of all the assignments a n = (a For each slice i, we define an indicator function I (i) that informs the controller about the SLA fulfillment on a per-stage basis: n ) = 1 if the SLA of slice i has been violated between decision stages n and n + 1, and Note that the function I (i) condenses all the KPIs that have been defined in the SLA. For instance, in an eMBB slice, if the system has not been able to meet either the GBR QoS level or the non-GBR QoS level, the indicator function for that slice will return 1. It will return 0 only if all the specified QoS levels have been satisfied during the observation period.

B. Control Policy
The decisions of the control agent are determined by its policy π, defined as a function that receives the system observation s n−1 , and provides the control action a n with the resource allocation per slice. Given the random system dynamics, and an initial distribution of the slice observations S , the policy π determines a random sequence of observation-action pairs S These K sequences constitute a trajectory of the system. In our setting, a policy is admissible if it prescribes only actions that comply with i∈K a (i) n ≤ C and a (i) n ≥ 0 for i ∈ K (admissible actions). The set of admissible policies is denoted by Π.

C. Constrained Markov Decision Process Formulation
Using the above definitions, the problem of finding an efficient policy for the control agent can be formulated as a constrained Markov decision process (CMDP) [39]. The objective of our CMDP is to find a policy π ∈ Π that minimizes the average amount of allocated resources, while the average number of SLA violations per slice is kept under a desired bound δ: where E π denotes the expected value with respect to the distribution of the trajectories under policy π. Note that the capacity constraint is implicitly included by considering only admissible policies π ∈ Π, and all the QoS objectives of each slice are accounted for in its indicator function I (i) . This CMDP cannot be directly addressed because the system dynamics is unknown and the state of the system is not directly observable. Even without these limitations, the large dimension of the state and control spaces would render any numerical approach infeasible.

D. Markov Decision Process Formulation
In order to apply RL methods, it is necessary to reformulate (1) as an MDP by removing the SLA constraints and including them into the objective. The usual approach [13], [18], [29], [32] is to add an SLA violation counter to the objective function as a penalty term, multiplied by a weight factor λ. The resulting MDP is: is interpreted as the cost incurred by the system at stage n. This cost term can be also denoted by −r n (s n , a n ), expressing the total (negative) reward of the observation-action pair s n , a n . The problem (2) is an average cost MDP because it aims at minimizing the average cost per stage (or maximizing the average reward per stage). However, most RL methods address discounted MDPs. We can reformulate (2) as a discounted MDP as follows: where γ is a discount factor, and S n , A n denote the (random) observation-action pair visited by the system trajectory at stage n. The discounted MDP (3) aims at maximizing the expected return, defined as the sum of the discounted rewards along the system trajectory.

E. Solution Strategies 1) Model-Free RL:
Model-free RL methods assume that the transition dynamics of the system is unknown. Their main idea is to estimate the expected return in (3) by taking samples of the trajectory. In order to do this, these methods build a parametric estimator of the expected return using, for example, a deep neural network. Policy gradient algorithms include a parametric policy, and estimate the gradient of the expected return with respect to the policy parameters. This makes it possible to gradually adjust these parameters by making gradient descent updates (see [40] for a coverage of RL techniques). The last 5 years have been particularly productive in terms of novel algorithms of this type. In Section VII we briefly review the state-of-the-art MFRL algorithms that we use as baselines in our performance evaluation experiments.
2) Model-Based RL: Model-based RL typically relies on learning the transition dynamics of the system instead of the optimal state values and/or policies [36], [37]. The main task of the learning process is to fit an approximation of the true transition function, given the states and the actions observed from the real system. Once a model is learned, the agent can use it to predict the expected return of each action in each observed state. Consequently, at each decision stage, the agent can evaluate multiple candidate action sequences, and select the optimal one to use.
Our approach, instead of learning the transition dynamics, learns the effects of these dynamics on the SLA violation indicator functions I (i) for i ∈ K. In particular, we build, for each slice i, a model h (i) n that predicts whether a given assignment a (i) n will fulfill the SLA, given the observation s (i) n−1 received at the end of the previous stage. Note that this strategy does not generate multi-step trajectories, i.e., we do not predict S . ., and therefore the agent will only be able to plan over a one-stage horizon. Although this strategy generally leads to suboptimal solutions, it allows us to address the original problem (1) instead of the modified one (2), resulting in better empirical results compared to model-free RL approaches (which are farsighted), as shown in Section VII.
Given h (i) n , we could approximately address the CMDP problem (1) as a one-step lookahead control problem, obtaining a model predictive controller (MPC) [40] in which the observation-action pairs must satisfy the SLA of each slice according to h (i) n for i ∈ K. However, this approach does not take the violation rate bound δ into account. An insufficiently accurate predictor could cause excessive SLA violations.
We need to define the error function e (i) as the probability that the prediction given by h (i) n , on a given pair (s is a false negative: Note that, by convention, we are associating the null hypothesis to the absence of any SLA violation in stage n, i.e., Therefore, e (i) denotes the probability of a type II error.
We have transformed the problem into a one-step lookahead control problem, in which each control action a n should be admissible, and each element a (i) n in a n should be SLA compliant according to h (i) n with an error probability bounded by δ: It is straightforward to decompose the above problem into K sub-problems, facilitating its online operation. As we will describe in the following section, the error functions e (i) will be also learned online, and thus the controller will use the learned functionsê

A. Online Learning
An online learning algorithm aims at learning a mapping h : X → R from a sequence of examples (x n , y n ), n = 1, . . . , N, where x n ∈ X is called an instance and y n ∈ R is called a label. In a linear binary classification task, the goal is to learn a linear classifier h : X → {−1, +1} such that h(x n , θ) = sgn(θ, x n ), X is typically a d-dimensional vector space, θ ∈ R d is a weight vector to be learned, ·, · denotes the dot product, and sgn(z) is an indicator function that outputs +1 when z > 0 and −1 otherwise. The function h is called the hypothesis (function) or the (prediction) model, and is denoted by h n at stage n.
The main feature of online learning is that learning takes place in rounds or stages. At each stage n = 1, 2, . . ., an instance x n is presented to the algorithm, which predicts a labelŷ n ∈ {−1, +1} using the current hypothesis function: y n = h n (x n ). Then, the correct label y n is revealed, and the learner can measure the suffered loss, which in online binary classification can be given by the hinge-loss ((x n , y n ); h n ) = max(0, 1 − y n h n (x n )). Whenever the loss is nonzero, the learner updates the prediction model h n according to an algorithm-specific strategy. The classic goal of online learning is to minimize the regret of the learner's predictions against the best fixed model in hindsight. The regret is defined as follows: ((x n , y n ); h) (6) where H denotes the model space. For example, in linear classification, H is the set of functions of the form h(x, θ) = sgn(θ, x) for θ ∈ R d . Note that the second term in (6) is the loss suffered by the optimal model h * ∈ H that can only be known in hindsight after seeing all the examples. Regret minimization formalizes the concept of sample efficiency.

B. Online Learning With Kernels
If the online algorithm needs to learn a nonlinear model h, one way to introduce nonlinearity is by the use of kernels [41], [42]. In this case H is known as a Reproducing Kernel Hilbert Space (RKHS) and is defined by a kernel function κ : kernel κ(x, x ) expresses the similarity between x and x , among other required properties [42], and allows us to write the hypothesis function h n as a kernel expansion as follows: where α n are coefficients (typically α n = y n ), and X n is defined as the set of instances for which a prediction error occurred (and thus h n was updated), that is The set X n is called the support set.
The usual update step (see, e.g., the Kernelized Perceptron in [41]), involves adding the new instance x n+1 to the support set X n+1 = X n ∪ {x n+1 } and updating the hypothesis function as h n+1 = h n + y n+1 κ(x n+1 , ·). This functional update expresses the addition of a new term y n+1 κ(x n+1 , x) to the summation in (7). One critical issue of this strategy is the unbounded growth of the support set X n , which increases the computational and space complexity over time.
To address this drawback, one possible strategy is to set an upper bound (budget) on the cardinality |X n | of the support set, and include a budget maintenance strategy to select which instance to remove when |X n | reaches the budget [43]. An alternative, and more effective strategy is the use of projected hypothesis.
The hypothesis projection technique was introduced with the Projectron algorithm [6] and works as follows. Before adding an instance x n+1 to the support set, a temporary hypothesis is constructed as h n+1 = h n + y n+1 κ(x n+1 , ·). Additionally, a projected hypothesis h n+1 is obtained by computing the values for the coefficients α n in expression (7) that best approximate h n+1 using the existing instances in the RKHS X n . We say that h n+1 is the projection of h n+1 onto X n . If the distance between h n+1 and h n+1 is below some threshold η, the next hypothesis will be h n+1 , and the support set will remain unchanged. Otherwise, the next hypothesis will be h n+1 and the support set will incorporate x n+1 . Algorithm 1 summarizes the Projectron algorithm.

A. Elements of the Proposal
Our proposal is based on the following operating principles: • One-step lookahead planning: the agent selects, at each decision stage, a control action addressing the problem formulated in (5) which is a one-step version of the CMDP (1). • Model-based RL: in order to address (5), the agent learns a model of the system, instead of a policy or a Predictŷ n ← sign(h n−1 (x n ))

5:
Receive label y n 6: if y n =ŷ n then 7: h n ← h n−1 + y n κ(x n , ·) 8: h n ← projection of h n onto the space X n 9: if γ n ≤ η then 11: h n ← h n 12: value function. The model is intended to predict whether the SLAs will be fulfilled or violated in each slice i, for any given state-action pair. Our modeling strategy includes a self-assessment procedure that computes the error function estimatorsê As shown in Figure 2 and Algorithm 2, our proposal, kernelbased RL (KBRL), is structured into three modules: 1) A controller that generates, at each stage n, a control action vector a n = (a  at every stage, based on s n−1 , a n , and the array of observed labels y n . 3) The E-learner, which updates the K estimators in E n , takingŷ n , y n , m n as inputs, and providing E n+1 as output. This module monitors the accuracy of the classifiers learned by the H-learner module. Note that at each decision stage, n, Algorithm 2 performs one step of the for loop, where each module intervenes once. (a n , m n ,ŷ n ) ← Controller(H n , E n , s n−1 )

6:
Apply a n and observe y n and s n 7: H n+1 ← H-learner(H n , y n , s n−1 , a n )

B. H-Learner: Online Learning for Classification
To increase the learning efficiency of h (i) n , we introduce a sample augmentation strategy, for which we use the following assumption.
Assumption 1: Given an initial set of conditions for a slice i ∈ K, summarized in s (i) n−1 , if the SLA of slice i is fulfilled in stage n with a resources assigned to it, then the SLA is also fulfilled with a > a resources. Conversely, if the SLA is not fulfilled with a resources, then it is also not fulfilled with a < a resources.
This assumption simply reflects a desirable feature of the underlying schedulers that allocate the resources among the UEs within each slice: more available resources can only improve the quality of service provided to the UEs.
As shown in Algorithm 3, the H-learner receives, for each slice i ∈ K, the s  for a = 0, . . . , a (i) n do 12: x n ← (s

C. E-Learner: Estimating the Classification Error Probability
In order to estimate the prediction accuracy for each action, we use the following structural result that is a direct consequence of Assumption 1.
Proposition 1: Given an initial set of conditions for a slice i ∈ K, summarized in s    −1 , a ).
Given the conditions of Proposition 1, if a is the smallest action predicted to fulfill the SLA of slice i, we say that a has a security margin of a − a. The security margin m (i) n for a given action a (i) n is defined as follows: By Proposition 1, the larger the security margin, the smaller the classification error probability e (i) . Therefore, to obtain an estimator of e (i) , we compute the average prediction error associated to all (positive) security margins m = 0, 1, . . . , a max , for each slice. For a given m, the error probability estimator, denoted byê −1 , a ), where a has a security margin of m. As shown in Algorithm 4, the E-learner receives, for each slice i, the predictionŷ where β is the learning rate. The E-learner uses a sample augmentation strategy similar to the one discussed in the previous subsection. If y (i) n = 0 (meaning that the SLA is fulfilled) for a given m end if 14: end while 15: end for 16: if i a (i) n > C then 17: for i = 1, . . . , K do 18: When the solution to the K sub-problems does not fit the global resource constraint, i∈K a (i) n > C, we project the solution onto the space of admissible actions. This is done by computing the set of actionsā (1) , . . . ,ā (i) such that i∈Kā (i)

E. Complexity
The most time-consuming part of the proposed method is the Projectron update (Algorithm 1), whose time complexity, analyzed in [6], is O(|X n | 2 ) (recall that |X n | denotes the cardinality of the support set). |X n | tends to increase over time but, in our experiments, it remained at relatively low values (accumulating fewer than 60 elements after 40000 learning steps).

VII. NUMERICAL EVALUATION
We have conducted extensive simulation experiments to evaluate our proposal and compare it with state-of-the-art RL baselines. The simulation environment was developed in Python 1 to emulate the allocation of RBs among several RAN slices. Each slice is devoted to one type of communication, either eMBB or mMTC, characterized by its traffic model and its SLA. The traffic generator of an eMBB slice simulates the arrival and departure of GBR and non-GBR UEs, characterized by constant bit rate and variable bit rate traffic flows respectively, according to the parameters shown in Table II. The traffic source of an mMTC slice simulates 1000 devices, each characterized by a transmission period and a number of packet repetitions. For each device, these two parameters are randomly selected from the sets shown in Table II. The nominal received power at the UE uses the macro cell propagation model for urban areas described in Section 4.5 of TR 36.942 [44]. Frequency-selective fading is generated by drawing samples from datasets containing fading traces. Our simulator uses the datasets of the ns-3 simulator [45] corresponding to usual fading/mobility models (pedestrian A, Typical Urban, Vehicular A, as defined in Annex B.2 of TS 36.104 [46]). For each new arrival, the simulator randomly selects one of the datasets, and generates a random integer as the first index from which to draw samples from the selected dataset. Within each slice, a proportional fair scheduler allocates the RBs among the UEs of that slice, according to their buffer state reports and their SNR estimations. Given the SNR estimated by the UE, the transmitter selects a Modulation and Coding Scheme (MCS) aiming to a block error rate (BLER) below 0.1. The spectral efficiency of the selected MCS, with the number of allocated RBs, determines the number of bits transmitted (transport block size) from the UE queue. It should be highlighted that we conducted the experiments using a different strategy for the generation of SNR samples based on the dataset obtained in [47], and the results were very similar, which suggests that the channel modeling details are not of crucial importance in the evaluation and comparison of the algorithms. Note that the slice resource allocation operates on a larger time scale and at a higher system level than the scheduling algorithms, and consequently its performance is relatively decoupled from the channel models.
In the simulated scenarios, the allocation of radio resources among the RAN slices is updated every 50 radio subframes, and the duration of each subframe is 1 ms. Therefore, the elapsed time between consecutive decision stages is 50 ms. Three main scenarios have been considered for the experimental evaluation:  Table III. The number of observed variables, which is determined by Table I in Section III, is 50, 36, 22 and 13 for scenarios 1, 2, 3 and 4 respectively.

A. RL Baselines
We compare our proposed KBRL controller against the following algorithms, considered state-of-the-art baselines in RL: • Deep Q-Networks (DQN) [48] is the deep learning version of Q-learning, a classical model-free off-policy RL algorithm. As discussed in Section II, Q-learning and DQN have been used for the allocation of RBs among network slices by [19] and [18], respectively. Nevertheless, its application is limited to scenarios with a small action space, i.e., with only 2 or 3 slices. Consequently, the comparison of KBRL with DQN was conducted only in scenario 4. • Trust region policy optimization (TRPO) [49] is a model-free deep policy gradient algorithm that updates policies while satisfying a constraint on how different the new and old policies are allowed to be. This difference is expressed in terms of Kullback-Leibler Divergence. • Proximal policy optimization (PPO) [50] is a variant of the TRPO idea, that uses a simpler technique to estimate the difference between policies. Two implementations have been used: PPO1 and PPO2, described in [51]. • Twin delayed DDPG (TD3) [52] is an off-policy deep actor-critic algorithm. It is an improvement over deep deterministic policy gradient (DDPG) [53]. • Soft actor critic (SAC) [54] is an off-policy deep actorcritic algorithm that incorporates the idea of entropy regularization, and generally achieves better empirical performance than DDPG. • Synchronous advantage actor critic (A2C) [55] is an on-policy deep actor-critic algorithm. It can execute multiple instances of the algorithm in parallel, although this feature is not applicable in online learning, where only one instance of the environment is available. • Normalized Advantage Function (NAF) is a technique for extending deep Q networks DQN) to continuous actions spaces, proposed in [56], and used in [13] for RB allocation among network slices. In our experiments, we have used the implementations of the RL algorithms provided by Stable Baselines [51], which is an improved version of the OpenAI Baselines [57]. For NAF, which is not included in these libraries, we have used the Keras-RL implementation [58].

B. Online Performance
To evaluate and compare our proposal with all the RL baselines, the simulation experiments were designed as follows. In each scenario we executed 30 simulation runs for each algorithm. Each run comprises two consecutive phases: the learning phase and the inference phase. During the learning phase, which lasts 40000 decision stages (equivalent to 33.3 minutes of simulated time), the algorithms learn from the interaction with the system, starting without any prior knowledge of the system's response. This situation is equivalent to the establishment of new network slices, or an update of the SLAs of existing slices. The evaluation of this phase is critical since the objective of our proposal is to learn while the network is operating, minimizing the negative effects of this learning on the service offered. During the inference phase, lasting 10000 steps (8.3 minutes of simulated time), the RL algorithms are no longer learning, and they simply use the policies obtained in the learning phase to make RB allocation decisions at each step. This phase models a situation where the network slices remain stable beyond the 40000 steps of the learning phase, and its objective is to compare our proposal to the baselines once they have been previously trained. This evaluation has been done for the sake of completeness, even   The proposed algorithm has been evaluated for two values of the reliability factor:δ = 0.97 andδ = 0.99, to assess its impact on the performance. For the baselines, we considered two values of the penalty factor, λ = 100 and λ = 1000, and found the second one to be more effective in avoiding SLA violations during the learning phase, which is crucial in an online learning setting. Besides, we also evaluated the baselines under normalized rewards (between −1 and +1), resulting in better performance for TD3 and NAF, whose results use this configuration. Figure 3 shows the performance of the algorithms in scenario 1 during the first 20000 steps of the learning phase. For each metric, we plot the average value and the  confidence interval for a 90% confidence level. The results for scenarios 2 and 3 are presented in Figures 4 and 5 respectively. It is evident that KBRL largely outperforms even the best baselines in terms of SLA fulfillment, and also shows greater efficiency in the use of resources. These results show the effectiveness of our model-based approach for learning efficient policies with much fewer samples than MBRL algorithms. 2 Moreover, the model used by KBRL is aimed at minimizing exploration, which prevents the selection of over-provisioning actions as well as under-provisioning actions from the early stages of the learning process. Another consequence of this design is that the performance of KBRL remains consistent across the different scenarios, while the performance of the RL baselines shows notable variations from one scenario to another.
Let us see how the algorithms compare during the inference phase. For this phase, we evaluated the two conflicting objectives of the problem: the average SLA violations per stage, and the average resource occupation (average number of used RBs divided by the total number of RBs) and then, to facilitate the visualization of the performance, we represented both metrics in a Euclidean axis for each scenario. Figure 6 shows the results, where the average performance of each algorithm corresponds to a point, and centered on the point is the confidence interval of each metric in its respective dimension.
The findings of the training phase are observed also in the inference phase. The relative performance of the baselines varies notably from one scenario to another. For example, we see that PPO1 is the best baseline in scenario 1 in terms of SLA fulfillment but is outperformed by TD3 with regard to resource efficiency. In scenario 3, PPO1 is the baseline using the fewest resources, but four of the baselines (TD3, TRPO, PPO2 and SAC) attain a lower SLA violation rate. In sharp contrast, KBRL performs consistently across the three scenarios: its SLA violation rate is almost negligible, clearly outperforming all of the baselines in every scenario, while using fewer resources than all or most of the baselines. Not surprisingly, KBRL withδ = 0.99 uses more resources than KBRL withδ = 0.97, because a higher reliability factor results in a larger security margin (m (i) n defined in Section VI), and thus more resources are assigned to the network slice.
To better understand KBRL operation, it is illustrative to evaluate two additional metrics specific to this algorithm. The first one is the rate at which KBRL generates solutions exceeding the available resources C, thus requiring to be adjusted as detailed in Algorithm 5. We use the term adjustments to refer to these events. Figure 7 shows the average number of adjustments per decision stage on each scenario, including the confidence intervals with a 90% confidence level. As we can see, the adjustment rate is generally below 0.1. For a given scenario, if we used a smaller C, we would obtain a higher adjustment rate, but we consider that operating at a relatively low adjustment rate is indicative of a proper dimensioning of the system, i.e., the existing resources are enough to accommodate all the slices, fulfilling the required SLA, and leaving room to absorb occasional traffic peaks. Note that, thanks to the security margins, the SLA violation rates (shown in Figure 6) are clearly smaller than the adjustment rates.
The second metric of interest for KBRL evaluation is the accuracy of the online classifiers. Figure 8 depicts the  To conclude our evaluation of KBRL, we show how it compares to an optimal performance, which was estimated using an oracle mechanism. The oracle operates as follows: at each decision stage n, it conducts an exhaustive search over all the possible allocations. For each allocation a n , the system is simulated up to decision stage n + 1, to assess the SLA fulfillment for all the slices. For all the allocations evaluated, the simulation starts from the same system state. Once the best allocation a * n is found (i.e., the one fulfilling the SLAs using the fewest resources), the system advances up to the next stage n + 1, and the searching process for a * n+1 starts. Note that the oracle control is infeasible for implementation in a real system, since it requires prescient knowledge of all the stochastic processes involved and is computationally cumbersome. In fact, it is not scalable even for simulation, due to the exponential growth of the action space with the number of slices. Consequently, KBRL and the oracle were compared only in scenario 4, which comprises only 2 network slices, resulting in a relatively small action space (556 actions). This feature makes this scenario useful for the evaluation of DQN, which was used by previous works [18], [19] for RB allocation among network slices. As a reference, we also include the results of NAF, which was used in [13] to overcome the limitations of DQN in this problem. For the RL baselines, we consider a learning phase lasting 20000 steps, and an inference phase of 4000 steps. Figure 9 compares the performance of the RL algorithms (the baselines and KBRL) during the training phase. We see that DQN outperforms NAF in this scenario. Note that NAF needs to approximate the discrete action space as a continuous one, which may impact its performance. As in previous scenarios, KBRL is capable of learning with a negligible amount of SLA violations, outperforming both baselines in this metric. Figure 10 shows the performance of the oracle policy in comparison to KBRL, NAF and DQN. In the experiments summarized in this figure, NAF and DQN are already in the inference phase (i.e., they have been trained previously for 20000 steps), while KBRL is in the learning phase. We see that the oracle policy perfectly fulfills the SLAs using roughly 50% of the resources used by the RL agents. This illustrates the inherent difficulty of the problem due to its stochastic nature. We also observe that, although the baselines improve their performance in the inference phase, KBRL still outperforms them in SLA fulfillment even without previous training.

C. Computational Overhead
A usual concern regarding the deployment of reinforcement learning algorithms is the computational overhead that these algorithms may introduce in the system. This is especially relevant in our online learning setting, in which RL agents need to learn from scratch on the operating network, updating their policies/models between decision stages. In this subsection we show an empirical evaluation of the per-stage computation time of our proposal and the RL baselines, both in the learning phase and in the inference phase. For each scenario and each  phase, we measured the execution times of the algorithms in 30 learning episodes. The experiments were conducted on an Intel Xeon E5-2650V3 CPU. Figure 11 shows the average execution time per stage consumed by each algorithm in each scenario during the learning phase. As expected, when scenarios are associated to observations of larger dimension, the execution time is longer. This effect is particularly noticeable in the baseline algorithms, where the execution time doubles from scenario 3 to 2, and from 2 to 1. It is also evident that our proposal introduces much less computational overhead than the baselines during the learning phase. It should be noted that these algorithms, including our proposal, are implemented for experimentation purposes, and are not optimized for production in terms of execution time. What these results show is that computational overhead is not a major obstacle for the deployment of our algorithm, and even less so if a production-optimized implementation is used. Figure 12 shows the execution time per decision stage during the inference phase. As expected, the baselines notably reduce their computation time in this phase, since the actions are simply obtained by forward propagation on the policy's neural network. The time required by KBRL is in the same order of magnitude of the baselines. The outlier results of NAF are probably due to implementation differences (recall that NAF uses the Keras-RL implementation instead of Stable Baselines).

VIII. CONCLUSION
This work has shown that a model-based RL approach can efficiently manage the allocation of RAN resources among network slices, and is especially well suited for online operation. Our proposal, KBRL, combines a one-step lookahead model predictive control with a model that comprises two elements, a classifier and an accuracy estimator for the classifier, both of which are learned from scratch while the network is in operation. This structure for an MBRL agent is novel and presents several advantages: i) it benefits from the high sample efficiency of existing online learning algorithms (in our case, we use a kernel-based algorithm known as Projectron); ii) it enables a sample-augmentation strategy that further enhances the sample efficiency of the learning process; and iii) it manages all the system objectives in parallel (resource efficiency and SLA fulfillment for each slice). These advantages largely outweigh the potential sub-optimality associated with the use of a one-step horizon by the control agent, as shown by our numerical results. In our experiments, we compared KBRL with state-of-the-art RL algorithms, all of which are farsighted, in different scenarios. KBRL outperformed all of the baselines in terms of resource efficiency, SLA fulfillment, and computational overhead, during online learning episodes. We believe that the simplicity and efficiency of our proposal make it suitable for the joint management of various RAN slice resources (e.g., RBs, backhaul capacity, computational resources). This future research line raises interesting challenges such as designing a model capable of handling the interactions between the different parts of the infrastructure.