LACO: A Latency-Driven Network Slicing Orchestration in Beyond-5G Networks

Network Slicing is expected to become a game changer in the upcoming 5G networks and beyond, enlarging the telecom business ecosystem through still-unexplored vertical industry profits. This implies that heterogeneous service level agreements (SLAs) must be guaranteed per slice given the multitude of predefined requirements. In this paper, we pioneer a novel radio slicing orchestration solution that simultaneously provides-latency and throughput guarantees in a multi-tenancy environment. Leveraging on a solid mathematical framework, we exploit the exploration-vs-exploitation paradigm by means of a multi-armed-bandit-based(MAB) orchestrator, LACO, that makes adaptive resource slicing decisions with no prior knowledge on the traffic demand or channel quality statistics. As opposed to traditional MAB methods that are blind to the underlying system, LACO relies on system structure information to expedite decisions. After a preliminary simulations campaign empirically proving the validness of our solution, we provide a robust implementation of LACO using off-the-shelf equipment to fully emulate realistic network conditions:near-optimal results within affordable computational time are measured when LACO is in place.


I. INTRODUCTION
The quest for new sources of revenue that revitalizes the mobile industry has spawned an unprecedented hype around the fifth-generation of mobile networks (5G) and, in particular, the network slicing concept. Enabled by software-defined networking (SDN) and network function virtualization (NFV), network slicing allows telco operators to offer virtualized slices of infrastructure resources on-demand to heterogeneous 3 rd -party services [1]. A high-level view of the system considered in this paper is described in Fig. 1. The figure represents a series of sliceable base stations as a pool of radio resources (coloured cubes in the figure). The resource allocation process is considered hierarchical: while bundles of radio resources are assigned to different tenants (namely radio slices), each tenant autonomously schedules its bundle of radio resources to each individual user following classic radio scheduling policies. The difference between such operations is subtle but of paramount importance: a slice controller operates at a larger timescale and thus over a coarser granularity [2], [3]. While most prior work on network slicing focuses on average bit-rate guarantees [3], [4], latency considerations have received little attention. Latency aspects however are gaining more and more attraction as a quest to face advanced use-cases requirements, e.g., autonomous driving and platooning [5] in Vehicle-toeverything (V2X) enabled scenarios. In this context, accurate resource allocation schemes and inter-slice isolation aspects are fundamental features to support the provisioning of latencyconstrained services.
Given the plethora of works on classic radio scheduling [6], [7], we keep this aspect out of the scope of this paper and we focus instead on the former impelling need: a proper design of an orchestration solution that autonomously assigns chunks of radio spectrum (slices) in relatively larger time-scales pursuing the goal of guaranteeing simultaneously latency and throughput constraints. From the best of our knowledge, there is a non-negligible lack of works focusing on both aspects simultaneously in sliced-network environments.
To fill this gap, we design a LAtency-Controlled Orchestrator (LACO), a network slice controller that maps virtual radio resource allocations to physical resources while still guaranteeing latency requirements 1 . Specifically, LACO augments such prior work by accommodating 1 Note that LACO does not compete with state-of-the-art throughput-only slice controllers-in fact, we purposely assume the presence of an admission controller that ensures that the aggregate load incurred by granted slices is within the system capacity region. resources to (granted) slices such that latency agreements are satisfied. This unlocks a new business opportunity for the telco operators that may apply customized pricing models according to the elasticity of offered slice latency constraints.
Technical challenges. While designing LACO, two sources of uncertainty need to be under control: i) the behavioral dynamics of the (aggregated) demand across involved tenants and ii) the inherent randomness of the wireless channel. These system dynamics have been traditionally modeled via either complex solutions that are hard to solve in realistic settings or via simplistic assumptions at the expense of low performance figures. In our work, we explore a novel approach by designing a scheme that learns the implications that allocation decisions have on per-slice latency without explicitly making assumptions on the underlying dynamics. To this aim, we first model our decision-making problem as a Markov Decision Process 2 (MDP), which allows us to neglect low-level details of the tenant demands and channel dynamics while letting us retain some knowledge on the consequences that a given action may have on the most immediate next system state.
An MDP model helps us to fully explore the problem features. However, the process of learning the state transition probability matrix of each of the embedded Markov chains incurs in prohibitive overhead as a reinforcement learning agent has to explore the whole space of state-action trajectories-the so-called curse of dimensionality. To address this, we resort to a Multi-Armed Bandit (MAB) model where the attained reward depends only on the action taken 2 With a little misuse of nomenclature, we will refer to Markov Decision Process (MDP) rather than Semi-Markov Decision Process (SMDP) despite considering continuous time scales. from a bounded set of possible actions. Importantly, in contrast to traditional MAB methods, LACO is model-aware (though not model-dependant), i.e., it exploits (abstracted) information regarding the underlying system to expedite the selection of highly rewarding actions, which is particularly attractive when dealing with dynamic non-stationary scenarios.
The main contributions of our paper can be summarized as follows: • We introduce a Discrete-Time Markov Chain (DTMC) model to capture the dynamics of the (instantaneous) aggregate slice traffic demand and the wireless channel variations.
• We present a latent variable regression model to accurately anticipate the transition probability matrix of the proposed DTMCs.
• We formulate the dynamic slice resource provisioning as a Markov Decision Process (MDP).
• We design a model-aware Multi-Armed Bandit (MAB) method to guide the decision-making process, which relies on the above DTMC models and anticipated transition probabilities to speed up convergence.
• We present an exhaustive simulations campaign to assess the performance of our approach.
• We implement and field-test our solution using off-the-shelf equipment that emulates real network conditions: LACO shows its innovative performance gain against considered legacy techniques.
The remainder of the paper is structured as follows. Section II formulates our problem and presents the main building blocks of LACO. Section III introduces an DTMC model that helps us expedite the action-space exploration phase and Section IV deeply analyzes it. In Section V, we introduce our decision process as a Markov Decision Process (MDP) and present a model-aware Multi-Armed Bandit decision-making engine integrated in LACO. Section VI presents our preliminary simulation campaign to validate the design principles of LACO, whereas Section VII details the implementation of our novel solution into off-the-shelf equipment with realistic network performance. Finally, Section VIII summarizes related literature and Section IX concludes the paper with some final remarks.

II. LACO: THE FRAMEWORK OVERVIEW
Our solution relies on the concept of slicing-enabled networks wherein multiple network tenants are willing to obtain a network slice with predefined service level agreements (SLAs).
Such SLAs may be expressed in terms of maximum slice throughput and average access latency.
Within the context of our paper, we define the average access latency as time the traffic belonging to a certain slice needs to wait before being served due to scheduling procedures. In particular, we focus on the radio access network (RAN) domain and design LACO, a RAN controller that dynamically provisions spectrum resources to admitted network slices while providing latency guarantees. In the following, we overview the main system building blocks with detailed notation and assumptions.

A. Business scenario
We consider different entities in our system: i) an infrastructure provider owning the physical infrastructure who offers isolated RAN slices as a service, ii) tenants who acquire and manage slices with given SLAs to deliver services to end-users, and iii) end-users, who demand radio resources from such tenants/slices.
Let us define I as the set of running network slices and U i as the set of end-users associated to the i-th slice. The total amount of wireless resources (radio spectrum) is split into multiple nonoverlapping network slices, each one belonging to one single tenant i ∈ I. 3 Based on fixed SLAs, each network slice is characterized by maximum throughput and expected latency denoted by Λ i and ∆ i , respectively. We assume that an admission control process 4 is concurrently running on a higher tier so that the average aggregate load can be accommodated within the overall system capacity.

B. Notation
We use conventional notation. We let R and Z denote the set of real and integer numbers.
We use R + , R n , and R n×m to represent the sets of non-negative real numbers, n-dimensional real vectors, and m × n real matrices, respectively. Vectors are denoted as column vectors and written in bold font. Subscripts represent an element in a vector and superscripts elements in a sequence. For instance, n ] T being a vector from R n , and x and 0 indicate an all-ones and all-zeroes vector, respectively, and · is the ceiling operation. 3 We assume a one-to-one mapping between slices and tenants. Therefore, we use i ∈ I interchangeably throughout the paper as a tenant identifier or its associated slice. Note that this assumption can be easily relaxed in the model. 4 Given the plethora of solutions in the literature, the admission control design is out of the scope of this work. We refer the reader, for example, to [2], [4] for more details.

C. Problem Definition
Assuming that an instance of LACO is executed per base station (BS) as shown in Fig. 1, we focus our problem design and performance evaluation on a single BS characterized by a capacity C, which is the sum of a discrete set of available physical resource blocks (PRBs) of fixed bandwidth. This resource availability must be divided into subsets of PRBs (i.e., slices), and our job is to dynamically assign such subsets to each network slice i ∈ I. We refer to such assignment as the configuration of slice i, denoted by the variable y i . Obviously, we shall guarantee i∈I y i ≤ C. For the sake of clarity, we summarize all mathematical variables used throughout the paper in Table I.
The decision epoch duration may be decided according to the infrastructure provider policies, ranging from few seconds up to several minutes. While the admission controller (pre-)selects a subset of slices that can co-exist without exceeding the capacity of the system in average, the dynamic nature of the slice's load and wireless channel may cause instantaneous load surges or channel quality fading effects and hence induce a non-zero mean delay.
We denote the experienced instantaneous signal-to-noise ratio (SNR) of slice i (averaged out across all users of the slice) and the instantaneous aggregate traffic demand within time-slot n as γ (n) i and λ (n) i , respectively. As each tenant i may show different behavior in terms of wireless channel evolution (according to θ i ) and traffic demands (according to ρ i ), we also assume γ   Formally, the above-described problem becomes: where ζ(·) (n) is a mapping function that returns the number of bits that can be served using the allocated number of PRBs (y configurations such that the expected total non-served traffic demand is minimized. Hereafter whenever is evident from context, we drop the superscript (n) to reduce clutter. To address the problem, we rely on a two-layer scheduling approach commonly adopted in the network slicing context [3], [9]. On the one side, an inter-slice scheduler is in charge of defining the PRB allocation strategy to meet the networking requirements while ensuring resource isolation among slices. On the other side, a lower layer intra-slice scheduler enforces the assignment of the pre-allocated subset of PRBs to the connected end-users. Our work mainly focuses on the higher-level inter-slice scheduler, leaving the implementation of intra-slice scheduling strategies open to address tenant-specific requirements.

D. Working flow
For a given slot n, problem LATENCY-CONTROL can be easily linearized 5 and solved with standard optimization tools. However, this approach may exhibit sub-optimal behavior in future epochs if the statistical distributions of f (x, θ i ) and f (x, ρ i ) are not stationary. Hence, we propose 5 Function ζ(·) can be easily approximated with a linear function by applying piece-wise linearization.   userū i with an aggregate traffic demand resulting from the set of users u i ∈ U i belonging to slice i. 6 We also assume a finite number of channel quality levels G, which may bound each 6 This assumption can be readily relaxed by considering the convolution of single cumulative distribution functions of every user channel and demand variation [10]. instantaneous user channel quality γ i , as depicted in Fig. 3. This is a system design choice and allows operators to trade off high accuracy for convergence speed, by ranging from a fine-grained scale (large G), e.g. by letting each channel quality level be equal to the modulation and coding scheme (MCSs) as defined in the 3GPP standard document [8], to a coarse-grained scale that may capture the channel variation behaviors with limited accuracy, as detailed in Section IV.

Markov Chain
Let us consider a discrete-time stochastic process X t 7 that takes values from a finite and discrete state space, which is denoted by In particular, a realization of X t when visiting state S g,d represents virtual userū i experiencing channel level g ∈ G with an associated delay exceeding the one specified by the slice SLA (d = 1) or otherwise (d = 0). When considering wireless channel conditions as Rayleigh distributed, it is common practice to model the variations as a sequential visiting of consecutive states, as the channel does not vary faster than the Markov chain time-slot [11]. Hence, we define the probability to improve the user channel condition from level g to level g + 1 as p g,g+1 whereas the probability to get a bad channel from level g to level g − 1 as q g,g−1 . As shown in recent works like [12], [13], accurate scheduling strategies might mitigate the interference effects coming from multiple base stations serving the same sets of slices thus improving the overall channel conditions. However, such schemes introduce additional complexity and synchronization overhead, which hardly fit with our view of a lightweight base station oriented solution. Last, given the available physical resource blocks assigned to a particular slice y i , the channel quality 7 The time scale t of DTMC state switch is much shorter than the decision epoch n used in the MDP described in Section V. 8 Each DTMC is defined within a state space S i . We remove the index i to limit the clutter, as the analysis can be easily extended to any other slice i. 10 level g and the overall traffic demand within the time-slot, we model the probability to incur in delay constraint violation as m g and the probability to keep the access delay within the agreed bound as l g . This process can be formulated as a two-dimensional DTMC M := (S, P ), where P denotes the following transition probability: Note that we assume p G,G+1 = q 1,0 = 0 and each square block K x={m,l} , M and L with [G × G] size so that the square matrix P has dimension [2G × 2G]. Without loss of generality, we assume that such transition probabilities do not depend on the particular time-slot we are evaluating.
Thus, we define our DTMC as a time-homogeneous MC where the process X t evolves based on Π(t) = Π(0)P t where the row vectors Π(t) and Π(0) represent the first order state probability distribution at time n and 0, respectively. In order to evaluate the long-term behavior of our system, we need to calculate the steady-state probability Π * = {π * s } of being in each of the defined states. It yields that (2) The above-described Markov chain is irreducible, as each state may reach through available paths any other state. Therefore, by stochastic theory, if a Markov chain is irreducible and nonperiodic, the steady-state probability distribution Π * always exists, is unique and is independent from the initial conditions.
Recalling the total probability theorem and using Eq. (1), we calculate the steady-state probability distribution as the solution of the following equations where 1 diag is the identity matrix.

IV. DTMC MONITORING AND PREDICTION
The asymptotic behavior of a Markov chain depends on the transition probability matrix P , which in turn depends on the stochastic processes of the slice traffic demands and wireless channel variations. While several models have been already defined in the literature to derive such probabilities [14], the latency control objective and the need of an accurate estimation exacerbate the problem and render model-fitting approaches impractical. This brings additional complexity and delay the convergence process to the optimal solution.
We apply the concept of unsupervised learning to estimate the transition probabilities based on previous observations. In particular, we rely on the well-known theory of probabilistic latent variable [15]. Let us consider w ∈ W as the stochastic latent variable denoting the current channel quality level. Formally, we redefine the transition probability of the above-described DTMC as that is the probability to move from state S g,a to S g,b when the channel level is exactly g = w. To easily understand this, note that ρ g 0,1 = m g , ρ g 1,0 = l g whereas ρ g 0,0 and ρ g 1,1 are the probabilities to stay within the same state S g,0 and S g,1 , respectively. We use an expectation maximization technique to estimate such probabilities. To this aim, we enumerate the transitions between a and b upon g in h g a,b based on the number of times X t switches to another state (or stays within the same state) between t and t + 1. We then derive the a posteriori probability as follows and the likelihood probability as the following and The above system of equations can be solved using an iterative method that yields ρ g a,b . Finally, we calculate the weight of each latent variable based on a given set of previous observations as per the following equation whereŜ i denotes the history of transitions (or lack thereof) across X t among different states belonging to level 0 or 1 in the DTMC depicted in Fig. 3. We can generalize the probability to move from a state wherein the latency is under control S g,0 to a state incurring unexpected latency S g,1 , i.e., exceeding the threshold defined in the slice SLA, using the following expression In the next section, we design a control-theory process by means of a Markov Decision Process of PRBs, where i∈I y i = C, i.e., the overall capacity is exactly split between running slices.
We assume that each slicing configuration is issued at every decision epoch n. The transition function characterizes the dynamics of the system from state σ to state σ through action φ.
Analytically, P (σ | σ, φ) is the probability to visit state σ given the previous visited state σ and the action φ. Finally, the function R(σ, φ) measures the reward associated to the transition from the current state σ performing action φ. We shall consider an MDP with an infinite time horizon. Future rewards will be discounted by a factor 0 < χ < 1 to ensure the total reward obtained is finite.
When dealing with MDPs is common practice to define a "policy" for the decision agent, namely a function P (n) : Σ (n) → Φ (n) that specifies which action φ to perform at time n when in state σ. As soon as the Markov decision process is combined with a defined policy, this automatically fixes the next action for each state so that the resulting combination exactly behaves similarly to a Markov chain. The final aim of the decision agent is to find the policy that maximizes the expected total reward, or, equivalently, to discover the policy P * that maximizes the value function.

A. Reward Definition
Each state (or slicing configuration) is associated with a reward value that influences the agent during the decision process. The rationale behind is that we need to bind the action reward to the probability of exceeding the latency constraints defined in the slice SLA. In the following, we introduce the reward function used in our experiments with a detailed overview of its behavior.
Given a slicing configuration c σ = {y i | i ∈ I}, we can analytically build a Discrete-Time Markov Chain, as described in Section III. If the associated transition probability matrix P is perfectly known, we can also derive the steady-state probabilities Π * = {π * s } to be within any single state using Eq. (3). Thus, we can compute the probability to have the access latency of our system under control. This can be used to formulate the instantaneous reward value where s is the index of all states S g,0 , ∀g ∈ G such that the slice latency is under control, whereas η ∈ [0, 1] is an adjustable value decided by the infrastructure provider to provide action fairness in the reward function when η tends to 0, or maximum likelihood of keeping latency under control when η tends to 1. Then our objective is to maximize the expected aggregate reward obtained as lim N →∞ N n=1 E χ n R σ (n) , φ (n) . However, given the fully-connected structure of our Markov Decision Process, i.e., all states are reachable from any MDP state, our objective is equivalent to maximize the instantaneous reward given by (10) at each decision epoch n.
Nonetheless, the assumption of perfect knowledge on the transition probability matrix P might be not realistic. Therefore, we need to rely on the transition probabilities ρ a,b inferred based on the previous observations, as explained in Section IV, Eq. (9). The larger the set of observations, the higher the accuracy of our probability estimation and the higher the reward attained to the instantaneous best action taken by the MDP.

B. Complexity analysis
Once we have fully characterized our proposed MDP, we can solve it by using dynamic programming solutions such as Value Iteration [16]. These approaches require exploring the entire state space of the MDP (several times) and the associated rewards. Let us consider a scenario with I online slices running in our system. Assume that each slice configuration y i can take values from integer multiples of a minimum PRB chunk size Θ and that the slicing configuration must be consistent, i.e., i∈I y i = C. Then, we can calculate the overall number of states equal to This poor state scalability, as well known as the curse of dimensionality, compromises the feasibility of MDP models under practical conditions. However, MDPs provide insights regarding the structure of the problem itself and are very helpful to design ausiliary solutions, such as Multi-Armed Bandit (MAB) models, which are better suited for functional deployments.
Therefore, in the next section we rely on a novel MAB design that exploits information from the underlying MDP to expedite the learning process while attaining near-optimal results.

C. Multi-armed Bandit problem
The online decision-making problem has been addressed in the past with several mathematical tools [17]. The limited information about real-time channel quality and effective traffic demand forces the operator to choose, like a gambler facing diverse options to play, the number of radio resources to assign to each running slice. This automatically falls in the fundamental explorationvs-exploitation dilemma: the gambler needs to carefully balance the exploitation operations on known slicing configurations that provided the best revenues in the past against the exploration of new slicing configuration that might eventually produce higher revenues.
This class of decision process can be formulated as a Multi-Armed Bandit (MAB) problem, which emulates the action of selecting the best (single) bandit (or slot machine) that may return the best payoff. Each slot machine returns unpredictable revenues out of fixed statistical distribution, not known a priori, that is iteratively inferred by previous observations. This matches well the randomness of the channel quality and the traffic demand we aim to capture whereas each bandit can be mapped onto a state of the MDP, i.e., a specific slicing configuration. The final objective of such a problem is to maximize the overall gain after a finite number of rounds.
This class of problems is usually assessed by a defined metric called regret Ω, which is defined as the difference between the reward that can be gained by an optimal oracle, i.e., using an optimal policy that knows the reward distributions a priori, and the expected reward of the myopic online policy.
Reusing notation from our MDP model, let us define each arm σ ∈ Σ as a different slicing configuration c σ = {y i | i ∈ I}. Once selected, each arm provides an instantaneous reward R(σ) defined as the following where the slicing configuration is y i ∈ c σ , ζ(·) computes the number of bits that can be served using y i configuration and given the current channel quality γ i , and λ i is the slice traffic demand, as described in Section II-C.
While using such reward function requires low overhead, as it only needs to calculate the incurred latency after selecting a slicing configuration, it only converges to a near-optimal solution after exploring several configurations, which results in overly long training periods (as shown in Section VI). This is an inherent issue with classic MAB methods, which are blind to the underlying system structure. Conversely, in this paper we resort to a novel model-assisted approach that exploits the system model of Section V-A to guide the exploration/exploitation process with (abstract) system information. In this way, as opposed to using the traditional reward model of Eq. (11), we define our bandit's reward as the expectation of access latency exceeding slice SLA defined in Eq. (10). This has a two-fold advantage: i) during the initial training period, the DTMC associated to each state of the MDP is updated (and enhanced) with more accurate values of the transition probabilities: this helps to find steady-state probabilities (and in turn an updated reward per slicing configuration) that reflect the real behavior of our system as time goes on; and ii) the slicing configuration selection accounts directly for stochastic behaviors of both channel quality and traffic demand, while reducing the state space to those that may benefit the entire system. Many algorithms have been proposed to optimally solve the MAB while learning from previous observations [18]. One of the main issues is that collecting rewards on a short-time basis may negatively impact on the decision of the best bandit. Thus, we rely on a modified version of the so-called Upper Confidence Bound (UCB) algorithm devised by [19] that overcomes this issue by measuring not only the rewards collected up to the current time interval, but also the confidence in the reward distribution estimations by keeping track of how many times each bandit has been selected z σ,n . The pseudo-code is listed in Algorithm 1.
Initially, we explore all bandits, i.e., slicing configuration σ ∈ Σ, to get a consistent reward (line 2-6). Then we select the best configuration that maximizes the empirical distributionρ σ accounting for a confidence value. This confidence value depends on the number of times we have explored that particular configuration as well as the accuracy of the transition probabilities we calculate for the associated DTMC. Note that this is different to traditional UCB algorithms.
Specifically, we define a Markov accuracy value ψ(σ) = ( , where W represents the cardinality of the set W. Note that ψ(σ) depends on the weights ω(·) obtained through the performed observationsŜ i , as reported in Eq. (8). Interestingly, ψ(σ) ∈ (0, 1], i.e., when the DTMC has no relevant observations to build its transition probabilities this function returns . We denote σ * as the arm providing the maximum average reward such thatρ σ * >ρ σ , ∀σ = σ * . If the arm selection is performed using LACO, it yields that the regret is obtained as where P LACO = {σ n } is the policy as defined in Section V that consists of a set of moves that LACO will play at time n whereas z σ,n is the overall number of decision epochs arm σ has been pulled down till time instant n. Now consider LACO as a uniformly good policy, i.e., any suboptimal arm σ = σ * is chosen by our policy up to round n so that E[z σ,n ] = o(n α ), ∀α > 0.

It holds that
Hence, we can express the regret lower bound as the following where Div(ρ σ ,ρ σ * ) is the Kullback-Leibler divergence of one statistical distribution against the other and it is used to measure how one distribution might diverge from another probability distribution.
Now consider the Hoeffding's inequality for multiple i.i.d. variables x n with mean µ. It yields Our algorithm LACO applies an upper confidence interval δ = 2 log σ k z k zσ . Therefore, it yields that and also that We can then derive the expectation of number of times sub-optimal arm σ = σ * is pulled down as follows and the regret upper bound as the following

VI. PERFORMANCE EVALUATION
In this section, we evaluate our solution through an exhaustive simulation campaign that takes into account complexity, revenue and SLA violation metrics.

A. Simulations setup
To assess heterogeneous slices, we simulate the network load demand of slice i at each timeslot (i.e., each transmission time interval (TTI) in Long Term Evolution (LTE) systems) by extracting a random value from a Normal distribution N i (µ i , ν 2 i ), where µ i and ν i represent the mean value and standard deviation, and let L i describe its latency constraint. Moreover, we model the SNR channel variation as another random variable drawn by a Rayleigh distribution and derive the probability distribution encompassing the whole SNR range. For every channel instantiation, we extract the corresponding Modulation and Coding Scheme (MCS) as defined by the 3GPP standard. 9 The MCS index m ∈ M combines one possible modulation scheme and a predefined coding rate providing a compact way to represent a simple concept: the better the radio conditions, the more bits can be transmitted per time unit, and vice versa. Fixing the channel bandwidth, the expected average throughput achievable by one slice during one epoch depends on both the modulation and coding schemes used and, most importantly, on the number of PRBs reserved for the slice. In a wider timescale 10 , the average capacity can be approximated as C i = M m Γ m π m,i T i y i where Γ m represents the average number of bits per LTE subframe that can be transmitted using the m-th MCS index, π m,i is the steady-state probability distribution output of the first stage Markov chain model, T i defines the decision interval size, and y i accounts for the number of PRBs allocated to the i-th slice. We refer the reader to Table I. In the LTE radio interface, the maximum amount of PRBs is fixed to 100 when operating at conventional bandwidth values of 20 MHz. In order to support massive type communication and Ultra-Reliable Low-Latency Communication (URLLC) use-cases, the 5G New Radio (NR) introduces significant enhancements in the radio frame composition. Not only 5G NR will support wider channel bandwidth (up to 100 MHz), but also introduce the support for multiple different types of subcarrier spacing. For back-compatibility reasons, even in 5G NR the time duration of radio frames and subframes are fixed to 10 ms and 1 ms, 9 We refer the reader to [8] for an exhaustive explanation of the mapping between SNR and MCS. 10 Note that we assume a timescale larger than our epochs used in the decision-making process.
respectively [20]. The number of slots within each subframe however would change according to the subcarrier configuration, which eventually translates in shorter PRB time duration and thus a different PRB availability depending on the selected configuration. It must be noticed that all the subcarrier spacing are defined as ∆f = 2 j · 15 KHz, j = {0, . . . , 4}, thus leading at the definition of time-frequency grids containing an amount of PRBs which is multiple of those contained in the traditional LTE grids. In this context, we assume a simple mapping function, as the one described in [21], implemented at intra-slice scheduler to homogenize the resources of potentially heterogeneous radio access technologies.
Traffic demands are compared with the current channel availability to derive the possibilities to pass from one state to another. It must be noticed that the accuracy of the resulting steadystate distribution strictly depends on the precision of such comparison. For this reason, we constantly monitor and update the transition probabilities of the Markov chain based on the resource allocation adopted in the current decision interval. During the arm selection, if the chosen configuration does not provide enough resources to meet the latency requirements, the steady-states will be mostly distributed in the lower part of the Markov chain leading to a minor reward that, in turn, guides the MAB agent to take a different action (i.e., selecting a different arm) in the following decision round.
For benchmarking purposes, we implement two widely used MAB algorithms, namely "legacy" UCB and Thompson Sampling (TS) 11 . On the one hand, UCB adopts a deterministic approach to deal with the exploration-vs-exploitation dilemma, but its performance generally degrades as the number of arms increases. On the other hand, Thompson sampling adopts a probabilistic approach that scales better with the number of arms, but it may provide sub-optimal results when the distribution of reward changes over time (i.e., in non-stationary scenarios). Conversely, LACO combines the advantages of them both by adopting a probabilistic model (MDP) guiding an exploration phase derived from UCB.

B. Multi-armed bandit problem behavior
We first explore the trade-off between action space (and its granularity) and the associated reward loss. To this aim, we set up a simple experiment with 2 slices with equal SLA requirements in a deterministic and static environment. We then apply LACO using 3 different action sets: 11 Due to space limits, we refer the reader to the literature introducing such algorithms, e.g. [22]. over 50 intervals for "2 PRBs" whereas it takes around 10 intervals for "10 PRBs". Interestingly, the loss in reward attained to the latter configuration is only 2%. Therefore, due to a faster convergence time at the expense of minimal reward loss, we empirically select Θ = 10 PRBs for our purposes.

C. Slice SLA violation analysis
We thus grant spectrum-time resources in the granularity of chunks of 1 second × 10 PRBs.
In the first scenario, we investigate the capacity of LACO to adapt the resource allocation at variable traffic loads. For this reason, we consider only two slices with equal requirements, i.e.,  Obviously, heterogeneous throughput/latency requirements impact the system differently. Fig. 6a shows the effect of such variations on the system extending the previous scenario and considering increasing values of resource requirements as 10 · α Mb/s, and 10 · β ms, respectively. As expected, smoother delay requirements (horizontal direction in the figure) allow to serve more traffic within the latency bounds defined by the SLA, although the impact becomes negligible after few incremental steps. This is due to long decision intervals when compared to the timescale of fast channel variations. A proper resource configuration selection allows to match the offered traffic requirements with the expected channel capacity, allowing the incoming traffic to be served within few milliseconds. As the offered traffic approaches the channel capacity boundary (vertical direction in the figure), the same task becomes more challenging and the admission and control process should consider this aspect when granting/rejecting access to new network slices. LACO 's abilities to adapt to demand variations not only mitigates the amount of traffic violating delay requirements but also improves the distribution of data delivery delay overall.
As shown by Fig. 6b, the empirical CDF of delay for each slice in the same scenario presented above remarkably improved with a mean delay equal to 2.6, 3.9 and 4.9 ms for LACO, TS and UCB, respectively.
Finally, we implement an optimal offline policy with full knowledge of the system, i.e., an oracle policy that knows the future with the corresponding latency violations. We compare both LACO and TS to this optimal policy for a variable number of slices. The aggregated demand is adapted to ensure we operate within the system capacity. In Fig. 6c, we depict the temporal evolution of the cumulative reward loss over time (regret) for both approaches. The figure illustrates how the regret increases with time much rapidly for TS, a difference that increases with the number of slices.

D. Convergence time
The next generation of mobile networks (5G) promises to support the provisioning of high throughput and low-latency services even in highly dense scenarios [2]. These capabilities are tightly bounded with the possibility to exploit higher communication frequencies together with wider spectrum bandwidth. In the 5G context, bandwidth is expected to increase up to 100MHz, leading to additional complexity in the management of radio resources. In order to assess LACO performances in such scenarios, we investigate the convergence time of our solution to the optimal slice configuration in different bandwidth settings. To enable more efficient use of the spectrum resources and reduce the power consumption at UE side, 5G New Radio (NR) introduces the concept of bandwidth parts (BWP) [20], where each BWP can be configured by different numerologies defining specific signal characteristic, e.g., in terms of subcarrier spacing.
Without loss of generality, we assume all the end-users belonging to the same slice operating under similar numerology settings. Moreover, we keep the subcarrier spacing fixed to ∆f = 15 KHz as in legacy LTE systems. Such coarse resource allocation scheme is mandatory to support LTE devices but, it can be easily mapped to finer resource block structures as defined within the 5G domain at lower layer intra-slice schedulers [21].  picture it is evident how the curse of dimensionality affects the overall convergence time. This is more evident for the legacy UCB approach (depicted in red), which hardly copes with the increasing size of the action space and in some runs did not converge to a solution within the time boundary of our experiment. Focusing on LACO performances (depicted in black), the number of decision intervals necessary to converge to the optimal resource allocation outperforms Thompson Sampling (in blue) by scaling almost linearly with the number of slices (and PRB availability) after the initial exploration phase.
Convergence to the optimal slice configuration also shows its dependency on the radio channel statistics. To measure the sensitivity of the decision process at the SNR fluctuations, Fig. 7b considers a fixed number of slices (i.e., 3) deployed in a system characterized by average channel statistics with an increasing variance. In every scenario, the average (per slice) channel realization is derived from a Rayleigh distribution characterized by a scale parameter τ = {0.1, 0.2, 0.3, 0.4}, respectively. This introduces an increasing level of variability in the SNR distribution according to the formula Var = 4−π 2 τ 2 , as depicted in the plots of the central column. On the left-hand side of the same picture, it can be noticed how higher SNR variability has very limited impact on the decision steps. This feature is inherited by the Markov Chain model described in Section III. In particular, provided that the slice requirements fit within the admissibility region of the system,

A. Implementation
The architecture of our software implementation and LACO's interfaces with srseNB are depicted in Fig. 8b  system where serving rate and packet arrival rate are balanced, the size of the virtual queues get smaller and the DTMC model is mostly characterized by non-delay states.

B. Experimental results
We consider a scenario accounting for two slices characterized by the following requirements. Latency and SNR information are depicted in the third and fourth plots of each figure. In this case, we use maximum and average as aggregation functions, respectively.
As described in Section V-C, during the starting procedure the MAB algorithm explores all available arms with the aim of collecting an initial feedback on the system dynamics. Fig. 9a depicts the effects of these sequential choices on the latency experienced by the ongoing traffic flows. The initial steps drive the allocation of radio resources towards the eMBB slice thereby providing significant advantages in terms of experienced delay with respect to the URLLC one.
In this phase, traffic coming from the URLLC might be dropped due to delay violation ∆ URLLC .
The scenario changes after the 6-th decision interval, when the agent selects the configuration . Given the current channel quality, that arm does satisfy the URLLC radio requirements but does not reserve enough radio resources for the eMBB slice, thus increasing the latency experienced by its users. Subsequent arm selections within decision intervals 7 and 8, further reduce the radio resources assigned to the eMBB slice thus leading the traffic to violate ∆ eMBB .
The MAB agent collects this information and quickly converges to a satisfactory configuration.
In Fig. 9b, we focus on the system dynamics once the convergence is achieved and clearly notice how both the latency requirements are satisfied. Interestingly, despite similar traffic requirements, the algorithm selects the configuration , which assigns more resources to the first slice.  Fig. 10, where both plots depict the empirical CDF of the latency, the RLC buffer density and the dropping rate incurred by each slice for the two allocation schemes.
The performances of the system when LACO is in place are depicted on the left-hand side picture, whereas the right-hand side shows the results of the RR-based slice scheduling scheme.
In both plots, the URLLC slice is shown in blue and the eMBB one in orange. Based on these results, we can observe that LACO successfully meets both slices latency requirements. This is achieved by providing the required resources to the URLLC and eMBB slices (Fig. 8c

VIII. RELATED WORK
The RAN design problem has always been at the forefront of the mobile operators and a vast amount of research has been devoted to novel RAN architectures [25], [26] and efficient radio resource schedulers [21], [27]. Recently, network slicing has been proposed as a new means for mobile operators to deploy isolated network services owned by different customers over a common physical infrastructure. However, as highlighted in [28], RAN needs additional functionalities to fully exploit SDN and NFV principles, specially in the partition and isolation of radio resources. The authors of [3] focus on efficient sharing of the RAN resources and proposed a RAN slicing solution that performs adaptive provisioning and isolation of radio slices. Their work is based on dynamic virtualization of base station resources, which gives tenants the ability to independently manipulate each slice. Although the proposed architecture may guarantee isolation through different control planes, no mechanism is in place to ensure the satisfaction of delay requirements. [29] provides an empirical study of resource management efficiency in slicing-enabled networks through real data collected from an operational mobile network, considering different kinds of resources and including radio access, transport and core of the network. Similarly, the authors of [30] formulate an optimization framework to deal with resource partitioning problem, where inter-slice isolation is assured through a virtualized layer that decouples the reservation choice from the physical resource availability and proposing different abstraction types of radio resource sharing. In [31] the authors present an Earliest Deadline First (EDF) scheduling approach in the context of network slicing. Differently from us, their approach works on a single MAC scheduler and assumes for every TTI a complex finetuning of the quota of resources to be assigned to each slice, thus limiting the implementation of dedicated intra-scheduling solutions.
The exploration-vs-exploitation trade-off, typical of Multi-Armed Bandit (MAB) problems is particularly suited to problems that require sequential decision-making. For this reason, a wide set of variations from the classical MAB model has been proposed in the literature [17], [32], together with novel algorithms to address them [33]. In this regard, the work of [34] investigates the MAB problems in case of Markovian reward distribution, where arms change their state in a two-state Markovian fashion. The authors addressed the problem assuming that the Markov chain evolves only when the arm is played, showing that the proposed sample mean-based index policy achieves regret performances comparable to legacy UCB algorithm.
The authors of [22] performed a complete regret analysis of the TS algorithm, generalizing the original formulation to distributions other than the Beta distribution. The MAB framework is also applied in [35] to deal with rate adaptation problem in 802.11-like wireless systems.
The authors demonstrate that exploiting additional observations significantly improve the system performance. Similarly, [36] deals with scheduling transmissions in presence of unknown channel statistics. The proposed algorithm learns the channels' transmission rates while simultaneously exploiting previous observations to obtain higher throughput. This led to the design of a queuelength-based scheduling policy using the channel learning algorithm as a component in timevarying environment. The authors of [37] presented an algorithm for multivariate optimization on large decision spaces based on an innovative approach combining hill climbing optimization and Thompson sampling. While the scalability of their algorithm has been proven through exhaustive simulations, the framework lacks a complete analysis of regret bounds aimed at demonstrating the impact of hill climbing in combinatorial decision making. Finally, similar to us, [38] deals with an MAB formulation where the reward distributions are characterized by temporal uncertainties.
Interestingly, they were able to mathematically capture, in terms of reward, the added complexity embedded in the non-stationarity feature when compared to the legacy framework.
The key novelty of LACO relies on the exploitation of (abstract) information of the underlying system structure to expedite solutions. Conversely, prior works are blind to such type of information and need to spend substantial time exploring very bad decisions before achieving it.

IX. CONCLUSIONS
Major efforts in the design of next-generation mobile systems pivot around network slicing and (mobile edge) low-latency services. This paper aims to bridge the gap between them both by designing LACO, a RAN-specific network slice orchestrator that considers network slice requests with strict latency requirements. Despite the efforts devoted by 5G researchers and engineers to network slicing, to the best of our knowledge, this is the first radio slicing mechanism that provides formal delay guarantees. To make network slicing decisions in environments with varying wireless channel quality and user demands, LACO builds on a learning Multi-Armed Bandit (MAB) method that is model-aware as opposed to classic MAB approaches that are blind to information regarding the underlying system. In addition, we exploit information from the system model to expedite the exploration-vs-exploitation process. Our results derived from an implementation with off-the-shelf hardware show that LACO is able to guarantee strict slice latency requirements at affordable computational costs.

ACKNOWLEDGMENT
The research leading to these results has been partially supported by the H2020 MonB5G Project under grant agreement number 871780.