PRORL: Proactive Resource Orchestrator for Open RANs Using Deep Reinforcement Learning

Open Radio Access Network (O-RAN) is an emerging paradigm proposed for enhancing the 5G network infrastructure. O-RAN promotes open vendor-neutral interfaces and virtualized network functions that enable the decoupling of network components and their optimization through intelligent controllers. The decomposition of base station functions enables better resource usage, but also opens new technical challenges concerning their efficient orchestration and allocation. In this paper, we propose Proactive Resource Orchestrator based on Reinforcement Learning (PRORL), a novel solution for the efficient and dynamic allocation of resources in O-RAN infrastructures. We frame the problem as a Markov Decision Process and solve it using Deep Reinforcement Learning; one relevant feature of PRORL is that it learns demand patterns from experience for proactive resource allocation. We extensively evaluate our proposal by using both synthetic and real-world data, showing that we can significantly outperform the existing algorithms, which are typically based on the analysis of static demands. More specifically, we achieve an improvement of 90% over greedy baselines and deal with complex trade-offs in terms of competing objectives such as demand satisfaction, resource utilization, and the inherent cost associated with allocating resources.


I. INTRODUCTION
O NE OF the aims of the fifth-generation (5G) cellular network infrastructure is to provide very high data rates with extremely low latency and Quality of Service (QoS) improvements for the final users.In response to these challenges, providers have implemented new technologies such as massive Multiple Input, Multiple Output (MIMO) [1], millimeter wave and sub-terahertz communications [2], network-based sensing [3], virtualization through Network Functions Virtualization (NFV) and Software-Defined Networking (SDN) [4], and Machine Learning (ML)-based digital signal processing [5], among others.These solutions enhance network capabilities but also come at the expense of increased management complexity and cost for the operators.To mitigate this complexity and avoid vendor lock-in, the Open-RAN (O-RAN) Alliance [6] has defined and standardized open interfaces in order to decouple hardware from software for enhanced flexibility and to enhance the support and usage of AI solutions for network operations and management.
O-RAN extends the 3GPP NR 7.2 split for base stations [7], which disaggregates their functions into a Central Unit (CU), a Distributed Unit (DU), and a Radio Unit (RU) (called O-CU, O-DU, and O-RU respectively in the O-RAN specifications).CUs and DUs are virtualized on-edge cloud servers, while RUs are deployed at cell sites.Moreover, O-RAN connects base station functionalities to intelligent controllers, also known as RAN Intelligent Controllers (RICs), which have visibility of network performance indicators and are utilized to aggregate Key Performance Measurements (KPMs) for supporting closed-loop control applications for the overall infrastructure.In the O-RAN specifications, these controllers are categorized as near-real-time RICs and non-real-time RICs [8].
The advantages of the O-RAN architecture are many: it allows for better usage and management of resources thanks to virtualization, centralization, and dynamic reallocation; it can reduce management costs and generate significant energy savings; and it enables the potential augmentation of network capacity through the addition of virtual resources to its logically centralized pool.
In this paper, we present Proactive Resource Orchestrator based on Reinforcement Learning (PRORL), a novel solution for dynamic orchestration and management of resources in O-RAN 1 .PRORL is an adaptive learning-based orchestrator that learns patterns in the demand from experience and uses them to proactively allocate resources by considering the expected effect over future time windows.2More specifically, we target a three-tier network infrastructure with O-RAN components (see Figure 1): at the highest level (Regional Cloud), we place the resource pool, which is directly connected to the core network in a regional cloud data center.In the middle (Edge Cloud), the infrastructure hosts edge data centers, also called Points of Presence (PoPs), which communicate through the O1 interface [9] to the Regional Cloud: PoPs are responsible for sharing the resources received from the upper tier to the CUs, DUs and the lower tier, thanks to CU's Radio Resource Control (RRC) [10] and Service Data Adaptation Protocol (SDAP) [11] layers, which manage the connection lifecycle and the Quality of Service of the traffic flows (also known as bearers).The last tier (Cell Site) consists of RUs that receive capacity from DUs of their PoP.PRORL is strongly original in its proposal of a learningbased orchestrator installed on the Regional Cloud as an rApp that controls the pool of currently available resources and is responsible for suitably allocating them to the PoPs.It is worth mentioning that the O-RAN PRORL rApp is a plug-and-play component that implements custom logic in the O-RAN ecosystem.It directly communicates with non-RT RICs to collect network-related KPMs and to inject resource management policies.
In comparison with the currently deployed O-RAN-related solutions, we strongly believe that PRORL advances the stateof-the-art in terms of effective and efficient use of resources.In fact, current base stations are usually provided with sufficient resources for accommodating peak hours, which then remain over-provisioned for the rest of the day [12].On the contrary, our orchestrator can react to rapid changes in loads and, simultaneously, proactively move resources while maintaining an adequate QoS.In this way, it can also reduce the total capacity required in the system and the Operating Expenditure (OPEX) for reallocating the resources between PoPs.
Design Challenges: In order to design PRORL, we addressed a series of challenges related to resource allocation: 1) Allocation Strategy: an effective and efficient strategy is required for orchestrating resources among PoPs.The strategy has to select PoPs that will receive additional resources taken either from the pool or from other nodes that have exceeding capacity with respect to their current demand.It also has to optimize PoP satisfaction (by allocating sufficient capacity to them) and, simultaneously, to avoid wasting resources by over-provisioning.2) Adaptability to Demand Dynamics: demand varies over space and time due to the variety of users, applications, and services that use the network.The solution has to address demand dynamics in a proactive manner in order to move resources before they are needed (proactivity of the strategy).Furthermore, the enforced strategy must be able to react quickly to unexpected changes (reactivity of the strategy).3) Movement Cost: moving resources carries an inherent cost due to the de-activation and re-allocation of resources on different edges of the network deployment environment.Thus, the allocation strategy needs to take into account the number of resource movements.Our Contributions: The main contributions of this paper can be summarized as follows: 1) We model the capacity allocation problem as a decisionmaking process in which PoPs have to serve connectivity to the underlying O-RAN components.An agent has to move resources from the centralized pool to the PoPs according to their aggregated demand; the agent decision is guided by a numerical reward signal based on an objective function to be maximized.We design our original objective function in order to balance multiple competing objectives weighting their effects; namely, we optimize system satisfaction, resource utilization, and the cost associated with resources movement (the OPEX).We formulate the problem in the Markov Decision Process (MDP) framework and solve it using Deep Reinforcement Learning.Our novel model architecture features a decomposition of the action space, which allows for a substantially faster convergence of the agent.2) We extensively evaluate our solution using a publicly available dataset composed of Call Detail Records (CDRs) collected in the city of Milan for two months in 2013, which are considered a solid use case in the related literature [13].We also analyze the effect of different trade-offs among the three objectives that PRORL can optimize in order to show the flexibility of the proposed approach.Finally, we conduct a sensitivity analysis using realistic synthetic data to experimentally study the robustness of the model when dealing with challenging scenarios, such as immediate demand peaks or system overloading.Main Results: Our experimental evaluation over both real and synthetic data shows that PRORL can achieve significantly better performance in comparison with baseline greedy solutions.Moreover, our approach can effectively deal with complex trade-offs in terms of competing objectives.PRORL is flexible enough to outperform the baselines over multiple trade-off configurations, with an average improvement of 90% over them.We also prove the robustness of our approach, evaluating it on 4 different challenging scenarios using synthetic data.The results confirm the superiority of PRORL in all considered scenarios, with improvements in single objectives of up to 10× the performance of the baselines.

II. RELATED WORK
The problem of resource optimization in RAN can be divided into low-level and high-level resource allocation problems.The former addresses allocation at levels close to the physical layer, such as spectral efficiency, radio resources, and power allocation, which typically involves control of RUs and sometimes DUs.The latter addresses orchestration problems further up the stack such as deployment, cell selection, and CU/DU control by activation and deactivation.Our proactive orchestrator belongs to the high-level resource allocation class.To the best of our knowledge, PRORL is the first attempt to address the problem of orchestrating resources among different pools while optimizing multiple conflicting objectives.Furthermore, it considers the effect of allocations over multiple timesteps, during which the demand naturally varies, and it is evaluated on both synthetic and real-world data.
Low-Level O-RAN Resource Allocation: In the context of Heterogeneous Cloud Radio Access Networks (H-CRANs), Peng et al. propose two optimization problems [19], [20].In the first work, the authors mitigate inter-tier interference while optimizing energy efficiency by assigning resource blocks and transmission power.The optimization problem is formulated as a non-convex fractional program, which is solved by means of the Lagrange dual decomposition.The second work focuses on the impact of the cost of different types of fronthaul in C-RAN.The authors propose a joint optimization of energy efficiency and the wired/wireless fronthaul cost.The problem is formulated as a non-convex beamformer problem that constrains fronthaul capacity and transmission power.It is solved by developing an outer-inner loops algorithm.In the outer loop, the primal problem is transformed into an equivalent sub-problem using the bisection search method while, in the inner loop, the sub-problem is solved by the weighted minimum mean squared error approach.In the O-RAN context, Wang et al. [21] study the RU-DU resource assignment problem in O-RAN.They model the problem as a 2D bin packing problem and present a deep reinforcement learning agent with Monte Carlo tree search to solve the problem.They evaluate their solution both on synthetic and real-world data, discovering improved policies for resource assignment in diverse network conditions.In [22], a reinforcement learning solution dynamically adapts the per-flow resource allocation, modulation, and coding scheme in order to satisfy the traffic flow requirements.In [23], reinforcement learning methods are proposed to manage sessions for ultrareliable and low latency communication (URLLC).
High-Level O-RAN Resource Allocation: In [14] and [24], Chen et al. propose a two-step system for mapping virtualized resources in H-CRAN environments.A traffic forecasting model is used for predicting the demand in the next time window.The authors define a constrained optimization problem to find the best mapping given the predicted demand.They evaluate both approaches using real-world data.However, even if the mapping of resources employed by the authors is conceptually similar to that presented in our work, these solutions do not consider the effect of the demand variation over subsequent time intervals, relying instead on the accuracy of the next prediction.Similarly, in [15], Pamuklu et al. propose a reinforcement learning solution for mapping the split of functionalities between CUs and DUs in Green O-RAN, where Fig. 2. PRORL at a glance.The environment is composed by the resource pool and PoPs.A state is derived from the environment representing the demand and capacity of the nodes and the pool.The state is given as input to the agent that outputs the action to be taken.The actions consist of a movement of resource units, which results in a new configuration.
the objective is to reduce energy consumption while utilizing renewable energy sources.In [16], the authors propose an algorithm to improve User Equipment (UE) placement taking into account radio quality, bandwidth, and user distribution.Finally, several works address the problem of Network Slicing using deep reinforcement learning [17], [18], [25], [26], [27], [28], which involves the allocation of resources to create network partitions where different policies and QoS requirements can be met.In contrast, our approach aims to support a finer level of granularity by orchestrating the pool of resource units.

III. BACKGROUND
In this section, we briefly introduce the theoretical framework used in this work.Firstly, we discuss the classic single-objective reinforcement learning problem.Then, we present the challenges of optimizing multiple objectives, as is the case in the problem setting of PRORL.Finally, we provide an in-depth discussion about the implementation of the proposed algorithm.
Single-Objective MDP: Markov Decision Processes are used to model a learning process based on interactions [29].The learner and decision-maker is usually called an agent.The agent interacts with the environment (in our case the O-RAN system) at discrete time steps t = 0, 1, 2, . . ., T .At each time step t, the agent receives a representation of the environment (defined as a state) S t ∈ S and selects an action A t ∈ A(S t ).S is the set of the possible states and A(S t ) is the set of possible actions in state S t .As a consequence of the action, the agent receives a reward R t+1 and enters a new state S t+1 .The goal of the agent is to define a policy π(A t |S t ), a probability distribution over actions for a given state, which maximizes the discounted return: where R t+k +1 is the reward at time t + k + 1 and γ a discount rate with 0 ≤ γ ≤ 1.We then define the state-value function which is the expected discounted return when following the policy π from the state s onwards.We also define the action-value function given policy π as i.e., the expected discounted return when taking action a in state s then following the policy π.In deep reinforcement learning, the policy π is typically parametrized using a deep neural network with parameters θ.
Multi-Objective MDP: The Multi-Objective Markov Decision Process (MOMDP) captures problems with more than one objective to be optimized.The only difference compared to the standard MDP is that, instead of a scalar, the reward function R : S × A → R d returns a vector of rewards, one for each objective, with d ≥ 2 indicating the number of objectives [30].Therefore, the value of a state given a policy π is also a vector (please note that we use bold fonts for indicating vectors).In contrast to single-objective MDPs, in MOMDP we cannot define a natural order between different policies without any additional information about how to prioritize them, because the value of a policy can be ambiguous.A typical method for finding solutions for a MOMDP is to use the utility-based approach.We define a utility function (also known as scalarization function) that projects the multiobjective value V π to a scalar value: where w is the weight vector parametrizing U .The scalarization can be either linear or non-linear and, depending on this choice, we can find different solutions to the problem.For simplicity, in this work, we assume that a linear combination of the objectives is sufficient to capture the trade-offs between the different desiderata.The full formulation of the linear utility function is given in Section IV, Equation (9).

Decision-Making Algorithm:
The choice of a linear utility function gives the added benefit that any single-objective RL algorithm is applicable.Therefore, we use DQN [31], [32], Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
which is an off-policy value-based method that is more dataefficient with respect to on-policy methods.This last advantage is a critical aspect in a real and complex environment such as that of O-RANs, where data acquisition is expensive.
DQN uses a function approximator (typically a deep neural network) for parametrizing, with weights θ, the action-value function Q(s, a, θ).Further to tabular Q-learning, DQN leverages the experience replay and the target network Q− .The former [33] consists of storing tuples of agent's experience S t , A t , R t , S t+1 in a replay buffer D and sampling a minibatch at each learning step.It allows for better data efficiency and reduces the variance in the updates.The latter makes the DQN more stable by copying the weights θ of the online network to those of the target network θ − every C steps.
In the past years, the research community has proposed several improvements to the original DQN algorithm [34], some of which we use in our implementation.In particular, we adopt Double DQN [35] updates, which are able to deal with the problem of the overestimation of Q-values of the original approach.We also adopt prioritized experience replay [36], which, instead of sampling uniformly from the replay buffer, samples entries proportionally to a weight, which is a function of the Temporal Difference (TD) estimation error, hence prioritizing "important" transitions.We opt for the variant of PER that determines weights corresponding to the last encountered absolute TD error, which has been shown to be highly performant yet computationally efficient.Finally, we use the dueling network approach proposed by Wang et al. [37], which modifies the Q-network architecture by splitting the last layer into two branches that output the advantage and value respectively.The two branches are subsequently factorized in order to obtain the Q-values in output.Indeed, this technique allows for better generalization across the actions.

IV. RESOURCE ORCHESTRATION WITH PRORL
In this section, we present the design and implementation of PRORL, our resource orchestrator solution, illustrated in Figure 2. We firstly describe the associated MDP starting from the concepts introduced in the previous section.Practically, the agent uses the system status as input and outputs the action (a reconfiguration of resources) that is applied to the network deployment environment, which is composed of the resources pool and the PoPs.The environment provides the agent with the new status and the reward associated with the allocation.

A. Problem Formulation
Formally, we are given a resource pool B = {b 1 , . . ., b K } initialized with K units, each of size σ and a set of PoPs N = {n 1 , . . ., n M } (also referred to as nodes) of cardinality M, where K M, as shown in Figure 3(a).Each node n i is described by the tuple d t,n i , c t,n i .The demand d t,n i ∈ R represents the load on the node n i at time t obtained by aggregating all the demand in the controlled area, while the capacity c t,n i ∈ N represents the number of resource units allocated to node n i .We note that the system is able to control the allocated capacities, while the demands are external.The system allows performing two resource unit movements (i.e., sub-actions) per time step t.The former, f add (n add ), allows moving one resource unit from the pool B to a PoP n add , illustrated in Figure 3(b).The latter, f remove (n remove ), allows releasing one resource unit from a PoP n remove , with n add = n remove , illustrated in Figure 3(c).Moving a resource unit from the pool means activating and allocating it into its new location.Conversely, releasing a resource unit means deactivating and moving back to the pool.Every time a resource unit is allocated or released a cost κ is incurred.In addition, for each movement, the system allows waiting (performing no movement), which incurs a cost κ = 0.
We define Δ t,n i = c t,n i − d t,n i as the difference between the capacity and demand for node n i at time t.Given a target τ , our objective is: where 1 is the indicator function, which is equal to 1 if the condition expressed as argument is true and 0 otherwise, while A 1 t and A 2 t are the two sub-actions, respectively.Intuitively, over a time horizon of T steps, we need to minimize the allocated capacities (first term of the sum) and the cost of movements (second and third term) while satisfying the demands requested by the nodes (the constraint).
The presented problem shares significant similarities with the field of Inventory Management [38], where it is necessary Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to determine the quantity of stock (capacity) allocated to each warehouse (node).In this context, the Dynamic Capacitated Lot-Sizing problem is recognized as NP-hard [39].In addition, also the Unit Commitment problem in electrical power production shares similarities and is known to be NP-hard [40], [41].Hence, deriving the optimal solution to large instances of the problem is intractable, particularly given the hidden nature of demands, necessitating adjustments to accommodate its variability.Due to these challenges, we posit that an approach like PRORL, providing approximate solutions, is needed.
In order to simplify the presentation, without loss of generality, we set τ = 0 and the time step size t to 1 hour for the remainder of the paper.The time step size influences the resolution at which the demand is collected and evaluated.The choice of the time interval is of key importance.If it is too short, the changes might be temporary and, in general, very noisy.If it is too long, the algorithm will not be able to adapt to non-negligible demand variations that might happen between the two sampling points.Figure 5 shows that, for the considered real-world dataset, this is an appropriate level of granularity to consider.It is worth noting that, given a scenario with different demand variability, our approach can support a significantly smaller granularity that approaches real time.In general, using a learned model in such optimization problems may be appropriate when decisions need to be made rapidly and running a solver is unfeasible [42].

B. Definition of the MDP
We now present the formulation of the orchestration problem underlying PRORL as an MDP, discussing the action and state spaces and the chosen reward structure.
Actions and Decomposition of the Action Space: The agent is a logically centralized component that receives the current environment state as input, selects actions and improves its policy in order to maximize the reward received.In this work, we split the agent in two components in order to reduce the dimension of the action space.The actions consist of either waiting or adding (removing) resources from one of the M nodes.Therefore, the combined action space has dimension (M + 1) 2 .On the other hand, if we decompose the action into two independent sub-actions, the overall action space dimension becomes (M + 1) + (M + 1) = 2(M + 1).A smaller action space allows for a quicker learning process given the reduction of its size, as experimentally demonstrated in the evaluation section.
We therefore implement two independent agents that act sequentially as depicted in Figure 4. We refer to them as the add and remove sub-agents, respectively.Furthermore, before each step, the environment shrinks the two action spaces if some criteria are met, in order to prevent invalid actions.For the add sub-action, the space is reduced to contain only the wait action if the current capacity of the pool B is zero.Instead, for the remove sub-action we disable the index corresponding to the node selected as n add , if it is not the wait action, and we disable nodes with zero allocated capacity.
States: The state should encapsulate all the relevant information about the environment that is useful to the agent  for selecting actions.More specifically, for our resource orchestration problem, it must capture the capacity of the nodes and the pool, as well as the demand in different locations.The state of each sub-agent contains the following information: • Pool Capacity: the current dimension of the pool B. This feature tells the agent the number of currently unallocated resource units that can be moved to the nodes.• Node Capacity: a vector of dimension M with one entry representing the current capacity c t,n i for each node n i .• Node Demand: a vector of dimension M with the current demand d t,n i for each node n i .• Capacity Surplus: a vector of dimension M, whose elements are the current differences between capacity and demand Δ t,n i for each node n i .This feature, although in principle redundant (since it is a linear combination of other features), is an inductive bias that has proven beneficial in preliminary experiments.• Current Time: a one-hot encoding of the hour and the day of the week of the current time step.This feature lets the agent associate patterns in demands with a time interval, in effect capturing seasonality in the data.• Remaining Attempts: it is a scalar value that tells the agent the number of attempts remaining in the current episode (please refer to the Trajectory Pruning paragraph for additional details).The state of the remove sub-agent contains an additional piece of information, namely, the one-hot encoding of the subaction selected by the add sub-agent, i.e., the index of the selected node or a vector of zeros if wait is selected.The add sub-action feature is used by the remove sub-agent for coordination.Furthermore, all the state features are normalized Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in order to have values between 0 and 1.Moreover, for the demand-related features, we also divide the values by the resource unit capacity and apply the ceiling function x .By doing so, we transform the demand into an integer value and we reduce the number of possible states, which leads to a more efficient learning process.
Rewards: The reward signal is used to discriminate between beneficial and non-beneficial actions.The quality of the decision does not directly impact the placement of a single service, since our solution governs the allocation of resource units at the underlying Points of Presence (PoPs), which are then responsible for ensuring the Quality of Service (QoS) for individual users.We define three different reward signals, one for each component of the objective.Therefore, at each time step t, the vector of rewards is The first component, corresponding to term (3), measures the unused resources, which are essentially wasted.We define the surplus as the number of unused units on every node in order minimize the units allocated on all the nodes: where 1 is the indicator function, which is equal to 1 if the condition expressed as argument is true, 0 otherwise.The second and third components of term ( 4) are the costs associated with resource movements.We use the following reward: where A 1 t and A 2 t are the two sub-actions, respectively.The final component (corresponding to the constraint ( 5)) measures the level of satisfaction of the nodes.The remaining gap is defined as the number of units that would be necessary in order to satisfy the constraint for every node.The corresponding reward signal at time step t is: Utility Function: The utility function is used for composing the rewards that drive the learning process.As discussed in Section III, we adopt the following linear utility function: where w surplus + w cost + w gap = 1.The values of the weights are set according to the requirements of the designer of the system.Different sets of values will lead to different tradeoffs, an aspect that is discussed in Section VI.Additionally, φ(R) is a normalization function applied to the rewards before combining them.We normalize the rewards in order to weigh values with the same magnitude.More specifically, the goal is to obtain values in the range [0, 1], where 0 corresponds to the worst value and 1 to the optimal one.
Trajectory Pruning: We also implement a mechanism for pruning sub-optimal trajectories during the training of our model.At initialization time, the environment starts with D attempts and, after each time step, we compute the number of unsatisfied nodes, i.e., the number of nodes with Δ t,n i < τ.If the number of nodes for which the demand is not satisfied is greater than 0, an attempt is lost.When the environment reaches zero attempts, the episode terminates.This has the effect of discarding clearly sub-optimal trajectories, encouraging the agent to maintain longer episodes during which higher returns can keep being received.We note that this technique is also called "early termination" in some RL works.
Computational Complexity: For performing inference, our solution involves interacting with two sub-agents at every time step, as depicted in Figure 4.The interaction requires building the representations of the two states, applying the action shrinking in order to disable unavailable actions, and performing the forward passes of the policies of the sub-agents.Assuming that the sizes of the neural networks remain constant in terms of number of layers and number of units in the hidden layers, each of the steps has complexity O(M ).Therefore, the overall computational complexity of our approach is O(M ), i.e., linear in the number of nodes.The training cost cannot be expressed analytically given the dependence on the number of steps to convergence, which is challenging to determine in reinforcement learning with neural networks for function approximation.Therefore, we assess convergence empirically by studying the validation performances, as shown in Figure 6.In practice, the models take approximately 5 hours to train for the real-world data case on a single Nvidia RTX A5000 GPU.Above all, it is worth noting that training represents a one-off cost, and retraining the network might become necessary only if there is a substantial change in demand patterns.

V. EXPERIMENTAL SETUP
In this section, we first present the experimental setup used for evaluating our solution by means of synthetic and realworld data.We then describe the scenarios that are adopted for our sensitivity analysis.Finally, we discuss the different baselines that are used to evaluate the performance of PRORL.

A. Real-World Data
Dataset: We evaluate PRORL using a publicly available dataset composed of network traces gathered in the city of Milan for two months in 2013 [13].Herein, the position of the base stations is hidden, while data is collected over square cells with a size of 235 × 235 m 2 .Inside each cell, traffic volume is aggregated and anonymized every ten minutes.A new trace is generated every time a user receives or sends SMS / calls or an Internet connection starts / ends.We then combine the position of the grid cells with the estimated position of real base stations obtained from a public dataset [43].
In our experiments, we consider 12 base stations that correspond to the PoPs set N .Since the distribution of demands is highly skewed -most base stations have very low demands, while a few handle the majority of traffic -we discard the ones with the lowest peaks in the dataset, since they would represent an unrealistic PoP, where a single resource unit is always underutilized due to the low resource requirement.To consider a challenging scenario for resource allocation, we Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6.PRORL validation total utility during training for the four setups (higher is better).Note that, while the number of episodes within an iteration may differ due to the trajectory pruning mechanism, the rightmost point in all the four plots corresponds to the same 336000 steps.choose 6 unique pairs of base stations characterized by the highest distance in terms of demand profile over time.By doing so, we select nodes for which demand peaks at different times of the day or the week.Figure 5 presents the demand for the selected nodes over the 8 weeks of data.
Training and Evaluation Procedure: We train our solution by cycling through the data of the training set for a total of 336000 steps.The number of training steps is chosen empirically on the basis of validation performance, which improves rapidly in the early stages of training while tending to plateau as training progresses, as shown in Figure 6. 3 Every time the entire training set is seen by the agent or the attempts D are exhausted, an episode terminates.Every five episodes, a validation run is performed over an unseen set of data (the validation set), and the agent's model is saved if a new best score is obtained.At the end of training, the model that obtains the best score at validation time is evaluated over an additional set of unseen data, the evaluation set.We obtain the three datasets from the 8 weeks by temporally splitting the data: 6 weeks for the training set and one week for the validation and the evaluation sets respectively.The performance we report refers to the score obtained on the evaluation set.To ensure statistical validity of the results, we use 10 runs, each using a different random initialization of neural network parameters.
Agent Setup: We use a fully-connected network with ReLU activations and number of hidden layer units ∈ {{64, 128, 64}, {64, 128, 256, 128}}, trained using the Adam optimizer [44].We consider learning rates lr ∈ {0.001, 0.005, 0.0001, 0.0005} and batch sizes batch_size ∈ {32, 64, 128, 256}.The exploration is based on an -greedy policy, with linearly decreased from 1 to 0.05 during the first half of the training and fixed to 0.05 for the remaining steps.The replay buffer has a capacity equal to 10000 and we start filling it with 9000 steps of bootstrapping in which we perform no learning and a random policy is used.The target network update frequency is 1000, and we use a discount factor γ = 0.99.From a grid search, we find that the best performance is achieved with the following set of values: lr = 0.005, batch_size = 256 and net = {64, 128, 64}.Environment Setup: Overall, we set the environment with 180 available resource units with size σ = 890, where the total number of units is obtained from the base station density we observe from the dataset, whereas σ is set in order to have a total demand at most equal to the 80% of the total capacity.We initialize the environment with 10 units for each PoP, while the remaining are set in the pool B. We use D = 150 attempts, having explored values of D ∈ {150, 300, 600, 1000}.
We set τ = 0 and the time step size t to 1 hour.The time step size influences the resolution at which the demand is collected and evaluated.The choice of the time interval is of key importance.If it is too short, the changes might be temporary and, in general, very noisy.If it is too long, the algorithm will not be able to adapt to non-negligible demand variations that might happen between the two sampling points.Figure 5 shows that, for the considered real-world dataset, this is an appropriate level of granularity to consider.It is worth noting that, given a scenario with different demand variability, our solution can support a significantly smaller granularity that approaches real time, as the average time required to execute one decision is less than 2 milliseconds.In general, the use of a learned model in such optimization problems may be appropriate when decisions need to be made rapidly and running a solver is unfeasible [42].
Finally, we train and evaluate PRORL using four different configurations of weights for our utility function (Equation ( 9)): Fig. 8. Cumulative utility at evaluation time over the real-world dataset for the different utility setups (the higher the better).
• Efficiency: we prioritize optimization of the R surplus component by using the weights {w surplus = 0.6, w cost = 0.05, w gap = 0.35}; • Balanced: We balance the trade-off between the objective components by using the following set of weights {w surplus = 0.3, w cost = 0.1, w gap = 0.6}; • Quality of Service: we prioritize optimization of the demand satisfaction component R gap by using the weights {w surplus = 0.2, w cost = 0.05, w gap = 0.75}; • Strict Quality of Service: we optimize mainly R gap while disregarding the cost, by using the weights {w surplus = 0.1, w cost = 0, w gap = 0.9}.We note that the weight associated with the cost w cost is always smaller than the others since, if it is given too high importance, the behavior of the agent degenerates into always waiting (since it does not incur a cost).

B. Sensitivity Analysis
We train and evaluate PRORL considering four challenging cases in order to demonstrate the robustness of our solution.We use a synthetic data generator where the demand is drawn from a set of Gaussian distributions.The generator allows for setting different values of means and standard deviations and to create peaks of demand with periodic or sporadic over a given time interval.In particular, we create four different scenarios, as depicted in Figure 7: • Constant demand: we keep the demand constant on average for the entire week, as shown in the top left part of Figure 7.It is a simple case, however, for which proactivity in the allocation does not give any gains.• Local overdemand: we keep the demand constant for most of the week, but twice in a week peaks of demand occur for some hours at a given node, as shown in Figure 7 top right.This scenario represents a relevant test for a proactive agent that aims to learn when the peak will occur and prevent cases in which the demand is not satisfied or resources are wasted.• Weekly patterns: we simulate a typical week in an urban area, in which a clear distinction between residential and business districts is present.During the weekdays, people commute from home to work during the morning and vice-versa in the evening, while on the weekend demand peaks occur in different areas (see Figure 7 bottom left).
• Global overdemand: we analyze a case in which the total demand surpasses the total capacity of the system (see Figure 7 bottom right).This is an extreme case, where complete demand satisfaction is not feasible.PRORL should be able to limit the movement actions and to reduce the associated management cost.All the cases share the same environment configuration: we use a PoP set composed of 4 nodes and a pool of 20 resource units.We set the resource unit size σ = 10 and the utility weights to the values corresponding to the Balanced setup.The standard deviation of the synthetic generator is equal to 1.In this case, in addition to the agent's initial state, we also initialize the synthetic generator using 10 different random seeds to ensure statistical validity of the results.We explored the hyperparameters of our agent and we found that the best configuration used for the real-world dataset is the best also for the synthetic data.Due to the simpler nature of the problem, we set the experience replay capacity to 6000, the bootstrapping phase contains 3000 steps, and the total number of training steps was set to 86400 for all the cases apart from the Weekly patterns case that was trained for 168000 steps.

C. Baselines
We evaluate PRORL by comparing its performance against the following baselines: • Random policy picks the two sub-actions by sampling uniformly from the action spaces A(S t ); • Heuristic policy always selects the PoPs with the highest remaining gap and the highest surplus (if any) for the n add and n remove actions respectively, otherwise it waits; • Greedy policy emulates the movement for each of the actions that are one time step away in the MDP, and greedily chooses the action resulting in the highest utility; • PRORL-no-split is an RL agent identical to the proposed solution, with the sole exception that the action space has not been split into two sub-actions, resulting in a combinatorial set of possible actions.

A. Evaluation Using Real-World Data
Figure 8 presents the cumulative utility for the 4 different configurations of the utility function, while Figure 9 illustrates the cumulative surplus (first row), movement cost (second row), and remaining gap (last row).Overall, the results confirm the superiority of PRORL in simultaneously optimizing the 3 objectives.Only in the Efficiency setup the Greedy baseline obtains similar total utility, while in the remaining setups our Fig. 9. Cumulative surplus, cost (for both the lower the better), and remaining gap (the higher the better) at evaluation time on the real-world dataset for the different utility setups.These results illustrate the ability of PRORL to differently prioritize components of the utility function, depending on their weights.solution greatly outperforms the four baselines.In terms of single objectives, let us highlight the diverse behaviors given the fact that the utility function setup weighs the 3 objective components differently.The different prioritization of the remaining gap component is evident between the Efficiency and the Strict Quality of Service setups.The opposite behavior occurs for the surplus: in the Efficiency setup the surplus is one of the lowest among the baselines, while the remaining gap is one of the highest.In contrast, a static policy such as the Heuristic cannot prioritize different components of the objective differently, and maintains identical values across the different setups.In terms of cost, the Balanced setup shows that PRORL can satisfy the demand while reducing the movement cost, while the Greedy baseline always waits since it leads to the highest reward in the short term.
Finally, analyzing the performance of the two PRORL agents, the advantage of splitting the action space is immediately apparent.In fact, the performance of PRORL-no-split is considerably worse than that of PRORL in all four setups.In particular, in the Efficiency and Balanced setups, PRORL-nosplit can only outperform a Random policy, while in the others it can compete with other baselines.We believe that the reason for such poor performance is the greatly increased action space size, which indeed requires a much longer training phase to converge.On the other hand, PRORL is more efficient in requiring fewer steps to derive an effective policy, thus resulting in superior resource allocation decisions.

B. Sensitivity Analysis With Synthetic Data
Table II presents the overall results obtained using synthetic data for the 4 sensitivity analysis cases.The columns show the average total utility as well as the breakdown into the total surplus, total cost, and total remaining gap.Overall, the results show the superiority of PRORL when the data exhibits well-defined patterns -in fact, in the Weekly patterns scenario, our agent outperforms the baselines on all the three components of the objective.In the Constant demand scenario, our agent trades off a surplus of resources in order to achieve more than 3× better remaining gap and more than 10× better cost.Instead, the baselines perform continuous movements due to the high variation of demand generated in this scenario, resulting in a larger surplus but much worse remaining gap and cost.In the Local overdemand scenario, our agent results in the cheapest solution that satisfies the nodes' demands the most (i.e., it has the highest total remaining gap).Also in this setting, it trades off a surplus of resources for higher total utility.Considering a more challenging scenario, such as the Global overdemand, PRORL is still able to outperform the baseline in terms of total utility and remaining gap by trading off performance in the other components of the objective.
Finally, comparing the results for our two agents, the importance of reducing the action space size is once again evident, even in the case in which the number of possible actions is small (as we only have 4 nodes).In fact, the performance of PRORL-no-split is always lower than that of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.its optimized counterpart, PRORL, as well as the Heuristic and Greedy baselines.The only exception is the Constant demand scenario, in which the performance results are very similar.This is motivated by the fact that, in this scenario, a good solution requires waiting most of the time since the load is constant; therefore, in small setups, having a single agent to train and no coordination among the two sub-agents might allow the discovery of similar resource allocation policies.

VII. CONCLUSION
In this work, we have demonstrated the advantages of Deep Reinforcement Learning as an approach for orchestrating resources in O-RAN deployments.Our solution is capable of learning the dynamics of a complex environment in which it operates and make decisions that lead to outperforming greedy solutions -both over real and synthetic data -while optimizing three competing components of the objective, namely: demand satisfaction, resource utilization, and the cost associated to resource movements.Moreover, we have highlighted the flexibility of PRORL, showing that it can be used to tune the optimization of competing objectives.Finally, we have also proven the robustness of our method in challenging scenarios characterized by high variability in terms of demand profiles through an extensive sensitivity analysis.Our future research agenda includes the deployment of PRORL in real O-RAN environments and its extension in order to support the orchestration of the lower levels of the O-RAN stack.

Fig. 3 .
Fig. 3. (3(a)) Graphical representation of the pool B and the PoPs set N .(3(b)) Example of movement following the execution of f add .(3(c)) Example of movement following the execution of fremove .

Fig. 5 .
Fig.5.Demand variation over 8 weeks for the nodes selected for evaluating PRORL.The data are from the network traces[13] gathered in the city of Milan in 2013.

Fig. 7 .
Fig. 7. Example of synthetic data generated for the four scenarios used for sensitivity analysis.

TABLE I SUMMARY
OF THE DIFFERENCES BETWEEN OUR WORK AND HIGH-LEVEL O-RAN RESOURCE ALLOCATION STATE-OF-THE-ART

TABLE II AVERAGE
TOTAL REWARD AND CONFIDENCE INTERVAL FOR THE FOUR SCENARIOS USED FOR SENSITIVITY ANALYSIS USING THE UTILITY CONFIGURATION REFERRED TO AS "BALANCED"