Orchestrating Energy-Efficient vRANs: Bayesian Learning and Experimental Results

Virtualized base stations (vBS) can be implemented in diverse commodity platforms and are expected to bring unprecedented operational flexibility and cost efficiency to the next generation of cellular networks. However, their widespread adoption is hampered by their complex configuration options that affect in a non-traditional fashion both their performance and their power consumption. Following an in-depth experimental analysis in a bespoke testbed, we characterize the vBS power consumption profile and reveal previously unknown couplings between their various control knobs. Motivated by these findings, we develop a Bayesian learning framework for the orchestration of vBSs and design two novel algorithms: (<inline-formula><tex-math notation="LaTeX">$i$</tex-math><alternatives><mml:math><mml:mi>i</mml:mi></mml:math><inline-graphic xlink:href="ayalaromero-ieq1-3123794.gif"/></alternatives></inline-formula>) BP-vRAN, which employs online learning to balance the vBS performance and energy consumption, and (<inline-formula><tex-math notation="LaTeX">$ii$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="ayalaromero-ieq2-3123794.gif"/></alternatives></inline-formula>) SBP-vRAN, which augments our optimization approach with <italic>safe</italic> controls that maximize performance while respecting hard power constraints. We show that our approaches are <italic>data-efficient</italic>, i.e., converge an order of magnitude faster than state-of-the-art Deep Reinforcement Learning methods, and achieve optimal performance. We demonstrate the efficacy of these solutions in an experimental prototype using real traffic traces.


INTRODUCTION
Virtualization is considered one of the key approaches for bringing cellular networks up to speed with the demanding services they aspire to offer to users [1]. The latest frontier in this endeavor is the development of virtualized Radio Access Networks (vRAN) where legacy base stations (BSs) are replaced by softwarized stacks such as those developed by srsRAN [2] and OpenAirInterface (OAI) [3]. These novel BSs are fully-configurable and can be deployed in different platforms ranging from commodity servers and small embedded devices to moving nodes such as drones [4]. This RAN transformation constitutes a paradigm shift for cellular networks and is expected to offer the much-needed performance flexibility, facilitate the necessary network densification, and reduce significantly their capital and operating expenses [5]. Hence, it is not surprising that we see today numerous industry efforts aiming to build such BS software stacks [2], design fully-open RAN architectures [6], and even conduct extensive field trials [7].

The problem
Nevertheless, the advent of vRANs raises novel technical challenges since the virtualized base stations (vBSs) differ significantly from their hardware-based legacy BSs. On the one hand, Open RAN solutions (led by the O-RAN alliance) enable vBS to change in real-time a variety of different operation parameters, such as transmission power and modulation schemes, in order to adapt to the volatile • J. A. Ayala-Romero is with Trinity College Dublin. • A. Garcia-Saavedra is with NEC Labs Europe. • X. Costa-Perez is with NEC Labs Europe, i2CAT and ICREA. • G. Iosifidis is with Delft University of Technology. network conditions and dynamic user needs. On the other hand, though this certainly provides network operators an unprecedented level of flexibility, it comes at the cost of less predictable performance due to the complex couplings between the high-dimensional space of tunable control knobs and the resulting performance, as we reveal in Sec. 3. The latter is crucial for economic reasons, especially in light of the increasing network densification; but also because vBSs are often expected to operate under tight energy budgets [8] -consider for instance vBS that are supported with batteries or Power-over-Ethernet (PoE) lines. Therefore, existing resource control policies run the risk of under-utilizing this new type of BSs, or rendering vRANs economically unsustainable. It becomes, therefore, clear that in order to unleash the full potential of vRANs we need to answer two key questions: (i) What is the performance and power consumption characteristics of virtualized BSs?
(ii) How can we optimize their operation using an adaptive and platform-oblivious approach?
In this paper we tackle these questions following a detailed experimental and analytical methodology.

Our solution
We start by studying the vBSs operation using different hosting platforms and usage scenarios in a customized wireless testbed. Our results shed light on the relationship between performance (throughput), power consumption, and vBS controls such as the modulation and coding schemes (MCS) and spectrum allocation. For instance, we find that the baseband unit (BBU) consumes power comparable to wireless transmissions, and we observe the vBS power consumption and effective throughput being affected by the configurations in a non-linear and nonmonotonic fashion. These results depend heavily on the hosting platform and underline the difficulties in optimizing the vBS operation. Moreover, we observe that the uplink (UL)-related computations of the vBS stack consume more power and are more sensitive to MCS and SNR variations, than the respective downlink (DL) computations; a finding attributed to the heavier UL decoding. Besides, we measure the vBS power consumption for concurrent UL and DL processing and find it significantly smaller than the total consumption of these operations when executed separately (only UL or only DL). These findings are particularly important since uplink transmissions are needed to support the ever-growing user traffic. Our analysis is centered on energy since it is the bottleneck vBS resource that affects both their computations and transmissions, and which, if not properly controlled, will induce prohibitive costs and environmental consequences as cellular networks become even more pervasive [9].
The take-away message from these extensive measurements (presented in Sec. 3) is that, unlike legacy BSs, virtualized BSs have a complex, poly-parametric, and platformdependent performance and power consumption profile; and this renders traditional control policies inefficient for their management. In order to overcome this obstacle, we propose and evaluate a novel machine learning framework that learns on-the-fly the vBS operational profiles and selects their optimal configuration based on the network needs and power availability or constraints. In particular, we formulate two energy-aware vBS control problems and design learning algorithms to solve them in a robust fashion: (i) BP-vRAN (Bayesian optimization for Power consumption in vRANs), which finds a tunable trade-off between performance and power consumption; and (ii) SBP-vRAN (Safe Bayesian optimization for Power consumption in vRANs), which maximizes the vBS performance subject to hard constraints on power consumption. The former allows operators to balance performance and power expenses, while the latter is crucial for vBS running on power-constrained platforms, e.g., Power-over-Ethernet cells.
Our algorithms are founded on Bayesian optimization theory [10] and Gaussian Processes (GPs) [11]. These tools are appropriate for our problems because, as we show in this paper, they are remarkably data-efficient, which is an important requirement in our case given the high-dimensional nature of our context-action space. The GPs model the behavior of the vBS in terms of performance and power consumption, using measurements that are collected in runtime. Accordingly, we use a contextual bandit framework to explore the space of vBS configurations and exploit the best ones for each context. For the latter, we use the average UL/DL traffic load and SNR values, which we measure over certain time windows as these are determined by the pertinent 3GPP O-RAN specification [6]. The outcome is a non-parametric algorithmic framework that makes minimal assumptions about the system, adapts to user needs and network conditions, and provably maximizes the throughput of the system. Furthermore, drawing ideas from safe Bayesian optimization [12], [13], the SBP-vRAN algorithm ensures the vBS power constraints are not violated during exploration, hence enables the vBS deployment on energy-constrained platforms. By its design, this framework outperforms other approaches requiring knowledge of the vBS functions [14] or offline data to approximate them [15], and adaptive techniques that do not offer performance guarantees or rely on strict system modeling assumptions [16], [17] (see Sec. 2).
Finally, we perform an extensive evaluation in a customized testbed based on srsRAN [2], and using several tools to measure in real time the vBS power consumption. This is an important step in our study as it allows us to assess the practical efficacy of the proposed learning algorithms. Indeed, we verified that both solutions converge to the optimal vBS configuration in a variety of scenarios. To that end, we also proposed and evaluated several practical enhancements that expedite the algorithms' convergence. Using real traffic traces, we show, step-by-step, how our framework explores the configurations, and how it refrains from violating the power constraints when necessary. We also benchmark our solution with a state-of-the-art Reinforcement Learning (RL) solution. Namely, we implement a Deep Deterministic Policy Gradient (DDPG) algorithm using an actor-critic neural network (NN) architecture [18], and adapted to our contextual bandit problem. We find that our framework is more data-efficient than such state-of-theart RL approaches which require orders of magnitude more measurements (hence, also more time) to train the NNs. We believe such experimental comparisons contribute to the ongoing discussion about which AI/ML techniques can in practice solve resource orchestration problems in cellular networks.

Contributions and paper organization
Motivated by the increasing importance and fast-paced deployment of virtualized base stations [2], [6], [7], we revisit the problem of energy-aware resource orchestration in cellular networks. Using a hybrid experimental and theoretical approach, we make the following contributions: In summary, the main contributions of this paper are: • We built a bespoke wireless testbed and performed an exhaustive experimental study of the power consumption and performance of vBSs, using different hosting platforms, configurations and use cases. Our experiments reveal hitherto-unknown features of this new class of base stations that depart significantly from the energy consumption profile of legacy base stations.

•
We developed a non-parametric learning framework to optimize the vBS operation in runtime; and we propose two algorithms for tackling two key problems: (i) BP-vRAN, which balances performance and costs; and (ii) SBP-vRAN, which maximizes performance subject to hard power consumption constraints. Our framework is based on Bayesian learning techniques, which remain relatively unexplored in communication networks (cf. Sec. 2), and which we extend to account for the network context and also amend them with practical rules in order to be suitable for vRANs.
• Finally, we assess the performance of our algorithms using realistic contexts (network loads and channel dynamics), and compare their performance and data requirements with a state-of-the-art RL solution. The findings verify that they constitute strong candidates as the next-generation zero-touch vBS control solution. The source code of BP-vRAN and SBP-vRAN and the produced experimental datasets are publicly available, aspiring to facilitate the evaluation of other AI/ML solutions for vRAN orchestration. This paper extends our preliminary conference version [19] with the following contributions: • We design and implement a customized version of a state-of-the-art deep reinforcement learning algorithm (DDPG) as a benchmark solution. We configure it to efficiently solve both of the problems investigated in this paper.
• We expand our evaluation section to thoroughly compare our solutions, BP-vRAN and SBP-vRAN, against the DDPG algorithm. We evaluate the convergence rate for both cases and assess the performance of a sudden change on the power budget for the second one. We discuss the pros and cons of Bayesian against reinforcement learning NN-based solutions.
Paper Organization. Section 2 discusses the related work and positions our contributions accordingly, and Section 3 presents experimental measurements that bring to the fore the vBS control challenges. In Section 4 we introduce the system model and formulate the two optimization problems. Section 5 follows with the Bayesian-based learning algorithms for solving the problems at hand, and Section 6 presents a series of experiments that validate our approach and compare it with deep-learning algorithms. We conclude in Section 7.

Network Optimization & Automated Configuration
The works that optimize resource management in softwarized cellular networks can be classified to: (i) those requiring models that relate control variables to performance metrics; (ii) model-free approaches that rely on offline training data; and (iii) online learning techniques. Interesting examples in (i) include [20] which performs rate control to maximize throughput subject to computing capacity; [14] that selects also the MCS and airtime; and [21] that additionally adapts to traffic. Nonetheless, such models are in practice platform/context dependent and unknown. On the other hand, model-free approaches employ machine learning, e.g., Neural Networks, to approximate performance functions [22]. Such approaches are used in network slicing [23], throughput forecasting [15], edge computing [24], etc. Their efficacy is remarkable as long as there are enough and representative training data. Otherwise, we need to employ online learning that has been recently used, for instance, to configure video analytic systems [25] and minimize the power consumption and interference among BSs [26]. Similarly, online convex optimization is used for cloud and IoT resource orchestration [27], [28], but requires convex functions; a condition not satisfied here. Another approach is reinforcement learning (RL), used in spectrum management [16], network diagnostics [29], interference coordination [30], and SDN control [31], among others. In this line, [32], [33] optimize the energy efficiency of the network as a function of some parameters such as the resource block allocation, the transmission power, or the amount of network offloading. Compared to [32], not only we are considering more configuration parameters, we are also considering more relevant aspects and dimensions of the problem. Specifically, in [32], they rely on a simplified setup comprised of some communicating blocks using GNU radio instead of a full system, and on an over-simplistic power consumption model given by a linear equation where the circuit power is considered constant. In marked contrast, we do not make any modeling assumption. We rely on real measurements from a full-fledged 3GPP-compliant system, which moreover show that the consumed power of our target object (a virtualized BBU) is highly variable, shows nonlinear behavior, and depends on many aspects. In [33], the authors address the problem of offloading and autoscaling in mobile edge computing considering renewable energy. However, the radio access network (RAN), which is the focus of our work, and hence their approach cannot be applied to our problem.
Similarly to RL, contextual bandits have been employed to adjust video streaming rates [34]; configure BS parameters (e.g., handover thresholds) [35], [36]; assign CPU time to virtualized BSs [17]; and control mmWave networks [37], [38]. Here, instead, we combine Gaussian Processes [11] and contextual bandit algorithms [39] to build a dataefficient Bayesian optimization framework [10] with convergence guarantees. Our approach captures the non-trivial multimodal correlations of configurations (revealed by our experiments) through GPs, and use these perpetuallyupdated functions to sample the decision space. Our work draws from the seminal CGP-UCB algorithm [39] which is extended to include vRAN-specific context, to optimize throughput and power costs, and to satisfy hard power constraints. This is crucial for vBS which cannot exceed at any time their power threshold, e.g., when they are powered over Ethernet.
Despite been very successful in many problems, ranging from the design of experiments to automated machine learning [10], Bayesian learning algorithms to date have not been used in communication networks, with very few exceptions such as [40] that explores the optimal server configuration for big data computing. Our approach aspires to fill this gap by studying experimentally their efficacy on the vRAN orchestration problem. To that end, we also compare them with a state-of-the-art Deep RL solution: Deep deterministic policy gradient (DDPG) algorithm adapted to our contextual bandit setting. Such sophisticated neuralnetwork based solutions have only recently been used in wireless networks (e.g. for traffic scheduling) [17], [41], [42], and, to the best of our knowledge have not been compared against Bayesian optimization approaches.

Experimental Profiling of vBS Computing & Power Consumption
Clearly, it is imperative to explore experimentally the operation of these new BSs. The early work of [43] studied the cost savings when pooling the processing operations of multiple BSs, and [44] proposed a similar vRAN architecture and measured 30% processing load reduction. Other studies considered the effect of MCS, bandwidth, and SNR on BBU computing load [45], [46]. In [47] an OAI simulator was used to model the processing time for different configurations, and [17] presented measurements with srsLTE for the impact of traffic. Our experimental analysis builds on these important works and further measures the impact of new context parameters and radio schedulers on throughput, the coupling of uplink and downlink operations, and the vBS power consumption in different scenarios.
Existing power consumption studies for legacy BSs focus on the effect of power amplifier, RF output, and baseband processing. The work [48] introduced the EARTH model which relates the RF output power with the supplied power; and [49] considered also the effect of bandwidth. The works [50], [51] proposed similar models for macro and micro BSs, and [52] studied how the packet length affects the CPU power consumption. A detailed model accounting for the different BS components is presented in [53], [54].
To illustrate the power behavior of legacy BS, we rely on the seminal model proposed in [48], where the consumed power (P in ) is given by where N TRX is the number of transceivers, P out is the RF output power, P max is the maximum RF output, P 0 represents the power consumption at zero RF output power, P sleep is the power consumption of transceivers components in sleep mode, and ∆ p is the slope of the load-dependent power consumption. Note that the model in eq. (1) is basically focused on the downlink, which is the predominant factor in legacy BSs. Conversely, for the new generation of small formfactor vBSs the uplink and the configuration parameters are equally important 1 . Moreover, although the downlink transmission power and airtime can be captured by P out , other factors such as the MCS and channel quality are not considered in eq. (1) and we have found they are relevant in the consumed power of vBSs. We observe that the model in eq. (1) is linear, which is a good approximation of the measurements in [48]. Its slope, given by ∆ p , characterizes the relation between the consumed power and P out the total RF output power radiated at the antenna elements. Similarly, some previous works that focused on vBS include [56] which proposed a theoretical model of CPU power consumption as a function of the active CPU cores, clock speed, and load. It also assumes a linear relation of traffic with computational load, and hence with the consumed power. This assumption is not universal, however, and our findings agree with previous studies finding non-linear effects [45].
More importantly, the impact of hardware, software platform, and context on these metrics is unknown and cannot be captured in predefined models. Our GP-based approach overcomes this obstacle since it essentially builds the models on-the-fly using the sampled data. 1. In femtocells, the BBU consumes 40% of power [55] BBU RU 0 Consumed power over the baseline for different radio bandwidths and hardware platforms. SF PC 1: Intel NUC i7-8559U@2.70GHz; SF PC 2: Intel NUC i7-8650U@1.90GHz; Server 1: Dell XPS 8900 i7-6700@3.40GHz; Server 2: Dell Aurora R5 i7-9700@3.00GHz.

PRELIMINARY EXPERIMENTAL ANALYSIS
We performed experiments using a customized srsLTEbased testbed [2], described in Section 6.1. We present here results that motive the problem and our solution approach.
• BBU/CPU Power Cost & Impact of Platform. Our first finding is that the power consumption associated with the BBU processing is comparable to the RF chain's transmission power. This result is consistent with previous studies; for example, [55] estimated that 40% of a femtocell's power consumption is due to its BBU. In detail, Fig. 1a dissects the power consumption of a vBS deployed on a small factor (SF) PC, and presents the different power components stemming from the BBU's CPUs 2 ; the BBUs cloud platform except the CPUs; and the actual radio unit (RU) which is deployed over an USRP software-defined radio. In order to have a complete picture, we measure the power consumption in four different scenarios: (i) the vBS is not deployed (baseline), (ii) the vBS is deployed with an idle user attached (vBS idle), (iii) the vBS is transmitting 20Mbps of downlink (DL) traffic, and (iv) the user is transmitting 20Mbps of uplink (UL) traffic to vBS.
Excluding the baseline scenario, the CPU power consumption is, on average, 29% larger than the RU power consumption; while the overall BBU power exceeds it by 175% (208% with full UL load). Interestingly, these numbers depend on the platform which hosts the BBU. Namely, Fig. 1b shows the BBU consumption over the baseline for various platforms. 3 We compare the power consumed by the BBU in idle state and when operating at full UL/DL buffer, and subtract the baseline power. Indeed, the power consumption changes significantly, and it is also affected by the vBS bandwidth -yet another configurable parameter of softwarized base stations.
• Impact of SNR & MCS. The second finding is that the signal-to-noise ratio (SNR) of the wireless channel and the UL modulation and coding scheme (MCS) affect the BBU 2. We use Intel's Running Average Power Limit function integrated into the Linux kernel for the CPU power consumption.
3. The small PCs consume less power than the servers, which can host more vBSs and thus consumes less power/user.  computing load -and hence its power consumption -in a non-linear fashion. This is because the decoder needs increasingly more iterations when the received signal becomes noisier. Thus, the decoding time per subframe increases, e.g., by 52% between 20 and 15 dBs for MCS 23, see Fig. 2a; and this induces a commensurate increase in power consumption, see Fig. 2b. Besides, Fig. 2b shows that, even for a fixed decoding time, higher MCS values induce more power consumption, which is attributed to their more intricate demodulation (denser constellation map). Importantly, excessive decoding delays can induce throughput loss since they lead to violations of vBS processing deadlines [2]. Hence, maximizing throughput does not only have an unpredictable effect on power, but it is indeed highly non-trivial to achieve in a resource-efficient way.
• Configuration Options & Impact of Scheduler. The vBS orchestration difficulties are exacerbated by the plenitude of configuration options these base stations offer. Fig. 3a, for instance, presents combinations of MCS and airtime values (percentage of used subframes) achieving the same UL throughput. Configurations with higher MCSs (and therefore lower airtime) reduce power by 38%. However, this relation is non-monotonic, as we have also measured higher power when the MCS increases and SNR is relatively low. This latter effect is due to the fast increase of computing load (see Fig. 2b). On the other hand,   Fig. 4 shows the BBU power consumption when DL and UL traffic is processed separately and concurrently (UL+DL), for high SNR and various MCS values. We observe that the joint power is not the total sum of the separate components. For instance, for MCS 15, concurrent DL and UL processing consumes just 7.5% more than UL-only processing (and 26% over DL-only). This is because there are common power consumption factors in both streams. This, in turn, makes it difficult to predict the overall vBS power consumption, given that the DL and UL can be configured separately. Also, note that UL power costs are higher and more volatile than DL, since decoding is more computationally demanding.
Conclusions: characterizing the vBS performance and power consumption is intricate as it depends on exogenous conditions such as the network traffic and SNR; and the BS configuration, e.g., the selected MCS and airtime parameters. There are many DL and UL configurations and some of them present non-linear and non-monotonic relations with power and throughput. Moreover, the power consumption depends on the BBU platform and the radio schedulerwhich if almost fully customizable in vBSs. This hinders the derivation of generally applicable power consumption models. Hence, we propose the use of online learning to profile each vBS power cost and performance, and devise accordingly goal-driven configuration policies.

SYSTEM MODEL AND PROBLEM FORMULATION
Our modeling approach follows carefully the latest O-RAN architecture proposals [6] which have provisions for (in fact, envision) learning-based orchestration of the BS operation, and as such is fully aligned with the ideas presented in this work. We start by presenting the O-RAN elements that are pertinent to our model and subsequently we formulate the two optimization problems.

O-RAN Background and Model
We consider a virtualized Base Station (vBS) comprising a Baseband Unit (BBU) that may correspond to a 4G eNB or a 5G gNB 5 hosted in a cloud platform and attached to a Radio Unit (RU), which are fed by a (possibly) constrained energy source. This type of BSs is relevant for low-cost small cells, Power-over-Ethernet (PoE) cells, and other similar platforms that are increasingly common in 5G-and-Beyond networks. Our goal is to use O-RAN's control architecture to implement configuration policies that are adaptive to system dynamics while satisfying different energy-aware performance criteria.
O-RAN Architecture. Fig. 5 shows the high-level architecture of our system, which is O-RAN compliant [6]. The Learning Agent (LA) implements online learning algorithms within the Non-Real-Time (Non-RT) RAN Intelligent Controller (RIC) in the system's orchestrator, and selects efficient radio policies every orchestration period t = 1, . . . , T (usually in the order of seconds). The optimal decision (i.e., a radio policy) in each t depends on the context information. This is provided at the beginning of each period by the vBS (via the O1 interface) from measurements collected at subsecond granularity within the near-RT RIC (using the E2 interface). The computed radio policies are then configured on the vBS via its A1-P interface as shown in Fig. 5. At the end of each orchestration period, the Data Monitor module in the Near-RT RIC computes a reward by aggregating the adopted performance metrics, which are collected from the vBS via the E2 interface; and eventually provides the results to the LA (O1 interface). Our system model and solution algorithms are fully compatible with this architecture.
Context Information. We define the DL context at each period t as ω dl The UL CQI is measured by the vBS at MAC layer, and the new UL bit arrivals are estimated from the periodic Buffer Status Reports (BSRs) of the users (UEs). All these measurement are collected by the Near-RT RIC's Data Monitor (Fig. 5) from the vBS using the E2 interface at subsecond granularity, and are aggregated at the start of each 5. 5G decouples BBU in 2 logical functions, i.e., a central unit (CU) and a distributed unit (DU). Our scheme controls the DU, or both when these are co-located. orchestration period t. We denote the global context vector where Ω is the context space. Note that the contexts are related to the traffic load and channel quality and are exogenous parameters, i.e., the configuration decisions cannot affect them. This allows us to formulate the problem as a Contextual Multi-armed Bandit or Contextual Bandit (CB). By using this formulation we can configure the system based on the observed contexts and learn from the zeroth-order feedback of our system (i.e., we observe only the outcome of the employed configuration).
vBS Controls. We define the DL control x dl important to stress that in practice we can only hope to observe noisy values of these functions, even when their arguments are fixed, because naturally the system operation is stochastic and also the power measurements are noisyas we have indeed seen in our experiments. Fortunately, our optimization framework can handle such impairments. Henceforth, we denote with R dl t (ω dl t , x dl t ), R ul t (ω ul t , x ul t ) and r t (ω t , x t ) these noisy samples of the functions at period t, which are considered to be stationary and return the mean (unperturbed) respective values when averaged (i.e., on expectation).

Case 1: Balancing performance and cost
We start with the case where the power supply is scarce or, equivalently, the operator wishes to reduce the power consumption costs. This can be achieved with a scalarized objective function: where P (ω t , x t ) is the vBS power consumption associated with the pair context-control (ω t , x t ), B(·) is a smooth function that models the cost associated with power consumption, and parameter δ determines the relative importance of the power cost and achieved throughput, and can be selected based on the operator's preferences. We will also use u t (ω t , x t ) to denote the realization of the objective function related to the t-period samples P t (ω t , x t ) and r t (ω t , x t ).
The selection of the cost function is crucial here. In the simplest case, it can be a linear function that maps the actual consumed power to a monetary value (negative reward). But, it can also model situations where policies that exceed a power threshold should be prevented due to regulation, battery constraints, and so on. To capture all these cases, we propose to use a parameterized sigmoid function with sharpness and tipping parameters a and b: B(x) := 1 + e ab e ab When a → 0, function B(·) approximates a linear function, and when a grows [58] it approximates the step function, without however to induce unbounded gradients -a condition that would deteriorate the learning process. Following the standard approach in Bayesian bandit optimization [13], [39], we use the cumulative contextual regret to assess the performance of our algorithm. Namely, we define the average T -period contextual regret: where max x ∈X u(ω t , x ) yields the best decision for the current period, which we cannot calculate in practice since the objective function is unknown. Our goal, therefore, is to find a sequence of decisions x t T t=1 from set X which ensure asymptotically sublinear average pseudo-regret, i.e., lim T →∞ E[R T ]/T = 0, where the expectation is taken with respect to the noisy samples and the context arrival process.

Case 2: Hard power budget
A different problem arises when the vBS operates under a hard power budget P max , e.g., when powered over Ethernet. In these cases, the LA has to find the maximum-throughput configuration that respects the available power budget. Importantly, the LA needs to achieve this goal by emloying a safe exploration of the configuration space X in order to satisfy the P max threshold at any period, i.e., not only at the final optimal-operation stage. We define the respective regret: where in this case the decisions are selected from set Note that we use in the definition of regret directly the throughput reward, since the power is now considered a hard constraint. Our goal is to find a sequence x t T t=1 , It is important to stress that the sets S t (ω t ), ∀ω t , are unknown initially, since P (ω, x) is also unknown, and therefore we need learn them using the real-time measurements P t (ω t , x t ). Similarly, we only have access to r t and u t , i.e., the t-period noisy measurements, instead of the actual functions r and u.
To solve the above problems, we propose a nonparametric learning approach using Gaussian Processes, Contextual Bandits, and Bayesian learning. Our approach has the additional practical advantage that one can change P max in runtime, which in fact is possible in the PoE standard (IEEE 802.3bt), at any time without having to restart the learning process. Other parametric methods, such as Reinforcement Learning relying on neural networks, need to be re-trained if the constraint changes, which naturally increases substantially the required training data.

BAYESIAN ONLINE LEARNING SOLUTIONS
Next, we propose two online algorithms for solving the problems stated in Sections 4.2 and 4.3. Our proposals leverage state-of-the-art Bayesian learning techniques which are properly configured and extended to account for the network context information, and amended with practical rules (of independent interest) that improve their performance, as we verify experimentally.

BP-vRAN: Balancing performance and cost
Many algorithms for solving contextual bandit problems assume there is a feature vector associated with each action, and the objective function is linear in that vector [59], [60]. This assumption does not hold here for the following reasons. Firstly, the objective function is not linear, see eqs.
(2)-(4). Secondly, the function values associated with different actions (i.e., vBS control policies) are correlated. Intuitively, we can think that a small change in some parameter (e.g., airtime) will induce a small change in the vBS consumed power. This is actually evaluated experimentally in Fig. 3b. This means that we can obtain information about unobserved context-control pairs by observing nearby actions, thus reducing the exploration time.
Based on these observations, we propose a Bayesian optimization method where we model the objective function as a sample from a Gaussian Process (GP) over the joint context-control space. This non-parametric estimator captures the aforementioned non-linearities and correlations, and provides predictive uncertainty on the function estimation. Hence, enable us to address effectively the exploration -exploitation trade-off.
Function estimator. We use a GP as a function estimator, which is a collection of random variables following joint Gaussian distributions [11]. Let z ∈ Z = Ω × X denote a context-control pair. We model the unknown objective function (3) as a sample from a GP (µ(z), k(z, z )), where µ(z) is its mean function and k(z, z ) is its covariance function or kernel. Without loss of generality, we assume µ = 0 and bounded variance k(z, z) < 1, which we refer to as the prior distribution, not conditioned on data.
Given this prior and a set of observations, the mean and covariance of the posterior distribution can be computed using closed form formulas. Let y T = [u 1 , . . . , u T ] be a vector of noisy samples (assuming i.i.d. Gaussian noise ∼ N (0, ζ 2 )) at points Z T = [z 1 , . . . , z T ]. Then, the posterior distribution of the objective function follows a GP distribution with mean µ T (z) and covariance k T (z, z ): )] z,z ∈Z T , and 1 T is the T -dimension identity matrix. These equations allow us to estimate the distribution of unobserved values of z based on the prior distribution, the vector Z T , and the function observations y T . Kernel function. The kernel selection is crucial as it shapes the prior and posterior GP distributions by encoding the correlation between the values of the objective function of every pair of points. Namely, k(z, z ) indicates the similarity between u t (z) and u t (z ). In other words, the kernel characterizes the smoothness of the function [61]. The properties of the kernel function should be carefully selected according to the specific application and the underlying function that will be learned. Therefore, we use the experimental data analyzed in Sec. 3 to conclude that our kernel should satisfy two properties: stationarity and anisotropicity. On the one hand, the kernel k(z, z ) is stationary since it depends only on the distance of z from z , which means it is invariant to translations in Z. On the other hand, a kernel is anisotropic since the encoded smoothness is different among the different dimensions of Z. That is, the kernel is not invariant to rotations in Z. The smoothness of the different dimensions of the function u are encoded into a length-scale vector L = [l 1 , . . . , l N ], where N indicates the number of dimensions of Z. Thus, the distance between two points based on the length-scale vector can be written as: where L = diag(L) is a diagonal matrix of the length-scale values. There are several kernel functions satisfying these properties such as the squared exponential kernel, one of the most commonly used. However, this kernel function assumes the underlying function to be very smooth, i.e., infinitely differentiable. This assumption does not hold in our framework since function B(·) defined in eq. (4) is not infinitely differentiable. Besides, recall that B(·) maps the monetary cost associated with the consumed power and can be defined according to the operator's needs. For that reason, we relax this assumption and select the anisotropic version of the Matérn kernel, which also satisfies the properties discussed above [11]. Furthermore, we configure it with parameter ν = 3 2 , which implies that the objective function is at least once differentiable. Note that this is a mild assumption, which yields a loose regret bound (see Lemma 1). In fact, our experimental evaluation in Section 3 shows that our approach performs much better than our theoretical bounds in the scenarios we tested. However, if we had more information about the structure of the function to learn, we could easily tighten such bound by selecting higher values of ν or by using a squared exponential kernel, which may improve the rate of increase of information gain. In this paper, we opt for the most conservative choice to cover scenarios beyond the ones shown in our experimental evaluation. The expression of the selected kernel is given by: To improve performance, we can optimize the hyperparameters L and the noise variance ζ 2 , eq. (7)-(8), before running the algorithm, by maximizing the likelihood estimation over prior data and keep these values constant over time. A different approach, namely when the hyperparameters are optimized using the data acquired in runtime, it is not guaranteed that the GP's confidence interval will cover the true function, and hence might induce the optimization process to stuck in poor local optima [62]. We have also observed this in our experiments.
Acquisition function. The acquisition function selects one control x t at each period t based on the posterior distribution of the objective function over the context-control pairs. To this aim, we use the Upper Confidence Bound (UCB) method which follows the principle of optimism in the face of uncertainty and allows us to derive theoretical guarantees for the algorithm. Formally: where ω t is the observed context at time t, β t is a weighting parameter and σ 2 t (z) = k t (z, z). We formalize our approach, which we refer to as BP-vRAN (Bayesian optimization for Power consumption in vRANs), in Algorithm 1. At the beginning of each decision period t a context ω t is observed (line 4). Based on the observed context ω t and the vectors Z t−1 and y t−1 , the posterior distribution is computed using eqs. Observe the context ω t
Note that an alternative formulation of BP-vRAN with two GPs (to approximate the reward and the consumed power separately) instead of one is amenable to better optimization of the kernels' hyperparameters. Nevertheless, the posterior variance of the objective function can be arbitrarily hard to obtain since the monetary cost of the power (B(·)) is selected by the operator according to its needs. In addition, this approach doubles the computational and memory requirements.
Theoretical results. The choice of a value for β t in eq. (11) is very important since it controls the trade-off between exploration and exploitation. Larger values of β t lead the acquisition function to select controls with higher uncertainty while, conversely, controls already known to be highperforming (though not necessarily highest-performing) are selected when β t takes smaller values. Following [39], we select where ∈ (0, 1), B ≥ u k is an upper bound on the Reproductive Kernel Hilbert Space (RKHS) norm of u, and γ t is the maximum mutual information gain obtained from u after t observations have been collected.

Lemma 1. The contextual regret R T of BP-vRAN satisfies
at stage T , where C 1 = 8 log(1+ζ −2 ) and γ t = O(t 44/45 log(t)). The proof of Lemma 1 is given in the Appendix. For the derivation of the bound of the information gain γ t , we consider a Matérn kernel with ν = 3 2 and N = 11 dimensions in Z, which correspond to a 6-and a 5-dimensional context and control space, respectively, as described in Sec. 4. For this setting, we particularize the expression provided in Theorem 5 of [63] to obtain the bound γ t = O(t 44/45 log(t)). Note that the regret bound obtained in this analysis considers a worst-case scenario, while the performance of the algorithm in practice is commonly far from these bounds as shown in Sec. 6. It is worth mentioning, however, that the bound provided in Lemma 1 indicates that BP-vRAN is a no-regret algorithm, i.e., lim T →∞ E[R T ]/T = 0.

SBP-vRAN: Safe Bayesian Optimization
Imposing hard constraints as proposed in Sec. 4.3, compounds the problem. Prior works, e.g., in robotics and other areas [12], [13], [64], [65], have proposed Bayesian optimization algorithms with safety constraints. Their main idea lays upon the definition: every t we define a subset of safe controls S t ⊆ X that satisfy the constraints with certainty. Then, it is needed to interleave an exploration process so as to expand the safe set, while seeking a safe action with high performance. Unfortunately, these works do not consider contextual information, which clearly affects the safe set, i.e., S t (ω t ) ⊆ X . To the best of our knowledge, only SafeOpt [65] proposes a contextual safe learning algorithm. However, although that algorithm provides theoretical guarantees, its acquisition function selects the control with the highest uncertainty among all candidates that can expand the safe set and also the potential maximizers. We found in our experiments that this approach has overly slow convergence. This practical issue has been reported in other works as well, e.g. [66]. Hence, we improve this methodology by employing the acquisition function of CGP-UCB [39], but constrained to the safe set.
We denote y f T = [r 1 , . . . , r T ] the vector of reward samples at T and y c T = [P 1 , . . . , P T ] the power consumption samples. We use one GP for the reward and one for the power constraint. Both GPs have the same prior distribution and kernel but different hyperparameters. The posterior distribution can be computed using (7)- (8), and replacing y T by y f T or y c T , for each GP. We denote the posterior mean and covariance of the reward at T as µ f T (z) and k f T (z, z ), and µ c T (z) and k c T (z, z ) for the power, respectively. The initial safe set S 0 ⊆ X is common for all contexts, and includes low power consumption configurations (vBS close to idle). This is worst-case S 0 can be expanded using prior data.
At each period, S t is computed based on the posterior distribution of the power consumption provided by the GP. We assume the true value of the power consumption at time t is within the interval [µ c t (z) ± β t σ c t (z)], where σ c t (z) = k c t (z, z). Using the posterior distribution, we define the safe set a time t and for a given context ω t as: The controls are selected at each period t using the CGP-UCB policy subject to the safe set: where σ f t (z) 2 = k f t (z, z). We summarize our approach, named SBP-vRAN (Safe Bayesian optimization for Power consumption in vRANs), in Algorithm 2. It is worth mentioning that in many practical scenarios it is desirable to have a soft constraint instead of a hard constraint. For instance, we may be interested in violating the soft constraint (increase the power consumption) to avoid poor user performance. We provide two alternatives to handle this scenario. First, we can use BP-vRAN by designing B(·) such that a power consumption exceeding the constraint incurs in high monetary cost. This approach provides soft guarantees where the power Algorithm 2 SBP-vRAN: Safe online optimization 1: Inputs: Control Space X , Initial safe set S 0 , kernel k, β, P max 2: Initialize: y f 0 = ∅, y c 0 = ∅, Z 0 = ∅ 3: for t = 1, 2, . . . do 4: Observe the context ω t

5:
Compute µ f t−1 , σ f t−1 , µ c t−1 and σ c t−1 using eqs. (7)-(8) 6: x ul t ) and P t (ω t , x t ) at the end of the decision period t 9: Compute r t (ω t , x t ) using (2) 10: : Update y c t ← y c t−1 ∪ P t (ω t , x t ) 13: end for constraint will be met in average but not at every interval. Alternatively, we can modify the definition of the safe set in eq. (14). Thus, we can add an exception such that if the expected performance of all actions in the safe set is below a performance threshold r min , include at least one action whose expected performance is higher than r min . Using this mechanism, we can set a minimum performance requirement for the vBS operation.
Convergence of SBP-vRAN. Note that SBP-vRAN does not expand explicitly the safe set, like in other works such as [13], [65]. In general, an explicit expansion of the safe set is needed (e.g., by exploring the controls in the boundary) to converge to the true safe set and therefore to reach the optimal safe control. However, we found that our acquisition function can both maximize the performance and expand the safe set at the same time under some conditions. Let us assume that the objective function and the constrained function are smooth and positively correlated. In this case, the maximization of the objective function also implies the expansion of the safe set. In fact, the optimal configuration is located at the boundary of the constraint space. This is a reasonable assumption in practice, as we can assess empirically: On the one hand, Fig. 6a shows the uplink throughput of our vBS as a function of the MCS and the airtime (two of our control actions). From this figure, we can see that the higher the MCS and the airtime the higher the throughput. On the other hand, Fig. 6b shows the consumed power as a function of the same variables. Note that both figures show the same trend: the higher the throughput the higher the consumed power.
We should remark that we have only considered two vBS controls (MCS and airtime) for this example. However, although the power behavior becomes non-linear when including all the dimensions of the problem, these conclusions also hold in the complete problem. It is obvious that higher airtime provides higher throughput. It is also evident that higher MCSs provide higher throughput under feasible conditions (appropriate SNR) as they allow to pack more data symbols per unit of time. Similarly, higher MCSs incur in higher power consumption because the number of computations required by the decoding algorithms scale linearly with the number of bits to decode. Moreover, higher transmission power enables higher MCSs and therefore higher throughput. Therefore, higher throughput is generally associated with higher power consumption.
The annotations in Figs. 6a-6b exemplify how SBP-vRAN expands the safe set. The initial safe set (S 0 ) is a set of configurations with the lowest power consumption, i.e., low MCS and airtime. This conservative initial safe set avoids violating the constraint from the beginning but also increases the convergence time. The aim of SBP-vRAN is to maximize the the reward function r which is directly related to the throughput. Moreover, our acquisition function in eq. (15) will select controls with high performance but also with high uncertainty. These conditions are met by the controls in the boundary of the safe set. By exploring these controls we are reducing the uncertainty of its neighborhood and therefore expanding the safe set. After a few iterations (t = n 1 ), the safe set S n1 has been expanded and the algorithm can now select configurations with higher throughput. At that point, the algorithm will continue exploring the boundary of the constraint since it contains the configurations with the highest throughput and also high uncertainty. After a few iterations more, the safe set will reach the boundary of the constraint, finalizing its expansion: the optimal configurations fall into the boundary of the constraint space. This is demonstrated in the following experimental evaluation.

EXPERIMENTAL EVALUATION
We have built a customized testbed to perform a thorough evaluation of the proposed ML resource orchestration techniques under realistic conditions. Our experiments employ the software-based eNB srsRAN, cf. [2], which we have properly modified (e.g., implementing scheduling policies, enabling airtime selection, etc.) so as to capture the entire range of our controls. The testbed configuration and created datasets are available online 7 for reproducibility reasons and, importantly, so as to facilitate further research in the area of AI/ML-assisted RAN orchestration. 7. https://github.com/jaayala/power dlul dataset

Experimental setup
The testbed, shown in Fig. 7, comprises a vBS, the user equipment (UE) 8 , and a digital power meter. Both the vBS and UE consist of an Ettus Research USRP B210 as RU, srseNB/srsUE (from srsRAN suite [2]) as BBU for the eNB and UE, and two small factor general-purpose PCs (Intel NUCs with CPU i7-8559U@2.70GHz) deploying each respective BBU and the near-RT RIC of Fig. 5. The vBS and UE are connected using SMA cables with 20dB attenuators and we adjust the gain of the RU's RF chains to attain different SNR values. Without loss of generality, we select a 10-MHz band that renders a maximum capacity of roughly 32 and 23 Mbps in DL and UL, respectively. We use the power meter GW-Instek GPM-8213 to measure the power consumption of BBU and RU by plugging their power supply cable to a GW-Instek Measuring adapter GPM-001. Finally, we have integrated E2's interface and the ability to enforce control policies on-the-fly (see Section 4) in srseNB.
We use three auxiliary PCs (not shown in the figure) hosting the non-RT RIC and the network traffic end hosts, which use mgen 9 . Finally, we have implemented O1 interface (Fig. 5) using the USB-based power meter SCPI (Standard Commands for Programmable Instruments) interface concerning power consumption measurements and a REST interface for the remainder. A final remark is that our RU (USRP B210) does not integrate a variable power amplifier. Instead, it uses a fixed power amplifier consuming 3W and a variable attenuator for power calibration (see Fig. 1a). To compensate for this, we post-process the power measurements to include a variable RU consumption according to a linear model based on previous works [48], [50] and a 3W cap.
For the elaboration of the dataset used in Sec. 3, we configure the vBS and UE in order to fix the conditions in the uplink and the downlink in terms of traffic load, channel 8. We use one UE emulating the load of multiple users (see in Sec. 6.3).
9. https://www.nrl.navy.mil/itd/ncs/products/mgen. quality, MCS, and airtime. Then, we fix each configuration for approximately one minute while the system takes measurements that later are processed to obtain its statistics. We assess the power behavior of the vBS by measuring the power consumption of its CPU and the whole BBU, the achieved performance in terms of throughput and goodput, details about the decoder at the vBS such as the subframe decoding time and the number of turbo decoder iterations per subframe, and some MAC and PHY indicators such as the Buffer Status Report (BSR), Block Error Rate (BSR), and the used MCS and airtime. Moreover, we detect and identify unfeasible configurations in the dataset. This mainly occurs when an MCS value is forced but the channel quality is not good enough to decode its data. Finally, we release our dataset 7 online allowing the community to realistically emulate the behavior of a vBS in terms of power consumption and performance as a function of its configuration and conditions (user traffic load and channel qualities) for future research.
For the evaluation we consider |P dl | = 20, |M dl | = 28, |M ul | = 24, and |A dl | = |A ul | = 11, and therefore the size of the control set is |X | ≈ 1.6 · 10 6 . Note that, for a decision period of 10 seconds, we would need up to 185 days to explore every control policy in X once, which highlights the need for a data-efficient learning strategy. Although Lemma 1 guarantees convergence and sublinear regret in general, faster convergence can be achieved with problem-specific information. Hence, and in line with previous works [65], [66], we select β 1/2 = 2.5, which shows good performance in our setup. In the case of BP-vRAN, we configure δ = 20 and set the parameters a and b in the penalty function, eq. (4), to severely penalize the power consumption values close to b or higher. Namely, we set a = 2.5 and evaluate different values of b. Finally, we present the results of 10 (at least) experiments, where we plot the mean values and the 10 th and 90 th percentiles (shadowed areas). The source code of the algorithms BP-vRAN 10 and SBP-vRAN 11 used for this evaluation can be found online.

Convergence Evaluation
We start off by evaluating the convergence of BP-vRAN and SBP-vRAN. To this end, we consider the special case of a sin- 10. https://github.com/jaayala/contextual bayesian optimization 11. https://github.com/jaayala/constrained bayes opt gle context and observe their performance over time with no prior training up till they converge to optimal policies. We select a context with high SNR = 35 dB (CQI = 15) in DL and UL, and high traffic demands (relative to our testbed's capacity) equal to 25 and 20 Mbps for DL and UL, respectively. Fig. 8-9 show the temporal evolution of different metrics for both algorithms during 150 orchestration periods.
Let us discuss first the results of BP-vRAN in Fig. 8. We observe that the power consumption and, consequently, throughput, are reduced for lower values of b, e.g., there is 12.5% power drop and 33.75% throughput drop between b = 25 and b = 16. This is intuitive because lowering b induces more stringent power requirements. Note that b = 16 only penalizes DL throughput. This is because it imposes a mild power requirement, and hence BP-vRAN only sacrifices transmission power, which reduces DL SNR and thus DL throughput. Lower values of b force BP-vRAN to sacrifice UL throughput too.
Concerning SBP-vRAN, we evaluate different values of P max up to P max = 20, which is an upper bound for the power consumption irrespective of the policy and the context. The results, in Fig. 9, depict how SBP-vRAN learns to use configurations within the power budget with high probability, sacrificing throughput when so required. Note that, in all the cases, SBP-vRAN always selects policies very close to P max . This is because the optimal policy, i.e., the one that maximizes throughput, usually requires consuming all the P max budget. To this end, SBP-vRAN gradually expands its safe set close to P max and therefore an explicit strategy to expand the safe set is not needed. Specifically, Fig. 10 shows that all the controls are safe for P max = 20, with 15.4% and 53.2% less safe policies for P max = 14 and P max = 12, respectively. As expected, lower values of P max incur a smaller safe policy set.
We conclude this evaluation with the observation that, despite using a large set of policies X , both algorithms converge within 30 orchestration periods. This highlights the data-efficiency of our solutions, which discern optimal policies by observing only a small subset of X .

Performance in real network contexts
Next, we evaluate the performance of BP-vRAN and SBP-vRAN using a realistic one-day traffic pattern from [67] (Fig. 11, top). Concerning channel quality, we consider a worst-case pattern emulating UEs with high mobility (Fig. 11, bottom)   (well below the demand). Due to the granularity of our traffic dataset, we set the orchestration period length to 5 minutes in these experiments (note there is no loss in generality). We run our algorithms for two days and present results of the second day to focus on the attained system performance. Their convergence, evaluated in the previous subsection, takes just a few periods. This is possible because the selected policies for correlated contexts are also correlated, i.e., knowledge acquired for one context is transferred to other similar contexts. Hence, after few iterations, the algorithms select efficient policies even for unseen contexts.
To remove the clutter introduced by the high SNR variability under evaluation, each point in Figs. 12 and 13 corresponds to the average across all the points of a SNR cycle, see Fig. 11, bottom. Fig. 12 shows the total power consumption (a) and the evolution of throughput along the day (b) using BP-vRAN and different configurations of the objective function. We observe that the power consumption evolves with the traffic demand and with the selected value of b. For instance, when b = 16, the achieved throughput is penalized in favor of better power consumption during daylight but no performance degradation is required during the night (between 2am and 7am). Similarly, Fig. 13 shows the performance of SBP-vRAN under the same scenarios. Specifically, SBP-vRAN manages to satisfy the power budget constraint with probabilities 0.99 and 0.93 when P max equals 14 and 12, respectively, while maximizing throughput (which was calculated through exhaustive search).

Comparison with other approaches
We complete our evaluation comparing our solutions with a state-of-the-art deep reinforcement learning algorithm: the Deep Deterministic Policy Gradient (DDPG) [68]. This algorithm needs to be customized since it is designed to solve the full-RL problem while in this work we face a contextual bandit problem. There are two main differences between these two problems. First, the full-RL considers that selected actions (control policies) have an impact on futures states (contexts). This assumption does not hold in our setting since the configuration of the vBS does not affect future contexts (traffic load and channel quality of the users). Second, in the full-RL problem, the reward can be delayed over time, while in our setting the performance is available at the end of the decision period.
The DDPG is implemented using an actor-critic deep neural network (NN) architecture and, in order to adapt it to the contextual bandit problem, we configure the critic NN to approximate the reward function instead of the Qvalue function (see [17] for more details). We consider the same NN architecture as in [17] but we use a sigmoid as the activation function of the output layer of the actor NN. Since the action space of the DDPG is continuous (the output of the actor is a continuous vector with the same dimensions as X ), the selected actions are cast to the closer control policies that can be configured by the vBS. Moreover, we optimize the hyperparameters to minimize convergence time. Our experiments show that the DDPG converges to the same solutions as the proposed Bayesian-based algorithms, but lacks in convergence speed and versatility. We illustrate these issues using both problems that we presented in Sec. 4.2 and 4.3 and one context, as in Sec. 6.2.
For the first problem (Sec. 4.2), we configure the reward function of the DDPG to be the objective function in eq. (3). Fig. 14 shows the time evolution of the objective function for BP-vRAN and DDPG, for different values of b. Notably, DDPG converges to the same optimal policy learned by BP-vRAN but has to invest one order of magnitude longer time. The main reason for this difference is that our approach infers correlations in the objective function over the context-action space more efficiently; and hence finds optimal policies even for unseen context-action pairs. This highlights the data-efficiency of the GP-based solution. It is also worth reminding that, differently to our benchmark, BP-vRAN has mathematical guarantees in performance (see Sections 5.1).
In order to implement the constrained problem in Sec. 4.3, we consider a customized reward function for the DDPG. The reward is encoded using a step function that takes the value of eq. (2) when the observed power is below P max , and the minimum reward value otherwise. Fig. 15 shows the evolution over time of the power consumption and the associated throughput performance of the vBS for SBP-vRAN and DDPG. We begin the experiment by setting the power constraint equal to 15W, and changing it to 13W at decision period t = 2000.
Our results render three observations: (i) SBP-vRAN attains considerable convergence improvements over its benchmark (roughly, an order of magnitude). (ii) SBP-vRAN is unaffected by a sudden change on the power constraint; note that it only requires the change of P max in line 5, Algorithm 2. Conversely, DDPG needs to change the configuration of the step function, which forces to restart its learning process from scratch, failing the hard constraint until decision period 3500, approximately. (iii) DDPG cannot perform safe exploration: it must use policies that violate the power constraint to learn so. On the other hand, our approach computes the uncertainty of each estimation, which allows us to implement safe exploration and satisfy the constraint with high probability. (iv) Although the DDPG can potentially find better solutions due to its continuous action space, our results show that both approaches converge to the same solution due to the fine-grained discretization of the action space of BP-vRAN and SBP-vRAN. Finally, it is important to remark the inherent drawback of GP-based approaches is the involved O(N 3 ) computation complexity (for Cholesky decomposition) in each orchestration period, where N is the number of data points. We observed in our experiments, however, that the unprecedented convergence speed of these methods pays off in a very short time. Moreover, we found that these computations do not induce a delay since, according to O-RAN specifications, there is a wide-enough time window to update the policy.

CONCLUSIONS
The goal of this paper was threefold. First, to conduct an in-depth experimental study of the power consumption of virtualized base stations (vBSs); secondly, to propose two Bayesian learning algorithms that optimize the vBS performance subject to power constraints; and thirdly to evaluate these algorithms in realistic conditions using a fully-fledged wireless testbed, and compare them with state-of-the-art solutions that use deep neural networks.
Our findings revealed an intricate relationship between performance, power consumption, and key vBS control knobs, which renders impractical traditional resource control policies and motivate machine-learning solutions. Moreover, we saw that Bayesian learning algorithms can indeed enable efficient vBS operation; yet they require extensions and amendments in order to account for the network context and other practical and problem-specific issues. Finally, we found that these approaches are more data-efficient than state-of-the-art deep reinforcement learning solutions, but are also more computationally-demanding. This latter property does not pose a problem for O-RAN systems, according to their operation requirements, but might become a limitation for other resource control problems running in finer time granularity -yet, there are remedies that can reduce the computing load, e.g., re-initializing the GP approximation.
The considered problems are motivated by the latest industry developments in next generation virtualized RANs, and are centered around power consumption which is probably their most prevalent design constraint. Similarly, our solutions are in line with the requirements for automated, data-driven, platform-oblivious vRAN configuration. As such, we believe this work opens a new research direction and to that end we also make publicly available our testbed implementations and the collected measurements. We have released the source code of BP-vRAN and SBP-vRAN along with the dataset used in this work to foster future research in this area.