Scaling UPF Instances in 5G/6G Core with Deep Reinforcement Learning

In the 5G core and the upcoming 6G core, the User Plane Function (UPF) is responsible for the transportation of data from and to subscribers in Protocol Data Unit (PDU) sessions. The UPF is generally implemented in software and packed into either a virtual machine or container that can be launched as a UPF instance with a specific resource requirement in a cluster. To save resource consumption needed for UPF instances, the number of initiated UPF instances should depend on the number of PDU sessions required by customers, which is often controlled by a scaling algorithm. In this paper, we investigate the application of Deep Reinforcement Learning (DRL) for scaling UPF instances that are packed in the containers of the Kubernetes container-orchestration framework. We propose an approach with the formulation of a threshold-based reward function and adapt the proximal policy optimization (PPO) algorithm. Also, we apply a support vector machine (SVM) classifier to cope with a problem when the agent suggests an unwanted action due to the stochastic policy. Extensive numerical results show that our approach outperforms Kubernetes’s built-in Horizontal Pod Autoscaler (HPA). DRL could save 2.7–3.8% of the average number of Pods, while SVM could achieve 0.7–4.5% saving compared to HPA.


I. INTRODUCTION
The fifth-generation (5G) networks and networks beyond 5G (e.g., the sixth-generation -6G) will provide the service for customers in various vertical industries (vehicular communication, IoT, remote surgery, enhanced Mobile Broadband, Ultra-Reliable and Low Latency Communication, Massive Machine Type Communication, etc.) [1]- [8]. The transformation of network elements and network functions from dedicated and specialized hardware to software-based containers have been started [8]. The Third Generation Partnership Project (3GPP) has specified the framework of 5G core with many network function components based on the Service Based Architecture (SBA). These 5G core architectural elements are implemented in software and executed inside either virtual machines (VM) or containers in clouds' environment [9]. In the future, 6G networks will likely maintain the user plane functions of the 5G core as well [8]. It is worth emphasizing that virtual machines and containers hosting network functions (termed instances) are executed within cloud orchestration frameworks that manage and assign the cloud resource for instances. A Network Functions Virtualization Infrastructure (NFVI) Telco Taskforce (CloudiNfrastructure Telco Task Force, CNTT) has defined a reference model, a reference architecture [10] and reference implementations based on Openstack [11] and Kubernetes [12].
In operator environments, the instances of network functions should be orchestrated (launched and terminated) in response to the fluctuation of traffic demands from customers. For example, customers initiate requests for PDU sessions before data communications; the traffic volume of such requests may depend on the periods of a day. During peak traffic periods, more instances should be launched and utilized than regular periods. That is, operators should apply appropriate algorithms to control the resource usage of network functions. transition probability in the MDP T i the i-th decision time r, r i immediate reward function, i-th reward at time T i κ reward multiplier γ discounting factor P probability distribution over a set s(t), s i state at time t, i-th state at time T i a(t), a i action at time t, i-th action at time T i π, π * policy function, optimal policy V value function In this paper, we investigate the application of Deep Reinforcement Learning (DRL) to the resource management (scaling) of User Plane Function (UPF) instances in the 5G/6G core. To our best knowledge, this is the first work for scaling UPF instances based on DRL. We assume that UPF instances are controlled by the Kubernetes containerorchestration framework, so we compare the DRL approach to Kubernetes's built-in Horizontal Pod Autoscaler (HPA) and found that DRL can perform better than the HPA. We find that the policy generated by the DRL method could make unwanted decisions occasionally. To remove the randomness from the policy, we apply a support vector machine (SVM) to classify actions based on a pre-trained DRL agent. As a result, we got a deterministic SVM-based policy at a slight performance degradation that could still perform just as well or even better than the HPA. Our contributions are as follows: • We formulate the problem of scaling UPF instances in a Kubernetes-driven cloud. We apply model-free DRL to this problem and compare its performance to the Kubernetes HPA. Through simulations we show that DRL can outperform the HPA. • After this, we show that in some cases, more particularly in case of sudden increases in traffic, the DRL agent may pick the unwanted action. This is due to the stochastic nature of the learned policy. We remedy this by generating a dataset of states with actions as labels and train a SVM to classify states and actions. We show that the performance of the SVM-based agent is slightly worse than the DRL agent when traffic change is slow. However, under sudden traffic change, the deterministic policy of the SVM agent does not take unwanted actions. In Section II we review the related literature on autoscaling in clouds, reinforcement learning (RL) methods for scaling, and resource management in the 5G core. In Section III we describe the problem of scaling UPF instances in Kubernetes. In Section IV we formulate a Markov decision problem for scaling UPF instances and present the Deep Reinforcement approach. In Section V extensive numerical results are delivered. Finally in Section VI we draw our conclusions.

II. RELATED WORKS
About autoscaling in cloud environments, Lorido-Botran et al. [13] conducted a survey that classifies existing solution methods into five categories: threshold-based rules, control theory, reinforcement learning, queuing theory, and time series analysis. Zhang et al. [14] designed a thresholdbased procedure to scale containers in a cloud platform and then measured the elasticity of their algorithm. In a hybrid approach, Gervásio et al. [15] combined an ensemble of prediction models with a dynamic threshold algorithm to scale virtual machines in an AWS cloud. Ullah et al. [16] used genetic algorithm and artificial neural networks to predict CPU usage in a cloud and then used a thresholdbased rule.
Several works studied the performance of Kubernetes. Nguyen et al. [17] compared metrics server solutions and highlighted the effects of various configuration parameters under both resource metrics. Casalicchio [18] investigated the autoscaler in Kubernetes and showed that scaling based on absolute measures might be more effective than using relative measures.
Reinforcement learning has been used to tackle scaling and scheduling problems in clouds as well. Horovitz and Arian [19] used tabular Q-learning to autoscale in cloud environments. They focused on the reduction of the state and the action space as they did not use function approximation for the Q-values. Their experiments scaled web applications on Kubernetes. They also proposed the Q-threshold algorithm where Q-learning was used to control the parameters of a threshold rule. We found that Q-threshold has difficulty finding the optimal policy with our reward formulation. This is mainly because this algorithm cannot control the Kubernetes Pods directly which means actions do not have direct effect on rewards either. Shaw et al. [20] compared the Q-learning and the SARSA algorithms on virtual machine consolidation tasks. In these tasks the objective of the RL agent was to use live migration and place virtual machines on the approriate nodes to minimize resource usage. Garí et al. [21] conducted a survey of previous RL solutions for scaling and scheduling problems in the cloud. Rossi et al. [22] compared the Q-learning, the Dyna-Q, and a full backup model-based Q-learning to autoscale Docker Swarm containers horizontally and vertically. They measured the transition rate between different CPU utilization values to estimate the model, however this method is difficult to scale as it needs to store the number of transitions between every state. Cardellini et al. [23] created a hierarchical control for data stream processing (DSP) systems. On the top-level they used token-bucket policy, on the bottom level they considered the threshold policy, Q-learning, and full backup model-based RL. Schuler et al. [24] used Q-learning to set concurrency limits in Knative, a serverless Kubernetes-based platform, for autoscaling.
Some works specifically focus on the 5G core. Járó et al. [25] discussed the evolution and virtualization of network services. They considered the availability, the dimensioning, and the operation of a Telecommunication Application Server. Tang et al. [26] proposed a linear regression-based traffic forecasting method for virtual network functions (VNFs). They also designed algorithms to deploy service function chains of VNFs according to the predicted traffic. Alawe et al. [27] used deep learning techniques to predict the 5G network traffic and scale AMF instances. Subramanya et al. [28] used multilayer perceptrons to predict the necessary number of UPFs in the system. Kumar et al. [29] devised a method for scaling up and scaling out UPF instances and deployed a 5G core network on the AWS Cloud. Rotter and Do [9] presented the queueing analysis for the scaling of 5G UPF instances based on threshold algorithms has been presented. Their queueing model provides a quick evaluation of scaling algorithms based on two thresholds.
From the related literature review, it is observed that scaling algorithms reported in most of the literature works so far control the number of VM, VNF, container instances with the use of some thresholds [14], [15], [18], [30]- [33]. However, to our best knowledge, no existing work on the application of artificial intelligence methods for the resource management of UPF instances.

III. THE OPERATION AND SCALING ISSUE OF 5G UPF INSTANCES A. 5G UPF
The connection between the User Equipment (UE) and the Data Network (DN) in 5G requires the establishment of a PDU session. In this connection the UE first directly connects to a gNB in the Radio Access Network (RAN), and through the transport network reaches the 5G Core, which provides the end point to the DN (see Figure 1) [1]- [4], [34].
The transport network may be wireless, wired, or optical connection [35] and the 5G core consists of a collection of various network functions implementing a Service Based Architecture (SBA). Such network functions are the Access and Mobility Management Function (AMF), which performs FIGURE 1: The role of the 5G UPF [9]. Each line represents a connection to a PDU session. An UPF instance may handle mutiple PDU sessions, the core network may contain multiple UPF instances. the authentication of UEs and controls the access of UE to the infrastructure; the Session Management Function (SMF), which helps the establishment and closing of PDU sessions and keeps track of the PDU session's state; and the User Plane Function (UPF) [1], [3]. Whereas the AMF and the SMF are part of the control plane, the UPF is responsible for the user plane functionality. UPF serves as a PDU session anchor (PSA) and provides a connection point for the access network to the 5G core. Additionally, a UPF also handles the inspection, routing, and forwarding of the packets and it can also handle the QoS, apply specific traffic rules, etc. [36]. The control and user plane separation (CUPS) guarantees that the individual components can scale independently and also allows the data processing to be placed closer to the edge of the network.

B. 5G UPF INSTANCES WITHIN THE K8S FRAMEWORK
Kubernetes is an open-source container orchestration platform that manages containerized applications on a cloudbased infrastructure [37]. A Pod is the smallest deployable computing unit for a specific application in Kubernetes and may contain more containers. Machines on the cloud, either physical or virtual, are referred to as nodes with a specific set of resources (CPU, memory, disk, etc.). To deploy applications, the Kubernetes controller can configure Pods with a given resource requirement on the nodes and run one or more containers inside these Pods.
If the 5G core elements are packed into containers that are organized into Pods, the resource of each Pod and the execution of Pods are managed by Kubernetes. To establish UPF, it is a natural choice to map a UPF instance to a Pod. That is, a Pod runs a single container hosting a UPF software image. Figure 2   a practical approach to limit the number of PDU session types. For each PDU session type a UPF instance type is created with identical resource requirement.

C. THE PROBLEM OF SCALING UPF PODS
The purpose of scaling UPF Pods is to save the resource consumption of the system. A scaling function changes (start new Pods, or terminate existing ones) the number of UPF Pods depending on the number of PDU sessions required by UEs. On the one hand, if the number of UPF Pods is too low, the QoS degrades since we do not have enough UPF Pods to handle new incoming PDU sessions. On the other hand, if the number of UPF instances is too high and the load is low, a lot of reserved resource increases the operation cost. Therefore, a trade-off between the QoS and the operation cost is to be achieved.
For each type of PDU sessions, we assume that at least D min Pods are initiated, D max Pods can be started, each Pod simultaneously could handle maximum L sess sessions. each Pod takes t pend time to boot, and their termination is instantaneous. Let d on (t) denote the number of running Pods in the system at time t. Therefore, D min ≤ d on (t) ≤ D max holds and the limit for the number of sessions in the system is D max L sess . Let l sess (t) denote the number of sessions in the system at time t. Then we have 0 ≤ l sess (t) ≤ D max L sess . Additionally, let us define a free slot as an available capacity for a session and denote their number with l free (t) at time t. Obviously, l free (t) + l sess (t) = D max L sess .
A PDU session can only be created if there is free capacity in the cluster, that is l free (t) > 0. In this case new PDU sessions are assigned to the appropriate UPF Pods by a load balancer. If l free = 0 and there is no capacity left, the session and the UE's request is blocked. We denote the blocking rate, the probability of blocking a request, with p b .
The list of basic notations is summarized in Table 1.

D. K8S HPA
Kubernetes autoscaler is responsible for the scaling functionality. Figure 3 shows the interactions between the autoscaler and other components. A metrics server monitors the resource usage of Pods and provides the autoscaling entity with statistics through the Metrics API. The autoscaler computes the necessary number of Pod replicas and may decide on a scaling action. The adjustment of the replica count can be done through the control interface. The Horizontal Pod Autoscaler (HPA) is Kubernetes's default scaling algorithm. It uses the average CPU utilization, denoted byρ, as an observation to compute the necessary number of Pods, denoted by d desired (t) at time t. It has two configurable parameters: the target CPU utilization ρ target and the tolerance ν. The equation used by the HPA is where d on (t) is the number of Pods at time t. The HPA then checks whether If it is not, the HPA issues a scaling action to bring the replica count closer to the desired value. The above described procedure is executed periodically with ∆T interval. This time interval can be set through Kubernetes configurations.

IV. SCALING UPF PODS WITH DRL
The application of the built-in Kubernetes HPA needs the appropriate values of ρ target and ν. A system operator may go through an arduous process of trials and errors to find the configuration that could minimize the Pod count while maintaining QoS levels. Instead, in this paper, we propose the application of Deep Reinforcement Learning (DRL) to set the Pod count dynamically depending on the traffic, without the assistance of an operator. The DRL agent observes the system and determines the correct action output through the continuous improvement of its policy. In what follows, we present our approach regarding the design of the DRL agent.

A. FORMULATION OF THE MARKOV DECISION PROBLEM
Before applying a reinforcement learning algorithm we need to formulate the problem as a Markov decision problem (MDP). This means we need to define the state space S, the actions space A, and the reward function r : S × A × S → R. A complete definition of the MDP would also require the state transition probability p : S × A × S → [0, 1] and the discounting factor γ ∈ [0, 1]. Here p is a probability that the system enters a next state when an action happens at the current state, and such a transition results in real value reward r. To avoid the specification of the p transition function as in a model-based formulation (like in [22]), we decided to use a model-free RL method. Also, γ is implicitly contained in other hyperparameters as we will see later.
In the MDP framework an agent interacts with the environment described by the MDP. At the decision time t it observes the state s(t) ∈ S and following its policy π : S → A it makes an action a(t) ∈ A. As a result the agent receives a reward r(t) and at the next deicision time it can observe the next state.
Let us denote the i-th decision time with T i (i = 0, 1, . . .). In our case the time between two decisions is ∆T, that is . Furthermore, we will also denote the state, the action, and the reward at time T i with a lower index i (e.g. s(T i ) = s i ).
The state s i at time T i should contain all the information necessary for an optimal scaling decision. In our case whereλ is the measured arrival rate since the previous decision at time T i−1 .
The action space consists of three actions: start a new Pod; terminate an existing Pod; no action. The agent may only start new Pods if there is a capacity for it in the cluster, that is, is the number of Pods still booting at time t. These booting Pods exist because when we start a new Pod, it enters a pending phase while it starts up its necessary containers. We assume this phase lasts t pend time. Also, the agent may only terminate Pods if d on (t) > D min . We assume this termination is graceful, which means that the Pod waits for all of its PDU sessions to close before shutting down. Obviously in this case the Pod is scheduled for termination and does not accept new PDU sessions.
The reward function is shown in (2).
Herep b,i is the measured blocking rate since the previous decision in the time interval [T i−1 , T i ) and p b,th is the blocking rate threshold set by the QoS level that we should not exceed in the long term. The coefficient κ is a scalar that scales the blocking rate to numerically put it in range with the d on value. The intuition behind this reward function is that if the measured blocking rate exceeds the threshold, we need to minimize the blocking rate; and if it does not exceed the threshold, we want to minimize the number of Pods.
For the list of notations used to describe the MDP see Table 1.

B. REINFORCEMENT LEARNING
Reinforcement learning (RL) is a method that applies an agent to interact with an environment. The agent observes the system states and rewards as results of subsequent actions. To apply a RL-based agent in the control loop illustrated in Figure 3, we propose an approach where a specific state contains the number of active and booting UPF Pods, the number of PDU sessions in the system, and an approximation of the arrival rate. These information about the states can be obtained either from monitoring the SMF and AMF functions of the 5G core, or from the SMF and AMF functions.
The RL agent uses the observations gathered between two scaling actions to update and improve its policy. This means that learning happens online during the operation of the cluster. Also, the neutral network in the RL agent can be pre-trained with the use of captured data and simulation as well.

VOLUME x, xxxx
The goal of RL is to find the policy π that maximizes tha value function V π (s), the long-term expected cumulated reward (3) starting from the state s. Note, that the optimal policy does not depend on the starting state.
In this paper we used proximal policy optimization (PPO) [38] as the RL algorithm with slight modifications, similar to our previous work [39]. The method is presented in Algorithms 1 and 2.
The PPO is an actor-critic algorithm [40]. It uses a parameterized policy π(s, θ) as an actor to select actions, where θ is the parameter vector. The algorithm also approximates the value function with V(s, ω) parameterized with the ω vector. This value function is used to calculate the advantageÂ GAE using the generalized advantage estimator (GAE) [41]. If we consider a batch of advantages in a vector A GAE of size N batch + 1, the j-th element of the vector can be computed byÂ Here λ GAE is a weight hyperparameter which implicitly contains the discounting factor γ, and δ j is the j-th element of the temporal difference error batch δ and is calculated by where the j-th temporal difference target is TD target, j = r j −r + V(s j , ω).
Note, that we use bold to signify vector values and the lower index j to signify vector elements, that is, TD target, j , s j , r j , and s j are the j-th element of the batches TD target , s, r, and s . In contrast to the original PPO algorithm, here in (6) we used the average reward scheme, wherer is the average reward that we keep track of through the soft updatẽ where α R is the update rate hyperparameter. The advantage and the TD target are used to evaluate the policy. During its operation the algorithm tries to improve the policy by updating θ and ω repeatedly using In (8) the α ω and α θ are the learning rates of the gradient descent steps and ε is the clipping ratio of the PPO. The vector r t (θ) is the probability ratio and its j-th element can be computed by r t (θ) j = π(a j |s j , θ)/π old, j , where π(a j |s j , θ) and π old, j are the probabilities of action a j in state s j . Note, that the difference between the two probabilities is that the former depends on θ which can change throughout epochs during an update (as seen in Algorithm 1), whereas π old , which is stored in the batch, represents the probability of action a j when it was executed by the agent. This means that at the start of the update π(a j |s j , θ) = π old, j , but after the first epoch θ is changed by (8) and the equality does not hold anymore. The operation is the elementwise product. We also added entropy regularization π(a |s, θ) log π(a |s, θ) with a weight of ξ.

11:
Compute r t (θ) using (9). 12: Update ω and θ using (8). 13: end for 14: Clear batch storage. 15: end procedure In Algorithm 1 we can see the Store and Update procedures of the algorithm used. The purpose of the Store function is to save the (s i , a i , π(·|s i , θ), r i , s i+1 ) trajectory samples into a batch for batch updates. Here π(·|s i , θ) denotes the vector of probabilities for all actions in state s i .
The Update procedure shows us the method used for improving the policy. It is only executed if the number of samples has reached N batch . It approximates the mean reward r and then it makes gradient descent steps k times. In each step it updates the policy π by updating θ and ω.
The procedure used to train the RL agent can be seen in Algorithm 2. First, if the system is not initialized yet, we need to start it up. For the RL agent we need to set the hyperparameters and initialize the θ and ω vectors with random values and set the value ofr to 0.
After the initialization we run N train steps and in each step we execute a scaling action a i received from the agent Algorithm 2 RL training loop 1: Initialize system, and get initial state s 0 . 2: Initialize learning parameters of AGENT. 3: i ← 0 4: for N train steps do 5: Get action from agent: a i , ← Sample π(s i , θ). 6: Execute action a i to scale the cluster. 7: Observe the new state s i+1 and performance measures after ∆T time. 8: Compute reward r i from the measurements using (2). 9: AGENT.STORE(s i , a i , π(·|s i , θ), r i , s i+1 ) 10: AGENT.UPDATE( ) 11: i ← i + 1 12: end for based on the observed state s i . We observe a new state and then store the observations using the Store procedure, then improve the agent's policy with the Update procedure.   The 5-tuple that represents the state s i of the system creates a 5-dimensional state space. Even though in practice the values in the state are directly or indirectly bounded by the number of maximum Pods D max and the arrival rate is also bounded by the maximum arrival rateλ max , the state space can grow so large that it would be impossible to fit the policy or the value function in a computer's memory. Therefore we used neural networks one hidden layer of 50 hidden nodes to approximate the policy π and the value function V. This means that θ and ω represent the parameter set of these neural networks. Figure 4 shows us a neural network that accepts the state as the input and ouputs the probabilities (π NoOp , π ScaleOut , π ScaleIn ) of the possible actions. In the hidden layer, the rectified linear unit (ReLU) function is applied. For the policy π, we used the softmax function in the output layer. The parameters θ and ω were started with the Xavier initialization. For the update steps, we used the stochastic gradient descent method.
For numerical stability we normalized most of the input values into the range [0, 1]. This means, that we divided the d on and d boot values by D max and also divided the l sess and l free values by L sess . As forλ, we do not have an maximum value for the arrival rate. Luckily for this normalization process we do not need to know this exact number, we only need that the order of magnitude of the normalizedλ is close to the other input value's order of magnitude. We chose to divideλ by 500 assuming the maximum arrival rate is close to this value.
To find the best DRL agent we conducted a hyperparameter search during training. We identified the reward multiplier κ and the entropy regularization factor ξ as the hyperparameters the DRL agent was more sensitive to. We used grid search for κ ∈ {3, 5, 10, 13, 15, 20} and ξ ∈ {0.01, 0.05} to find the adequate hyperparameter values. We found the other hyperparameters to have less influence on the overall performance of the DRL agent. In these cases we used values that are often used in the literature, such as [38]. Table 2 shows us the hyperparameter values we used for the DRL agent. For the entropy parameter we chose ξ = 0.01 and for the reward multiplier we chose κ = 13. Note, that since we use GAE to estimate advantages, the discount factor γ is implicitly incorporated into the λ GAE hyperparameter. Table 2   We used PyTorch 1.5.1 [42] to implement the DRL model and used an NVIDIA GeForce RTX 2070 (8GB) GPU for training.

D. DRL WITH CLASSIFICATION
It is possible to use RL with non-stochastic policies that enforce actions with the probability equal to one for a specific observation. For example, the application of the deep Q-learning, also known as deep Q-networks (DQNs) [40], may result in a greedy policy, which is demonstrated in Section V. Moreover, we find that DQN leads to the overprovisioning of the resource in our numerical study.
The PPO method learns and finds a stochastic policy where the action space has a probability distribution for a given state. The DRL agent takes action based on the learned distribution. In general, it is expected that the agent recommends the launch of new Pods when d on is low and VOLUME x, xxxx λ is high, and suggests the termination of Pods when d on is high andλ is low. However, the agent may advise an unexpected action with low, but non-zero probability due to the nature of a stochastic policy. For example, when a sudden increase in traffic is detected as illustrated in Figure 5, the agent begins starting up Pods to lower the blocking rate. We would expect the agent to start new Pods until the blocking rate is below the threshold, however, from time to time the agent terminates a Pod incorrectly. If the algorithm could decrease the probability of the bad actions to 0 and increase the probability of the good action to 1 in every state, we would get a deterministic policy. However, this cannot happen, due to the entropy regularization which prevents the PPO algorithm from reducing the probability of an action to zero. This is a necessary measure to guarantee that all actions remain possible in all states so that the agent would have a possibility to explore the whole state space during training. The noisy behavior of the DRL agent can also be seen in Figure 6 that plots the action versusλ and d on .
It is worth emphasizing that there may be outlier points in the dataset, e.g. where the arrival rate is very low and the Pod count is very high. If these points are labeled correctly, they do not influence the separating line. However, in case of mislabeling, these points can shift the decision boundary into an unwanted direction. Therefore we need a classifier to clean the dataset by removing these outlier points. We did this by considering every point an outlier for which whereλ max is the maximum of the measured arrival rate during the experiment. With this we removed every point that is not on the main diagonal strip of the scatter plot. We apply the DRL agent to generate labels by taking a set of states and mapping actions to each state. The resulting dataset of size N data was then used to create a linear support vector machine (SVM) classifier that maps actions to states.
The linear SVM is a machine learning model that can find the separating hyperplane in a dataset between two classes [43]. In our case, the set of actions A contains three types of actions which would require a multiclass classificator. We can circumvent this with the one-versusrest strategy and build separate models for each action type. In this case we label the corrseponding action a with +1 if it belongs to the given action type, otherwise we label it with −1. We will denote the modified action label withã.
Using the set of states S as the feature set, for a state s i ∈ S we are looking for the separating hyperplane f (s i ) = w T s i + w 0 = 0, where w and w 0 are the parameters of the SVM. This would give us the classification rulẽ Note that multiple hyperplanes may be found. So we pick the one with the largest margin M, which is the distance between the hyperplane and the data point closest to the hyperplane.  The system is initialized with 50 Pods and the DRL agent immediately starts removing unused ones. At step 480 the traffic suddenly increases and the agent reacts by increasing the Pod count. Due to the stochastic nature of the policy, a Pod may be terminated even when the blocking rate is above the threshold level. Two cases are distinguished.
• If the data is separable, that is, a hyperplane exists that can separate the actions labeled with +1 from the actions labeled with −1, the optimization problem is ) • If the dataset contains overlaps and it is not separable we need to find the separating hyperplane that allows the least amount of points in the training set to be classified incorrectly. This can be achieved by introducing the slack variables ζ i and modifying the optimization problem into where C, called the cost parameter, is a tuneable hyperparameter of the SVM. The smaller C is, the more points are allowed to be misclassified, resulting in a higher margin. Algorithm 3 presents the training procedure of the SVM. The algorithm requires the parameters and the hyperparameters of the simulation environment and the DRL method. It returns the SVM model parameters w and w 0 and also returns the accuracy of the model on the test set which is a performance measure of the SVM.
After the initialization of the agent and the environment, the algorithm starts training the agent for N train steps. This training loop is almost identical to the one in Algorithm 2. The difference is that in the last N data steps the agent stores the states in a list L states for the dataset later.
When the training of the DRL agent is finished, it's policy is used to evaluate the states in L states . The resulting actions are then saved in the list L acts . These two lists together form the dataset we use to train the SVM. The dataset is cleaned by removing the outlier data points. Then it is split VOLUME x, xxxx into training (L train states , L train acts ) and test (L test states , L test acts ) sets. The training set is used to train the SVM model, whereas the test set is used to determine the accuracy of the trained model. At the end of the procedure we get the SVM model parameters and the accuracy on the test set.

Algorithm 3 Training an SVM classifier
Input: Environment and DRL parameters (see Table 1  a i ← Get action from agent in state s i .

8:
r i , s i+1 ← Execute action a i and get reward and the next state. 9: Store history and Update agent using the AGENT.STORE and AGENT.UPDATE procedures in Algorithm 1. 10: if i > N train − N data then 11: Append state to L states . 12: end if 13: i ← i + 1 14: end for 15: i ← 0 16: for N data steps do 17: Evaluate DRL agent on state L states [i] to get action. 18: Append action to L acts . 19: i ← i + 1 20: end for 21: Remove outlier points according to (11). 22: Separate lists into train and test sets: L states → L train states , L test states ; L acts → L train acts , L test acts . 23: w, w 0 ← Train SVM using L train states as features and L train acts as labels and run a grid search on hyperparameter C. 24: Get accuracy of the model using L test states and L test acts . 25: return w, w 0 , accuracy We run this algorithm multiple times to perform a grid search on the C hyperparameter of the SVM, which means each run we use a different C value. Finally, we pick the model with the highest accuracy on the test sets.
In order to asses the SVM classfier, we also experimented with another classification method, logistic regression which describes the log-odds of each class with a linear function. For more on this classifier, see [43]. In this case, Algorithm 3 can be modified by replacing the SVM model with a logistic regression model.
We used the scikit-learn 0.24.2 [44] library to implement the SVM and the logistic regression models. For the logistic regression, we used the default hyperparameters. For the SVM hyperparameter values see Section V-B. For the list of notations used by the algorithms see Table 3.

E. SYSTEM MODELING
We built a simulator program that emulates a multi-node cloud environment and implemented the DRL agent in Python with the help of pytorch. The simulator program contains a procedure that generates the arrival of a UE as a Poisson process with arrival rate λ(t) at time t. Upon arrival a PDU session is initiated if there is available capacity among the pods. Otherwise the UE's request is blocked. The UPF handling the PDU session and its traffic is chosen at random. We assume the length of a session is random and distributed exponentially with rate µ.
Note that in practice we do not know the exact arrival rate function in advance. To show how the DRL algorithm can cope with this, we divided the DRL experiments into two phases, a training phase and an evaluation phase. In each of these phases we used a different function for the arrival rate, λ train and λ eval , respectively. We can think of the training phase as a pre-training stage where we initialize the DRL agent and train it with a predefined arrival rate function. Whereas in the evaluation phase we apply the pretrained agent on an environment with a new traffic model. So learning also happens in the evaluation phase, but the agent does not need to go through a cold start.
We trained the DRL agent on a sinusoidally varying arrival rate λ train (t) = 250 + 250 sin for N train amount of simulation steps. With this function the agent can explore a wide range of traffic intensity. For evaluation we used an equation from [45] which was determined for mobile user traffic. We scaled it to our use case to get λ eval (t) = 330.07620 + 171.10476 sin π 12 t + 3.08 + 100.19048 sin π 6 t + 2.08 + 31.77143 sin π 4 t + 1.14 (16) and ran N eval amount of simulation steps with it. This scaling makes the peak traffic 500 PDU requests/s. To visualize λ train and λ eval in (15) and (16)  We set the blocking threshold p b,th = 0.01 and ran the DRL algorithm under various t pend values. For each t pend value we ran 8 simulations and took the average of d on , and p b for the evaluations phase.

A. SCENARIOS
For the numerical evaluations, we assumed that • UPF instances run in phyiscal servers [46] with the Intel Xeon 6238R 28 core 2,2 GHz processor and 4x64 GB RAM; • each UPF session conveys video streaming data; • eight cores on each server are allocated for OS and the container management system; • Each UPF instance occupies one core and 2GB RAM and serve maximum 8 simultaneous video streams; • booting time is not negligible and is fixed and identical for each UPF pod. Parameter values for the cluster used during the simulations can be found in Table 4.  Besides the PPO algorithm, we also experimented with deep Q-networks (DQNs) which use a deterministic, greedy policy in the evaluation phase. We found that DQN had difficulties learning the optimal policy. Figure 8 shows us a comparison between the evaluation of a trained DQN and a trained PPO agent. We can see that while the PPO could adapt well to the varying arrival rate in (16) and found a balance between the Pod count and the blocking rate, the DQN could not keep the Pod count as low and overprovisioned the Pods. We also have to note that the DQN seemed to be much more unstable as in many cases it failed to even learn a policy that would adapt to traffic changes, whereas the PPO algorithm could find a good policy throughout every run. For this reason we ruled out the use of DQN and focused on PPO instead.
Results for the PPO are included in Table 5. The values displayed are averaged through 8 runs. Each run took approximately 120-150 minutes. We can see that the DRL agent could keep the blocking ratep b below the threshold 0.01 in each case.     number of PDU sessions it handles. We set the tolerance ν = 0.025 and searched for the parameter ρ target that could still maintain the blocking rate below the threshold. Each run consisted of N eval evaluation steps. The range of the search was (6.0, 6.25, . . . , 9.75) and the result was ρ target = 0.875. We included the results in Table 5.
Comparing the results in Table 5 we can see that at lower t pend values the DRL agent could maintain fewer UPF Pods and the d on was reduced by 3.8%. At higher t pend values this improvement percentage decreased but still, the DRL agent was more efficient in using the UPF Pods. The reason for the decrease is that when t pend is high, it takes much longer for a Pod to start, which means that in order to keepp b below the threshold, the DRL agent cannot terminate that many idle Pods.

B. IMPROVING PERFORMANCE WITH ACTION CLASSIFICATION
In a given state there is a small probability, that the DRL agent makes a bad decision. For example, Figures 5 and 6 show a scenario where the system is initialized with 50 Pods and the DRL agent immediately starts removing unused ones. At step 480 the traffic suddenly increases and the agent reacts by increasing the Pod count. Due to the stochastic nature of the policy, a Pod may be terminated even when the blocking rate is above the threshold level. Note that such actions could be beneficial during training because the agent should explore the state space. Therefore, we provide an approach to minimize such unwanted actions. We took the sample of states and used the actions as labels to create a dataset. Using this dataset we trained an SVM classifier with linear kernel. We also considered other kernel types such as polynomial or radial basis function kernels but found their training times significantly longer and also less accurate than the linear kernel. Algorithm 3 presents the procedure we used for training. We set the size of the dataset N data = 450000 and used 80% of the data for training and 20% of the data for testing. For the hyperparameter search we considered C ∈ {0.1, 1, 10, 40, 100}.
Besides the full state description, we also considered a reduced state description where s(t) = d on (t),λ(t) to alleviate the curse of dimensionality during training of the SVM. With the full state space, training took approximately 12 minutes, whereas with the reduced state space, the training time was about 9 minutes. Figure 9 shows the decision boundary learned by the SVM classifier and Figure 10 plots the behavior of the agent using the SVM classifier for the sudden increase of traffic. We can see that the agent behaves in a more consistent way than PPO and start new Pods while the number of the Pods is not high enough to meet the blocking rate criterium.
We ran Algorithm 3 for various t pend values. Each experiment was carried out 8 times and we took the average of the mean number of Pods and the mean blocking rate throughout the runs.  see that by using the SVM classifier we could still keep the blocking rate below the p b,th = 0.01 threshold. However, we get slightly higher mean pod counts than when using the DRL agent only.   To investigate the choice of a classification model, we conducted experiments with a logistic regression model as a classifier. In Table 7 results show that the logistic regression classifier could also keepp b below the threshold p b,th = 0.01. Incorporating the classifiers could save the resource usage (the mean number of Pods) compared to the HPA algorithm. Furthermore, the SVM model outperforms the logistic regression one when both the models are trained using the same generated dataset. The decrease in d on was lower with the logistic regression at higher t pend .

VI. DISCUSSIONS AND CONCLUSIONS
We have investigated the autoscaling of UPF Pods in a 5G core running inside the Kubernetes container-orchestration environment. An extensive numerical study shows that the application of deep Q-networks (DQNs) results in a greedy policy. Therefore, we proposed a DRL based on the PPO method to find a stochastic policy. We have shown that DRL can outperform the built-in HPA algorithm.
Note that the DRL agent may recommend an unexpected action with a tiny probability due to the nature of a stochastic policy. Such unexpected actions with the low probability value are due to outlier points in a dataset collected during the training, which drives the agent in an unwanted direction. Therefore, we need a classifier to clean the dataset by removing these outlier points. A study shows that the incorporation of the classifier could save the resource usage (the mean number of Pods) compared to the HPA algorithm. We have also investigated two classification models (the logistic regression model and SVM) and found that the SVM model outperforms the logistic regression one when both the models are trained using the same generated dataset. It is worth emphasizing that training a classifier can create a deterministic policy that reacts better to sudden changes in traffic. In exchange, the performance degraded compared to the DRL agent when we evaluated it in an  The agent uses a deterministic policy and does not terminate Pods, when they are needed. environment with slower traffic change. The degradation was slight, though, and the performance was still better than with the HPA most of the time. One major drawback, however is that Algorithm 3 cannot be run online. Therefore, it would be applied when the DRL policy is stable enough with available datasets. Otherwise, the DRL with PPO is suggested.