Admission Control and Virtual Network Embedding in 5G Networks: a Deep Reinforcement-Learning approach

Fifth-generation (5G) networks are already available in major urban areas and are expected to bring a major transformation to citizens’ lives. 5G services, such as enhanced mobile broadband (eMBB), ultra-reliable low latency communications (URLLC), and massive machine-type communications (mMTC), require a network infrastructure capable of supporting stringent requirements in terms of latency and bandwidth demands; as such, it must be highly dynamic and flexible. Network slicing is a key enabler technology that can provide dynamic and flexible characteristics to 5G network architecture. A network slice (NS) can be defined as a partition of network and IT resources, that is, network links and nodes capacity dedicated to a specific set of service demands. As a result, different NSs can coexist over the same physical infrastructure network and can be used to dynamically and flexibly deploy the aforementioned 5G services. However, to efficiently implement NSs with different requirements, communication service providers (CSPs) that own the physical infrastructure network must adopt sophisticated techniques for admission control and resource allocation of NSs. In this paper, we present a novel framework for admission control and resource allocation of 5G NSs in metro-core networks. Specifically, our framework is based on a deep reinforcement learning (DRL) algorithm called Advantage Actor Critic (A2C), which performs admission control, i.e. it is capable of learning which slice to admit based on the availability of the physical network resources. Then, given the diversity of requirements for each 5G service, we propose different resource allocation algorithms based on integer linear programming (ILP) and heuristics to treat each service accordingly. The results show that our proposed framework can increase the number of admitted NSs with respect to the case in which the admission control is disabled by improving the resource allocation performance.


I. INTRODUCTION
Emerging 5G technology is now widely available in major urban areas, and coverage is expected to reach less populated areas in the coming years [1].
5G services, such as enhanced mobile broadband (eMBB), ultra-reliable low latency communications (URLLC), and massive machine-type communications (mMTC), are defined by the International Telecommunication Union (ITU) [2] and will soon be available to the majority of citizens.
In addition, 5G has already reached relevant industrial scenarios, thanks to the introduction of new use cases enabled by 5G connectivity that have improved the productivity and performance of the production chain, for example, Industrial IoT (IIoT).
5G leverages the benefits of network function virtualization (NFV) to accommodate flexibility in providing carriergrade differentiated services. The notion of NFV focuses on the concept of a software-based representation of both hardware and software resources by considering data and/or control-plane functions. In other words, NFV is the paradigm of moving network functions, such as routing, firewall, and NAT, from dedicated hardware appliances to software-based applications running on commercial off-the-shelf equipment [3]. These virtual network functions (VNFs) provide many benefits to communication service providers (CSPs) that own the physical infrastructure network. They enable openness of platforms, scalability and flexibility, shorter development cycles, and reduced capital expenditure (CapEx) and operating expenditure (OpEx) [4].
NFV is the main foundation of network slicing, which ensures isolation and multi-tenancy support on a common physical network infrastructure by enabling logical and physical separation of network resources. Specifically, a network slice (NS) is defined as a partition of network and IT resources, that is, network links and nodes capacity dedicated to a specific set of service demands. As such, different NSs can coexist over the same physical substrate network (SN) and can be used to dynamically and flexibly interconnect VNFs by providing different types of services, such as realtime video streaming and enterprise services. To such extent, NFV opens up the implementation and management of NSs not only to CSPs, which are also infrastructure network providers (InPs), but also to third-party service providers, such as Network Slice Providers (NSPs), which rely on one or more InPs to sell 5G services to end users.
Although the adoption of NFV brings revolutionary benefits in terms of scale and agility, it also brings a new level of complexity. Virtualization breaks traditional networking into dynamic components and layers that have to work in unison and can change at any given time [5]. For instance, a virtualized firewall can be subject to continuous updates by NSPs. To efficiently implement NSs over SNs, NSPs must deploy a software-based embedding system (ES) comprising a set of sophisticated techniques for admission control and resource allocation of NSs of different types, such as eMBB, URLLC, and mMTC.
The problem of how to allocate physical resources to virtual resources is called virtual network embedding problem (VNEP). In most real-world scenarios, the VNEP needs to be addressed as an online problem (online VNEP). That is, we do not know how many and which types of Network Slice Requests (NSRs) will come to the ES, as such, they arrive dynamically and remain in the SN for an arbitrary period of time [6]. To be realistic, the ES must handle the NSRs as they arrive through an admission control (AC) algorithm, rather than attending a set of NSRs at once (offline VNEP).
The NSP may decide to admit NSRs deemed to have the best chance of meeting the predefined requirements. For example, the URLLC and eMBB are two dominant types of service of the emerging 5G network. Latency and reliability are major concerns for URLLC NSRs (0.25-0.30 ms/packet [7]), while eMBB NSRs request for the maximum data rates (Gbps). The trade-off among latency and reliability between eMBB and URLLC services, heads to a challenging scheduling dilemma [8] [9].
Slice admission is also dictated by the available resources in the network resource pool, and the AC algorithm must consider the available resources in the SN and manage them in order to accommodate as many NSRs as possible.
Different solutions to this problem have been addressed by several research works by employing various techniques, such as Markov chains [10], big data analytics [11], queuing theory [12], etc. (see Section II).
However, these studies do not differentiate the slices embedding according to the 5G services they carry, i.e. eMBB, URLLC, and mMTC. This paper proposes a novel ES framework to solve the online VNEP of 5G services in metro-core networks. The AC algorithm is based on deep reinforcement learning (DRL) while the VNE is based on integer linear programming (ILP) and heuristic algorithms.
DRL is a subfield of machine learning that aims to capture the most important features of a dynamic environment by deploying a learning agent (powered by deep learning algorithms) that interacts with it to achieve a goal [13]. For instance, authors in [14] developed AlphaZero, a DRL framework that performs exceptionally in games such as chess, shogi, and Go. In short, this framework includes a set of software agents capable of learning how to win in these games. They generate a series of actions (such as movements of pawns in the chessboard) based on the results produced in previous games. Each time they play, they produce increasingly better results. An interesting aspect of DRL is that it implements software agents capable of learning how to optimize an objective function by interacting with an environment that can assume hundreds of thousands of different states. For this reason, given the complexity of the AC problem, our goal is to implement a DRL algorithm to optimize the admission of NSRs in a 5G metro-core network environment. In particular, we implement a novel algorithm called Advantage Actor Critic (A2C) [15], which combines two types of reinforcement learning algorithms: policy-based and value-based. Policy-based agents directly learn a policy (probability distribution of actions) by mapping input states to output actions. Value-based algorithms learn to select actions based on the predicted value of the input state or action. Moreover, given the different requirements of 5G services, we developed different ILP and heuristicbased algorithms to address the VNE of eMBB, URLLC, and mMTC slices. This work considers the metro-core network architecture proposed by the Metro-Haul European project [16] as an SN. This network defines an NFV infrastructure that comprises metro nodes with IT and TLC equipment, following the multi-access edge computing (MEC) model defined by ETSI to support the instantiation of VNFs. Both ILPs and heuristics aim to optimize the SN resources for each NSR.
The remainder of this paper is organized as follows. Section II presents the related work. Section III presents the proposed ES framework that solves the online VNEP of 5G slices. Section IV presents the performance evaluation of the proposed AC and VNE algorithms. Finally, section V concludes the study.

II. RELATED WORK AND PAPER CONTRIBUTION
This section reviews the research works that have investigated AC and VNE algorithms.
The authors in [10] presented an analytical model based on a semi-Markov decision process enhanced by an artificial neural network (ANN) to perform AC of NSs on a wireless access network. The objective of the model is to maximize the overall profit of the infrastructure network provider while guaranteeing the service level agreement (SLA) committed to all slices.
In ref. [11], the authors introduced a slice admission strategy based on big data analytics (BDA) predictions. They considered a network architecture comprising three different domains: 1) wireless access network, 2) metro-optical network, and 3) datacenter (cloud) network. The goal is to accept a slice request issued by a customer from the wireless network domain only when it is estimated that no service degradation will occur for both the incoming slice request and the slices already deployed. The BDA prediction algorithm consists of a regression-based framework that makes predictions based on past data.
Han et al. [12] tackled the AC problem by proposing a system based on the queuing theory. The authors cast this problem into a typical wireless access network scenario, where the mobile operator decides to lease infrastructure resources to customers (or tenants). The proposed system consists of a stochastic model that leverages a multi-queuing system (e.g., one queue for each type of slice) to design an AC for on-demand network slices.
Challa et al. [17] mapped the AC problem into a knapsack problem (MKP) with randomized arrivals and slice durations. The goal is to maximize resource monetization, defined as the revenue of the network provider while minimizing the rejection rate to avoid SLA violations.
The authors in [18] proposed an AC model based on RL with the goal of maximizing the profit of the network provider. Specifically, they considered a 5G flexible RAN, where slices of different mobile service providers are virtualized over the same RAN infrastructure. The proposed RLbased algorithm employs an ANN-based stochastic policy network to model the AC agent used to accept or deny slice request to maximize revenue.
In [19], the authors proposed a framework for network slice management in the context of a 5G RAN. It comprises three modules: prediction, AC, and scheduling. The prediction module is responsible for predicting the traffic of a specific slice. The second module performs the AC as a geometric knapsack problem, showing that this problem is NP-hard. Finally, the scheduling module is in charge of meeting the agreed SLAs, and reports back deviations to the prediction module.
In [20], the authors proposed an admission control based on a recurrent neural network (RNN) to improve the overall system performance for the online VNE problem. The admission control serves as a filter for the incoming network slices by preventing the VNE algorithms from spending time on slices that are either infeasible or that cannot be embedded within an acceptable time. Their approach was based on supervised learning, which means that their RNN algorithm is trained offline.
The different techniques proposed in the aforementioned papers are based on accepting/rejecting individual network slices as they arrive. These studies do not distinguish the slices according to the 5G services [21], such as eMBB, URLLC, and mMTC, preventing the diversity of their QoS requirements. Furthermore, most of the proposed techniques focus on allocating radio resources while neglecting the allocation of 5G metro-core network nodes. In this study, we focus on a generic 5G metro-core network and consider standardized 5G services to perform AC based on a novel DRL approach.

B. VIRTUAL NETWORK EMBEDDING TECHNIQUES
In this section, we present an overview of the research work on the VNE techniques. Several research works have proposed different methodologies to solve this problem. ILP formulations, hence, mathematical optimization models, provide optimal solutions. Heuristics algorithms cope with the complexity problems that generally affect ILPs, such as the rapidly-increasing resolution time due to the large number of variables and constraints. Machine learning algorithms provide flexible and autonomous solutions to cope with the dynamic nature of VNE problems.
In [22], the authors proposed an ILP formulation to solve the VNE. The authors implemented a multi-commodity flow formulation to optimize the allocation of a generic slice onto a physical network. Their model strives to minimize physical resource consumption and load balancing, which is accomplished by means of three different objective functions: load balancing plus shortest path (LB+SP), shortest distance path (SDP), and weighted shortest distance path (WSDP).
In [23], the authors proposed a heuristic algorithm for the VNE. The authors evaluated the acceptance ratio and run time of rank algorithms by comparing them with an ILP-based solution. They claim that all ranking techniques achieve high acceptance ratios within short run-times, whereby the best algorithm depends mainly on the slice and the SN.
In [24], the authors provided an ILP formulation by proposing a single objective function to compute the optimal VNE with respect to the revenue-to-cost ratio. They considered computation power, memory, throughput, and latency as the relevant resources for the SN by providing a nearly optimal ILP formalization and implementation. However, the revenue and cost are not the only objectives that an embedding algorithm needs to meet for every type of slice, different types of slices have different requirements, and as they claim in the future work section, heuristics need to be performed and evaluated to solve large problem instances within a short run-time.
The authors in [25] proposed a graph-based model to map network slices to the SN. The mapping process is based on two steps: 1) node mapping, that is, the selection of the substrate nodes as the host for the virtual nodes of the slices; VOLUME 4, 2016 2) then, link mapping, that is, the procedure of connecting selected host nodes in the physical SN.
In [20], the authors formulate an ILP and a heuristic algorithm for VNE of network slices. They considered network slices requests with a delay requirement and a set of service function chains (SFCs) where each SFC is denoted as a set of VNFs with a capacity requirement. However, this solution does not have different objectives of bandwidth or CPU capacity for the different types of slices; their work aims to minimize the embedding costs while guaranteeing delay constraints.
In [26], the authors proposed a framework based on deep reinforcement learning to allocate virtual radio resources. They considered bandwidth-and delay-constrained slices, whose radio resources are mapped to base stations to maximize the transmission rate and minimize the queuing delays according to the slice requirements. Nevertheless, this work does not consider the slice admission control nor the metrocore network slicing.
In [27], the authors developed an algorithm for VNE based on deep reinforcement learning. They considered virtual network requests that demand node CPU processing and link bandwidth resources to the network with the goal of improving the acceptance ratio and the revenue. Besides enabling dynamic slice embedding, this work does not consider the heterogeneous requirements of the services that may be carried by the slice.
Most part of these studies mapped the arriving generic slices onto the physical SN.
In contrast, in this work, we propose both ILP and heuristic-based algorithms that consider four types of 5G network slices, namely, generic, eMBB, mMTC, and URLLC, according to the International Telecommunication Union (ITU) [21]. We consider their requirements to tailor both the proposed ILP and heuristic-based VNE algorithms. In short, the objective of this work is to jointly perform AC and VNE of NSs in a 5G metro-core network by optimizing resource utilization and maximizing the revenue of the infrastructure network provider.

C. PAPER CONTRIBUTION
In a nutshell, we developed a novel ES framework able to solve the online VENP for 5G services in metro-core networks. The contribution of this work can be summarized as follows: • We implemented a novel algorithm for the AC of NSs based on DRL. In particular, we tailored the Advantage Actor Critic (A2C) [15] algorithm with the aim of optimizing the slice admission control of different types of 5G services, such as generic, eMBB, mMTC, and URLLC. Furthermore, we took into account the revenue-to-cost ratio after the VNE procedure to design the reward function. • We included two types of VNE algorithms in our ES that are inspired by the research work in [25]. The first is based on an ILP mathematical formulation with   [25], we provided the following novelties: 1) We extended the ILP formulation by modifying the objective functions for eMBB and mMTC NSRs. In particular, our objective functions ensure that the residual physical resources on the SN are maximized. As a result, our ILP formulations provided a fair distribution of available resources throughout the SN in order to prevent some nodes, or links, from being used more than others. 2) We developed the heuristic algorithms by taking into account specific requirements of the 5G NSRs. As such, we have considered the maximum CPU capacity of the physical nodes as the main feature to embed the eMBB NSRs; then, we have used the minimum bandwidth available on the substrate nodes and the number of hops as the main features for the mMTC and URLLC NSRs respectively. In the following sections, we discuss the proposed ES that solves the online VNEP for 5G services on a metro-core network. Fig. 1 shows the proposed ES framework. It includes two modules: an AC module and a VNE module. The first is based on a DRL algorithm and performs admission control of NSRs. The second is based on different ILP and heuristicbased algorithms and performs resource allocation onto the SN. The proposed ES framework is as follows. 1) At time t, an NSR arrives at the AC module. It decides to accept or reject the NSR based on the output of the VNE module and the status of the SN at the previous time step. 2) If the AC module accepts the NSR, it informs the VNE module such that it can be embedded onto the SN. Otherwise, it is rejected. 3) Once the VNE module receives acceptance from the AC module, it performs the embedding procedure. If the NSR embeds successfully, the VNE module sends positive feedback to the AC module; otherwise, it sends negative feedback. The latter is important for the AC module because it can learn to reject NSRs that cannot be embedded in advance.

III. EMBEDDING SYSTEM FRAMEWORK FOR 5G NETWORK SLICES
The following sections will discuss the details of the SN, the type of NSRs, the DRL-based AC module, and the VNE module.

A. SUBSTRATE NETWORK (SN)
The SN considered in this work follows the metro-core network architecture proposed by the Metro-Haul European project [16]. The Metro-Haul (MH) network is designed to support network slicing according to the 5G-MEC model; as such, it can support different 5G service requirements. For instance, eMBB requires a wide range of VNFs distributed across the core and edge metro nodes. At the same time, the mMTC needs VNFs deployed at the edge to support a high connection density of online devices such as sensors and other wireless devices. The MH network defines an NFV infrastructure that comprises metro core edge nodes (MCENs) interconnected by a high-capacity optical network. The MCENs are considered as mini datacenters hosting both IT and Telecommunication (TLC) equipment, following the multi-access edge computing (MEC) model defined by ETSI, so that they can support the instantiation of VNFs. Following the MH guidelines, we assumed that these metro nodes are connected via bidirectional fiber links with transmission distances ranging from 5 km to 50 km. Each link contains two fibers with 20 wavelengths each at 100 Gbps/wavelength; in addition, each MCEN is also equipped with a set of 100 Gbit/s transponders. In terms of the nodes computational capacity required to host the VNFs related to each slice, we consider the server Intel® Xeon® Gold 6134 with 8 cores, processing capacity of 537.6 GFLOPS, maximum operating frequency of 3.7 GHz. In this work, we designed the 5G metro-core network as an undirected graph G S := (N S , L S ), where N S represents the set of all physical nodes (MCENs) and L S ⊆ N S × N S represents the set of all physical links in the SN. A N and A L are the node attributes (CPU processing capability) and link attributes (bandwidth).

B. NETWORK SLICE REQUEST (NSR)
The NSR comprises a virtual topology composed of VNFs and virtual links with CPU and bandwidth requirements. In this work, we follow the separation between the 5G control plane and data plane VNFs [28], which includes the following VNFs: access/mobility management function (AMF), session management function (SMF), and user plane function (UPF).
We consider four types of NSRs based on 5G services: eMBB, mMTC, URLLC [21], and a generic slice that can be customized according to the user needs. Next, we present the details and requirements of the considered 5G NSs.
• eMBB: It requires high throughput (up to 20 Gbps) and computing resources, while the latency constraint is approximately 4 ms [25]. As a result, the deployment of eMBB requests aims to minimize the remaining resources of the physical nodes. Video streaming services and augmented reality applications belong to this class [23]. • mMTC: It includes a large number of connected online devices (1 million devices/km 2 [29]), such as IoT sensors. The deployment of mMTC requests aims to minimize bandwidth usage on physical links. These types of requests have plenty of connections; consequently, the requirement of computing resources is high and the demand for a low congestion rate. • URLLC: this type or requests have strong latency constraints (e.g. 1 ms) and high availability requirements (e.g. 99.9%) [30]. The deployment of URLLC requests aim to minimize the delay. Autonomous driving services and eHealth applications are examples of this type of slice.
As for the SN, we denote each NSR as an undirected graph G V := (N V , L V ), where N V represents the set of all virtual nodes and L V ⊆ N V × N V represents the set of all virtual links in the NSR. R N and R L are the node requests (CPU processing capability) and link requests (link bandwidth). Table 1 shows the mathematical notation of the SN and NSRs.

Notation
Description Full set of virtual links in NSR R N Set of node requirements in NSR R L Set of link requirements in NSR

C. ADMISSION CONTROL MODULE
This section describes the proposed AC module based on the DRL. As mentioned in the Introduction, DRL is a subfield of machine learning that combines RL and deep learning. In particular, DRL agents can handle very large sets of input data and decide what actions to perform to optimize an objective. In this study, we implement the Advantage Actor-Critic algorithm (A2C) [15], which takes as input the current NSR and different information from the SN (i.e. state space), such as the number of metro-core nodes and links, the current CPU loads, bandwidth consumption, etc. Then, the agent learns to make decisions, such as accepting the NSR (i.e. action space) by optimizing the profit of the NSP. For this, we shaped a reward function that returns the goodness of action made by the agent. In the following subsections, we discuss the details of the A2C algorithm implementation. Actor Critic algorithms have proven to be very efficient for problems with a large number of environmental states [31]. For instance, Md. Shirajum Munir et al. [32] designed a multi-agent A2C algorithm with the goal of providing an efficient energy scheduling scheme for a microgrid-powered MEC network. In particular, the objective is to reduce the gap between energy generation and demand estimation, where it can maximize the usage of renewable energy. Madyan Alsenwi et al. [33] proposed a novel approach combining optimization-theory based methods with the A2C algorithm to improve the performance of resource allocation of eMBB and URLLC traffic in wireless networks. These two research works show successful implementations of the A2C algorithm in different network scenarios. The objective of our work is to exploit the capability of the A2C algorithm to perform the AC task of NSs in the metro-core network scenario. A2C is a specific type of actor-critic algorithm that belongs to the family of action-value functions. It combines two types of RL algorithms: policy-based and value-based.
Policy-based algorithms comprise agents that learn a policy, that is, a probability distribution of actions, by mapping the input state space to output actions. A policy can be represented as a rule used by the agent to select the correct action. It can be deterministic or stochastic, in which case it is usually denoted by π. Policy-based algorithms represent a policy as π (a|s), where a is the action and s is the state.
On the other hand, value-based agents learn how to select actions based on the predicted value of the input state or action. Unlike policy-based algorithms, value-based agents aim to find a numerical representation of a state. In other words, the value is the expected reward E for state s under a policy π. The value function is denoted by V π (s) and it is represented as follows: V π (s) = E π [r(t) |s].
As a result, action-value functions learn a value for the action instead of a state. The goal of the agent is to find an optimal policy π * : S → A to maximize the expected reward. As such, we first define the value function V π : S → R which represents the expected value returned by following the policy π for each state s ∈ S. The value function V for policy π is defined in Eq. 1: Because the goal of the agent is to find the optimal policy π * , an optimal action at each state can be expressed by the optimal value function: as the optimal Qfunction for all state-action pairs, then the optimal value function can be written as follows: V * (s) = max a {Q * (s, a)}. Therefore, the final goal is to find the optimal values of the Q-function, that is, Q * (s, a), for all state-action pairs, which can be done through iterative processes. In particular, the Qfunction is updated according to the following rule: In Eq. 2, the learning rate α t defines the impact of new information on the existing Q-value. The algorithm then yields the optimal policy indicating an action to be taken at each state such that Q * (s, a) is maximized for all states in the state space, i.e., π * (s) = arg max α Q * (s, a).
The traditional Q-learning is based on the concept that the agent knows the expected reward for each action at every step. However, it does not scale for problems with large states and action spaces. Therefore, DRL-based algorithms use deep learning to scale to decision-making problems that were previously intractable, that is, settings with high-dimensional state and action spaces. An actor-critic algorithm consists of two artificial neural networks: actor and critic. The actor network selects an action at each time step, and the critic network outputs the Q-value of a given input state. In other words, while the critic network learns which states are better or worse, the actor uses this information to explore good states and try to avoid bad states. The A2C algorithm is a specific version of the traditional actor-critic that exploits an advantage function that predicts the error of the agent [15]. The learning procedure of the actor and critic network is performed separately, and it uses gradient ascent to update both sets of weights in the corresponding networks. As time passes, the actor is learning to produce better and better actions (it is starting to learn the policy), and the critic is getting better and better at evaluating those actions.
In the following subsections, we provide a definition of the state space, action space, and reward function.

2) State space
A state is represented by the available resources in the substrate 5G metro-core network. To achieve efficient admission control, the agent needs sufficient information about the current state of the environment and the NSR characteristics. To achieve this, we define two feature vectors, ϕ S (G S , t) and ϕ V (G V , t), which are representative of the SN and NSR at a time t. The ϕ S (G S , t) vector is composed by the following features: denotes the degree of virtual node v i at a time t. • Average path length between virtual nodes (APLv): it is the averaged total path length at a time t between virtual node v and every other physical node that is reachable from v • CPU and bandwidth requests for NSRs (CP U N SR and B N SR ): the are defined as the CPU and bandwidth units requested by the specific type of slice • NSR type: it defines the type of NSR, such as: generic, eMBB, URLLC and mMTC As an example, Fig. 2 shows the embedding of a generic NSR made by 3 virtual nodes, with requirements on CPU and bandwidth, on the SN composed by 9 physical nodes; while Table 2 shows the values of the ϕ V (G S , t) and ϕ V (G V , t) state space vectors that describe the network at a time t = 1.

3) Action space
At each time step, the A2C-agent receives an NSR and decides whether to accept or deny the incoming slice. The set of actions is denoted by a(t) and can assume a value equal to 0 (deny) or 1 (accept). The agent selects the action that maximizes the reward function.

4) Reward function
The reward function considers the resource efficiency of the SN in terms of revenue-to-cost-ratio (see section IV-A). We can improve the overall acceptance ratio if the agent predicts which NSR can be embedded in such a way that the resulting resource efficiency is maximum. If the agent rejects an NSR that cannot be embedded onto the SN, the VNE algorithm does not lose time trying to embed requests that are unfeasible. Therefore, the reward function depends on the success of the VNE algorithm. If the NSR can be embedded, the reward function is described by Eq. (3).
On the contrary, if the VNE cannot embed the incoming NSR, the reward function is described in Eq. 4.
where α is a negative value between -1 and 0, and β is a positive value between 0 and 1. If α is closer to −1, the agent will be more cautious and will reject more incoming NSR; in contrast, if α is closer to 0 the agent will accept more requests. Algorithm 1 shows the complete AC procedure based on A2C.

D. VIRTUAL NETWORK EMBEDDING MODULE
The VNE module allocates the resources of each type of NSR to the SN. In this study, we extended two types of VNE algorithms introduced by the research work in [25]. The first is based on integer linear programming (ILP), a mathematical formulation model to solve optimization problems, such as the VNE. We start by defining different objective functions (one for each type of NSR) with the goal of minimizing or maximizing one (or more) specific metric, such as bandwidth and latency. Then, we define the constraints that consist of equations or inequalities that set upper and lower bounds on the variables of the model, for example, the CPU capability for each physical node. Compared to the work in [25], we modified the objective functions for eMBB   However, the well-known VNE problem has been demonstrated to be NP-hard [34]; hence, exact solution methods suffer from scalability problems, unfeasible computational complexity, and running times. To address these issues, we developed a second set of algorithms based on heuristic methods. The goal is to obtain VNE solutions as close as possible to the optimal solutions in the shortest possible time. In particular, the run time of the VNE module is important to facilitate the rapid exploration/exploitation of the AC module and to speed up the embedding of the NSRs onto the SN. Compared to the work in [25], we developed the heuristic algorithms by taking into account specific requirements of the 5G NSRs. Therefore, we have considered the maximum CPU capacity of the SN nodes as the main feature when embedding the eMBB NSRs; then, we have used the minimum bandwidth available on the substrate nodes and the number of hops as the main features for the mMTC and URLLC NSRs respectively.

1) ILP-based algorithms
As stated above, we propose a mathematical formulation to accommodate different types of NSRs in the SN. It is based on a node-link formulation [22], which enables the synchronous embedding of virtual nodes and links by optimizing the allocation of physical network resources. Specifically, we define: 1) four objective functions (one for each NSR type); 2) a set of constraints shared by each type of NSR. Table 3 shows the variables and parameters used to develop the ILP.

a: Objective functions
The definition of the objective function is one of the major challenges in formulating an ILP. We define four different objective functions for the following NSR types: • Generic NSR: This type of slice is used for those service requests that have no specific network constraints, as opposed to those that require specific network requirements such as eMBB, mMTC and URLLC. The idea is to allow slices of generic applications that do not need fixed allocations of resources in the SN. As such, the goal for a generic NSR is to take advantage of SN resources efficiently [25]. Therefore, the objective function minimizes the overall CPU and bandwidth. See • eMBB: As stated above, for this type of request, we have to minimize the CPU usage on physical nodes; hence, in this objective function, we maximize the minimum CPU capacity on physical nodes. See Eq. (6).
The constraint under Eq. (6) ensures that the CPU capacity assigned to the eMBB slice does not exceed the maximum capacity of the physical nodes. • mMTC: For this type of request, we have to maximize the remaining resources of physical links; hence, in this objective function, we maximize the remaining bandwidth on physical links. See Eq. (7). Get feature vectors ϕ S (G S ) and ϕ V (G V ); 7 Get the policy function π θ (s t , a t ) using actor network; Calculate value V (s t ) using the critic network; 28 Calculate value V (s t+1 ) using the critic network; 29 Calculate advantage value A(s t , a t ); To ensure a feasible embedding between the virtual nodes/links of the NSR and the physical SN resources, we define a set of constraints shared by each NSR type. 1) Assignment of virtual nodes to physical nodes: Eq. (9) ensures that each virtual node is assigned to one physical node.
2) Assignment of physical nodes to virtual nodes: Eq. (10) ensures that each physical node can embed at a maximum of one virtual node per NSR.
3) CPU capability conservation: Eq. (11) ensures that the available CPU capacity of each physical node at time step t is not exceeded.
4) Bandwidth capability conservation: Eq. (12) ensures that the available bandwidth capacity of each physical link at time step t is not exceeded.
5) Multi-commodity flow conservation with node-link formulation: Eq. (13) optimizes the mapping of virtual links and virtual nodes. The multi-commodity flow constraint is applied within a node-link formulation [35].
2) Heuristic-based algorithms The VNE problem can be divided into two sub-problems: virtual node mapping (VNoM) and virtual link mapping (VLiM). Both sub-problems can be solved in a coordinated or uncoordinated fashion [35].  Algorithm 3 presents the VNE procedure for a generic NSR. It starts by solving VNoM as follows: 1) First, it computes the node resource (N R) and the node importance (N I) metrics for both virtual and physical nodes. N R represents the available resources of each node, and is defined by Eq. 15. N I aims to score the nodes based on their resources and centrality in the network. It is defined by Eq. 16.
CP U (i) is the CPU resource capacity of node i; s(i) represents the set of links that are directly connected to node i; BW (l) represents the bandwidth capacity of link l.
N R(i) is the node resource metric defined by Eq. 15; d i is the normalized degree of node i, and b i is the normalized betweenness centrality of node i. The latter is defined as the number of shortest paths between two nodes that pass through node i. 2) Once the algorithm computes the N R and N I metrics, it sorts the virtual nodes using Algorithm 2. Each node v ∈ N SR i is sorted in a non-increasing order using the breadth-first-search (BFS) algorithm [36] by exploiting the N I values of each node. The N I metric allows the nodes to by classified according to the number of shortest paths that can pass-through each node. BFS algorithm gives priority to nodes with higher N I; as such, virtual nodes v ∈ N SR i are mapped onto the most important physical nodes.
3) If the VNoM procedure is successful, the algorithm proceeds with the VLiM procedure. The link mapping procedure is based on the shortest path between embedded physical nodes, and in particular, we use the Floyd-Warshal algorithm [37]. For each virtual link ∈ N SR i , it removes the physical links that do not meet the bandwidth requirements. Then, it determines the physical nodes where the source and target virtual nodes of the virtual link are mapped. Finally, it maps the virtual link onto the physical links in the physical shortest path calculated using the Floyd-Warshal algorithm. This heuristic aims to minimize the CPU usage on physical nodes. Hence, we first run the VNoM procedure and then the VLiM procedure. The virtual nodes were sorted using Algorithm 2. The physical node in the SN with the highest CPU capacity was selected to map each virtual node of the incoming NSR. If the VNoM succeeds, we run the Floyd-Warshal algorithm for the VLiM procedure with the goal of embedding the virtual link onto the physical links in the SN shortest path. See Algorithm 4. if v is the root node then 5 Map v onto the physical node with highest NI; 6 else 7 Find parent virtual node P of virtual node v in T; 8 Find physical node PP in which P is mapped; 9 Set C the adjacent physical nodes of PP; 10 Map v onto the physical node with highest NI in C; 11 end 12 end 13 Sort virtual links (i, j) ∈L Vi of N SR i in a non-increasing order according to their bandwidth requirement; 14 for (i, j) ∈L Vi do 15 Remove the physical links that can not meet the bandwidth requirement; 16 Find physical node a where virtual node i is mapped; 17 Find physical node b where virtual node j is mapped; 18 Find the physical shortest path between a and b by using Floyd Warshall algorithm; 19 Map vitual link (i, j) onto the physical shortest path. 20 end c: Heuristic algorithm for mMTC NSRs This heuristic maximizes the remaining bandwidth on physical links; it first runs the VLiM and then the VNoM procedure. The algorithm sorts the virtual links of the NSR in nonincreasing order according to their bandwidth requirements. The goal is to first embed virtual links with higher requirements. Then, it calculates all possible paths that can allocate each virtual link and selects the best fitting path in the SN. The best-fitting path is selected according to the following procedure: 1) it selects the physical link with the minimum BW resources for each possible path, and 2) it selects the path where the minimum link BW is maximum among all possible paths. Then, the physical links on the best-fitting path are selected to map the virtual link. If the VLiM is successful, we select the source and target physical nodes in the bestfitting path of each virtual link to map the virtual nodes of the incoming NSR. See Algorithm 5. if v is the root node then 4 mapv onto the physical node with highest CPU capacity resources; 5 else 6 Find parent virtual node P of virtual node v in T; 7 Find physical node PP in which P is mapped; 8 Set C the adjacent physical nodes of PP; 9 Mapv onto the physical node in C with highest CPU capacity resources; 10 end 11 end 12 Sort virtual links (i, j) ∈L Vi of N SR i in a non-increasing order according to their bandwidth requirement; 13 for (i, j) ∈L Vi do 14 Remove the physical links that can not meet the bandwidth requirement; 15 Find physical node a where virtual node i is mapped; 16 Find physical node b where virtual node j is mapped; 17 Map vitual link (i, j) onto the physical shortest path between a and b; 18 end d: Heuristic algorithm for URLLC NSRs Finally, the URLLC request must be deployed with a specific delay requirement. To do so, the heuristic algorithm must minimize the end-to-end delay between sources and destinations; in other words, it will use fewer physical links. As such, it first runs the VLiM and then the VNoM procedure. Similar to the previous algorithm, we first sort the virtual links of the incoming NSRs in a non-increasing order according to their bandwidth requirements. Then, it calculates all possible paths that can allocate each virtual link and selects the bestfitting path in the SN. The best-fitting path is the one with fewer hops in the SN. If VLiM is successful, we select the source and target physical nodes in the best fitting path of each virtual link to map the virtual nodes of the incoming NSR. See Algorithm 6.

IV. PERFORMANCE EVALUATION
This section introduces the performance evaluation of the proposed ES framework. First, we present the metrics used for assessing the performance. Afterwards, we show the VOLUME 4, 2016 Find the physical link l ∈ p with the minimum bandwidth resources. Map virtual node j onto the the physical target node of the bp path; 8 end performance of the VNE module by comparing the ILPbased and heuristic-based algorithms separately; and then we present the evaluation of the entire ES framework with the AC module based on DRL. Finally, we discuss the complexities of the algorithms and the results of this study.

A. PERFORMANCE METRICS
The VNE problem is defined as a mapping problem in which VNE algorithms map an NSR (G V ) to an SN (G S ). The mapping procedure of an NSR onto the SN is only valid if the full set of virtual nodes N V is mapped onto a subset of substrate nodes N S ∈N S , and all virtual links L V connecting virtual nodes are mapped onto a subset of substrate nodes L S ∈L S . All virtual node and link requests in the G V must be satisfied. The following sections introduce: acceptance ratio, resource efficiency and run time metrics.

1) Acceptance Ratio (AR)
The acceptance ratio is the ratio between the number of requests that have been successfully mapped and the total number of requests arrived to the AC module [25]. It is defined in Eq. (17):

2) Resource Efficiency (RE)
Resource efficiency is defined as the revenue-to-cost-ratio [25]. When an NSR is accepted, the mapping revenue can be defined as the sum of its nodes capacity and link bandwidth requirements. The cost can be defined as the sum of the node and link bandwidth resources of the SN. It is defined in Eq. (18): R N (n) represents the CPU capacity requirement of virtual node n; R L (l) represents the bandwidth requirement of virtual link l; and hop(l) represents the path length of link l in which the virtual link l is mapped onto the SN.

3) Run Time (RT)
VNE algorithms experience a trade-off between performance in terms of resource efficiency and run time. ILP-based algorithms can guarantee optimal VNE solutions but suffer from scalability issues and long run times. In contrast, heuristicbased algorithms provide suboptimal but much faster VNE solutions. This trade-off must be considered in the case of real-time network slice deployment. In this work, we evaluate the run time of the proposed ES framework as a metric for real-time network services.

B. EXPERIMENTS SETUP
The proposed ES framework was developed using Python version 3.6.7. We used Ubuntu 18.04 LTS with an Intel Core i7 CPU and 8 GB RAM. We used the Barabasi-Albert algorithm [38] to generate topologies for the SN and NSRs. In this paper, evaluations are given for 50-MCENs SN topology and 3-virtual node topologies for generic, eMBB, URLLC and mMTC [28] [39]. Both physical processing units capacities and bandwidth unit are uniformly distributed between 20 and 50, U (20, 50). NSR processing units and bandwidth units are distributed according to the NSR type as follows: The NSRs of each type arrive with an exponentially distributed arrival rate λ = 1/10, 3/10, 5/10 and stay for an exponentially distributed lifetime with a mean value of 100 s. The A2C algorithm comprises an actor network with an input layer, a recurrent neural network (RNN) layer, two dense layers, and finally, a softmax layer. Moreover, the critic network comprises an input layer, a recurrent neural network (RNN) layer, two dense layers, and a single unit layer. The actor network returns the policy π θ (s t , a t ), a probability distribution of two possible actions (accept or reject) for the incoming NSR. The critic network returns the value V θ (s t , θ v ). Both actor and critic networks have the same input and hidden layers that are described as follows: 1) The input layer takes the concatenation of the feature vectors ϕ S (G S , t) and ϕ V (G V , t).
2) The output of the input layer is forwarded to the RNN layer. It is implemented using a long short term memory (LSTM) [40] with 200 nodes and an hyperbolic tangent activation function. 3) Two dense layers with 128 and 64 nodes, respectively, are implemented. Both dense layers implement a rectified linear unit (ReLU) activation function.
Then, we set the A2C algorithm parameters as follows: (learning rate) α = 0.0003, (discount rate) γ = 0.99, (maximum exploration factor) max = 1. Training starts at 400 steps, mini-batch size to 20 samples and actor-critic updates every 150 steps. The hyper-parameters of the A2C algorithm were chosen by adopting greedy layer-wise training [41]. We obtained 95% confidence level of 50 repetitions of the experiments.

C. RESULTS
This section presents the results of the proposed ES framework. We first show the performance of the VNE module in stand-alone mode, by evaluating the RE and RT for the ILPbased and heuristic algorithms introduced in sections III-D1 and III-D2.   Subsequently, we evaluate the performance of the entire ES made by the AC and VNE modules, the last equipped with the heuristic-based algorithms. In this evaluation, we compared two AC algorithms: the proposed A2C implementation and a state of the art DRL algorithm called Deep Qlearning (DQN) [42]. We set the DQN algorithm as follows: 1) policy network is an Artificial Neural Network (ANN) with 2 hidden layers made by 128 artificial neurons with a softmax activation function at the output layer; 2) discount factor equal to 0.99; 3) learning rate equal to 0.0005; 4) exploration fraction equal to 0.1. Figure 3 compares the resource efficiency results of the generic, eMBB, mMTC, and URLLC NSR types, respectively. We start by generating 15 NSRs until 50 NSRs of each type. As shown in the figure, the RE shows a small difference between the heuristic and the ILP-based algorithms. Specifically, it is −12% for generic, −5.8% for eMBB, −11.3% for mMTC and −8% for URLLC. These results confirm that there is minimal difference between the solutions derived from ILP-based algorithms and those based on heuristics. Moreover, we can assess that the algorithms that perform better are those in which the objective function works more in saving link resources (generic, URLLC and mMTC) than node resources (eMBB). Figure 4 and 5 compares the run time (in Log scale) of the different types of requests for both the ILP and heuristic algorithms. Using the ILP method, the eMBB request is embedded significantly faster than the other request types, with a mean RT of 15 seconds compared to the other NSR types that require an average RT of 493, 33 seconds. That is because it searches only for a set of candidate nodes (|N S | = 50) for each virtual node. In contrast, the other types of requests take into account also the set of candidate paths. This leads to less time consumption when embedding an eMBB request type.

1) VNE module: ILP-based and heuristic-based algorithms comparison
On the other hand, using the heuristic algorithms, the embedding procedure of a generic NSR type takes more time with respect to the others (approximately 20 s more) because the node mapping and link mapping procedures of Algorithm 3 are the longest in terms of time complexity. Furthermore, heuristic algorithms can embed all NSR types with a mean time of 7 s, while the ILP-based algorithm takes almost 372, 48 s. Thus, given the high performance of the heuristic algorithms in terms of both RE and RT, we can use them as VNE algorithms within the VNE module of the proposed ES framework.

2) Proposed ES framework: AC module + VNE module
After assessing the performance of the ILP and heuristicbased algorithms, this section shows the performance of the entire ES framework, as shown in Fig. 1, made by the AC module, equipped with two DRL algorithms (A2C and DQN), and the VNE module, equipped with the heuristic algorithms. Figure 6 shows the acceptance ratio of incoming NSRs with and without the proposed AC module according to the selected DRL algorithm. Without admission control means that every time a NSR arrives, it is embedded by the VNE module. Considering Fig. 6, the four labels represent:  The results show that for increasing values of λ, the number of accepted NSRs decreases. When NSRs are generated with high arrival rates, the physical resources are occupied much faster than when they are deployed with shorter arrival rates, leading to a higher number of rejected NSRs.
The number of rejected requests indicates the number of requests that the VNE module tried to implement in the SN but failed due to scarce resources left. We can see how this amount is greatly reduced when we enable the AC module based on A2C (37%). We can notice that the DRL-based algorithm has learned to reject in advance the NSRs that cannot be implemented in the SN. The latter is indicated by unfeasible and feasible labels. The former identifies NSRs that cannot be implemented when they arrive. In fact, the AC module (A2C) has learned to reject them. In contrast, NSRs in blue could have been implemented on the SN because they are eligible when they arrive. However, the AC module (A2C) "preferred" to reject them to optimize the total RE of the SN. Thanks to the AC module (A2C), our ES framework accepts 15% more NSRs than the case in which the AC module (A2C) is disabled. On the other side, the AC module based on DQN did not show good results. In particular, the agent's policy network did not converge by showing almost the same results as the case in which the AC module is not enabled. Table 4 shows a detailed view on the acceptance ratio for each type of NSR with an AC module based on A2C. In this case, the amount of requests that are unfeasible and feasible from Figure 6 are considered rejected in Table 4. We can see that embedding eMBB slices is difficult in a dynamic environment where the types of incoming NSRs are unknown. This is due to the fact that eMBB slices require high computational resources, in terms of CPU units, compared to other types of slices. Figure 7 presents the overall resource efficiency once all   NSRs are embedded. It can be observed that the resource efficiency decreases while λ increases. This is because with increasing λ, node resources are generally scarce, leading to embedding virtual links onto paths with a higher number of physical links as non-adjacent physical nodes are selected, resulting in higher costs and lower resource efficiency. Moreover, for all λ values, it can be seen that the overall resource efficiency increases when the AC module is implemented, both in the A2C case and in the DQN case. Obviously, given the poor performance of the DQN algorithm in accepting slices, the efficiency is lower than in the case of the A2C. Finally, Figure 8 shows the run time with and without AC module. We start measuring the time since the arrival of each NSR until the final embedding by the VNE module. We can  Table 5 summarizes the complexity of each algorithm proposed by this work. Algorithm 1 aims at training a DRLbased agent to maximize the number of admitted slices into the SN. RL is generally known to be highly data intensive, and the amount of state-action space determines the computational complexity and processing time. For instance, an environment involving N × M state matrix, the computational complexity may be O(N M ). Indeed, if the state matrix becomes too large the complexity may grow exponentially [43] and the computational cost becomes too high. Besides, DRL-based agents perform action-value approximations through deep learning techniques, such as artificial neural networks. Indeed, the computational complexity can be expressed as the multiplication between the number of epochs to train the agent and the number of NSRs to process: O(N um epochs × N um N SRs ). Algorithm 3 and 4 have the same computational complexity since both perform first a virtual node mapping and then a virtual link mapping, hence the complexity is O(N v + L v ).

D. COMPLEXITY ANALYSIS
Algorithm 5 is more computationally intensive since it calculates all possible paths that can allocate each virtual link and selects the best fitting path in the SN. The complexity is equal to O(L v × L s × N s ).
Algorithm 6 minimizes the end-to-end delay between sources and destinations by reducing the number of used physical links. As such, the complexity is related to the virtual link mapping: O(L v ).

V. CONCLUSION
This work introduced a novel framework to jointly perform AC and VNE of 5G NSs. The AC module is based on the DRL A2C algorithm, which is able to select the most profitable NSRs to accommodate onto the SN. The VNE module is based on a set of heuristic algorithms (validated by VOLUME 4, 2016 ILP mathematical formulations) with the aim of embedding each type of 5G NSRs according to their requirements. Our proposed ES framework achieved up to 15% more accepted NSRs with respect to the case in which the AC module is disabled. Moreover, while accepting more requests, our ES framework can increase the resource efficiency and decrease the run time of the admission control and embedding procedure. In future work, we will enrich the VNE module with algorithms based on DRL. With this aim, we plan to build a comprehensive ES made by DRL-based agents, both for AC and VNE, to further improve the acceptance ratio and resource efficiency in the 5G metro-core network.