Deep Reinforcement Learning-based Network Slicing for Beyond 5G

With the advent of 5G era, network slicing has received a great deal of attention as a means to support a variety of wireless services in a flexible manner. Network slicing is a technique to divide a single physical resource network into multiple slices supporting independent services. In beyond 5G (B5G) systems, the main goal of network slicing is to assign the physical resource blocks (RBs) such that the quality of service (QoS) requirements of eMBB, URLLC, and mMTC services are satisfied. Since the goal of each service category is dearly distinct and the computational burden caused by the increased number of time slots is huge, it is in general very difficult to assign each RB to a certain service properly. In this paper, we propose a deep reinforcement learning (DRL)-based network slicing technique to find out the resource allocation policy maximizing the long-term throughput while satisfying the QoS requirements in the B5G systems. Key ingredient of the proposed technique is to use action elimination to eliminate undesirable actions that cannot satisfy the QoS requirements. Numerical results demonstrate that the proposed technique is effective in maximizing the long-term throughput and handling the coexistence of use cases in the B5G environments.


I. INTRODUCTION
5 G network is envisioned to be a multi-service network supporting a wide array of services such as enhanced mobile broadband (eMBB), ultra reliable and low latency communications (URLLC), and massive machine-type communications (mMTC) [1], [2]. As a means to accommodate a variety of services in a flexible manner, network slicing, a concept to divide a single physical network into multiple (logically) isolated networks, has received a great deal of attention in recent years [3]. Since a wireless resource block (RB) is divided into multiple slices supporting independent services, a malfunction of a certain slice will not affect other services, thereby ensuring the stability of the entire system. Network slicing can also reduce the maintenance and operation costs significantly since diverse services can be supported by the common physical infrastructure [4].
In beyond 5G (B5G) and 6G systems, the main goal of network slicing is to assign the physical resource blocks such that the quality of service (QoS) requirements of eMBB, URLLC, and mMTC services are satisfied simultaneously.
To do so, the coexistence of eMBB, URLLC, and mMTC needs to be handled properly by the base station (BS) and 5G core network (i.e., access and mobility management function (AMF), session management function (SMF), and user plane function (UPF)). However, it is in general very difficult to satisfy these diverse service requirements at the same time since the goal of each service category is dearly distinct. For example, to meet the latency requirement of the URLLC service, the BS should transmit the URLLC packet immediately even in the middle of eMBB or mMTC transmission. In such case, obviously, reception quality of eMBB and mMTC services will be degraded severely due to the abrupt increase in interference.
In order to maximize the system throughput while satisfying the QoS requirements, various network slicing techniques have been suggested in recent years. In [5], a technique to allocate the equal number of RB to each network slice has been proposed. In [4], a regression tree-based technique to assign different bandwidth to each network slice based on service requirement has been suggested. While these approaches are useful to handle the network slicing in a given time slot, efficacy of this approach would be severely degraded in the dynamically changing wireless environments. This is mainly because to determine the allocation of a RB for a certain service is basically binary integer program [6] so that one has to deal with the exponential increase in computational complexity. For example, when one consider 20 network slices, then we need to check 2 20 ≈ 10 6 possible resource allocation decisions at every time slot. For this reason, to come up with a network slicing technique that can effectively control the allocation of slices over the long-term period is of great importance for the success of network slicing in B5G and 6G cellular systems.
An aim of this paper is to propose a novel network slicing technique, referred to as DRL-based network slicing (DRL-NS), to improve the system throughput while supporting various services (e.g., eMBB, URLLC, mMTC). In our study, we employ the deep reinforcement learning (DRL), an efficient learning-based technique specialized for solving the sequential decision-making problem as a baseline. In DRL, an agent finds out a series of actions maximizing a long-term cumulative reward among large-scale state-action pairs [7]. In our work, we use DRL to let the gNB (agent) observe the channel state, data rate constraints, delay requirements (states), and then assign each resource block (RB) to a certain service (action) to maximize the overall system throughput (reward).
In the network slicing problem, an action space size (i.e., a possible combination of actions) is huge since it is proportional to the number of sequential network slicing decisions. To be specific, when deciding whether the RB is allocated or not for a certain service, the number of possible choices increases exponentially with the number of time slots. Due to this immense action space, gNB is likely to explore undesirable actions (e.g., resource allocation decisions that cannot satisfy the data rate constraints or delay requirements) during the training phase of DRL, thereby slowing down the convergence speed and also preventing the DRL from maximizing the reward. To address the problem, we employ the action elimination [8], an approach to exclude undesirable actions among all possible actions to boost up the training speed and quality of trained policy. Due to the elimination of meaningless actions, after playing reasonable number of training episodes, the trained gNB generates the network slicing strategies (i.e., resource allocation decisions) improving the system throughput and satisfying the QoS constraints.
The main contributions of this paper are as follows: • We propose a DRL-based network slicing (DRL-NS) technique that finds out the resource allocation policy maximizing the long-term throughput while satisfying the QoS requirements of eMBB, URLLC, and mMTC services simultaneously in dynamically varying 5G environments. • We integrate the action elimination to DRL to eliminate undesirable actions that cannot satisfy the QoS requirements. In doing so, exploration of DRL agent is directed Actions that pass the URLLC and eMBB tests toward the desirable actions, improving the chance of making the optimal resource allocation decision and the convergence speed of the DRL training. • We provide empirical simulation results from which we demonstrate the superiority of DRL-NS over the conventional approaches. For example, DRL-NS achieves about 25% and 15% improvements in the throughput performance over the equal allocation and regressiontree based network slicing techniques, respectively. Even when compared to the vanilla DQN-based network slicing technique, DRL-NS achieves around 10% improvement in the throughput performance. The rest of this paper is organized as follows. In Section II, we present the related works of network slicing. In Section III, we discuss the system model and explain the network slicing problem. In Section IV, we provide a detailed description of the proposed DRL-NS technique. In Section V, we present the simulation results to verify the performance gain of the proposed technique and conclude the paper in Section VI. Since terminologies and major notations might be unfamiliar to the reader, we summarize the technical terms in Table I.

II. RELATED WORKS
In this section, we provide a brief review on the state-ofthe-art network slicing techniques. Over the years, various efforts have been made to provide network slicing satisfying the B5G service requirements. For example, in [12], a proportional fair-based network slicing technique pursuing a balance between the throughput and the QoS of user has been proposed.
One popularly used approach to optimally serve multiple network slices over the common physical network is DL, a data-driven learning approach. Recently, due to its ability to provide fast and accurate prediction and decision making, DL has shown great promise in many practical applications [6], [9]- [11]. In fact, since DL is effective in extracting policy from environments, it can be readily used for the decision making problem such as the resource management and scheduling. Recently, DL has been applied in many network slicing problems to come up with a well-informed slicing decision using available physical resources. For instance, in [13], a DL-based technique that predicts the network load on each network slice and then allocates slices based on incoming traffic has been proposed. In [14] and [15], RNN and LSTM-based network slicing techniques that analyze the overall traffic pattern from the sequential traffic data and then allocate slices based on the traffic prediction have been proposed.
Recently, various DRL-based network slicing techniques have been proposed to perform the robust network slicing in the dynamically changing 5G environments [16], [17]. For example, in [17], a DRL-based network slicing technique that controls the large-scale resource allocation has been proposed. To avoid the exploration of undesirable actions that fail to satisfy the QoS requirements, authors in [17] suggested dueling DQN, a DRL technique that identifies the desirable action by explicitly dividing the Q-value function into the state-value and state-dependent action advantage functions.
Our approach is distinct from previous studies in the sense that we integrate the action elimination to DRL to reduce the size of action space. While an agent in the conventional DRL-based network slicing techniques is likely to explore meaningless actions (e.g., actions violating the QoS requirements) due to the immense action space, we avoid exploration of such actions by choosing only the meaningful actions. To be specific, using the action elimination, we identify undesirable actions (set of resource allocation decisions violating rate and delay requirements) and then exclude them from the action space. In doing so, exploration of DRL agent is directed toward desirable actions, improving the chance of making the optimal resource allocation decision maximizing the system throughput and ensuring the QoS requirement satisfaction.

III. SYSTEM MODEL AND NETWORK SLICING PROBLEM
In this section, we explain a network slice model in downlink transmission scenario and specify three types of slices: eMBB, URLLC, mMTC. Also, we formulate the network slicing as a constrained optimization problem.

A. SYSTEM MODEL
In our work, we consider a downlink transmission scenario where M BSs serve K UEs randomly located in the coverage area of the BSs 1 . We denote the set of UEs as K = {1, · · · , K}. The BSs (a.k.a radio unit or gNBs) are connected to a digital unit (DU) to share the channel state information (CSI) between the BSs and UEs (see Fig. 1). The physical network is divided into N slices where each RB can be assigned to eMBB, URLLC, or mMTC network slice. The sets of eMBB, URLLC, and mMTC slices are denoted as I, J , and L. The total number of eMBB, URLLC, or mMTC network slices are I, J, and L, respectively (I + J + L = N ). In order to indicate the allocation of the eMBB, URLLC, and mMTC slices in the resource block grid, we use three binary vectors α e ∈ R I , α u ∈ R J , and α m ∈ R L where As a channel model, we consider the Rayleigh fading channel model, one of the most widely used channel model in the wireless communication. Note that since the proposed DRL-based network slicing technique is data-driven learning algorithm exploiting the explicit channel coefficient information, its performance might not be affected much by the channel model variation. Specifically, the downlink channel coefficient h m,k ∈ C between the BS m and the UE k is expressed as where β m,k is the large-scale fading coefficient and g m,k ∼ CN (0, 1) is the small-scale fading coefficient. In this setup, the data rate R k of UE k is given by where w m,k is the downlink precoding coefficient from the BS m to the UE k and σ 2 k is the noise power. In our work, we consider the OFDM systems to avoid the interference caused by the adjacent sub-channels. We also assume that the BSs employ different frequency bands to minimize the inter-cell interference.

B. NETWORK SLICING PROBLEM
Main goal of the network slicing is to maximize the overall system throughput while fulfilling the QoS requirements of various network slices. To achieve the goal, we need to consider the three major components in the system throughput: 1) throughput of eMBB slices (T eMBB ), 2) throughput of URLLC slices (T URLLC ), and 3) throughput of mMTC slices (T mMTC ).

1) Throughput of eMBB slice
When a UE sends a request for the eMBB slice to the mobile network operator, the corresponding throughput T eMBB,k of the UE k is expressed as where f i,k is the resource bandwidth allocated to the UE k in the i-th eMBB slice. Note that T eMBB,k should be larger than the rate requirement of UE: The corresponding sum throughput of eMBB network slice is T eMBB = K k=1 T eMBB,k .

2) Throughput of URLLC slice
The throughput of URLLC slice of UE k is where f j,k is the resource bandwidth allocated to the UE k in the j-th URLLC slice. In our work, we assume that one data packet should be completely transmitted within one frame in URLLC. That is, the frame duration should be less than or equal to the maximum packet delay D as [18] where F j,k is the packet length to UE k in j-th URLLC network slice and D j,k,max is the maximum packet delay of UE k in j-th URLLC network slice. The corresponding sum throughput of URLLC slice is T URLLC = K k=1 T URLLC,k .

3) Throughput of mMTC slice
Similar to the eMBB and URLLC slices, the throughput of mMTC slice of UE k is given by where f l,k is the resource bandwidth allocated to the UE k in the l-th mMTC slice. Note, in contrast to eMBB and URLLC slices, mMTC slice has no rate/latency requirement. The corresponding sum throughput of mMTC slice is T mMTC = K k=1 T mMTC,k . In summary, the total network throughput T (t) total at the time slot t is given by: The problem to maximize the overall system throughput over T time slots is formulated as m , respectively. Also, T = {1, · · · , T }, and 0 ≤ τ ≤ 1 is the ratio of utilized network slices. Note that the remaining N − τ N slices are reserved for backup.
By plugging (3), (5), and (7), P can be re-expressed as is a mixed-integer programming, which is known as a nonconvex NP-hard problem [6]. When we try to solve P ′ using the conventional analytic approach (e.g., combinatoric search algorithm), computational burden would be unacceptably high. For instance, if the number of resource blocks and time slots are 10, respectively, then one should search over 2 10×10 ≈ 10 30 decision choices to find out the optimal resource allocation decision. To make things worse, this kind of analytic approach has a causality issue since the futureoriented resource allocation decisions require the channel information of future time slots.

IV. DRL-NS
The primary goal of DRL-NS is to learn the proper network slicing strategy maximizing the system throughput. To achieve this goal, the proposed scheme exploits DRL framework in the resource allocation decision. DRL is a DL technique that finds out the optimal policy for the sequential decision making through the interaction with the environment. Specifically, based on the input information (e.g., CSI, the required UE data rates), DNN in the DRL agent (i.e., deep Q-network (DQN)) learns the complicated relationship between the resource allocation decision and the long-term system throughput.
Since the DRL agent learns the policy through trials and errors, performance of DRL depends on the exploration process of action space. In our case, due to the immense action space (e.g., 2 10 ≈ 1000 resource allocation decisions when we consider 10 network slices), DRL agent needs to ex-plore too many undesirable actions (e.g., resource allocation decision that cannot satisfy the UEs' requirements of eMBB and URLLC slices). This can severely diminish the sample efficiency due to the lack of useful training data, and thus the trained policy might not be optimal. In our test experiments, we observe that more than 80% of allocation decisions made by the trained DRL could not satisfy the requirements of eMBB and URLLC slices.
To overcome this problem, we exploit the action elimination method that identifies the resource allocation decisions violating the QoS requirements (i.e., data rate and delay requirements) and then eliminates them from the action space. In doing so, we can reduce the action space and direct the exploration of DRL agent toward the desirable actions, improving the chance of obtaining the optimal resource allocation strategy. Two key ingredients in the proposed action elimination method are 1) URLLC test to check whether each resource allocation decision can meet the delay requirements of URLLC and 2) eMBB test to check whether each resource allocation decision that pass URLLC test can satisfy the data rate requirements of eMBB. By choosing allocation decisions that pass both tests, we can dramatically reduce the action space and also improve the training speed.
In the following subsections, we briefly review the basics of RL and then discuss the state space, action space, and reward function in DRL-NS as well as eMBB and URLLC tests reducing the action space. Finally, we illustrate the training process of DRL-NS and analyze its computational complexity.

A. BASICS OF DEEP REINFORCEMENT LEARNING
In this subsection, we briefly introduce the basics of DRL. Reinforcement learning (RL) is a goal-oriented algorithm that learns how to solve a task using trials and errors. The key ingredients of RL are agent, environment, state, action, and reward [6]. Mission of an agent is to learn the optimal policy through interactions with the environment. In the learning process, an agent observes the current state s t , takes an action a t , and then the environment returns the next state s t+1 and the immediate reward r t to the agent as a feedback. The optimal policy π * maximizing the expectation of cumulative reward is [19], where γ is a discount factor (0 < γ < 1) to provide less weight to the future reward.
In order to find out π * , the action-value function Q π (s, a) that represents the expected cumulative reward obtained when carrying out the policy π, is exploited: Since Q π (s, a) indicates the expected cumulative reward for taking action a in state s, the optimal policy can be VOLUME 4, 2016 obtained by selecting the action maximizing Q π (s, a). To do so, the action-value function should be available for all possible state-action pairs. To find out the optimal Q-function Q * (s, a), Bellman equation for Q * (s, a) is used [19]: where r(s, a) is the reward corresponding to the state-action pair (s, a) and P a ss ′ is the transition probability. To reduce the burden of computing and comparing the Qvalue for every state and action, deep Q-network (DQN), a DNN-based function approximator to estimate Q-function (i.e., Q * (s, a) ≈ Q(s, a, w)), has been popularly used [6]. Basically, the weight w of DQN is updated to minimize the loss function given by L(w) = (Y dqn t − Q(s, a, w)) 2 where Y dqn t = r(s, a) + γ max a ′ ∈A Q(s ′ , a ′ , w).

B. THE DRL-BASED RESOURCE ALLOCATION MODEL
In this subsection, we discuss the state space, action space, and reward function of the proposed scheme. In the proposed scheme, a whole network consisting of M BSs and K UEs is considered as an environment and the gNB is serving as an agent (see Fig. 2). In this setting, we can define the state space, action space, and reward function.

1) State space
State contains essential information in the environment used for the policy learning. In the proposed DRL framework, the state of the environment observed by the agent consists of several parts: the minimum rate constraints of UEs R Also, the maximum delay constraints of UE at the time slot t is expressed as J × K matrix as In summary, the state can be expressed as Note that both H (t) and H (t−1) are included in s t so that the DNN in DRL agent can extract the temporally correlated features of the channel information. By exploiting the extracted features among H (t) , R (t) , D (t) , and α (t−1) , the DRL agent learns the resource allocation policy maximizing the system throughput while satisfying the QoS requirements.

2) Action space
An action α t is defined as where α  resource block grid, respectively. If we denote the set of possible actions as A, then the size of A (i.e., the number of possible actions) is 2 N . For example, if there are 25 network slices, the size of A is 2 25 , which is clearly too large to explore. To deal with this problem, we reduce the action space using the URLLC and eMBB tests (we will say more in the next subsection).

3) Reward
When the action of a time slot is decided, gNB measures the system throughput. Since our goal is to learn the resource allocation policy maximizing the system throughput, we set the reward as the sum of the throughput of all network slices. That is, where T eMBB , T URLLC , and T mMTC are the throughputs of eMBB, URLLC, and mMTC slices, respectively. Since all components of T total are the function of slice allocation decisions, the reward maximization problem is equivalent to the problem to determine the allocation of slices maximizing T total .

C. ACTION SPACE REDUCTION VIA ACTION ELIMINATION
The main purpose of action elimination method is to reduce the action space by eliminating the undesirable allocation decisions. As a result, we can improve the chance of making the optimal resource allocation decision maximizing the system throughput and also ensure that the learned resource allocation policy satisfies the QoS requirements of URLLC and eMBB services. One drawback of the proposed method is that computational burden of the training process to obtain the reduced action space is a bit considerable. For example, if we consider 10 network slices, then the action elimination method needs to check 2 10 ≈ 1000 allocation decisions to identify which decisions violate delay and data rate requirements. To speed up the action elimination process, we can consider the parallel processing. Since parallel processing can simultaneously check whether each individual allocation decision satisfies the QoS requirements or not, we can reduce the computational burden and also speed up the DRL training.
The proposed action elimination method consists of two tests: 1) URLLC test to remove the resource allocation decisions that cannot satisfy the delay requirements of URLLC slices and 2) eMBB test to remove the infeasible allocation decisions that cannot satisfy the rate requirement of eMBB slices (see Fig. 3).

1) URLLC test
In this test, we first exclude the infeasible decisions that do not meet the delay requirement of users in URLLC slice from the action space. Let A = {α 1 , · · · , α 2 N } be the set of all possible resource allocation decisions. To determine whether each allocation decision α(∈ A) is infeasible, we measure the throughput of URLLC slice for each α. Recall that the throughput of URLLC slice is expressed as where α = {α u (j) ∈ {0, 1}|j = 1, · · · , J} is a single slice allocation decision. When we obtain T for t = 1, · · · , T do 3: Exclude resource allocation decisions through the URLLC test.

5:
Exclude resource allocation decisions through the eMBB test. Obtain α (t) by DQN based on reduced action A F .

7:
Compute reward r t and observe the next state s t+1 .

8:
Store the transition (s t , a t , r t , s t+1 ) into replay memory R.

9:
end for 10: Randomly sample a mini-batch of the transition (s i , a i , r i , s i+1 ) with a size of N R . 11: 12: 13: t = t + 1 14: end while the delay requirement of URLLC slice from A to obtain the desirable action space A D : where F j,k is packet length to UE k in slice j.

2) eMBB test
In this test, we check whether each resource allocation decision α ∈ A D can satisfy the rate requirement for eMBB slice. Specifically, to determine whether α is appropriate to accommodate the eMBB slice, we first measure the throughput of eMBB slice T (t) eMBB (s, α). Next, similar to the URLLC test we discussed in the previous subsection, we obtain T (t) eMBB (s, α) for each α ∈ A D and then eliminate the allocation decision that does not satisfy the rate requirement of users in eMBB slice. The obtained final desirable action space A F is

D. TRAINING OF DRL-NS
An integral part of the proposed DRL-NS is the training process optimizing the set of network parameters w. In the training phase, the network parameters are updated to minimize the DQN loss function L(w) = (r(s, a) + γ max a ′ ∈A Q(s ′ , a ′ , w) − Q(s, a, w)) 2 . When the loss function is differentiable, which is true in our case, one can use the stochastic gradient descent (SGD) method to update the parameters. The update operation of SGD is expressed as where η is the learning rate of DQN. After obtaining the final desirable action space A F , the DQN agent calculates the Q-values of all actions in A F and chooses the action with the maximum Q-value as the output action a t . Then, the agent computes the immediate reward r t as in (23) and the next state s t+1 through updated channel matrix H, rate and delay constraints (i.e., R and D), and the chosen action a t . In each time slot, the transition tuple (s t , a t , r t , s t+1 ) observed by the agent is stored to the replay memory. In each iteration of the training phase, a mini-batch data is randomly sampled from the replay memory and the weights of DQN are updated in a direction to minimize the loss value in L(w). The overall training procedure is summarized in Algorithm 1.

E. COMPUTATIONAL COMPLEXITY ANALYSIS
In this subsection, we analyze the computational complexity of the DRL-NS technique in terms of the number of floating point operations (flops). Initially, the state s t = [vec(H (t) ), vec(H (t−1) ), R (t) min , vec(D (t) max ), α (t−1) ] T ∈ R 2M K+K+JK+N is fed into the first hidden layer of DQN and then is multiplied by the weight W i1 ∈ R α×(2M K+K+JK+N ) and then the bias b i1 ∈ R α is added. Then, for each element, we check whether the value is larger than 0 using the rectified linear unit (ReLU) function (α Next, by noting that both input and output dimensions of the remaining h − 1 hidden layers are α, the computational complexity of the remaining hidden layers can be expressed as After passing through h hidden layers, the weight multiplication and bias addition are performed in the output FC layer of DQN. Since W iz ∈ R (2M K+K+JK+N )×α and b iz ∈ R (2M K+K+JK+N ) , the corresponding complexity C out of the output layer is Note that the ReLU function is not applied in the output layer. Lastly, we need to consider the complexity of the action elimination method since it occupies the largest part of the overall complexity in the training process. Basically, in the proposed action elimination method, we need to check 2 N resource allocation decisions for each UE in eMBB slices to identify which decisions violate the data rate requirements. Also, we need to check 2 N allocation decisions for each URLLC slice and UE in URLLC slices to find out which decisions violate the delay requirements. Thus, the complexity of the proposed action elimination method is expressed as Let ϵ be the maximum training iteration number of DQN, then the overall complexity of DRL-NS C DRL−N S during the training phase is summarized as where T is the total number of time slots. Once DRL-NS is trained, the network parameters (weight and bias) of DQN are no longer updated so that ϵ and T can be eliminated from (28). Also, the action elimination method is no longer required so that C AE can be eliminated. Therefore, the overall complexity of DRL-NS C DRL−N S in the test phase is given by Since α < (2M K + K + JK + N ) and h is a small constant, the computational complexity of DRL-NS in the test phase can be expressed as O((M K + K + JK + N )α).

A. SIMULATION SETUP
In this section, we describe the simulation results to demonstrate benefits of DRL-NS in a comprehensive way. In our simulation, we consider a downlink transmission scenario where M gNBs simultaneously serve K UEs. The small cells are uniformly distributed in the hexagonal area of inter-site distance (ISD) 200m and the UEs move freely at a constant speed υ ∈ Unif[υ min , υ max ]. Note that υ max and υ min are the max/min speed of UE. A UE changes its velocity when it reaches the edge of service area. In our work, we set the mobile height h U to the average adult height (165cm). Note that h U mainly affects the path loss between the BS and the mobile and the variation of the channel due to the height change (say, from 160cm to 190cm) is negligible.
For the fading channel model, we apply the small-scale fading coefficient g m,k generated from the complex Gaussian VOLUME 4, 2016 distribution (i.e, g m,k ∼ CN (0, 1)) and the large-scale fading coefficient β m,k generated based on Hata-COST231 model [20], which is expressed as β m,k = P L m,k 10 z m,k σ sh 10 where P L m,k is the path loss and 10 z m,k σ sh 10 is the shadow fading (z m,k ∼ N (0,1)). In specific, P L m,k is given by where d m,k is the distance between the gNB m and the UE k and where f is the carrier frequency, h B and h U are the heights of BS and UE, respectively. Note that the duration of time slot is set to 1 (secs). The DQN of DRL-NS consists of 5 fully connected layers and the width of a hidden layer is set to 256. For the network parameter training, we use an Adam optimizer, a well-known optimization tool to guarantee the robustness of training process [21]. The simulation parameters are listed in Table  1. Our approach is implemented based on Tensorflow [22]. The DQN is trained on a single NVIDIA GeForce Titan Xp.
We compare the proposed DRL-NS scheme with four baseline network slicing techniques: 1) equal allocation method where the equal number of RBs are allocated to each network slice [5], 2) regression tree-based allocation method where the number of RBs allocated to each slice is sequentially decided in a way maximizing the system throughput [4], 3) proportional fair-based allocation method where RBs are allocated in a way to pursue a balance between the total throughput and QoS requirement satisfaction [12], and 4) vanilla DQN-based allocation method where the allocation of RBs is determined by using basic DQN without special treatment.

B. SIMULATION RESULT
In order to examine the convergence behavior of the proposed DRL-NS, we plot the loss function value as a function of the number of training episodes (see Fig. 4). In this test, we set R min = 15.0 bps/Hz. From Fig. 4, we clearly see that the loss function decreases with the number of episodes, which indicates that the training process of DRL-NS is carried out properly in a way to increase the system throughput. Also, since the proposed technique enhances the sample efficiency significantly by exploring the resource allocation decisions in desirable action space, the loss of DRL-NS is smaller than that of the vanilla DQN scheme.
In Fig. 5, we plot the cumulative reward as a function of the number of training episodes in the training phase. We observe that as the number of training episodes increases, the cumulative reward of DRL-NS increases gradually. Since DRL-NS receives fewer penalties by eliminating the infeasible resource allocation decisions, cumulative reward of DRL-NS is much higher than that of the vanilla DQN scheme. In Fig. 6, we plot the average throughput of DRL-NS as a function of UE's data rate requirement. We observe that DRL-NS outperforms the conventional network slicing methods across the board. For example, when R min = 13.75 bps/Hz, DRL-NS achieves about 25% improvement in the average system throughput over the equal allocation network slicing method. This is because while the equal allocation network slicing method allocates the equal number of RBs for each network slice, DRL-NS makes sequential resource allocation decisions in a way to maximize the long-term system throughput. Also, due to the elimination of the infeasible resource allocation decisions incurring sub-optimal allocation policy, DRL-NS achieves 15% improvement in the system throughput over the regression tree-based allocation network slicing method.  We also plot the average throughput of DRL-NS as a function of UE's delay requirement (see Fig. 7). As shown in Fig. 7, we observe that the proposed DRL-NS achieves a significant gain over the conventional methods. For example, DRL-NS achieves around 27% and 19% improvements in the throughput performance over the equal allocation and regression tree-based network slicing methods at D max = 1 msec, respectively. Since the resource allocation decisions that cannot satisfy the delay requirements of UEs are removed, DRL-NS has better chance to find out the optimal resource allocation policy maximizing the system throughput.
In Table 3, we summarize the average throughput of DRL-NS for various resource utilization ratios. We observe that as the ratio of utilized network slices increases, the throughput performance increases as well. This is because throughputs of larger number of eMBB, URLLC, and mMTC network slices add up to the total throughput of the overall system. Finally, we summarize the training time of DRL-NS for various SNR levels (see Table 4). In this simulation, we declare that DRL-NS converges when the absolute fraction difference of the loss is smaller than the threshold ε = 0.001, and then measure the time to the convergence 2 . As shown in Table I, when SNR = 20dB, it takes about 1.73 × 10 3 seconds for the algorithm to converge.

VI. CONCLUSION
In this paper, we proposed the DRL-based network slicing framework called DRL-NS, to improve the system throughput while satisfying all the QoS requirements. In the proposed DRL-NS, undesirable resource allocation decisions violating various QoS requirements are eliminated by specially designed action elimination so that we could significantly reduce the action space, and therefore improve the chance of making the optimal resource allocation decision. Through the simulations on realistic 5G environment, we observed that DRL-NS outperforms conventional schemes by a large margin. In this paper, we restricted our attention on network slicing but we expect that proposed scheme can be extended to many different tasks such as radio link scheduling, beam management, and cognitive radio scheduling. In consideration of a long road ahead to exploit the deep reinforcement learning paradigm to wireless communication systems, we 2 That is, | L(w (t+1) )−L(w (t) ) L(w (t) ) | < ε (ε = 0.001).
believe that the DRL-NS can be useful tool for the future wireless applications.