Reinforcement Learning for Adaptive Resource Allocation in Fog RAN for IoT With Heterogeneous Latency Requirements

In light of the quick proliferation of Internet of things (IoT) devices and applications, fog radio access network (Fog-RAN) has been recently proposed for fifth generation (5G) wireless communications to assure the requirements of ultra-reliable low-latency communication (URLLC) for the IoT applications which cannot accommodate large delays. To this end, fog nodes (FNs) are equipped with computing, signal processing and storage capabilities to extend the inherent operations and services of the cloud to the edge. We consider the problem of sequentially allocating the FN’s limited resources to IoT applications of heterogeneous latency requirements. For each access request from an IoT user, the FN needs to decide whether to serve it locally at the edge utilizing its own resources or to refer it to the cloud to conserve its valuable resources for future users of potentially higher utility to the system (i.e., lower latency requirement). We formulate the Fog-RAN resource allocation problem in the form of a Markov decision process (MDP), and employ several reinforcement learning (RL) methods, namely Q-learning, SARSA, Expected SARSA, and Monte Carlo, for solving the MDP problem by learning the optimum decision-making policies. We verify the performance and adaptivity of the RL methods and compare it with the performance of the network slicing approach with various slicing thresholds. Extensive simulation results considering 19 IoT environments of heterogeneous latency requirements corroborate that RL methods always achieve the best possible performance regardless of the IoT environment.


I. INTRODUCTION
There is an ever-growing demand for wireless communication technologies due to several reasons such as the increasing popularity of Internet of Things (IoT) devices, the widespread use of social networking platforms, the proliferation of mobile applications, and the current lifestyle that has become highly dependent on technology in all aspects.It is expected that the number of connected devices worldwide will reach three times the global population in 2022 with 3.6 devices per capita.However, in some regions, such as North America, the number of connected devices is projected to reach about 13.4 devices per capita by 2022, which makes the massive IoT a very common concept.This trend of massive IoT will generate an annual global IP traffic of 4.8 zettabytes The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh.by 2022, which corresponds to 4-times the traffic in 2016 and 184-times the traffic in 2005, in which wireless and mobile devices will account for 71% of this forecast [1].This unprecedented demand for mobile data services makes it unbearable for service providers with the current third generation (3G) and fourth generation (4G) networks to keep pace with it [2].The design criteria for fifth generation (5G) wireless communication systems will include providing ultra-low latency, wider coverage, reduced energy usage, increased spectral efficiency, more connected devices, improved availability, and very high data rates of multi giga-bit-per-second (Gbps) everywhere in the network including cell edges [3].Several radio frequency (RF) coverage and capacity solutions are proposed to fulfill the goals of 5G including, beamforming, carrier aggregation, higher order modulation, and dense deployment of small cells [4].Millimeter-wave (mm-wave) frequency range is likely to be utilized in 5G because of the spacious bandwidths available in these frequencies for cellular services [5].Massive multiinput-multi-output (MIMO) is potentially involved for excellent spectral efficiency and superior energy efficiency [6].
To cope with the growing number of IoT devices and the increasing amount of traffic for better user satisfaction, cloud radio access network (C-RAN) architecture is suggested for 5G, in which a powerful cloud controller (CC) with pool of baseband units (BBU) and storage pool supports large number of distributed remote radio units (RRU) through high capacity fronthaul links [7], [8].The C-RAN is characterized by being clean as it reduces energy consumption and improves the spectral efficiency due to the centralized processing and collaborative radio [9].However, in light of the massive IoT applications and the corresponding generated traffic, C-RAN structure places a huge burden on the centralized CC and its fronthaul, which causes more delay due to limited fronthaul capacity and busy cloud servers in addition to the large transmission delays [10], [11].

A. F-RAN AND HETEROGENEOUS IoT
The latency issue in C-RAN becomes critical for IoT applications that cannot tolerate such delays.And that is the reason fog radio access network (F-RAN) is introduced for 5G, where fog nodes (FN) are not only limited to perform RF functionalities but also empowered with caching, signal processing and computing resources [12], [13].This makes FNs capable of independently delivering network functionalities to end users at the edge without referring them to the cloud to tackle the low-latency needs.
IoT applications have various latency requirements.Some applications are more delay-sensitive than others, while some can tolerate larger delays [14]- [16].Hence, especially in a heterogeneous IoT environment with various latency needs, FN must allocate its limited and valuable resources in a smart way.In this work, we present a novel framework for resource allocation in F-RAN for 5G by employing reinforcement learning methods to guarantee the efficient utilization of limited FN resources while satisfying the low-latency requirements of IoT applications.

B. LITERATURE REVIEW
For the last several years, 5G and IoT related topics have been of great interest to many researchers in the wireless communications field.Recently, a good number of works in the literature focused on achieving low latency for IoT applications in 5G F-RAN.For instance, resource allocation based on cooperative edge computing has been studied in [17]- [21] for achieving ultra-low latency in F-RAN.The work in [17] proposed a mesh paradigm for edge computing, where the decision-making tasks are distributed among edge devices instead of utilizing the cloud server.The authors in [18], [21] considered heterogeneous F-RAN structures including, small cells and macro base stations, and provided an algorithm for selecting the F-RAN nodes to serve with proper heterogeneous resource allocation.The number of F-RAN nodes and their locations have been investigated by [22].Content fetching is used in [7], [19] to maximize the delivery rate when the requested content is available in the cache of fog access points.In [23], cloud predicts users' mobility patterns and determines the required resources for the requested contents by users, which are stored at cloud and small cells.The work in [20] addressed the issue of load balancing in fog computing and used fog clustering to improve user's quality of experience.The congestion problem, when resource allocation is done based on the best signal quality received by the end user, is highlighted in [24], [25].The work in [24] provided a solution to balance the resource allocation among remote radio heads by achieving an optimal downlink sum-rate, while [25] offered an optimal solution based on reinforcement learning to balance the load among evolved nodes for the arrival of machine-type communication devices.To reduce latency, soft resource reservation mechanism is proposed in [26] for uplink scheduling.The authors of [27] presented an algorithm that works with the smooth handover scheme and suggested scheduling policies to ease the user mobility challenge and reduce the application response time.Radio resource allocation strategies to optimize spectral efficiency and energy efficiency while maintaining a low latency in F-RAN are proposed in [28].With regard to learning for IoT, [29] provided a comprehensive study about the advantages, limitations, applications, and key results relating to machine learning, sequential learning, and reinforcement learning.Multi-agent reinforcement learning was exploited in [30] to maximize network resource utilization in heterogeneous networks by selecting the radio access technology and allocating resources for individual users.The model-free reinforcement learning approach is used in [31] to learn the optimal policy for user scheduling in heterogeneous networks to maximize the network energy efficiency.Resource allocation in non-orthogonal-multiple-access based F-RAN architecture with selective interference cancellation is investigated in [32] to maximize the spectral efficiency while considering the co-channel interference.With the help of task scheduler, resource selector, and history analyzer, [33] introduced an FN resource selection algorithm in which the selection and allocation of the best FN to execute an IoT task depends on the predicted run-time, where stored execution logs for historical performance data of FNs provide realistic estimation of it.
A comprehensive study of network slicing in 5G system is considered in [34], [35].Issues and challenges of network slicing in Fog RAN is investigated in [34], where authors presented key techniques and solutions in regards to radio and cache resource management as well as social-aware slicing.[35] provides a comprehensive survey on network slicing which embraces the key principles, enabling technologies, challenges, standardization, and solutions including slicing solutions for 5G system, and illustrates the requirements and diverse use cases of network slicing considering RAN sharing, end-to-end orchestration and management involving the radio access, transport and core networks.Radio resource allocation for different network slices is exploited in [36]- [38] to support various quality-of-service (QoS) requirements and minimize the queuing delay for low latency requests, in which network is logically partitioned into a high-transmission-rate slice for mobile broadband (MBB) applications, and a low-latency slice which supports ultrareliable low-latency communication (URLLC) applications.The authors in [38] proposed a hierarchical radio resource allocation architecture for network slicing in which a global radio resource manager (GRRM) allocates subchannels to local radio resource managers (LRRMs) in slices, which then assign resources to their end users.However, the resource allocation problem considered in the network slicing literature focuses on the dynamics of resource allocation among various network slices and layers, i.e., decides on allocation of resources between FNs for URLLC applications and RRUs for MBB applications while it is infeasible for mobile operators and service providers to keep changing the distribution of resources in the network due to the huge accompanying operational expenditure (OPEX) to cover all required hardware, software and license swaps as well as the implementation of any necessary frequency reuse plans and fronthaul links capacity upgrade, and the impact of outages and prolonged testing, optimizing and fine tuning the network performance follow every single change.Hence, deciding on resource allocation among network slices should be so thoughtful and deliberate, and only after assuring that each slice utilizes its allocated resources efficiently.In this work, we zoom in to the allocated limited resources to FNs to optimize and guarantee their efficient utilization.We compare the performance and adaptivity of the RL methods to the performance of the utility filtering-based network slicing approach with various slicing thresholds.

C. CONTRIBUTIONS
With the motivation of satisfying the low-latency requirements of heterogeneous IoT applications through F-RAN, we provide a novel framework for allocating limited resources to users that guarantees efficient utilization of the FN's limited resources.In this work, we develop Markov Decision Process (MDP) formulation for the considered resource allocation problem and employ diverse reinforcement learning (RL) methods for learning optimum decision-making policies adaptive to the IoT environment.Specifically, in this paper we propose an MDP formulation for the considered F-RAN resource allocation problem, and investigate the use of various RL methods, Q-learning (QL), SARSA, Expected SARSA (E-SARSA), and Monte Carlo (MC), for learning the optimal fine-grained decision making policies of the MDP problem to improve efficiency.We also provide extensive simulation results in various IoT environments of heterogeneous latency requirements to evaluate the performance and adaptivity of the four RL methods and compare it to the performance of the network slicing approach with various slicing thresholds.The remainder of the paper is organized as follows.Section II introduces the system model.The proposed MDP formulation for the resource allocation problem is given in Section III.Optimal policies and the related RL algorithms are discussed in Section IV.Simulation results are presented in Section V. Finally, we conclude the paper in Section VI.A list of notation and abbreviations used throughout the paper is provided in Table 1.

II. SYSTEM MODEL
We consider the F-RAN structure shown in Fig. 1, in which FNs are connected through the fronthaul to the cloud controller (CC), where a massive computing capability, centralized baseband units (BBUs) and cloud storage pooling are available.To ease the burden on the fronthaul and the cloud, and to overcome the challenge of the increasing number of IoT devices and low-latency applications, FNs are empowered with capability to deliver network functionalities at the edge.Hence, they are equipped with caching capacity, computing and signal processing capabilities.However, these resources are limited, and therefore need to be utilized efficiently.An end user attempts to access the network by sending a request to the nearest FN.The FN takes a decision whether to serve the user locally at the edge using its own computing and processing resources or refer it to the cloud.We consider the FN's computing and processing capacity to be limited to N resource blocks (RBs).User requests arrive sequentially and decisions are taken quickly, so no queuing occurs.The QoS requirements of a wireless user are typically given by the latency requirement and throughput requirement.IoT applications have various levels of latency requirement, hence it is sensible for the FN to give higher priority for serving the low-latency applications.To differentiate between similar latency requirements we also consider the risk of failing to satisfy the throughput requirement.This risk is related to the ratio of the achievable throughput to the throughput requirement.The achievable throughput is characterized by the signal-to-noise ratio (SNR) through Shannon channel capacity.Shannon's fundamental limit on the capacity of a communications channel gives an upper bound for the achievable throughput, as a function of available bandwidth (B) in Hz and SNR in dB, C = B log 2 (1 + SNR).Hence, we define the utility of an IoT user request to be a function of latency requirement, l (in milliseconds), throughput requirement, !(in bits per second), and channel capacity, C (in bits per second), i.e., u = f (l, !, C).Since the utility should be inversely proportional to the latency requirement, and directly proportional to the achievable throughput ratio, µ = C/!, we define utility as where , ⇣, > 0 are mapping parameters.This provides a flexible model for utility.By selecting the parameters , ⇣, a desired range of u and importance levels for latency and throughput requirements can be obtained.Since F-RAN is intended for satisfying low-latency requirements, typically, more weight should be given to latency by choosing larger values.
FNs should be smart to learn how to decide (serve/refer to the cloud) for each IoT request (i.e., how to allocate its limited resources), so as to achieve the conflicting objectives of maximizing the average total utility of served users over time and minimizing its idle (no-service) time.The system objective can be stated as a constrained optimization problem, max a 0 ,...,a T 1 T X t=0 {a t =serve} u t and min where a t denotes the action taken at time t (either serves the request locally or rejects it and refers to cloud), T denotes the termination time when all RBs are filled, N denotes the number of RBs, and {•} is the indicator function taking value 1 if its argument is true and 0 if false.The goal is to find the optimum decision policy {a 0 , a 1 , . . ., a T 1 } for an IoT environment which randomly generates {u t }.Note that the final decision is always a T = serve by definition, hence omitted in the policy representation.
One straightforward approach to deal with this resource allocation problem is to apply network slicing [34], [35] based on the user utility, in which the network is logically partitioned into two slices [36]- [38], a fog slice handles high-utility IoT requests of low-latency demand, and cloud slice considers low-utility users.Hence, a filtering standard is required for the FN to direct users' requests to their corresponding network slices.For instance, we can define a threshold rule, such as ''serve locally if u > 5'', if we classify all applications in an IoT environment into ten different utilities u 2 {1, 2, . . ., 10}, 10 being the highest utility.However, such a policy is sub-optimum since the FN will be waiting for a user to satisfy the threshold, which will increase the FN's idle time and make the CC busier.The main drawback of this policy is that it cannot adapt to the dynamic IoT environment to achieve the objective.For instance, when the user utilities are almost uniformly distributed, a very selective policy with a high threshold will stay idle most of the time, whereas an impatient policy with a low threshold will in general obtain a low average served utility.A mild policy with threshold 5 may in general perform better than the extreme policies, yet it will not be able adapt to different IoT environments.A better solution for the F-RAN resource allocation problem is to use RL techniques which can continuously learn the environment and adapt the decision rule accordingly.

III. MDP PROBLEM FORMULATION
RL can be thought as the third paradigm of machine learning in addition to the other two paradigms, supervised learning and unsupervised learning.The key point in the proposed RL approach is that FN learns about the IoT environment by interaction and then adapts to it.FN gains rewards from the environment for every action it takes, and once the optimum policy of actions is learned, FN will be able to maximize its expected cumulative rewards, adapt to the IoT environment, and achieve the objective.
For an access request from a user with utility u t , at time t, if the FN decides to take the action a t = serve, which means to serve the user at the edge, then it will gain an immediate reward r t and one of the RBs will be occupied.Otherwise, for the action a t = reject, which means to reject serving the user at the edge and refer it to the cloud, the FN will maintain its available RBs and get a reward r t .The value of r t depends on a t and u t .For tractability, we consider quantized utility values, u t 2 {1, 2, . . ., U }.
We define the state s t of the FN at any time t as where b t 2 {0, 1, 2, . . ., N } is the number of occupied RBs at time t.Note that the successor state s t+1 depends only on the current state s t , the utility u t+1 of the next service request, and the action taken (serve or reject), satisfying the Markov property P(s t+1 |s 0 , . . ., s t 2 , s t 1 , s t , a t ) = P(s t+1 |s t , a t ), i.e., Markov state.Hence, we formulate the Fog-RAN resource allocation problem in the form of a Markov decision process (MDP), which is defined by the tuple (S, A, P a ss 0 , R a ss 0 ), where S is the set of all possible states, i.e., s t 2 S, A is the set of actions, i.e., a t 2 A = {serve, reject}, P a ss 0 is the transition probability from state s to s 0 when the action a is taken, i.e., P a ss 0 = P(s 0 |s, a), where s 0 is a shorthand notation for the successor state, and R a ss 0 is the immediate reward received when the action a is taken at state s which ends up in state s 0 , e.g., r t = R a t s t s t+1 2 R. The return G t is defined as the cumulative discounted rewards received from time t onward and given by The reward mechanism R a ss 0 is typically chosen by the system designer according to the objective.We propose a reward mechanism based on the received utility and the action taken for it.Specifically, at time t, based on u t and a t , the FN receives an immediate reward r t 2 R = {r sh , r sl , r rh , r rl }, and moves to the successor state s t+1 , where r sh is the reward for serving a high-utility request, r sl is the reward for serving a low-utility request, r rh is the reward for rejecting a highutility request, and r rl is the reward for rejecting a low-utility request.A request is determined as high-utility or low-utility relative to the environment based on a threshold u h , which is a design parameter dependent on the utility distribution in IoT environment.For instance, u h can be selected as a certain percentile, such as the 50 th percentile, i.e., median, of the utilities in the environment.Hence, the proposed reward function is given by Remark 1: Note that the threshold u h does not have a definitive meaning with respect to the system requirements, i.e., there is no requirement saying that requests with utility lower/greater than u h must be rejected/served.The goal here is to introduce an internal reward mechanism for the RL approach to facilitate learning the expected future gains, as will be clear later in this section and the following section.For an effective learning performance, the reward mechanism should be simple enough to guide the RL algorithm towards the system objective (see ( 2)) [39].That is, its role is not to imitate the system objective closely to make the algorithm  2 for an FN with N = 5, U = 10, u h = 6.Non-terminal states and terminal state are represented by circles and squares, respectively, and labeled by the states names.Filled circles represent actions, and arrows show the transitions with corresponding rewards.achieve it at once, but to resemble it in a simple manner to let the algorithm iteratively achieve a high performance.
Remark 2: Although a threshold u h is utilized in the proposed reward mechanism, its use is fundamentally different than the utility filtering-based policy in network slicing approach which always accepts/rejects requests with utility greater/lower than a threshold.While the utility filteringbased policy considers only the immediate gain from the current utility, the algorithms tackling the MDP problem, such as the RL algorithms, consider the expected return E[G 0 ] which includes the immediate reward and expected future rewards.Hence, the threshold u h does not necessarily cause the algorithm to accept/reject requests with utility greater/lower than u h ; it only plays an internal role in learning the expected future rewards.
State transitions for an FN with 5 RBs (N = 5), 10 utility levels (U = 10), and u h = 6, a sample of IoT requests with utilities u t , and random actions a t are shown in Table 2.At time t, being at state s t , and taking the action a t will result in getting an immediate reward r t and moving to the successor state s t+1 .The state transitions in Table 2 represent an episode of the MDP, it starts at t = 0 and terminates at T = 10 with the states 5 ! 9 !13 !13 !28 !36 ! 31 !40 !47 !49 !54.The dynamics of this episode is shown through a state transition graph in Fig. 2, in which non-terminal states and terminal state are represented by circles and squares, respectively, and labeled by the states names, filled circles represent actions, and arrows show the transitions with corresponding rewards.

IV. OPTIMAL POLICIES
The state-value function V (s), shown in (6), represents the long-term value of being in state s in terms of the expected return which can be collected starting from this state onward till termination.Hence, the terminal state has zero value since no reward can be collected from that state, and the value of initial state is equal to the objective function E[G 0 ].The state value can be viewed also in two parts: the immediate reward from the action taken and the discounted value of the successor state where we move to.Similarly, the action-value function Q(s, a) is the expected return that can be achieved after taking the action a at state s, as shown in (7).The action value function tells how good it is to take a particular action at a given state.The expressions in ( 6) and ( 7) are known as the Bellman expectation equations for state value and action value, respectively [39], where a 0 denotes the successor action at the successor state s 0 .Since (6) shows the relationship between the value of a state and its successor states, similarly for the value of an action in (7), it is useful to show the dynamics of the MDP in a backup diagram, as shown in Fig. 3 for a 2-RB FN (N = 2).
The backup diagram provides an overview for the possible episodes of the considered MDP, where the minimum termination time required to reach a terminal state, at which all RBs are occupied (b = N ), is T = 2 through the episode u 0 ! 10 + u 1 !20 + u 2 , i.e., serve all the time.Note that early termination does not necessarily maximize the return.The objective of the FN in the presented MDP is to utilize the N resource blocks for high-utility IoT applications in a timely manner.This can be done through maximizing the value of initial state, which is equal to the MDP objective E[G 0 ].To this end, an optimal decision policy is required, which is discussed next.
A policy ⇡ is a way of selecting actions.It can be defined as the set of probabilities of taking a particular action given the state, i.e., ⇡ = {P(a|s)} for all possible state-action pairs.The policy ⇡ is said to be optimal if it maximizes the value of all states, i.e., ⇡ ⇤ = arg max ⇡ V ⇡ (s), 8s.Hence, to solve the considered MDP problem, the FN needs to find the optimal policy through finding the optimal state-value function V ⇤ (s) = max ⇡ V ⇡ (s), which is similar to finding the optimal action-value function Q ⇤ (s, a) = max ⇡ Q ⇡ (s, a) for all state-action pairs.From ( 6) and ( 7), we can write the Bellman optimality equations for V ⇤ (s) and Q ⇤ (s, a) as, The notion of optimal state-value function V ⇤ (s) greatly simplifies the search for optimal policy.Since the goal of maximizing the expected future rewards is already taken care of the optimal value of the successor state, V ⇤ (s 0 ) can be taken out of the expectation in (8).Hence, the optimal policy is given by the best local actions at each state.Dealing with Q ⇤ (s, a) to choose optimal actions is even easier, because with Q ⇤ (s, a) there is no need for the FN to do the one-step-ahead search and instead it picks the best action that maximizes Q ⇤ (s, a) at each state.Optimal actions are defined as follows, After discretizing the utility into U levels, the state space becomes tractable with cardinality |S| = U (N + 1), hence in this case the optimal policy can be learned by estimating the optimal value functions (either (8) or ( 9)) using tabular methods such as model-free RL methods (e.g., Monte Carlo, SARSA, Expected SARSA, and Q-learning), which are also called approximate dynamic programming methods [39].Since the expectations involved in value functions are not tractable to find in closed form, we resort to model-free RL methods in this work instead of exact dynamic programming.Continuous utility values (see (1)) would yield infinite dimensional state space, and thus require function approximation methods, such as deep Q-learning known as DQN [40], for predicting the value function at different states, which we leave to a future work.
In our MDP problem, firstly FN receives a request from an IoT application of utility u, then it makes a decision to serve or reject, meaning that the reward for serving r s 2 {r sh , r sl } and the reward for rejecting r r 2 {r rh , r rl } are known at the time of decision making.Thus, from ( 6) and (10), the optimal action at state s is given by where s 0 serve is the successor state when a = serve, s 0 reject is the successor state when a = reject, and E u is the expectation with respect to the utilities u in the IoT environment.Generate an episode: Take actions using (11) until termination; 7: G(s) sum of discounted rewards from s till terminal state for all states appearing in the episode; Take action a t according to ⇡ (e.g., ✏-greedy), and store r t and s t+1 ; 7: Update Q with Q(s ⌧ , a ⌧ ); end if 19: end for 20: Use Q ⇤ (s, a) estimated in Q for ⇡ ⇤ using (12) Algorithm 2 shows how FN learns the optimal policy for the MDP by estimating Q ⇤ (s, a) using QL, E-SARSA, and SARSA methods.The step size parameter ↵ represents the weight we give to the change in our experience, i.e., the learning rate, ✏ is the probability of making a random action for exploration, and the batch size n represents the number of time steps after which we update the Q(s, a) values.The Q array at line 3 represents a matrix to save the updated values of the action-value functions of all states and actions in each iteration.In each iteration, we take an action, observe and store the collected reward and the successor state.Actions are taken according to a policy ⇡ such as the ✏-greedy policy in line 6, in which a random action with probability ✏ is taken to explore new rewards, and an optimal action (see (12)) is taken with probability (1 ✏) to maximize the rewards; with ✏ = 0, the policy becomes greedy.The condition at line 7 represents the time, in terms of the batch size, at which we start updating the Q values of the actions taken in the previously visited states.The way target G is computed for QL, E-SARSA and SARSA is shown at lines 9-11.G represents the return collected starting from time (t +1 n) to n time-steps ahead, and it contains two parts, the discounted collected rewards and a function of the action-value for future rewards.The latter part changes for QL, E-SARSA and SARSA.For QL, the maximum action-value is used considering all possible actions which can be taken from the state at t + 1. Whereas, E-SARSA uses the expected value of Q(s t+1 , a) over possible actions at state s t+1 , and SARSA uses Q(s t+1 , a t+1 ) considering the action that will be taken at time t + 1 according to the current policy.The way to update the action-value is shown at line 12, where ⌧ is the time whose Q estimate is being updated.At line 13, the matrix Q is updated with the new Q value and used to make future decisions.The algorithm stops when all Q values converge.The converged values represent the optimal action values Q ⇤ which are then used to determine optimal actions as in (12).
Recall that the FN objective is to maximize the expected total served utility and minimize the expected termination time, as shown in (2).Hence, to compare the performance of FN when using QL, SARSA, E-SARSA and MC provided in Algorithms 1 and 2 with the performance of a fixed-threshold algorithm in the utility filtering-based network slicing, which does not learn from the interactions with environment, we define an objective performance metric R as where a served utility is denoted with u m , the number of served IoT requests in an episode is denoted with M , (T M ) represents the total idle time for RBs, and ✓ is a penalty for being idle.

V. SIMULATIONS
We next provide simulation results to evaluate the performance of FN when implementing the RL methods, Q-learning, SARSA, Expected-SARSA, and Monte Carlo, given in Algorithms 1 and 2. We also compare the RL-based FN performance with the FN performance when utility filtering-based network slicing is employed with a fixed thresholding algorithm.We evaluate the performances in various IoT environments with different compositions of IoT latency requirements.For brevity, we do not consider the effect of ratio of the achievable throughput to the throughput requirement in assessing the utility of a service request.Specifically, we consider 10 utility classes with different latency requirements to exemplify the variety of IoT applications in an F-RAN setting.That is, we consider ⇣ = 0, = 1,  = 1 in (1), and discretize the latency-based utility to 10 classes (U = 10).The utility values 1, 2, . . ., 10 may represent the following IoT applications, respectively: smart farming, smart retail, smart home, wearables, entertainment, smart grid, smart city, industrial Internet, autonomous vehicles, and connected health.By changing the composition of utility classes, we generate 19 scenarios of IoT environments, 6 of which are summarized in Table 3. Higher density of highutility users makes the IoT environment richer in terms of low-latency IoT applications.
Denoting an IoT environment of a particular utility distribution with E, we show in Table 3   latency requirement.last two rows illustrate the quality or richness of IoT environments, where ⇢ is the probability of a utility being greater than 5, and ū is the mean value of utilities in the environment.In the considered 19 scenarios, ⇢ increases in steps of 0.05 from 5% to 95% for E 1 , E 2 , . . ., E 19 respectively.The remaining 13 scenarios have statistics proportional to their ⇢ values.We started with a general scenario given by E 7 , and changed ⇢ to obtain the other scenarios.
The simulation parameters shown in Table 4 are used for the presented results in this section.The rewards R = {r sh , r sl , r rh , r rl } are chosen to facilitate learning the optimal policy.We consider that the FN is equipped with computing, signal processing and storage resources of 15 resource blocks (RBs), i.e., N = 15.In a particular environment E, the threshold that defines ''high utility'' is set to the mean of all utilities, i.e., u h = ū.We applied the greedy policy in our simulations, hence ✏ = 0.
We firstly consider the MDP formulation for the IoT environment given by scenario E 7 shown in Table 3.By interaction with the environment, the FN updates the state value functions which converge to the optimum policy.Fig. 4, shows how the FN learns the optimal policy using the Monte Carlo (MC) method given in Algorithm 1 to estimate the optimal state values.With 15 RBs, there are 160 states, the last 10 of which are terminal states with b = 15 for which V (s) = 0.The state-value functions of 16 states are given in Fig. 4. The remaining states have values within a standard deviation = 0.5 of the selected 16 states.It is seen that for most of the states the state values converges to the optimal  (s, serve) required in (12) using the Q-learning method given by Algorithm 2. Q-values converge to the optimal values after around 4000 episodes.The IoT environment E 7 is considered, and the FN is equipped with 15 RBs.value V ⇤ (s) after about 5000 iterations.This number can be easily exceeded by the number of requests received by FN during a busy hour from a variety of IoT applications [1].
We next apply SARSA, Expected SARSA and QL in the IoT environment E 7 , for learning the optimal policy in (12) using the estimated Q ⇤ (s, a) in Algorithm 2. The convergence of Q(s, serve) and Q(s, reject) when using QL is shown in Figs. 5 and 6, respectively.In our MDP problem, QL converges slightly faster than E-SARSA, SARSA and MC since it implements a greedy approach by selecting the maximum Q(s 0 , a 0 ) when updating the return G t as shown in Algorithm 2. However, this is not a general rule as it depends on the nature of each problem.There are many factors affecting the convergence rate, e.g., large values of the learning rate ↵ make the Q-values bounce around a mean value, whereas small values causes it to converge slowly.(s, reject ) required in (12) using the Q-learning method given by Algorithm 2. Q-values converge to the optimal values after around 5000 episodes.The IoT environment E 7 is considered, and the FN is equipped with 15 RBs.

FIGURE 7.
The performance in terms of R for the FN with N = 15, in various IoT environments when applying the RL methods (QL, SARSA, E-SARSA and MC) given in Algorithms 1 and 2, and the utility filtering algorithm in network slicing with different slicing thresholds.RL methods' performances are indistinguishable here, and better than the conventional-filtering based network slicing in all environments thanks to their learning and adaptation capability.
Unnecessary exploration makes the convergence controlled by the ✏ value in the ✏-greedy policy.The step size n after which we update the state values or Q-values affects also the convergence dependent on the problem.For instance, updates the state values at the end of an episode regardless of how long it is, which makes it slower to exploit the updated state values in making better actions, whereas QL, SARSA and E-SARSA using n = 1 update the Q-value time step.Unlike MC, the FN needs to keep updating two Q-values for each state instead updating one state value.Hence, we have 300 Q-values to in order to learn the optimal policy.
We compare the performance of the methods, in terms of R, as shown in (13), with that of utility filtering-based network slicing with various slicing thresholds in the 19 IoT environments.The utility filtering algorithm uses the same threshold for network slicing regardless of the environment.For the RL methods, we consider the simulation setup shown in Table 4, and for the utility filtering-based network slicing we consider all possible slicing thresholds 1, 2, . . ., 10.As shown in Figs.7 and 8, the RL methods exhibit the best performance as they learn how to balance early termination with higher total served utilities.It never terminates too early or too late (T ⇡ 27 for all environments as seen in Fig. 8), as opposed to the utility filtering-based network slicing which is not adaptive to the environment.As seen in Fig. 7, the performance of the utility filtering algorithm with slicing thresholds 1, 2, 3, 8, 9 are steadily below that of the RL algorithms.The average termination time for slicing thresholds 1, 2, and 3 is about 15 which is the minimum termination time, though they could not achieve good performance.Slicing threshold 4 has a comparable performance to RL for the environments E 2 E 5 , after which its performance starts to decline.Although slicing thresholds 5, 6, 7 have good performances close to RL for environments with medium to high ⇢, they perform far from RL for IoT environments with small ⇢.The performance of slicing threshold 10 is much worse than threshold 9 for all environments due to the long termination time which exceeds 280 time-steps, thus it does not appear in Figs.7 and 8.
The performance of the RL methods is very close to each other, hence it is not easy to distinguish them in Figs.7 and 8.For a clearer view, Fig. 9 compares the performance of the four RL methods in terms of the performance ratio with respect to performance of slicing threshold 4, i.e., (R RL /R Thld4 ).QL has the best performance with an average performance ratio of 104% in all IoT environments with a peak of 106% in E 9 , followed by E-SARSA and MC.SARSA has the same performance as QL because greedy policy, i.e., ✏ = 0, was used.

VI. CONCLUSION
We proposed a Markov Decision Process (MDP) formulation for the resource allocation problem in Fog RAN for IoT services with heterogeneous latency requirements.Several reinforcement learning (RL) methods, namely Q-learning, SARSA, Expected SARSA, and Monte Carlo, were discussed for learning the optimum decision-making policy adaptive to the IoT environment.Their superior performance over utility filtering-based network slicing methods, and adaptivity to the IoT environment were verified through extensive simulations.The RL methods strike a right balance between the two conflicting objectives, maximize the average total served utility vs. minimize the fog node's idle time, which helps utilize fog node's limited resource blocks efficiently.As future work we consider expanding the presented resource allocation framework to more challenging scenarios such as dynamic resource allocation with heterogeneous service times and number of resource blocks needed, and collaborative resource allocation with multiple fog nodes.

FIGURE 1 .
FIGURE 1. Fog-RAN system model.The FN serves heterogeneous latency needs in the IoT environment, and is connected to the cloud through the fronthaul links represented by solid lines.Solid red arrows represent local service by FN to satisfy low-latency requirements, and dashed arrows represent referral to the cloud to save FN's limited resources.

where 2 [
0, 1] is the discount factor.represents the weight of future rewards with respect to the immediate reward, = 0 ignores future rewards, whereas = 1 means that future rewards are of the same importance as the immediate rewards.The objective of the MDP problem is to maximize the expected initial return E[G 0 ].In the presented MDP, for an FN that has a computing and processing capacity of N RBs, there are U (N + 1) states, s t 2 S = {1, 2, 3, . . ., U (N + 1)}, where U is the greatest discrete utility level.At the initiation time t = 0, all RBs are available, i.e., b = 0, hence from (3), there are U possible initial states s 0 2 {1, 2, . . ., U } dependent on u 0 .The MDP terminates at time T when all RBs are occupied, i.e., b T = N , hence similarly there are U terminal states s T 2 {UN + 1, UN + 2, . . ., U (N + 1)}.Note that a policy treating the MDP problem can continue operating after T as in-use RBs become available in time by taking actions similarly to its operation before T .

FIGURE 3 .
FIGURE 3. The first 3 levels of the backup diagram for the MDP with 2-RB FN (N = 2).Non-terminal states and terminal states are represented by open circles and squares, respectively, and labeled by the states according to (3).r s 2 {r sh , r sl } and r r 2 {r rh , r rl } are the rewards for serving and rejecting, respectively, and depend on u h .

FIGURE 4 .
FIGURE 4. Learning optimum policy of the MDP by applying the Monte Carlo method given by Algorithm 1 to obtain the optimal state values required in(11).The IoT environment E 7 is considered, and the FN is equipped with 15 RBs.The 16 state values shown in the figure are a sample of the 150 non-terminal state values.

FIGURE 5 .
FIGURE 5.Learning the optimal action-value function Q ⇤ (s, serve) required in (12) using the Q-learning method given by Algorithm 2. Q-values converge to the optimal values after around 4000 episodes.The IoT environment E 7 is considered, and the FN is equipped with 15 RBs.

FIGURE 6 .
FIGURE 6.Learning the optimal action-value function Q ⇤ (s, reject ) required in (12) using the Q-learning method given by Algorithm 2. Q-values converge to the optimal values after around 5000 episodes.The IoT environment E 7 is considered, and the FN is equipped with 15 RBs.

FIGURE 8 .
FIGURE 8.The average termination time T in time-steps for FN with N = 15 in various IoT environments when applying the RL methods (QL, SARSA, E-SARSA and MC) given by Algorithms 1 and 2, and the utility filtering algorithm in network slicing with different slicing thresholds.RL methods manage to have a steady termination time in all environments.

FIGURE 9 .
FIGURE 9.Comparison between the performance of RL methods in terms of relative performance with respect to the utility filtering algorithm in network slicing with slicing threshold 4. QL and SARSA coincide due to the greedy policy is used in the simulations.

TABLE 1 .
Summary of notations and abbreviations.

TABLE 2 .
State transitions of 5-RB FN for a sample of IoT requests and random actions with U = 10, u h = 6.State transition graph for the MDP episode given in Table the statistics of E 1 , E 4 , E 7 , E 10 , E 15 , and E 19 .The first 10 rows in the table provide detailed information about the proportion of each utility class in an IoT environment corresponding to a

TABLE 3 .
Utility distributions for various IoT environments with heterogeneous latency requirements.

TABLE 4 .
Summary of simulation parameters and their values.