Cloud and Edge Computation Offloading for Latency Limited Services

Multi-access Edge Computing (MEC) is recognised as a solution in future networks to offload computation and data storage from mobile and IoT devices to the servers at the edge of mobile networks. It reduces the network traffic and service latency compared to passing all data to cloud data centers while offering greater processing power than handling tasks locally at terminals. Since MEC servers are scattered throughout the radio access network, their computation capacities are modest in comparison to large cloud data centers. Therefore, offloading decision between MEC and cloud server should minimize the usage of the resources while maximizing the number of accepted delay critical requests. In this work we formulate the joint optimization of communication and computation resources allocation for computation offloading (CO) requests with strict latency constraints. We show that the global optimization problem is NP-hard and propose an efficient heuristic solution based on the single user optimal solution. Simulation results are presented to show the effectiveness of the proposed algorithm, compared to optimal and baseline solution where tasks are allocated in the order of arrival, with different system parameters. They show that our algorithm performs close to the optimal in terms of resource utilization and outperforms the baseline algorithm in terms of acceptance rate.


I. INTRODUCTION
Emerging applications, such as augmented and virtual reality, face recognition and language processing, are becoming more computationally demanding. At the same time, upcoming wearables such as AR (augmented reality) glasses, various IoT devices such as medical instrumentation, implants and sensors have limitations in terms of available energy from batteries and computational capacities. While CO is recognised as a potential solution to this problem, the use of existing distant cloud services is becoming impractical especially for the upcoming delay critical applications [1], [2]. Multi-access Edge Computing (MEC) concept is aiming at bringing external computational resources closer to the users. European Telecommunications delay-critical applications that cannot be offloaded to distant cloud, require sophisticated decision-making algorithms. In this paper, we propose joint allocation of wireless (communication) and computational resources for MEC and Cloud server offloading with strict delay requirements for multiple computational tasks. We refer to computational infrastructure where tasks are dynamically assigned to MEC or Cloud server as two-tier infrastructure [5].
Delay-critical applications have different requirements in terms of data throughput, latency and reliability, and the need for further classification of low latency traffic is recognised. Furthermore, it should be noted that even the definition of communication latency itself is not unique, but instead it depends on the use cases [6]. Latency requirements can be expressed by stochastic measures, such as that the expected value and the variance of the latency must remain under a predefined threshold [7], [8], or that the latency should be under a certain threshold with certain reliability. In the literature, the former case is also referred to as probabilistic latency while the later as deterministic latency [9]. In this paper, we consider the deterministic latency definition and focus on the computational tasks that have strict requirement to be completed within given time period. This definition is appropriate for use cases such as VR (virtual reality), AR, and real-time control, where, for example, images have to be processed before human eye can detect the lagging. Number of emerging IoT applications is expected to fall into this category of computational offloading. Delay critical and high reliability application are significant for development of telemedicine services. For example remote monitoring of the patients through a number of wearable sensors and implants require reliable and efficient external data processing.
In summary, our paper presents the following major contributions: • Global optimal resource allocation algorithm for two-tier CO that jointly minimizes computation and communication resource usage, while maximizing the number of accepted tasks and meeting the strict latency requirements in the two-tier computational offloading architecture (2TCOA) • Efficient heuristic solution for dynamic CO resource allocation decision depending on the instantaneous network conditions and computational demands, including the trade-off between selection of low-latency MEC and high capacity cloud server. Heuristic resource allocation (HRA) algorithm, based on the single task optimal allocation is designed to emulate the optimal allocation., • Extensive simulation results that demonstrate the performance of our proposed scheme.
The rest of the paper is organized as follows: Section II presents related works on CO to multi-access edge and cloud servers. Section III provides a general system model and formulates the problem of optimal resource allocation for 2TCOA for latency limited traffic. In order to design the efficient heuristic solution we analyse the performance of the optimal algorithm in Section IV. Heuristic solution is introduced in Section V. Section VI provides simulation results illustrating the performance of our schemes. Finally, Section VI concludes the paper.

II. RELATED WORK
Resource allocation for cloud and edge CO is extensively studied in the literature. Decision criteria whether to offload the computational task or perform it locally at the end-user device is considered in [10]- [13]. Work in [10] proposes joint physical resource block and MEC computation resources allocation with consideration of interference, in order to minimize the overall consumption of the entire system in terms of time and energy. Since this problem is non-convex, they propose a solution based on graph coloring. Similarly in [11], a multi-user CO problem for MEC in a multi-channel wireless interference environment is studied by using game-theory approach. Here, a potential game is used to determine the bandwidth assignment, while cloud computing capability is determined according to the subscribed contract. MEC servers are assumed to have unlimited capacity which is more appropriate for cloud servers. Another game-theoretic approach assumes that users are not rational but have subjective perceptions under uncertain wireless environment [13]. They formulate users' decision-making of whether to offload or not as a prospect theory based non-cooperative game. The optimal offloading probability and transit power that minimize delay and energy consumption are calculated in [12] using Interior Point Method to solve non-linear optimization problem. The energy delay tradeoff for MEC offloading in [14] is set as mixed integer programming problem and solved by convex approximation.
Resource allocation for cloudlet offloading, a similar problem to MEC offloading, is studied in [15]- [17]. Cloudlet technology has been introduced to deploy mobile cloud services at the network edge. Cloudlet is a server that has direct wireless access such as WiFi. In a computation-intensive environment, Cloudlets can efficiently process the computationally intensive tasks. Authors in [15] propose heuristic solution for computational latency minimization for cloudlet without considering communication delay. Integer Programming formulation that minimizes the total cost of providing services, while taking into account probability of resource availability, is presented in [16]. The minimization of delay in a multi-cloudlet system is studied in [17]. They solve two mixed Integer Linear Programming (ILP) problems, to first select the appropriate cloudlet, and then to allocate the resources. Nowadays, Cloudlets are considered inadequate due to the limited wireless coverage [18]. On the other hand, MEC has superior offloading techniques provided by the mobile network with low-latency and high-bandwidth [18], wide coverage and better security.
Majority of studies on CO considers latency minimization without strict latency constraints. Work in [19] proposes the joint optimization of the radio resources and the computational resources in MIMO multicell system in order to VOLUME 9, 2021 minimize the overall users' energy consumption, while meeting strict latency constraints. While this work focuses on one-tier problem, i.e. offloading to MEC, our work studies two-tier offloading architecture, namely offloading to MEC and cloud server.
Several works study MEC co-operation [20]- [22]. The co-operation of mobile cloud providers, in terms of resource sharing, is studied in [20]. In [21], the authors propose collaborative task offloading. Price based dynamic resource management for co-operation between cloudlets is presented in [22].
A few papers, such as [23]- [25] and [26], consider two-tier offloading architecture. In these works, learning algorithms are utilized to allocate resources. Semi-Markov decision process criterion is used in [23], where the optimization problem is solved using linear programming. They oversimplify the two-tier problem with an assumption that offloading to distant cloud only occurs when there is no cloudlet coverage. In addition no explicit delay constraints are considered. Two virtual machine allocation methods based on semi-Markov decision process are proposed in [24] to balance the tradeoff between the high cost of providing services by the remote cloud and the limited computing capacity of the local fog. Model-based planning method and model-free reinforcement learning (RL) method are used to allocate virtual machines. The authors distinguish the high and low-priority services. Only high-priority services can access the cloud while low priority services can be accepted if there is free space in the fog. Delay constraints or bandwidth allocation are not considered. Reinforcement learning solution is proposed in [25] for selecting appropriate collaborative edge servers and allocating corresponding portion of the computing task to individual edge severs as well as the radio bandwidth resource. A simple scheme is adopted that offloads task to the cloud only when edge servers are occupied. They also minimize the average service latency, and although maximum tolerated latency is introduced as a constraint. This problem is different compared to ours, where over-provisioning of resources is not possible since strict latency requirement is met for each task in order to maximize the number of accepted tasks. The problem of two-tier offloading is addressed in these works by setting predefined policies. In our work, the decision whether to execute a task at a MEC server or a cloud data-center is dynamic, and depends on the instantaneous network conditions and computational demands.
Finally, a few recent works consider resource allocation for dynamic two-tier computational offloading with explicit latency constraints. In [26], a heuristic algorithm that jointly allocates wireless bandwidth and computational resources to mobile devices, is proposed to minimize the energy consumption of the system. In contrast, our work aims to minimize the total use of overall resources which is equivalent to maximizing the number of performed tasks. Furthermore, joint optimization of two-tier computation offloading decisions and computation resource allocations for vehicular network is considered in [27]. In contrast to our work they do not consider the transmission rate allocation.

III. SYSTEM MODEL A. COMPUTATIONAL OFFLOADING ARCHITECTURE
We consider a two-tier computation offloading model presented in Fig. 1. A number of users (UE), connected to base stations, have access to external computation servers for offloading computational tasks. Each Base Station (BS) is connected to a MEC server, which consists of a number of interconnected physical machines. The placement of MEC servers is flexible, i.e., a server can directly connect to a BS via high speed fibers, or to an edge switch so that multiple BSs can share the computing resources of the same MEC server. MEC servers are placed in the access network, and while the communication latency between BS and MEC is low, the computational capacity of MEC is typically limited when compared to a cloud server. Computational requests can also be forwarded from BS through a core network to a distant cloud server. The communication latency between the UE and cloud servers is typically high, while the capacity of a cloud server is typically also high and easily scalable. Therefore, for the purpose of our analysis, its capacity can be considered unlimited.
We consider a MEC server M is directly connected to a base stations B as shown in Fig. 2. A set of UEs connected to the BS B, generate N = {1, . . . , n, . . . N } limited latency tasks. The available bandwidth at the moment of the allocation, i.e. transmission rate in packets per second, is R. The minimum possible bandwidth to be assigned to one UE is R min . Each UE n has a computational request with the profile Q n {D n , J n , p n 1 , p n 2 } where D n is the maximum allowed delay, which accounts for total communication and computational delay of task execution. In order to complete a computational task n, J n number of instructions, i.e. CPU cycles, have to be executed. We denote with p n 1 and p n 2 the number of packets that have to be transmitted in order to forward the task to the server, and the computation result back to the UE, respectively. Wireless bandwidth is typically scarce compared to the optic links capacities, so here we focus on the wireless (communication) resources allocation. We assume that the transmission rates per session on the optical links are constant and known at the time of task  The total computational capacity of a MEC server available in the moment of allocation, in terms of CPU cycles per second, is F. The computational rate assigned to a task n at a MEC server and cloud server is f n M and f n C CPU cycles per second, respectively. A list of notations used throughout the paper is given in Table 1.

B. LIMITED LATENCY COMPUTATION OFFLOADING
We assume that the bandwidth allocated to UE with task n at BS, B, provides the rate of r n packets per second. The time necessary to offload a task to the BS is p n 1 /r n , while relaying the computational result back to UE takes p n 2 /r n . We assume that the transmission rate of the limited-latency slice relaying the task between the BS and MEC server is r s packets per second. The delay of sending a total of p n 1 packets over L M links between BS and MEC is [28]. Similarly, relaying the computational result back to the UE needs d B,M n,2 = L M /r s + (p n 2 − 1)/r s , and thus the total delay between BS and MEC for relaying the task n is d B,M n = (2L M + p n 1 + p n 2 − 2)/r s . The execution time of a task n is J n /f n M and the total delay of offloading a task n to MEC has to be less than the delay threshold: The delay of a task offloaded to the cloud server can be calculated in a similar way taking into account additional delay between the MEC and Cloud server d M ,C n = (2L C + p n 1 + p n 2 − 2)/r s where L C is the number of links/hops on the route between MEC and Cloud server. The condition for a delay guarantee in the Cloud can be expressed as: The goal of the computation offloading (CO) resource allocation algorithm is to minimize the overall usage of network resources, while maximizing the computational requests served within their latency limit. Since we assume that computational capacities of cloud servers are unlimited in comparison to the load of a single BS, the available bandwidth is the limiting factor to the maximum number of tasks that can be handled at one scheduling period. We assume that the number of tasks forwarded to the allocation algorithm is N < N B max , where N B max = R/R min is the maximum number of tasks that can be transmitted at the same time with the available bandwidth. In the extreme case, when the number of received requests is greater than N B max , N B max tasks should be pre-selected. One option is to select the shortest-deadline tasks first, i.e. those with smaller requested delays. Requests with large delay tolerance can be left for the next allocation period. Another option is the introduction of service priorities, if the use cases permit it. In the next allocation period, delay requirements D n of these tasks have to be reduced for the amount of time the task is waiting to be allocated.
The overall usage of the network resources is minimized if the following two criteria are met: 1) there is no overprovisioning of communication and computational resources for any task n, i.e. D n M = D n or D n C = D n and 2) the execution of a task at MEC is favoured over forwarding it to the cloud. This reduces the load in the backbone network and reduces the demand for processing power, since time frame for execution is not decreased due to network delay. Therefore, the objective of the global resource allocation optimization problem is to minimize the amount of the allocated communication bandwidth and computational rate at MEC and cloud servers. We define an indicator a n = 0, if request n is offloaded to MEC, and a n = 1 if it is offloaded to cloud server. The vector of these indicators is denoted as a. The vectors of the variables related to assigned computational rates to MEC and cloud server are denoted by { M and { C , respectively, while ∇ is the vector of assigned transmission rates. The price of using r units of the rate is P R (r), while prices of computational resources at MEC and Cloud are P M (f ) and P C (f ), respectively. The global optimization problem for CO decision and resource allocation can be expressed as: (3) The execution of a task at the Cloud always requires more resources than the execution at the MEC server, i.e. f n M +r n M < f n M + r n B where r n = r n M in order to satisfy (1) and r n = r n C in order to satisfy (2). Therefore, the objective function aims at maximizing number of tasks assigned to the MEC server. The first constraint guarantees that maximum allowed delay of each request is not exceeded. The second constraint ensures that the total allocated computational rate does not exceed the available capacity at MEC server. Next, the third and fourth constraint limit the total allocated bandwidth at BS to the available amount and bandwidth of a single UE to be greater than the minimum permitted. Each request can be allocated only to one server, MEC or cloud, so indicator a n takes value 0 or 1. Finally, allocated computational rates have to be positive numbers.
The minimization in (3) is a mixed integer programming problem, since indicators a are binary variables, and therefore problem (3) is NP-hard. The relaxation of the integer variables to a n ∈ {0, 1} is not suitable for this problem, as the execution of a task cannot be simply divided between MEC and Cloud servers. The optimal solution can be obtained iteratively by testing different combinations of these binary variables. At each step, a value for server association vector a is picked and a non-integer subproblem is solved. The problem does not have a solution for every value of vector a. In order to anal yse the subproblem we assume the indicator vector a is given. We also aim at minimizing the total amount of resources used in the network, so we set the price of all resource units to one. Next, we substitute equations (1) and (2) in the first constraint of optimization problem (3): This optimization problem is quadratic, since the first and second constraints are not linear functions of variables r n , f n M and f n C . In order to linearise them, we have to introduce auxiliary variables r n , f n M and f n C , and additional conditions. We can rewrite them as: Hyperbolic constraints (5c) can be rewritten as a second-order cone program (SOCP) constraint. We use the fact that any relation of the form z 2 ≤ xy can be transformed into 2z x-y 2 ≤ x + y where || * || 2 is l 2 norm. The optimization problem in (4) is then equivalent to: The transformed sub-problem in optimization (7) is SOCP, and it has an efficient solution [29]. The objective to minimize f n M + f n C + r n , together with the first and second constraint of problem (4) ensures that there is no resource over-provisioning. For the optimal solution we need D n M = D n or D n C = D n depending on the selected server, i.e. a n . In summary, the optimum allocation can be found by the following Algorithm 1.
The algorithm first checks if all requests can be allocated to the MEC servers, as this is the most desirable outcome. If this is not possible, it investigates possibilities for offloading 1, 2, 3 . . . requests to the Cloud server. The algorithm does not have to go through all the variations to find the minimum that solves (3). The biggest A M that has the solution for (7) is optimal for (3). This means that if there is an optimal solution for some A * M , all the solutions for A * M + x, where x = 1, 2, 3.. will require more resources because more tasks are offloaded to the Cloud server, and execution at the Cloud server requires more resources. For some A * M there can be more than one possible allocation depending on the selected combination for vector a. Algorithm 1 finds which combination of a for A * M gives the minimal objective function value Solution min . This solution is the optimal solution of the original problem (3). Although Algorithm 1 is more efficient than checking solutions of (7) iteratively for all combinations of a, it is still not efficient for large number of requests N . Nevertheless, we use this solution as a benchmark for comparing the heuristic algorithm's performance. In order to construct efficient sub-optimal algorithm we analyse the optimal allocation characteristics in the rest of the section.

D. CHARACTERISTICS OF THE OPTIMAL SOLUTION
In this sub-section we present and discuss the performance of the optimal resource allocation solutions for several simple examples. We choose the examples that show certain characteristics of optimal solution and typical tendencies in algorithm's output. We are particularly interested in the effects that each parameter of the request profile have on the allocation decision.

1) SINGLE TASK OPTIMAL ALLOCATION
First we investigate what is the optimal resource allocation in terms of minimizing the resource usage when the bandwidth and computational resources are not scarce. We look at a single task example were F and B are unlimited, to find the optimal split between two degrees of freedom: computation and communication delay. The communication rate that needs to be allocated to task n in order to execute the task with delay constraint D n , can be expressed as a function of allocated bandwidth. Bandwidth allocation r n of request n can take one of two values: r n M if assigned to MEC and r n C if assigned to Cloud server. For MEC case we get from (1): The computation rate has to be a positive number, so we can obtain the minimum bandwidth requirement. For the MEC allocation the following condition must hold: For the Cloud server allocation In order to find the analytical expression for single task optimal MEC allocation we need to find the minimum value of f n C + r n C i.e. Eq.   Similarly, the optimum resource allocation for Cloud server is: . We refer to these solutions as single task optimal solution for MEC and Cloud allocation.

2) MULTIPLE TASK OPTIMAL SOLUTION
In the first example we consider N = 5 requests with delay limited to D n = 5ms ∀n. We set other task parameters same as in the previous example. The available bandwidth capacity is R = 100 packets/second and the vacant MEC computational capacity is F = 100 MIPS. Each request has J n = 100 MI to  execute and p n 1 + p n 2 = 20 packets to send. Round-trip delay between the BS and the MEC server is d n B,M = 0.5ms, and between the MEC server and the Cloud center is d n M ,C = 2ms. The optimal server resource allocations f n M , f n C are shown in Fig. 4, while the optimal bandwidth allocation is presented in Fig. 5 for each task. For the optimal solution hyperbolic constrains in (7) hold with equality. Therefore solution of Algorithm 1 is the optimal solution of (3).
Results show that three requests are allocated to the MEC server and two requests are allocated to the Cloud. Since requests have equal demands, equal number of resources are allocated at each server. In case the delay requirement is the same, the services executed on the Cloud server require more computational power and bandwidth than requests executed at MEC server. This is due to additional delay through the network d n M ,C . We can see that in an optimal solution the task is allocated to the MEC servers whenever possible. The allocated bandwidth and computational resource (r n , f n M ) of tasks n = 1, 3, 4 are equal to the single task optimal MEC allocation, while (r n , f n C ) of tasks n = 2, 5 are equal to the single task optimal Cloud allocation, as presented in Fig. 3. Note that the request profiles Q n of tasks in this example are the same as in the single task optimal example. In this example, since bandwidth is not the limiting factor, optimal resource allocation is a collection of singe-task optimal allocations. Next, we consider the example with the same parameters, where the request n = 2 has higher number of packets to transmit, p 2 1 + p 2 2 = 40 packets. The optimal computational  capacity and bandwidth allocation are presented in Fig. 6 and Fig. 7 for each task. Similarly to the previous example, three tasks are allocated to MEC. However, we can see that in contrast to the previous example, the request 2 is now allocated to the MEC server. The request with higher number of packets needs more resources to be executed at both MEC and Cloud servers, as compared to the other requests. This time the request 2 is placed at the MEC server because it is more efficient from the resource point of view to send one of the less demanding requests to the Cloud. We can also see that both bandwidth and computational capacity allocations of n = 2 are higher, compared to other tasks (n = 3, 4) allocated to MEC. This is because the excess delay, caused by higher number of packets, can be compensated by lower computation delay and faster transmission. The allocations (r n , f n C ) are close to the single task optimal. However, in this example the bandwidth is the limiting factor n r n = R. Therefore, the tasks allocated to the cloud have slightly less bandwidth than in the single optimal case, while this additional delay is compensated with the higher allocation computational rate at the Cloud server.
In the third example, we consider the same parameters as in the first example except that request n = 2 has a higher number of instructions J 2 = 200 MI. The optimal solution is presented in Fig. 8 and Fig. 9.   The request with the higher number of instructions n = 2 is assigned to the Cloud server since the high number of instructions requires higher computational capacity to complete the task in a limited time. If allocated to a MEC server it would occupy more capacity than the other requests. Therefore, a smaller number of requests would be allocated to MEC server which is typically less efficient, especially when both bandwidth and MEC capacity are scarce. Again, the bandwidth is the limiting factor and the allocations of all tasks are slightly different than the single tasks optimal.
Finally, in the fourth example, we investigate the effect of delay constraint on the optimal allocation. In addition to parameters set in the first example, we set D 5 to 4.5ms. The requests with the stricter delay limit require more resources to be completed in time than the others. The aim of the algorithm is to place shorter-delay requests to MEC because the delay between the MEC and the Cloud server further decreases the delay budget, in turn leading to even higher resource consumption. Results in Fig. 10 and 11. show that only two short-delay requests are assigned to MEC. Due to the lack of MEC resources, one short delay request is assigned to the Cloud server. As expected compared to the other request assigned to Cloud server, it requires more bandwidth and computational resources.

IV. HEURISTIC LIMITED LATENCY CO ALGORITHM
Due to inefficiency of finding the optimal resource allocation, in this section we propose a heuristic algorithm for limited latency CO problem. Based on the properties and trends of the optimal solution, discussed in the previous section, we construct a computationally simple algorithm that achieves close to optimal performance. Equations (12)- (15) show that the limited-latency computational task has the unique resource allocation pair (r n * M , f n * M ), which minimizes the total resource consumption in the network. The analysis of the optimal solutions showed that when resources are not limited, all the requests have the single task optimal allocation. If both MEC capacity and bandwidth are sufficient for such allocation, all requests are allocated to the MEC server for execution. If the MEC server capacity becomes the limiting factor, while the bandwidth is sufficient, some tasks need to be allocated to the Cloud server. The optimal solution shows that both MEC and Cloud-allocated requests have the single task optimal allocation.
If the bandwidth is scarce, the allocated rate r n is lower than the single-task optimal r n * , and the computational rate f n M or f n C is higher than single task optimal f n * M or f n * C , in order to compensate for excess communication delay. In this case optimal allocation cannot be predicted in advance. However, based on these examples, we can see that the bandwidth is fully utilised. If the bandwidth is scarce, it is not possible to tell if the MEC capacity is sufficient, based on the knowledge of the single-task optimal allocation.
In our Heuristic Resource Allocation (HRA) algorithm, we assume that the requests are allocated one by one. So far we identified 3 areas, in which the optimal solution has different characteristics. In the first case, both MEC and bandwidth resources are sufficient and all requests are allocated with the single-task optimal allocation. This is easy to replicate by the proposed heuristic algorithm, where one by one request is allocated (r n * M , f n * M ) if n f n * M > F and n r n * M > R. The second case is when the bandwidth is sufficient and the MEC resources are scarce, i.e. n f n * M < F. The sufficiency of the bandwidth cannot be checked in advance, i.e. before it is determined which requests are sent to the Cloud server. First, the heuristic algorithm allocates the task one by one with (r n * M , f n * M ) to MEC, and when MEC runs out of capacity, allocates them to the Cloud. However, the allocation of the requests in the order of arrival is suboptimal. Instead, the requests with shorter delays should be processed first, because the goal is to allocate them to the MEC server. Example 4 also shows that the shorter-delay request is allocated to MEC in the optimal solution. It is preferred for the requests with high number of instructions to be executed at the Cloud server, as shown in Example 3 in Fig. 8. and Fig. 9. This is because they require a lot of computational resources and can block the MEC resources from many other tasks. Finally, the requests with higher number of packets should be processed before those with less packets, in order to increase their chance of being admitted to MEC, because it is not efficient to send large amount of data out of the local network. The heuristic algorithm should first sort the requests by the ascending order of delay. If two requests have the same delay requirements, the one with the lower number of instructions J is allocated before the higher one. If also the number of instructions is the same for both, the requests with the higher number of packets p1 + p2 should be processed first. Sorting the resources does not guarantee that the optimal solution would be achieved, but it increases efficiency compared to the algorithm working on a FIFO principle.
The allocation in these two cases, i.e. when the bandwidth is sufficient, can be handled with the single algorithm, as shown in Fig. 12. We refer to this part of the HRA as Procedure 1. It sorts the requests, aiming at sorting requests approximately by the amount of resources needed, and does the single-task optimal allocation. It first tries to assign the task to the MEC server, and if vacant resources are not sufficient, to the Cloud server.
In the third case, the bandwidth resources are scarce n r n * M < R, while MEC resources are either sufficient n f n * M > F or scarce n f n * M < F. This means that for some n r n < r n * M . However, it is not possible to know how much smaller the bandwidth allocation r n is from the optimal. On the other hand, since the bandwidth allocation decreases, the computational rate allocation increases compared to the f n * M , so there is no guarantee that all tasks can be executed at the MEC server, i.e. that the MEC resources are sufficient. This means that the cases where MEC capacity is sufficient and scarce cannot be distinguished and need to utilize the same heuristic procedure. Again, the first step is to sorts the requests. The whole available bandwidth is divided between the tasks. Then, requests are processed one by one. MEC rate f n M is calculated based on (13). If vacant MEC resources are sufficient, the request is assigned to MEC, otherwise f n C is calculated and the request is assigned to the Cloud. We refer to this part of HRA as Procedure 2, Fig. 12.
If the bandwidth is sufficient, the resource allocation algorithm selects the Procedure 1. If the bandwidth is not sufficient, it selects the Procedure 2. However, as mentioned earlier, when MEC capacity is scarce, it is not possible to tell a priory if bandwidth is sufficient since we do not know which tasks will be allocated to the Cloud. The optimal Cloud allocation requires more bandwidth that optimal MEC allocation. This can only be determined during the Procedure 1 allocation, if at any point the requested rate is greater than leftover vacant bandwidth r n > B. In this case, Procedure 1 is terminated and the allocation is performed from the start by Procedure 2. During the execution of Procedure 1, tasks are only assigned to the servers. The allocation is only performed at the end of the algorithm.
The steps of the HRA algorithm in the form of pseudo code are summarized in Algorithm 2. Similar to the Optimization algorithm, the first step is to determine the number of requests N processed in that allocation period. In the next step requests are sorted in the manner described above. At the start of the Procedure 1, the vacant capacities R B and F M are set to the total available capacities at the time of allocation R B = R and F M = F. For each request, it is checked if it can be assigned to the MEC with (r n * M , f n * M ). If vacant MEC capacity is insufficient it is checked if request can be assigned to Cloud server with (r n * C , f n * C ). It should be noted that only rate condition needs to be checked, because the cloud capacity is considered unlimited. The indicator a i , where i is the index in the sorted requests list, is set to 0 if request is assigned to MEC or to 1 if the request is assigned to the Cloud. After each assignment, the vacant capacities are updated. Finally, if the last request is successfully assigned, the resources are allocated to each task based on the calculated assignment. If for any request i, the bandwidth is not sufficient, the Procedure 1 is terminated and the algorithm continues with the Procedure 2. Vacant capacities are reset to the initial values R B = R, F M = F, and the available bandwidth is split between the tasks. Each task is first assigned minimum rate r i min and the leftover bandwidth is split equally. For each request, i and assigned bandwidth r i algorithm calculates f i M that satisfies delay constraint. Next, it is checked if request i can be assigned to the MEC. If yes, a i is set to 0 and

V. NUMERICAL RESULTS
In this section we analyse the performance of the Heuristic Resource Allocation (HRA) Algorithm. We compare its VOLUME 9, 2021 performance to the optimal solution as well as to a simple heuristic algorithm. Due to the lack of the appropriate two-tier architecture resource allocation algorithm with strict latency limitations in the literature, we use the following benchmark algorithm. Requests are allocated in the order of arrival with single-task optimal allocation. This ensures that the latency condition is met.
In Fig. 13, the performance of our algorithm is compared to the performance of optimal resource allocation as defined in (3). Since both the optimal allocation and HRA have the same acceptance rate, we compare the total amount of resources the algorithms utilize to serve all computational requests. This is the sum of the transmission rates at the wireless link and computational resources at the MEC and Cloud servers. Due to the large number of input parameters in Fig. 13, we present total resources used in 4 multi-task example scenarios defined in Section V C.
Depending on the case, the results show that HRA performs either equal or slightly worse than optimal allocation while its implementation is significantly simpler. In the cases where the bandwidth is sufficient for the single-task optimal allocation, such as in example 1, our proposed algorithm performs optimally. If the single-task optimal allocation is not possible, our algorithm has sub-optimal performance. In example 2, our HRA algorithm needs around 8.3% more resources to accommodate the same tasks compared to the optimal allocation. Therefore sub-optimal solutions are very close to the optimal while requiring significantly less implementation effort.
In Fig. 14, we compare the total resource consumption of our HRA and optimal algorithm as the number of requests increase from N = 2 to N = 7. Every request has the delay threshold of D = 5ms for execution of J = 100 MI and transmission of p 1 + p 2 = 20 packets. The round trip delay between the BS and the MEC server is d n B,M = 0.5ms, and d n M ,C = 2ms between the MEC server and the Cloud center. The available bandwidth capacity is W b = 100 packets/second, and vacant MEC computational capacity is F = 100 MIPS. The capacity parameters stay fixed as the number of requests increase. Our results show that the HRA algorithm performs optimally up to N = 5 requests. For  higher number of requests, HRA algorithm performs very close to optimal.
Next, we compare performance of our solution to the baseline algorithm. The baseline algorithm does not have the same acceptance rate as our algorithm, so we cannot conduct a meaningful comparison with respect to the total resources used. Instead in Fig. 15, the number of accepted requests is presented for the example scenarios, defined in Section V C. HRA always finds a feasible allocation to accommodate all the delay-constrained requests. The baseline algorithm underperforms when the bandwidth is not sufficient for the single-task optimal allocation.
In Fig. 16, we compare the percentage of the accepted resources using the HRA and the Baseline algorithm, as the number of requests increase form N = 2 to N = 7. The other parameters are the same as in the setup for Fig. 14. Each request has equal demands and the capacities of the systems are fixed. The baseline algorithm performs well for small number of requests, i.e. when the system capacity is sufficient for the total demand. However, with increased number of requests, the performance of the baseline algorithm rapidly degrades. For N = 7, request acceptance rate is only around 55%.
In the previous example, the HRA algorithm has an advantage over the Baseline algorithm when the system capacity is insufficient for the single-task optimal allocation. This is because the HRA algorithm utilizes the Procedure 2 in Algorithm 2. However, the HRA algorithm has an advantage over the Baseline algorithm even when using Procedure 1,  due to the sorting of the requests. In the previous example, this is not utilised since all requests are equal. We use the following example to illustrate the benefits of sorting requests in the HRA algorithm. The available bandwidth capacity In Fig. 17, the percentage of the accepted requests for HRA and Baseline algorithm is presented for different orders of arrivals. The performance of the Baseline algorithm depends on the order of arrival and for certain orders it performs equally with the HRA, but for others it performs worse. This is a consequence of the single-task allocation using more bandwidth resources when the request is sent to the Cloud service, and therefore it is important to intelligently sort the request to those to be sent to the cloud and those to be sent to the MEC.

VI. DISCUSSION AND CONCLUSION
In this paper we study the problem of two-tier computational offloading of latency limited tasks. We formulate the joint optimization problem to obtain offloading decision, i.e. server selection, and allocation of communication and computation resources necessary to execute the tasks. The objective is to minimize the usage of the resources while maximizing the number of accepted latency-limited task requests. We show that global optimization problem is NP-hard and we propose a computationally efficient heuristic solution. Simulation results show the efficiency of our proposed algorithm, compared to the optimal solution and a benchmark heuristic algorithm. We show for different system parameters that our proposed solution outperforms the benchmark algorithm in terms of the acceptance rate and gives allocations that are equal or close to optimal. Minimizing the usage of resources reduces the cost of computational offloading for infrastructure and service providers. At the same time it improves the scalability of the system, which is especially important in networks with large number of IoT devices. Our algorithm enables execution of computational tasks with strict delay requirements, which is necessary for delay-critical services. Maximizing the number of accepted requests further improves the quality of experience for the end-users and the overall system reliability.