Distributed Edge Computing Offloading Algorithm Based on Deep Reinforcement Learning

As a mode of processing task request, edge computing paradigm can reduce task delay and effectively alleviate network congestion caused by the proliferation of Internet of things(IoT) devices compared with cloud computing. However, in the actual construction of the network, there are various edge autonomous subnets in the adjacent areas, which leads to the possibility of unbalance of server load among autonomous subnets during the peak period of task request. In this paper, a deep reinforcement learning algorithm is proposed to solve the complex computation ofﬂoading problem for the heterogeneous Edge Computing Server(ECS) collaborative computing. The problem is solved based on the real-time state of the network and the attributes of the task, which adopts Actor Critic and Policy Gradient’s Deep Deterministic Policy Gradient(DDPG) to make optimized decisions of computation ofﬂoading. Considering multi-task, the heterogeneity of edge subnet and mobility of edge devices, the proposed algorithm can learn the network environment and generate the computation ofﬂoading decision to minimize the task delay.The simulation results show that the proposed DDPG-based algorithm is competitive compared with the Deep Q Network(DQN) algorithm and Asynchronous Advantage Actor-Critic(A3C) algorithm. Moreover, the optimal solutions are leveraged to analyze the inﬂuence of edge network parameters on task delay


I. INTRODUCTION
IN order to meet the access network requirements of different devices in different scenarios [1], network operators use heterogeneous access methods (wired or wireless access point) when they build networks. Limited by the computing capability of the IoT device itself, the IoT device will offload some tasks to the cloud server to reduce the device's own burden. However, with the advent of the 5G era, the number of connected devices will be exploded, and a large number of tasks from connected IoT devices will squeeze bandwidth and cause network congestion. In addition, the long distance between the cloud server and the IoT device has brought about severe time delay in the cloud computing mode.
The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Huang .
Aiming at the above problems of the cloud computing model and meeting the requirements of network devices for low latency, low energy consumption, and better quality for network services, academia and industry have begun to conduct in-depth research on new computing modes, and proposed paradigms of edge computing types such as mobile edge computing(MEC), fog computing [2], and cloudlet [3]. Edge computing is a shared application mode with communication, computing, and storage functions. Compared with cloud computing, edge computing mode is closer to connected devices in geography, which can effectively solve the problems of time extension and network congestion in cloud computing mode [4], [5]. However, compared to cloud servers, the resources and service scope of a single edge server are limited. When excessive edge devices offload tasks to an edge server, the entire edge network resources will be underutilized.
In order to alleviate the burden on a single edge server and effectively utilize the computing resources of the whole edge network, a distributed MEC model was proposed [6]- [8] to coordinate global edge servers for distributed computing. Without loss of generality, we name the edge server connected by Edge Device (ED) as access Edge Computing Server (aECS) and the adjacent edge server as collaborate Edge Computing Server (cECS). In distributed MEC mode, aECS assigns part of the received task to the nearby cECS through its associated coordinator. In this way, distributed MEC makes each edge server responsible for part of the task in order to effectively use edge network resources.
However, the existed modes become complicated due to the collaboration between edge servers as well as mobility of services. In heterogeneous network scenarios, offload decisions in the distributed MEC mode becomes very complicated. That is because the offload decisions include task division between ED and aECS, aECS and cECS, as well as device mobility in wireless scenarios. It is difficult to handle such a complex decision problem by traditional optimization algorithms. Fortunately, the revolution in artificial intelligence technology triggered by the rise of deep learning has brought new ideas to solve complex model problems. In solving high-dimensional state and action space decision problems, the deep reinforcement learning (DRL) algorithm [9]- [11] has been proven to be more effective than the traditional methods. Therefore, deep reinforcement learning is adopted to solve the above computation offloading and services migration issues in collaborative edge computing.
In this study, edge servers from an overall perspective of heterogeneous edge subnets are selected to solve the multiuser edge computing offloading problem by using distributed computing manner. Based on DDPG, a distributed computing algorithm is proposed, by which different types of tasks can be offloaded to appropriate locations in the edge subnet for execution. Simulation studies will be carried out to compare with traditional computation offloading methods in heterogeneous scenarios [12]. The contributions of this article are as follows: 1) The computation offloading model is proposed in heterogeneous network scenarios. Different from the existing edge computing offloading model in a single scenario, the proposed model can effectively utilize the computing resources among heterogeneous subnets, and maximize resource utilization in heterogeneous scenarios.
2) The resource allocation algorithm based on reinforcement learning with Task completion rate(TCR) and total task time delay is proposed. Compared with pure consideration of delay, TCR and delay can better offload different types of tasks to different site of network, and more effectively use broadband and computing resources.
3) The performance of the proposed solution is compared with the DQN algorithm. In addition, DQN and other else offloading strategies are used to analyze the impact of computing resource allocation on offloading decision calculation.

II. RELATED WORK
Edge computing is proposed for edge access network with distributed cloud computing capabilities. Compared with cloud computing mode, it has lower latency and higher computing capabilities [13]. However, the computing capability of a single MEC server has not been able to meet the challenge of the growing number of connected devices. Thus, computing offloading and resource allocation have become hot issues in edge computing [14]- [16]. Traditional resource allocation and calculation offloading are solved by clustering or convex optimization methods [17]- [21]. [17] designed a microcloud computing (CLOUDLET) in an Orthogonal Frequency Division Multiplexing Access (OFDMA) system with multiple mobile devices, which can simultaneously allocate radio resources and computing resources. The work in [18] used time division multiple access (TDMA) at the communication end to download the results in a predetermined time slot. The literature [19] Obtained the optimal data allocation scheme based on particle swarm optimization of simulated annealing. Unlike the above three studies, which are optimized from the perspective of download links, Authors in [20] optimized the mission uplinks and achieved satisfactory results. The above method is optimized from the perspective of traditional communication. The above researchers have used traditional optimization methods to optimize the allocation of channel resources and computing resources, and have achieved certain results. In recent years, researchers have tried to optimize the offloading and resource allocation with different optimization objectives. The method was proposed in [21] generated considering both cost and performance factors, an effective resource allocation scheme can be generated. A resource allocation and provisioning algorithm was proposed in [22] by using resource ranking and provision of resources in a hybrid and hierarchical fashion. Reference [23] proposed a Support and Confidence based (SCB) technique which optimises the resource usage in the resource monitoring service. However, as the network scale becomes increasingly complex, there are solutions which are better than the traditional methods.
In order to achieve better edge network resource utilization, the tasks are divided into subtasks in a distributed computing manner and handed over to multiple edge servers for processing into a computing offload mode [24]- [26]. The literature [24] utilize specific repetitive structures of computation allocation at the user to provide coding opportunities that reduce the shuffling load by a factor that increases linearly with the number of users. The work in [25] developed the ''Stackelberg'' game to model the interaction between the edge cloud and users, where the edge cloud sets prices based on limited computing capability to maximize its revenue. For a given price, each user makes offload decisions locally to minimize their own costs. Authors in [26] a decentralized edge cloud infrastructure that explores the use of voluntary resources for computing and data storage.
At present, most of the research on computation offloading is in a single scenario. With the rollout of 5G network construction, multiple access methods and edge subnets are possible in an area. Several studies on multi-access edge computing have been considered [27]- [29]. Authors in [27] considered a MIMO multi-cell system in which multiple mobile users (MUs) required the transfer of computing tasks to a public cloud server. The literature [28] decomposed the original problem into a resource allocation (RA) problem with fixed task offloading decisions and a task offloading (TO) problem that optimizes the optimal value function corresponding to the RA problem, solve the RA problem by using convex optimization manner, and a novel heuristic algorithm is proposed for the TO problem. This algorithm achieves a suboptimal solution in polynomial time. Software defined network (SDN) are used in heterogeneous edge network. Reference [30] proposed a heterogeneous vehicular networks by using the SDN. Reference [31] propose a software-defined adaptive transmission control protocol (SATCP) for selecting various transmission control policies to adapt to the timevarying vehicular environment. Reference [32] propose a SINET customized solution enabling crowd collaborations for software defined vehicular networks. Reference [33] propose a joint resource allocation and task scheduling approach to efficiently allocate the computing resources to virtual machines and schedule the offloaded tasks. The work in [29] proposed an optimization framework for computation offloading and resource allocation for multi-server mobile edge computing system. The framework aims to minimize system-wide computing overhead by jointly optimizing each computing decisions, transmission capabilities, and computing resources on the server.
In recent years, some studies related to resource allocation and computation offloading use DRL as a solution [34]- [38]. The literature [34] started with finite blocklength code, combined with Deep q-learning algorithm for resource allocation, reduced delay violation rate, and achieved better results than random and equal scheduling benchmark. Reference [35] used the deep reinforcement learning method to make data migration decisions for multi-access edge computing in a dynamic network environment. Reference [36] used DDQN on virtual edge computing and obtained good performance in offloading computing. The literature [37] applied the Monte Carlo tree search algorithm to the resource allocation of MEC, and the performance of the scheme was significantly better than that of DQN. Reference [38] combined the computing offload of edge computing with block chain, comprehensively considered the time delay energy consumption and the cost of the block chain, and achieved a good effect with the DRL-based algorithm.
It can be seen from the simulation results of the above article that the DRL-based algorithm performs better than traditional optimization algorithms. In the field of DRL, we find that the algorithm based on Q-learning cannot reach the optimal due to its discrete actions. So the Actor-Criticbased DDPG algorithm is used to optimize the strategy of this paper. Different from previous research, this paper comprehensively considers distributed computing offloading method and DDPG algorithm to optimize edge computing in heterogeneous edge network scenarios. Fig. 1 show the 3 level architecture of the heterogeneous edge network. From bottom to top are IoT device layer, heterogeneous edge computing layer and cloud computing layer. Among of them, the heterogeneous physical layer is composed of different types of edge devices, such as driverless cars, smart homes, smart phones, etc. In this architecture, the networking scenarios (such as wired and wireless network scenes) where edge devices located are different. The characteristics of the tasks (such as traffic characteristics and time delay characteristics) to be performed on the device are also different. Users access the network in different ways through various IoT devices. These devices generate tasks and send requests to edge subnets which they connect. Agents of autonomous edge subnets will get the network state of other subnets through distributed file system. After receiving the task request from the Internet of things device, the agent makes the decision of computing and offloading according to the characteristics of the request. After training, computation intensive tasks tend to be executed on cloud servers, while time sensitive task tend to be performed on local and edge servers.

B. HETEROGENEOUS NETWORK MODEL
In the proposed Structure, the subnets of the edge layer are not only autonomous but also heterogeneous. These edge subnets may be composed in different ways. As described in Fig. 1. In order to meet the requirement for different types of tasks, an edge device will be covered by multiple heterogeneous edge subnets at the same time with its own access mode. Edge subnets with different structures have different characteristics. For example, a wireless subnet composed of interconnected wireless access points (AP) can handle mobile edge devices well and has minimal requirements for the infrastructure of the subnet, while wired edge subnets have the most abundant communication resources and are suitable for tasks with heavy traffic characteristics. Wireless edge subnet has the best mobility, fiber wireless hybrid edge subnet has both mobility and high bandwidth and low latency, wired edge subnet has the highest bandwidth and the lowest latency, but it cannot be suitable for moving edge equipment. Edge device selects different subnet computing offloading according to the task properties.
In each subnet, an edge node that is fully connected to other ECS is designated as the coordinator, which is responsible for collecting task offload requests received by each ECS, making offloading decisions, and coordinating the computing resources of this subnet and other subnets to maximize Edge computing resources for each subnet.

C. WORKFLOW OF THE COORDINATOR'S WORK
As shown in Fig. 2, the decision workflow of computing offloading based on reinforcement learning can be divided into two parts: information input flow and decision output flow. Information input flow is the information flow from edge network to decision optimization engine based on reinforcement learning. The decision output flow is the decision control flow from the decision optimization engine to the edge network entity. The edge device accesses the edge network through different access methods and requests to offload the task upward. At this time, the service request coordinator obtains the task request from the edge device, then extracts the feature information of the task, abstracts it into a feature vector suitable for the task feature of the decision engine, and inputs the decision optimization engine. When Ed generates tasks and sends service requests to the network, the requests which consist of the task's eigenvectors will be captured by agents in the subnet. In the network architecture proposed in this paper, the coordination between edge servers is realized by distributed file system. Through the distributed file system, agents in each edge subnet can obtain the real-time network state. Because of this, the agents can make the right decision for computing offload. Agent in the edge network offload tasks to the appropriate location in the network for processing, so as to achieve the overall performance optimization.
After obtaining the network state matrix and task eigenvector, the decision optimization engine makes the decision of computation offloading and deploys the physical instance through the edge network agent. In section V, there are the further explanation of decision algorithm with Fig. 2.

D. HETEROGENEOUS EDGE NETWORK COMPUTING OFFLOADING MODEL
The edge network consists of N D edge devices, each of which is connected to a different ECS according to its communication characteristics. Limited by the weak computing capability of the edge device itself, the task execution time will be very long for computation-intensive tasks if they VOLUME 8, 2020 are all performed locally. Therefore, it is a good choice to offload tasks to the edge server in a reasonably way. In this process, each ECS will carry a large number of tasks from edge devices. We record the task waiting queue of edge server e i as q e i , which indicates that there are tasks with q e i time to be processed. If the q of ECS is very long, it means that the ECS is overloaded, and the task will be offloaded to the cloud server for processing.
Different types of edge devices tend to different task characteristics, such as different network access methods, different traffic characteristics, and different time delay requirements. In order to meet different types of tasks, heterogeneous edge network is needed to meet different task requirements. In this paper, access networks are chose to access wireless edge network by identifying the edge of the device type, such as smart car, smart phones, which have the strongest mobility. Sweeper robots with certain mobility and large data flow access conditions have priority access to wireless fiber hybrid edge subnet, while devices with large task data flow and optical fiber access conditions, such as smart TV, have priority access to wired edge subnet.
In this framework, tasks are distributed to local, edge and cloud servers respectively according to the strategy. According to the network state and task characteristics, aECS assigns tasks to the adjacent ECSs for collaborative processing. If the adjacent edge servers also have no spare computing capability, the tasks is sent to the cloud computing center for execution. When the adjacent cECS finishes the task, the task will be returned to access aECS, which will be integrated and returned to IoTD.

IV. PROBLEM FORMULATION
In this section, the optimization problem will be formulated with two objectives. The goal of this paper is to find a computing offloading strategy with expected minimum task delay under the condition of ensuring TCR. The task delay is determined by the maximum value of the local execution and the offloading computation.

A. NETWORK AND TASK STATUS AND CHARACTERISTICS
The state of each episode can be expressed as a state set S(S E , T d , S d , L E→E ), as explained below.
• S E represents the state set of the edge server, S E = {S j = (q e j , f e j )|j = 1, 2, . . . , m}, q e j is the task queue duration of the ECS e j , and f e j is the frequency of the ECS e j CPU.
Where q e j is given as follows: • T d represents the feature set of the task, T d = {c i , td i , s u i , s d i |i = 1, 2, . . . , N }, c i represents the computing cycle required to process task i, td i is the Deadline of task i, s t i is the size of task. • S d indicates the state and characteristics of the edge device, the task queue duration of edge device d j , and f d j is the CPU frequency of edge device d j . Where q d j is given by • L E→E represents the communication link bandwidth between the nodes.

B. OFFLOAD ACTION
The task offloading action consists of three parts: from ED to aECS, from aECS to cECS, and from aECS to cloud server. The offloaded task includes the task execution code and the task data. It has been found that the offloading ratio of the computing cycle is roughly equal to the total amount of task offload, so we merge the two sub-actions of the ED offload to aECS. Then each episode's set of computation actions can be expressed as A(A d→ê , Aˆe →e , Aˆe → e ), as specified below: • A d→ê indicates the proportion of task calculation/data amount offloaded from ED to aECS.
• Aˆe →e indicates the proportion of task calculation/data amount offloaded from aECS to cECS.
• Aˆe → e indicates the proportion of task calculation/data amount offloaded from aECS to cloud server.

C. COMPUTING MODEL 1) LOCAL EXECUTION
The local computing time of the task consists of two parts, namely the local execution time and the local task queue waiting time. The total completion time of the local computing task is calculated by the following formula:

2) SERVER EXECUTION
If the task is all executed locally, the ED computing capability is not sufficient to complete it within the task's deadline, so the ED needs to offload a portion of the task to the edge server. The time spent in the computation offloading consists of task queuing time, task execution time, and task transfer time. The edge calculation time + queuing time + transmission delay is calculated by the following formula.
The calculation formula of the transmission delay of the task is as follows.

3) OPTIMIZATION GOAL
Our optimization objective takes into account the expectation of task delay and TCR. Based on the above formula, the task completion time is We assume that there are P tasks in total, so the expected delay of all tasks is The purpose of this paper is to maximize the TCR while making the task latency expectation as small as possible. Therefore, when the task is not completed in the deadline, a negative reward is given, and when the task is completed in the deadline, a high positive reward is given. We set the reward as: In order to explain the role of reward more clearly, an example will be proposed as follow. Assuming that the deadline of a task t i is 8ms, and the execution time of the task is 9ms in a one training. So that the reward generated in this state is -9.
In the next iteration training, the execution time of task ti is 7ms, so the reward of the state is 14.68. After this iteration, DRL agent finds that the total rewards of the second iteration is greater than the previous, and actions in this iteration will be remembered by neural network. In the another iteration training, the execution time of task ti is 5ms, and the reward of this state is 22.55, which is greater than the previous two iterations. After continuous iterative training, the decisionmaking of DRL agent's computing offload will perform more outstanding in general. The meanings of the symbols defined in this article are shown in Table 1.

V. RESOURCE ALLOCATION ALGORITHM BASED ON DRL
In this section, the complex computation offloading problem is abstracted into the MDPs. In the next step, we propose a collaborative computing resource allocation algorithm based on deep reinforcement learning, which is used to solve the optimization problem of offloading action. It consists of four neural networks and an experience pool.

A. MARKOV DECISION PROCESS
The optimization goal t i is an single time slot object, it depends only on the current state. However, the state of network environment is dynamically, so the past network states are also important reference factors for computation offloading actions. If only the current state is taken to make a decision, the decision action made by agent will lack further vision. Reinforcement learning is a method to optimize problems in a dynamic environment. We describe the problem as a simple one-slot problem, and then add the past state on this basis. So that, we first formulated the problem as a MDPs, which can be represented by a tuple {S, A, P, R}, S is the state space of the model, and A is the action space of the model. P is the state transition matrix of the model, and R is the reward function of the model.The description of each element is as follows: • The state space in this article is defined as the state of each server and task in the edge network. The server set is S ESC = {s 1 , s 2 , . . . s n }, where n is the number of servers in the edge network. The server status is task waiting queue length, and the task set is A, where v is the total number of tasks. Task  • At each moment, after considering the task waiting queue length of each node and the deadline of the task itself, the agent must make an action to assign the task to each server for processing. We define the action space as and the constraint as The meaning of each element is given in Section 3.
• Every time an agent makes an action, the environment automatically gives a reward: here we define the reward value as (8), so the total reward is VOLUME 8, 2020 a t , s t+1 ). Our ultimate goal is maximin the total reward. In addition to the above four elements, there is a hyperparameter γ . γ is the future reward weight, the value range is [0, 1]. The value function is focused on the currently obtained reward when γ tends to 0. While. The value function will consider more rewards from the followed steps if γ tends to 1. In other words, γ makes decisions biased towards shortterm rewards or long-term rewards.

B. DYNAMIC RESOURCE OPTIMIZATION ALGORITHM BASED ON DRL
In the resource allocation decision, we need to directly interact with the environment to get the sample, and through the resulting sample estimate value function, the ultimate goal is to find out the optimal strategyπ * . The state space and action space of the network model in this paper are characterized by high dimensionality, dynamics, and non-discrete action, which need to be optimized in the process of sequence generation. Based on the above factors, we choose DDPGbased methods to optimize our decision-making algorithms. DDPG is derived from the improved version of Actor-Critic and Policy gradient algorithms, and also draws on the double network structure of DDQN [39].
As shown in the Fig. 2, the deployment of DRL consists of two parts: the network environment and the intelligent agent. The network environment consists of network nodes (edge nodes and cloud nodes), network monitors, and users. The network node receives the task offload request from the user, and the network monitor collects the information in the network in real time and interacts with the intelligent agent information to respond to the state change in the network.
DDPG uses a policy function π θ (s) to make a decision. It definitely maps a state to a specific action. Compared to a random strategy, its action choice is only one, which greatly improves the convergence of training. In Actor-Critic schema, the agent uses the Policy Gradient method to enhance Gradient, and directly selects the action with the highest probability in the current state through the policy function π θ (s). Correspondingly, the Critic network evaluates the current decision based on the TDerror between the value function and the current reward, and evaluates the behavior of the Actor. The gradient calculation formula for the deterministic strategy gradient based on Q value is: ∇ θ J (π θ ) = E s∼ρ π [∇ θ π θ (s)∇ a Q π (s, a)|a = π θ (s)] (9) As shown in the Fig. 2, the distributed dynamic resource optimization algorithm based on DRL is composed of four parts: edge network environment, experience pool, Actor double network and Critic double network, in which the two networks of Actor and the two networks of Critic have the same structure respectively.
The network control node interacts with the edge network environment to obtain the current network state and stores the current state vector φ(S), action vector A, reward R, and next state vector φ(S ) to the experience pool. In order to explore the action space more broadly, we add some noise to the action selected by the Actor. Then, after accumulating a certain amount of data in the experience pool, the data block with the size of mini-batch will be taken out and input into the estimated neural network to obtain the action value function. The loss function 1 m m j=1 (y j − Q(φ(S j ), a j , ω)) 2 is calculated together with the action value function of the target value network, and all parameters ω of the current network were updated through the gradient reverse transmission of the neural network.
With the loss function, Eval-Net uses gradient descent to update the parameters in the network. After a certain episode, Eval-Net copies the latest parameters to Target-Net. After several rounds of episode parameter update, the loss function of Eval-Net tends to converge, and after obtaining the trained neural network parametersω, the optimal strategy π * can be obtained.
First, we define the current target Q value, which is used to calculate the expected reward value of the action in the current state, which is defined as follows: This represents the weighted expectation of the current state of reward and possible future rewards, and is used to evaluate the value of the current state.
Next, we define how the Actor-Critic neural network parameters are updated: Unlike DQN, which directly copies the parameters of the target-network to the eval-network, this algorithm uses a gradual update method, and each parameter is updated only by a small amount. At the same time, in order to increase the randomness of the learning process and better explore the entire solution space, we add some noise to the learning process. The action selection expression is defined as follows: where η is noise. Next, we define the loss function of the Critic network and the Actor network. Critic network's loss function is defined as follows: For the loss function of the Actor network, refer to (15). The loss gradient defined in the article [39] is as follows: The computing offloading algorithm based on DDPG is as described in Algorithm 1. It consists of two parts: the initialization of the network environment and the DRL algorithm in the intelligent agent.

Algorithm 1 Computation Offloading in Heterogeneity Edge Environment
Input: environment parameters Initialize: the actor eval-network and tar-network with parameter θ, θ ; the critic eval-network and tar-network with parameter ω, ω ; 1: for episode in range(Max_Iteration) do 2: Initialize edge network state and task queue. 3: for each stept = 0, 1, 2 . . . do 4: Get the current state s t from edge network environment and transform it to vector φ(s t ).

5:
Offload the request according to policy π θ (φ(s))+ℵ in actor network; 6: Receive reward r t by (8); 7: Store the tuple {φ(s t ), a t , r t , φ(s t+1 )} to the reply memory D; 8: Get the next state s t+1 ; 9: Calculate the Q value y j by(10); 10: Update critic by minimizing (14); 11: Update the actor policy using the (15); 12: Update the Actor tar-network by (11); 13: Update the Critic tar-network by (12); 14: t=t+1; 15: end for 16: end for Output:Optimal Actor current network parameter θ, Critic current network parameter ω At the beginning, the decision made by the intelligent agent is close to the random algorithm. With the learning process of the DDPG-based computing offload algorithm, the obtained offload strategy is getting closer to the optimal algorithm. After the learning iteration ends, the learned DDPG neural network parameters are obtained.

VI. EXPERIMENTAL PERFORMANCE ANALYSIS AND COMPARISON A. HETEROGENEOUS NETWORK ENVIRONMENT
Because the real edge computing environment is not realized for us to the algorithm, we use the network package Networkx based on Python to build the network environment, and use tensorflow to implement the proposed algorithm. In this section, we perform simulation experiments on DDPG-based computing offloading algorithms in heterogeneous scenarios. In the standard configuration, three edge devices are connected in each scenario. In Fig. 3, the wired scenarios, edge devices are connected to the edge network through optical fibers. In this scenario, we only focus on computing offload, not considering service migration.
In the wireless access scenario, as shown in Fig. 3, we divide the scenario into two categories, one is pedestrian walking, and the other is vehicle movement. Pedestrians move in a low-speed environment, and vehicles move at a constant speed (0-50m/s) in a fixed direction. In the wireless access scenario, the device has mobility, so this scenario is a mobile edge computing scenario. In this scenario, we use the computing offloading scheme under MEC.

B. SIMULATION SETTINGS
The environment of this simulation is a three-layer edge network, including a terminal layer, a heterogeneous edge layer, and a cloud layer. There are 2-8 wireless edge servers in the edge layer and 1-4 wired edge servers. The ED with fiber access mode has no mobility. The service range of each wireless AP is 50×50m, and the speed of each wireless mobile ED is randomly 0-40m/s. Initially, there are edge devices in each AP service range, and the probability that each edge generates an offload request in each time slot meets the Poisson distribution.
The related parameters in the experiment are listed in Table 2, which are applied in simulation examples unless otherwise stated. DDPG's Actor network and Critic network are three-layer structure, and the second layer of the fully connected layer is composed of 200 neurons. In the first layer, the input state vector is normalized. The hyperparameters of other neural networks are shown in Table 3.

D. SIMULATION RESULTS
In this section, we first observe the convergence of the algorithm in the training process, and then compare it with the performance of other algorithms in the heterogeneous network scenario. As shown in Fig. 4 and Fig. 5, during the first 3300 rounds of training, the average task delay decreases rapidly. When the number of training reaches 3,500, the task delay tends to be stable. In the first 3,500 rounds of training, the task completion rate increased rapidly, and the training effect converges to 98%-99% after the 3500th.
As depicted in the Fig. 6(a) and Fig. 6(b), the proposed algorithm is compared with other algorithms in different settings of edge network environment. CPU Capacity and  Number of ECS are changed in Fig. 6(a) and Fig. 6(b). After 4000 iterations training, it is 0.8%, 10.9%, 30.5% lower in DDPG-based algorithm when compared with the A3C, DQN, Edge server computing and local computing. We can clearly see that the proposed algorithm and A3C-based algorithm performed significant better then the others. As both DDPG and A3C algorithms belong to actor-critic mode, their performances in each setting are very close. We set the capacity of each edge server as 2GHZ, 3GHZ, 4GHZ, 5GHZ, 6GHZ. When the number of edge servers increases from 3 to 12, the average task delay of other computing methods except local computing is gradually decreasing. After the number of servers is increased to 6, increasing the number of edge servers does not significantly improve the effect. It can be seen that with the improvement of the CPU capacity of the edge server, the average delay of tasks is decreasing. However, this kind of improvement is not obvious compared with the improvement of CPU capacity. We find that task delay is limited by the network bandwidth and the capacity of the terminal's own CPU, and the it has a theoretical limit.
In the MEC scenario, the mobile ED moves from the service area of one AP to the service area of another AP. Due to the migration, the moving speed of the mobile ED also affects the performance in this scenario. According to Fig. 7(a), in addition to the local execution of the ED, as the ED moves faster, the task latency expectation is increasing.   Under low speed conditions, the performance based on DDPG algorithm is significantly better than the rest of the offloading mode. As the speed increases to 40m/s, the performance based on DDPG algorithm is only slightly stronger than DQN and A3C.
The impact of the number of mobile terminals on task delay expectations is shown in Fig. 7(b). It can be seen that the algorithm based on DDPG performs optimally, and as the number of mobile terminals increases, the task delay expectation increases. In the mobile edge computing scenario, the increase of number of the mobile terminal brings about an increase in the number of task requests in the time slot, which increases the burden on the edge server.

VII. CONCLUSION
To satisfy three heterogeneous edge networks using one offloading algorithm, this paper combines three heterogeneous edge networks with remote cloud networks, and builds a three-layer edge cloud network with cloud-side collaboration. The heterogeneous edge network is used to implement the multi-access capability on the edge side, and the task completion and task delay are used as reward values. Thus, the trained allocation strategy can reasonably offload different types of tasks to different network locations to achieve the highest task completion rate. At the same time, task delay is minimized. It is verified by simulation studies that the DDPG-based algorithm designed in this study can effectively improve the system performance, reduce the task delay by up to 50% while that most tasks are completed within the deadline. However, the ECS divides tasks in terms of task delay and deadline in this study, we will consider the cost consumption in the future work. YUNZHAO  FENG QI is currently a Professor with the Beijing University of Posts and Telecommunications, engaged in scientific research, teaching, and standardization research in information and communication. His research interests include communications software, network management, and business intelligence. He has won two National Science and Technology Progress Awards. He has also written more than ten ITU-T international standards and Industry Standards. He served as the Vice Chairman for ITU-T Study Group 4 and Study Group 12.
ZHILI WANG is currently an Associate Professor with the Beijing University of Posts and Telecommunications, engaged in scientific, technology, and standardization research work in communication networks and computer science. His main research directions are network management, communications software, and interface testing. He has won one National Science and Technology Progress Awards and wrote more than eight ITU-T international standards, and successively served as the Working Party Chair for ITU-T Study Group 2 and Working Party 2.
XIUMING YU received the master's degree in business administration from the University of International Business and Economics (UIBE), in 2013. She is currently an Engineer with the China Electronics Standardization Institute (CESI). She currently researches in the area of cyber-physical systems, smart manufacturing, and the industrial Internet.
SUJIE SHAO received the Ph.D. degree from the Beijing University of Posts and Telecommunication, Beijing, China, in 2015. He is currently a Lecturer with the Beijing University of Posts and Telecommunication. His research interests include edge computing, the Internet of Things, smart grids, and communication network management. VOLUME 8, 2020