Multi-Agent Deep Reinforcement Learning for Cooperative Computing Offloading and Route Optimization in Multi Cloud-Edge Networks

Edge computing is a new paradigm to provide computing capability at the edge servers close to end devices. A significant research challenge in edge computing is finding efficient task offloading to edge and cloud servers considering various task characteristics and limited network and server resources. Several reinforcement learning (RL)-based task-offloading methods have been developed, because RL can immediately output efficient offloading by pre-learning. However, these methods do not take into account clouds or focus only on a single cloud. They also do not take into account the bandwidth and topology of the backbone network. Such shortcomings strongly limit the range of applicable networks and degrade task-offloading performance. Therefore, we formulate a task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints. We also propose a task-offloading method that is based on cooperative multi-agent deep RL (Coop-MADRL). This method introduces a cooperative multi-agent technique through centralized training and decentralized execution, improving task-offloading efficiency. Simulations revealed that the proposed method can minimize network utilization and task latency while minimizing constraint violations in less than one millisecond in various network topologies. It also shows that cooperative learning improves the efficiency of task offloading. We demonstrated that the proposed method has generalization performance for various task types by pre-training with many resource-consuming tasks.


I. INTRODUCTION
W ITH the development of communication technolo- gies, diverse applications have emerged in various domains, e.g., healthcare, smart cities, and manufacturing.These applications are generally offloaded and processed in cloud servers because of the limitation of the computation resources of end devices (EDs), e.g., personal computers, smartphones, Internet of Things devices, and cars.This process is called cloud computing (CC).Such offloaded applications' tasks consist of the demand for computing and communication with various characteristics, e.g., traffic heavy, computing heavy, or latency sensitive.Since cloud servers are generally located far from EDs, offloading tasks to the cloud inevitably generates additional transmission latency.Therefore, CC degrades the performance of latency-sensitive tasks.
To address this issue, edge computing (EC) has been proposed to deploy computing resources at the edge servers close to EDs.Combining CC and EC provides multiple offloading choices, improving the efficiency of offloading.For example, offloading computing-heavy tasks to the cloud is effective because the cloud usually has sufficient computing resources.Similarly, offloading traffic-heavy and latencysensitive tasks to the edge effectively shortens the transmission latency.Thus, CC and EC must be combined to improve the efficiency of various offloading tasks.
Several studies have addressed task-offloading problems for CC and EC [1], [2], [3], [4], [5], [6], [7], [8].Current methods based on mathematical optimization still incur high computational cost.For example, Wang et al. [1] accelerated the computation time by more than 100 times using their parallel optimization framework.However, due to the computational cost, they could only evaluate it in a small network with one cloud and a few base stations (BSs) (i.e., a few edge servers).To solve this problem, reinforcement learning (RL) [9] has been gaining attention as a solution [4], [5], [6], [7], [8].RL can immediately output the preferred task offloading by learning the relationship between input network patterns and output task offloading in advance.
Although RL can solve the problem of computation time, two issues remain.One is that these methods [4], [5], [6], [7] do not take into account CC or target only the network with a single cloud.As mentioned above, CC and EC need to be combined to improve task-offloading efficiency.Moreover, a typical network has multiple clouds.The other issue is that these methods [4], [5], [6], [7], [8] do not take into account the bandwidth capacity and backbone-network topology.Many studies attempted to minimize task latency by shortening the path that offloaded tasks take.When a control system does not take into account bandwidth capacity, some links may become congested due to the concentration of tasks.Therefore, the system should obtain the task offloading that minimizes latency while satisfying all link-bandwidth constraints.Some of the above studies assumed a typology that is not practical as a backbone network, e.g., a full-mesh network.
With RL-based task-offloading methods, multi-cloud and network topology are challenging to consider because route optimization requires many variables.Even a minimal network with five nodes requires hundreds of variables to optimize all routes between nodes.The performance of RL drastically worsens as the number of variables increases [10].Therefore, computing offloading and route optimization are difficult to jointly solve by using RL-based methods.Since the joint problem is a non-deterministic polynomial time (NP)-hard mixed integer linear problem, it is also difficult to solve by mathematical optimization.
The key solution to this problem is that mathematical optimization can quickly solve the route-optimization problem.Therefore, we develop a method for combining RL and mathematical optimization.This method calculates the optimal computing offloading by RL and optimal routing by mathematical optimization.However, not all constraints may be satisfied when two methods are calculated individually.Therefore, we propose a combining method that uses the solution of mathematical optimization for RL reward, enabling all objectives and constraints of two methods to be considered.This method is inspired by the extendable integrated control architecture [11] that can coordinate multiple control algorithms prepared for each control metric.
Improving RL is another critical issue.Multi-agent RL (MARL) is effective in solving more complex problems.It is a system of multiple agents interacting within a common environment.Each agent works with the other agent(s) to achieve a team reward.The learning cost of each agent can be reduced by assigning each agent to each task.However, when training decentralized and independent agents to optimize the team reward, each agent faces a non-stationary learning problem [12].An example of this problem is that when each agent independently acts simultaneously, all tasks will be allocated on the light-load server, resulting in server overload.To avoid a non-stationary problem, Zhan et al. [4] developed an algorithm combining MARL and game theory.However, the actions of decentralized agents based on game theory fall into a sub-optimal solution of the Nash equilibrium.
We formulate an optimal task-offloading problem for multicloud and multi-edge networks considering network topology and bandwidth constraints.We handle the task offloading of edge-to-edge and edge-to-cloud.Each task arriving at the nearest edge is either processed at the nearest edge or offloaded to the neighboring edge or cloud.We define optimal offloading as a solution that maximizes server-and link-resource efficiency and minimizes task latency while satisfying the constraints of server and link capacity and task latency.The decision variables are the computing-resource allocation of tasks and routing between the ED and allocated server.We also propose a task-offloading method that is based on cooperative multi-agent deep RL (Coop-MADRL).We assign an agent at each edge that has learned the optimal task offloading.We introduce a cooperative multi-agent technique in which several agents jointly optimize a single reward in a centralized training and decentralized execution manner.This can prevent the non-stationary problem and improve task-offloading efficiency.The proposed method combines Coop-MADRL and mathematical optimization to take into account network topology and bandwidth constraints in the task-offloading problem.We use the solution of mathematical optimization in the learning process of Coop-MADRL.The proposed method calculates the optimal computing offloading by RL and the optimal routing by mathematical optimization.Therefore, our method can handle numerous routing variables without approximating or reducing them.
The main contributions can be summarized as follows: • We formulate an optimal task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints to find the optimal solution that maximizes server-resource, linkresource, and task-latency efficiency while satisfying the constraints of server capacity, link capacity, and task latency.We evaluated the effectiveness of the proposed method through simulations in terms of performance, networktopology dependency, computation time, and scalability regarding the number of tasks.The simulation results indicate that our method can find a solution that minimizes network utilization and task latency while minimizing constraint violations in less than one millisecond in various network topologies.• We also evaluated the generalization performance of our method for unknown task patterns.The simulation results indicate that our method has generalization performance for various task types by pre-training with many resourceconsuming tasks.This paper is an extension of the conference version [13].The extension can be summarized as follows: • The primary extension is a comprehensive evaluation of the proposed method.We evaluated its performance, network-topology dependency, computation time, scalability, and generalization performance for unknown task patterns.• We prepared three new comparison methods for evaluation, including exhaustive search and heuristic-based methods.We also prepared more diverse tasks for evaluation.In the conference version, we selected the taskcomputation request size from several discrete values.In The rest of the paper is organized as follows.Section II describes related work.Section III describes the formulation of the task-offloading problem.Section IV briefly reviews RL.Section V describes the proposed Coop-MADRL-based taskoffloading method.Section VI describes the evaluation of its performance, and Section VII concludes the paper.

II. RELATED WORK
Several studies have addressed task-offloading problems for CC and EC [1], [2], [3], [4], [5], [6], [7], [8].In general, CC has sufficient computing resources but inevitably increases network latency.On the other hand, EC can reduce network latency at the expense of having sufficient computing resources.These studies aim to optimize the task-offloading when considering the characteristics of heterogeneous computing resources.They sometimes refer to fog computing instead of CC or EC, or assume multiple layers in the edge network.When an edge fog or a two-tier edge network considers the heterogeneous resource characteristics, we can see them as the same studies.
Table I summarizes the characteristics of these studies and ours."Cooperation" in method properties means whether multiple control algorithms are coordinated when a method contains multiple algorithms.It is not the case when a single algorithm determines all task offloads or when multiple algorithms independently determine them."Comprehensiveness" in evaluation properties means whether the methods evaluate the performance of the algorithm under practical conditions, e.g., a sufficient number of edges and clouds, and under various metrics, e.g., computation time and scalability.
Wang et al. [1] considered a cooperative three-tier computing network by leveraging vertical cooperation among devices, edge nodes, and cloud servers and horizontal cooperation between edge nodes.They also presented a parallel optimization framework by using the alternating direction method of multipliers (ADMM) method.However, they did not impose network bandwidth constraints and did not assume multi-cloud networks.They evaluated minimal conditions with four edges and 40 devices.Yuan and Zhou [2] designed a profit-maximizing collaborative computation offloading and resource-allocation algorithm to maximize system profit and guarantee task-response-time constraints.They also developed a migratory-bird optimization method that is based on simulated annealing (SA) to obtain a close-to-optimal solution.However, they did not assume multi-cloud networks and backbone-network topology.They evaluated their method under very light conditions, such as one task every 20 seconds.Kai et al. [3] developed a collaborative computing framework to process mobile devices' tasks at terminals, edge nodes, and cloud centers.They presented a pipeline-based offloading scheme, in which both mobile devices and edge nodes can offload computation-intensive tasks to an edge node and cloud center in accordance with their computation and communication capacities, respectively.However, they did not assume multi-cloud networks and backbone network topology and evaluated minimal conditions with 40 devices.
MADRL has gained attention as a solution to the problem of computation time [4], [5], [6], [7], [8].Zhan et al. [4] designed a decentralized algorithm for computation offloading so that users can independently choose their offloading decisions.They developed their algorithm by combining MADRL and game theory.However, the actions of decentralized agents based on game theory fall into a sub-optimal solution of the Nash equilibrium.They only considered EC and did not consider CC.Nguyen et al. [5] presented a new collaborative offloading framework that is based on MADRL in heterogeneous edge networks where each ED acts as an intelligent agent to make offloading decisions collaboratively and achieve optimal system utility.However, they only considered EC and did not consider CC.Hou et al. [6] introduced a hierarchical task-offloading and resource-allocation method that is based on MADRL for the Cybertwin-based network.It can promote the flexible collaboration of EDs, EC servers, or CC servers to improve system processing efficiency and security.Although they described the formulation in detail, their paper remains only a concept, and they evaluated minimal conditions with a single cloud, three edges, and 100-300 devices.Ding and Lin [7] considered cooperative task offloading, where all edge servers cooperate to achieve good performance for the entire edge-CC system, such as low latency and energy costs.They addressed MADRL-based task offloading that takes into account cooperation among agents.However, they evaluated minimal conditions with a single cloud and five edges.They evaluated only one snapshot and did not evaluate statistical performance.They also did not conduct any practical evaluation, e.g., computation time or scalability.Zhang et al. [8] considered a three-layer distributed multi-access edge computing network where there are multiple clouds, EC servers, and EDs.They developed a distributed scheme that is based on MADRL.Each cloud jointly determines the offloading task and resource-allocation strategy on the basis of its inference of other cloud decisions.However, they evaluated minimal conditions with 1-3 clouds, 1-3 edges, and 3-9 devices.
In summary, previous studies did not consider CC or targeted only the network with a single cloud.Zhang et al. [8] assumed multi-cloud networks, but their evaluation was under the limited condition of three edge nodes, and effectiveness of their method under practical conditions remains unclear.Previous studies did not consider the link capacity and topology of the backbone network.They assumed that the backbone-network link between clouds and edges was as if it were a single link.To the best of our knowledge, our study is the first to focus on optimal task offloading for multicloud and multi-edge networks considering network topology and bandwidth constraints.Several studies have addressed MADRL-based task offloading by considering cooperation among agents similar to ours but do not meet the above network requirements.Furthermore, we introduce a generalized task model representing various task types.Previous studies evaluated the performances of their methods under conditions where they generated only one type of task uniformly.We evaluated the effectiveness of the proposed method in terms of performance, network-topology dependency, computation time, scalability regarding the number of tasks, and generalization performance for unknown task patterns.Such a comprehensive evaluation had not been conducted.

A. Overview
We describe an overview of the task-offloading problem.We assume a network consisting of EDs, edges, and clouds.EDs generate various tasks with diverse applications.Each task consists of the required computing demand, traffic demand, and maximum permissible latency to accomplish the task.Each ED can compute its tasks locally or offload tasks to the neighboring edge or cloud.We collectively refer to edges and clouds as nodes.All nodes have computing resources to execute the tasks instead of EDs, called edge or cloud servers.All nodes also have a function of traffic forwarding to another node as a router.Only each edge has a function that determines the optimal node to offload each task.Each cloud cannot offload tasks to other nodes, only execute the tasks.We assume that the ED determines whether to offload its tasks; optimizing the ED's decision is outside the scope of this paper.We aim to optimize the offloaded server and the route between ED and server for the accepted task.
We describe the procedure of our task-offloading method.We consider a discrete time-step t and assume that each ED has one or more tasks and consider K tasks during t ∈ [0, T ].At the beginning of each t, each task arrives at the nearest edge of each ED.Each edge observes the information about tasks accepted at each edge and the overall network usage of all nodes and links.On the basis of the observation, our method deployed in each edge calculates the optimal node to offload the task (see Section V for details).The offloading node is calculated only once, and the neighboring edges that receive the tasks cannot offload it again to the other edge or cloud.When multiple tasks arrive simultaneously on each edge at t, our method repeats determining the offloading node in a firstin-first-out (FIFO) manner.The method then aggregates the traffic-demand information between nodes and calculates and updates the optimal route between nodes.Next, each edge forwards tasks to the optimal nodes through the optimal route, and the node executes the task and returns the result to the ED.The upload and download routes of each task may differ since our method handles the upload and download traffic as individual traffic demands.After a certain amount of time, it proceeds to the next t + 1.The executing tasks continue to consume the resources of the offloaded node and the link(s) it traverses until it returns a result to the ED.The tasks accepted at t do not need to be completed before t + 1.Note that it is also adequate to calculate and update the route once every several steps.In this case, our method calculates the optimal route on the basis of the average traffic volumes between nodes for several steps.

B. Network Model
Table II summarizes the notation definitions of the physicalnetwork model.We consider the physical-network graph G(N, L) consisting of a physical node set N and physical link set L. We assume that each physical node has a role as an edge or cloud.We also assume that EDs connect to the nearest edge through the access network, which is not included in G (N, L) in this paper.We also assume that the routes in the access network take the shortest path, and route optimization in the access network is outside the scope of this paper.We denote the edge as e ∈ E ⊂ N and the cloud as c ∈ C ⊂ N .We also denote the numbers of nodes, edges, and clouds as |N |, |E |, and |C |, respectively.We denote the node-processing capability of the i th node as v N i ∈ R + , which indicates the limit of computing resources, e.g., the central processing unit (CPU) capability per second in the i th node ([G cycles/s]).Here, R + indicates the set of positive real numbers.We also denote the node capacity of the i th node as w N i ∈ N, which indicates the upper limit of the number of allocated tasks.We assign one CPU core to each task, i.e., w N i equals the number of CPU cores in the i th node.We denote the bandwidth capacity of link (i , j ) as w L ij ∈ R + , which indicates the limit of bandwidth resources [Mbps].All links also have propagation latency depending on the distance between each node.We define the propagation latency of link (i, j) as

C. Task Model
Table III summarizes the notation definitions of the task model.We describe a task model for uniformly representing various types of tasks of EDs.We define the task set as D = {D k } and the k th task as , we satisfy the requirements of the task k.The computing demand refers to the total amount of CPU cycles needed to accomplish the task.The traffic demand refers to the amount of traffic generated by the task.The reason for distinguishing between upload and download traffic is to accommodate tasks, the traffic volume of which changes before and after the computation process on the assigned server.When D k is accepted, it consumes the computing and bandwidth resources on G (N, L) depending on the amount of computing and traffic demand of D k .If the task is assigned to the nearest edge of the ED, the bandwidth resources consumed on G (N, L) are regarded as 0. Note that if the task parameters are unknown, other systems need to be used to estimate the parameters, which is out of the scope of this paper.

D. Optimization Problem
We formulate the optimal task-offloading problem, which aims to minimize Eq. ( 1) while satisfying the constraints in Eqs. ( 2)-( 14).Table IV summarizes the notation definitions of the task-offloading problem.We aim to find optimal task-allocation variables Y and routing variables X t .Here, Y := {y kn } shows the task allocation in which y kn is 1 if the computing demand of the k th task is assigned to the n th node; otherwise, 0. The X t := {x pq ij ,t } shows the proportion of traffic demand m pq t from the origin node p to destination node q passing through link (i, j) at t. Here, M t := {m pq t } shows the traffic demand matrix between nodes p and q at t.The m pq t represents the total traffic demand between nodes p and q at t.The nodes p and q in m pq t are determined by device placements and task-allocated servers.A detailed formulation is shown in Eqs. ( 12)-( 13).Since we assume multi-path routing, x pq ij ,t can take a continuous value between 0 and 1.We also define the ED placement as Z := {z ke }, in which z ke is 1 if the nearest edge of the ED requested k th task is the e th edge; otherwise, 0. We assume that Z is constant during t ∈ [0, T ] to ignore the effects of ED movement, e.g., handover.In other words, we assume that the ED requesting to process the task at edge e stays near edge e for the processing time, e.g., at least a few seconds.
We introduce the objective function: where U N t and U L t show the maximum node and link utilization at t, which are respectively defined as We denote the i th node utilization as show the node and round-trip time (RTT) latency of the k th task.RTT is the time it takes to get from the source to the destination and back from the destination to the source.Their definitions and formulations are described in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the subsection on the latency constraints.The term λ indicates the weighting parameter determining the importance ratio of resource efficiency and task latency.
We impose three types of constraints: node capacity, link capacity, and task latency.We first define the binary variable I k ,t as follows: where I k ,t is 1 if the k th task is executing at t; otherwise, 0. Here, t k is the acceptance time of the k th task and shows the completion time of the k th task.The total latency τ N k + τ RTT k of the k th task can be modeled by Eqs. ( 15) and ( 16) (details below).Latency in a real-world environment does not necessarily occur as modeled.However, in a real-world environment, we can measure latency and can use measured values instead of the modeled equations.
1) Node-Capacity Constraints: A task-allocation variable y kn is formulated to minimize the maximum node utilization U N t while satisfying the node-capacity constraints as follows: Equation (3) shows that the computing demand of each task must be allocated to any node.Equation ( 4) shows the constraint of node capacity.The term I k ,t y kn in Eq. ( 4), the product of two binary variables, is 1 if the k th task is allocated to node n and is executing at t, because I k ,t y kn = 1 if I k ,t = 1 and y kn = 1.By calculating the sum over k, the left side of Eq. ( 4) shows the number of tasks executing at node n at t.The left side of Eq. ( 4) shows the n th node capacity at t when the maximum node utilization is U N t .In other words, it shows the maximum number of tasks that the n th node can perform at t.We impose the upper bound w N n U N t , which is stronger than w N n .Equations ( 5)-( 6) show the range of variables.
2) Link-Capacity Constraints: We formulate the linkcapacity constraints as a multi-commodity flow problem, which is a network flow problem with multiple commodities (i.e., traffic flow demands) between source and destination nodes.These equations give the capacity and flow conservation constraints that the routing variables must satisfy.If we find routing variables that pass through the shortest path, we can use the shortest path algorithm, such as Dijkstra or Bellman-Ford.However, these algorithms do not take capacity constraints into account.If all traffic demands pass through the shortest path, traffic may concentrate on a single link, causing congestion.
A routing variable x pq ij ,t is formulated to minimize the maximum link utilization U L t while satisfying the link-capacity constraints as follows: x pq ji,t = 0 (7) x pq ji,t = 1 (8) Equations ( 7) and ( 8) show the traffic-flow conservation law.Equation (7) shows that the traffic flowing into a node equals the traffic flowing out of the node except the source node p and destination node q.Equation ( 8) shows that the flow out of the source node p is 1.The traffic-flow conservation law at the destination node q is guaranteed when Eqs. ( 7) and ( 8) are satisfied, which is proved in [14].Equation ( 9) shows the link-capacity constraints.We impose the upper bound w L ij U L t , which is stronger than w L ij .Equations ( 10) and (11) show the range of variables.The term m pq t in Eq. ( 9) can be formulated as follows: Equation (12) shows the upload traffic demands from origin node p to destination node q.Here, z kp and y kq determine the node p and node q, and I k ,t extracts the executing tasks.Equation (13) shows the download traffic demands from node q to node p, which is the opposite of the upload.
3) Latency Constraints: Task latency is the sum of all possible delays a task experiences during offloading, including processing, transmission, propagation, and queuing latency.Processing latency is the delay it takes to process the task at the edge or cloud server.We refer to this latency as node latency τ N k .Transmission latency is the time it takes to push all the traffic onto the link, calculated by dividing the number of bits by the transfer rate.Propagation latency is the time it takes for traffic to travel from the origin to the destination, determined by the distance between nodes and the speed of signal propagation.For a wired network, propagation latency can be assumed to depend only on distance because the speed of light is constant in fiber.Queuing latency is the time that traffic waits in the queue before it can be processed.We consider the sufficient buffer and assume that the queuing latency is zero.We refer to the sum of the transmission latency and the propagation latency as RTT latency τ RTT k .Latency constraints of k th task are formulated as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The definitions of the node latency τ N k and RTT latency τ RTT k of the k th task are as follows: Equation (15) shows that τ N k is determined by the computingdemand size of the task C k and the node-processing capability per second v N n .The term C k /v N n is the time it takes to process C k at node n.The term y kn determines the node n to which the task k is allocated, as explained in Eq. (15).Equation (16) shows that τ RTT k is determined by the bottleneck path with the maximum latency when the task goes through multiple paths.
Here P up k and P down k are the set of upload and download paths of the k th task, which is calculated on the basis of X t , Y, and Z.The term τ p (b) is the function that shows the latency when the traffic demand b goes through the path p.The τ p (b) is formulated as follows: The first term in Eq. ( 17) shows the transmission latency when the traffic demand b goes through the path p.The term r p is the traffic-splitting ratio of path p calculated by X t that satisfies the following constraints: p (r p ) = 1.The term L p is the link set along with path p.The term min (i,j ) ∈Lp (w L ij ) indicates the bottleneck bandwidth on path p.The second term in Eq. (17) shows the propagation latency of path p, which is the sum of the propagation latency α L ij of the link (i , j ) ∈ L p .

IV. REINFORCEMENT LEARNING
We briefly review single-agent RL and MARL.The recent review of MARL is described in detail [15], [16], [17].

A. Single-Agent Reinforcement Learning
Single-agent RL solves the decision problem in which an agent interacts with an environment.The agent observes state s ∈ S, takes action a ∈ A, receives reward r, and transfers to the new state s ∈ S. The goal of the agent is to maximize the action value Q(s, a), which is defined as the expectation of the sum of rewards obtained in the future when action a is selected in state s.Q-learning [9] is a widely used RL algorithm and updates the Q-function by: where α is a learning rate and γ is a discount factor.
Deep RL (DRL) [10], [18] dramatically improves the generalization and scalability of traditional RL algorithms and can handle continuous and high-dimensional state space by approximating the Q(s, a) with a deep neural network (DNN).Among many studies on DRL, Deep Q-network (DQN) [18] is a well-known study that has achieved human-level control through DRL and has achieved generic performance in many classic games.Our method uses DQN as the basis for the DRL algorithm because DQN has achieved stable and high performance in various use cases and is suitable for handling discrete actions.The optimal task offloading problem must handle discrete actions because it requires selecting the best server from a set of edge or cloud servers.
DQN uses a replay memory to store the transition tuple s, a, r , s .DNN parameters θ are learned by sampling batches b of transitions from the replay memory and minimizing the squared error: where θ − are the parameters of a target network that are periodically copied from θ and kept constant for several iterations.It was reported that the performance of RL algorithms drastically worsens as the number of candidate actions increases due to the decrease in the sampling efficiency of s, a, r , s and increase in the error of the DNN [10].Therefore, stable DRL becomes difficult as the number of actions increases.DRL performs better when historical information is used for learning.DQN defines the last four frames as the current state in the classic game environment to consider historical information.This approach was successful enough for games where reflexes are critical, but it became clear that DQN needed more than four frames for some games.To address these shortcomings, Deep Recurrent Q-Network (DRQN) [19] adds recurrency to DQN by replacing the first post-convolutional fully connected layer with a recurrent long short-term memory (LSTM).DRQN successfully integrates historical information and improves the performance of DQN in some games, even though it only sees a single frame at each time step.In addition, DRQN with the recurrent network can better adapt during an evaluation when the quality of the observations changes.Therefore, our method introduces DRQN for the DNN layer.

B. Multi-Agent Reinforcement Learning
MARL is a system of multiple agents interacting within a shared environment.Each agent works with the other agent(s) to achieve a given goal.It is used for learning a complex environment by dividing a single work into multiple sub-works.We can reduce the learning cost of each agent by assigning each agent to each work.Due to the complexities of the environments and the combinatorial nature of the problem, most MARL problems are categorized as NP-hard problems [20].
A cooperative multi-agent environment can be described as a decentralized partially observable Markov decision process (Dec-POMDP) [21] consisting of N , S, A, R, P , O, γ , in which N is the number of agents, S is state space, A = {A 1 , . . ., A N } is the set of actions for all agents, P is the transition probability among the states, R is the reward function, and O = {O 1 , . . ., O N } is the set of observations for all agents.If the agents can fully observe the global state, the i th agent at t observes the global state s t , takes action Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.a i t (a t = {a i t }), and receives reward r t .If the agents cannot observe the global state, each agent only accesses its own local observation o i t .In this case, the i th agent at t observes its local observation o i t , takes action a i t (a t = {a i t }), and receives reward r t .Each agent has an observation-action history h i ∈ h, where the history indicates a series of past observations and h is the observation-action history of all agents.The joint policy π has a joint action-value function: 1) Independent Q-Learning: The most naive approach to solve the MARL problem is to treat each agent independently.This idea is formalized using an independent Q-learning (IQL) algorithm [22], [23], which decomposes a multi-agent problem into a collection of simultaneous single-agent problems that share the same environment.Each agent runs Q-learning [9] or DQN [18].IQL is scalable from the viewpoint of implementation while increasing the number of agents, and each agent only needs its local history of observations during the training.In this study, IQL was trained to minimize the loss: where b is the batch size of transitions sampled from the replay memory, Q k is the k th agent's Q-value function, and y k is the k th agent's y DQN , as in Eq. ( 20).
2) Value Decomposition Networks: Value decomposition networks (VDNs) [24] are used to learn a joint actionvalue function Q tot (h, a).It is assumed that the joint action-value function Q tot can be additively decomposed into N Q-value functions for N agents, in which each Q-value function Q i relies only on the local state-action history: Each agent observes its local state, obtains the Q-values for its action, and selects an action.The sum of Q-values for the selected action of all agents then provides the total Q-value of the problem.Because each agent's DNN is updated on the basis of the total Q-value, each agent learns the optimal behavior for all agents, i.e., they learn cooperative behavior.The loss function for VDN is as follows: where b is the batch size, y tot = r +γ max a Q tot (h , a ; θ − ) and θ − are the parameters of a target network, as in DQN.

A. Overview
We give an overview of the proposed task-offloading method, which is based on Coop-MADRL.We aim to find the optimal task offloading that minimizes Eq. ( 1) while satisfying the constraints in Eqs. ( 2)-( 14).The decision variables are the task-allocation variables Y and routing variables X t .Our method consists of two parts: Coop-MADRL and mathematical optimization.Coop-MADRL is responsible for finding the optimal Y, and mathematical optimization is responsible for finding the optimal X t .Another possible way to find the solution of two variables is to solve them in sequence, such as finding the Y and then finding the X t .However, in this case, the Y is determined on the basis of a non-optimal X t , which may degrade the performance of the solution compared with that of our method.Therefore, our method jointly optimizes Y and X t .Concretely, we calculate the reward of Coop-MADRL to find the optimal Y on the basis of the X t calculated by mathematical optimization.The concept is based on current methods [11], [25], [26].Our method also consists of centralized training and decentralized execution.The decentralized agents continually execute the task offloading after centralized training.
Table V summarizes the notation definitions of our method.We introduce |E | agents equal to the number of edges and assign each agent for each edge offloading control.The e th agent g e ∈ G learns how to optimize task offloading for the e th edge.In centralized training, each task arrives at the nearest edge at the beginning of each t.Each agent g e observes the information o e t .On the basis of the observation, the g e determines the node to offload the task as an action a e t .Each agent randomly determines offloading nodes in the early stages of training but can select the best node as training progresses.Our method then aggregates the traffic demands between nodes, then the mathematical-optimization solver calculates the optimal route between nodes.The solver calculates a routing variable X t to minimize the link utilization U L t while satisfying the constraints in Eqs. ( 7)- (11).Since the routing variable X t is a continuous value within 0-1, as shown in Eq. ( 11), this problem is classified as a linear programming problem.Next, each edge forwards tasks to the offloading nodes through the optimal route, and our method calculates the reward r t .By repeating these steps, agents collect learning samples that are combinations of o t , a t , r t .train all agents G by random episodic transition On the basis of these samples, the agents are trained using Coop-MADRL.In decentralized execution, the trained agents repeat the above steps except for the learning steps.Since each agent has already learned the cooperative action for all agents in centralized training, each agent can act independently in decentralized execution.

B. Modeling
We first define the variables that represent a subset of tasks as follows: The K t indicates the index subset of tasks executed at time step t.The K e indicates the index subset of tasks accepted at edge e.The D t indicates the subset of tasks accepted at t.
The D e,t indicates the subset of tasks accepted on edge e at t, i.e., D e,t ⊂ D t ⊂ D.
A state is defined as ].An observation for agent g e is defined as The candidate action set A e is defined as the set of nodes that offload tasks.The node set consists of the nearest edge, neighboring edges, and neighboring clouds.The neighboring edges and clouds are one or more nodes determined by the physicalnetwork graph G(N, L).We exclude nodes from A e that have no remaining resources and edges that are more than two hops away from edge e.When edge e does not receive any tasks at t, agent g e chooses the action "do nothing."We design the reward function to return a negative value if the constraints are not satisfied; otherwise, a positive value depends on the objective-function value (see Section V-E for details).

C. Formulation
Algorithm 1 shows the centralized training using Coop-MADRL.Line 1 shows the initialization of agent parameters.
A series of procedures (lines 2-18) is repeatedly executed until learning is complete.Lines 3-4 show the generation of tasks and initialization of environment parameters.A series of actions is called an episode, and each episode (lines 5-16) is repeatedly executed.In each episode, agents collect learning samples that are combinations of o t , a t , r t .We denote a time step for the network simulator as t sim ∈ T sim , which is reset at the beginning of each episode.In line 7, when edge e accepts multiple tasks at t sim , agent g e selects one task in an FIFO manner.The two task subsets can be written as the following relationship: D e,t ⊆ D e,t sim (⊂ D).In line 9, a random action is selected with probability ε; otherwise, an action that maximizes Q e (o e t , a ) is selected with probability 1 − ε.This is to avoid convergence to a local optimum solution.Each agent executes lines 7-9 in parallel.In line 10, task offloading is updated in accordance with a t by Algorithm 3. Line 11 shows the reward calculation.Lines 12-13 mean the termination condition of agent learning.In this algorithm, r t ≤ −1 is the terminate condition, i.e., the state that does not satisfy at least one constraint.In lines 14-15, if all tasks accepted at t sim are allocated, it proceeds to the next t sim + 1. Line 17 shows stores in replay memory M. The reason for storing the samples once in replay memory is to eliminate the time dependence of collecting training samples [18].Lines 3-17 can be paralleled because the samples of the episodic transitions stored in replay memory are independent of the storing order.In line 18, all agents G are trained by the history of episodic transition, which is randomly taken from M. We used VDN [24] as the learning algorithm for all agents (see Section IV-B2 for details).To achieve global optimization, all agents learn to maximize a shared objective function, namely the joint action-value function Q tot (o t , a t ), which helps prevent competition among the agents.In our architecture, each edge has a replay memory M. Since the observation o e t of the edge e is composed of local task information and shared network information, episodic transitions can be stored at each edge without aggregating data from other agents.
Algorithm 2 shows the Coop-MADRL algorithm of the proposed task-offloading method.Line 1 shows the pre-training of G by using Algorithm 1. Next, this algorithm continually repeats lines 2-9 as long as our method accepts new tasks.In line 6, each agent selects an a e t that maximizes Q e (o e t , a ).

D. Update Environment
Algorithm 3 shows the procedure of the update environment.The task-allocation variables Y and routing variables X t are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 3 Update Environment
return -3 4: return max(−3, min(3, r t )) updated in Algorithm 3. Line 1 shows the calculation of Y, line 2 the calculation of U N t , line 3 the calculation of M t using Eqs.( 12) and ( 13), and line 4 the calculation of U N t .In line 4, we solve the route-optimization problem, which calculates X t using Eqs.( 7)- (11).Line 5 shows the calculation of latency.Finally, Algorithm 3 returns variables for the reward calculation.

E. Reward Calculation
We design the reward function on the basis of the objective function in Eq. ( 1).Algorithm 4 shows the procedure of the reward calculation for G.The term Eff(x ) shows the efficiency function and is defined as follows: We design this function so that the efficiency decreases as x increases.It returns a positive value depending on x if x < 1; otherwise, it returns a negative value.The decrease in efficiency doubles when x > 1.Note that this design is essentially independent of the effectiveness of the proposed method.The term τ ave t indicates the average task latency efficiency at t, which shows the average satisfaction of latency and is defined as follows: The terms τ N k and τ RTT k are calculated using Eqs.( 15)-( 16).If it does not satisfy U L t+1 > 1 or U L t+1 > 1, it returns −3.Finally, it returns a clipped reward within −3 ≤ r t ≤ 3. The reason for clipping the reward is to increase the stability of agent learning [18].

VI. EVALUATION
We evaluated the effectiveness of the proposed method through simulations in terms of performance, network topology dependency, computation time, and scalability regarding the number of tasks.We also evaluated its generalization performance for unknown task patterns.We prepared five comparison methods and four network topologies for the evaluation.For each agent's DNN layer, we introduced DRQN [19].
We used a three-layer DNN consisting of two fully connected layers and the gated recurrent unit (GRU) layer [27].We used Adam [28] to optimize all DNNs and set the number of neurons in the hidden layers to 64 for all DNNs.We used Double-DQN [10] as the DRL algorithm.We implemented the DRL-algorithm-based PyTorch [29], PyMARL [30], and PyMARL2 [31].We solved the route optimization using the GNU Linear Programming Kit (GLPK) [32].For the hyperparameters of DRL, the learning rate was set to α = 0.001, and the discount factor was set to γ = 0.99.The parameter ε, which determines the probability of random actions, was linearly decreased from 1.0 to 0.05 over the first 10 5 steps and then fixed at 0.05.We accelerated the centralized training with parallel counts of 8.

A. Evaluation Conditions
We set the number of tasks to K = 1000, total time steps to T = 5.0×10 5 , total time steps of each episode to T sim = 50, and weighting parameter to λ = 1.We assume that each ED has one task during t sim ∈ [1, T sim ], the acceptance time t k is randomly generated within 1-50, and the ED placement Z is randomly generated.We set the time-step interval from t to t + 1 on the network simulator to 100 ms.Since this interval is merely a parameter of the network simulator, the actual training time per step need not be less than 100 ms.However, the execution time of real-world task offloading should be less than 100 ms.Our proposed method satisfies these time constraints (see Section VI-F for details).
For task generation, we prepare four types of tasks assuming various use cases.We use a generalized task model to reveal that the proposed method can effectively allocate tasks even when many types of tasks are mixed. .We determine the maximum permissible latency on the basis of the upload and download traffic and computation demand.We model Type 1 tasks as basic tasks and model other types of tasks by modifying the parameters of basic tasks.Each task parameter randomly takes a continuous value within a specified range.The task parameters are reset at the beginning of each episode.
For the physical network, we prepare four network topologies.We first prepare the 12-node topology on the basis of Internet2 [33], which consists of nine edges and three clouds, as shown in Fig. 1.We connect each cloud to one randomly selected edge and fix the cloud placements for all evaluations.The values in this figure show the pair of node-processing capability and node capacity (w N i , v N i ) and the pair of link capacity and propagation latency (w L ij , α L ij ).All evaluations except for the network-topology dependency evaluation were based on this topology.We then prepare the other topologies that refer to SNDLib [34], a library of test instances of survivable fixed telecommunication network design.Table VII summarizes the network-topology parameters.For each network topology, we define neighboring edges and clouds for each edge.We define the edges with a hop   count of 1 as the neighboring edges for each edge.We also define the cloud(s) other than the farthest cloud from each edge as neighboring clouds for each edge.
We perform training and evaluation steps on the prepared tasks and physical networks.We run Algorithm 1 for training and Algorithm 2 for evaluation.Tasks in training and evaluation are generated under the same conditions.Note that the tasks generated during training and evaluation are different because some parameters are randomly generated.In the evaluation, the MADRL-based methods select the best action (i.e., the offloading server) on the basis of the trained agents in line 6 of Algorithm 2. The non-DRL-based methods select the offloading server on the basis of their respective algorithms, rather than selecting actions by agents.

B. Comparison Methods
Table VIII summarizes our method and the comparison methods.VDN indicates the proposed Coop-MADRL-based method.IQL indicates the MADRL-based method without cooperation, i.e., each agent learns on the basis of their reward.As mentioned in Section II, there is no method for considering multi-cloud networks and network topology, but IQL corresponds to such a method [4] when it considers them.Note that we did not evaluate single-agent DRL because learning will clearly be unsuccessful due to requirements for huge training iterations until the Q-values for all actions are sufficiently close to the optimal.
We also compared our method with three algorithms that may be able to offload tasks in a computation time close to RL-based methods.The random algorithm (RA) has a policy that allocates each task to a node randomly chosen from all Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.candidate nodes at each t.The greedy algorithm (GA) has a policy that allocates tasks to the nearest edge until the node utilization of the nearest edge exceeds a threshold and allocates tasks to the neighboring cloud after exceeding that threshold.We selected the best threshold with the best performance after evaluating all patterns in increments of 0.1 through a preliminary experiment.The GA threshold was set to 0.2, 0.2, 0.2, and 0.1 for Internet2, Abilene, Atlanta, and Geant, respectively.The heuristic algorithm (HA) has a policy that allocates tasks to suitable nodes depending on the characteristics of the task type.For this evaluation, HA allocated tasks to random nodes when the task type was 1 or 2, to the neighboring clouds when the task type was 3, and to the nearest edge when the task type was 4. This is based on the empirical rule that allocating computing-heavy tasks to the clouds and latency-sensitive tasks to the nearest edge are suitable.Note that these three algorithms also optimize routes calculated with the route-optimization algorithm at each t.
We also evaluated exhaustive search (ES), which has a policy that finds the best offloading node by exhaustively calculating each reward for all candidate nodes at t, equivalent to solving an online optimization problem every t.Note that the solution of ES is different from the exact optimal solution.Whereas the optimal solution is defined as the solution that maximizes the sum of the rewards at each time, as shown in Eq. ( 1), ES aims to maximize the immediate reward obtained in the present.ES also independently calculates the optimal offload node for each edge at t on the basis of the information obtained at t − 1, which may concentrate the tasks on certain nodes.

C. Training Curve
Figure 2 shows the training curves tracking the agent's total return of VDN and IQL under the Internet2 topology.The total return is defined as the sum of rewards at each time until the end of the episode.For this evaluation, the total time step was set to T = 1.0 × 10 6 to ensure that the agent's learning has converged sufficiently.We conducted five trials for every 1.0 × 10 4 steps with random initial conditions.The width of each bar indicates the standard deviation (±σ).
The results indicate that the average total return of the two MADRL methods increased as the training progress and converged around T = 3.0 × 10 5 .The results also indicate that the final return of VDN was higher than IQL, which means that IQL cannot learn the suitable task offloading that maximizes the objective function while satisfying constraints.We discuss the performance details in Section VI-D.

D. Performance
1) Results: Figure 3 shows the average performance of each method.We carried out 20 calculations with random initial conditions and set the same random seeds for all trials.The width of each bar indicates the standard deviation (±σ).We can roughly compare the performance of each method with the average reward.We also investigated the performance details in terms of the second to fifth metrics in Fig. 3.These metrics are the average maximum node utilization U N t , average maximum link utilization U L t , average task-latency efficiency τ ave t , and average constraint violation.The higher reward means better performance, whereas the lower values for other indicators mean better performance.When all constraints are satisfied, the node utilization, link utilization, and task latency take a value within 0-1.The constraint violation denotes the total number of times each method violates either node utilization, link utilization, or task-latency constraints.
Figure 3 shows that the proposed method (VDN) performed comparably to ES and outperformed the other methods.These results indicate that VDN and ES can maximize network-utilization efficiency and minimize task latency while reducing constraint violations.Note that constraint violations occurred for all methods because we evaluated performance under severe conditions that put heavy loads on the network.Although ES performed the best, it is problematic in terms of computation time (see Section VI-F for details).
Next, we discuss the performance comparison between VDN and other methods.We first compared VDN and IQL. Figure 3 shows that VDN performed better than IQL for all metrics.This indicates that the coordination among agents with VDN improves the performance of task offloading and can prevent constraint violations.Since each agent of IQL acts independently, task offloading is concentrated on lightly loaded resources in some cases, causing constraint violations as described in the Introduction.
We next compared VDN and the three non-RL-based computationally lightweight methods: RA, GA, and HA.The GA threshold was set to 0.2 because the best performance was obtained in the preliminary experiment.Figure 3 shows that VDN outperformed the three methods in terms of average reward.However, VDN had a higher maximum node utilization than GA and HA and higher maximum link utilization than RA.The reason is that agents choose the action that maximizes the average reward even if the node and link utilization are increased.As a characteristic of each method, GA performed comparably to or better than VDN in terms of maximum node utilization, maximum link utilization, and task latency efficiency by setting the best threshold in the preliminary experiment.However, GA had more constraint violations because it could not dynamically select allocations on the basis of the situation, resulting in a lower average reward than VDN.In addition, the maximum node utilization and task latency efficiency of HA were competitive with those of the other   methods because HA designs the appropriate offloading node depending on the characteristics of each task.However, the other indicators of HA showed the worst performance because HA does not take into account the link constraints.
2) Analysis: We discuss the reasons for the performance shown in Fig. 3 by analyzing the latency and allocation ratio of each task type.Figures 4 and 5 show the latency and allocation ratio of each task type in the evaluation in Fig. 3. Here, the allocation ratio indicates the proportion of where tasks allocated to the nearest edge, neighboring edges, or neighboring clouds.The allocation ratios for VDN, IQL, and ES are determined by the task and network conditions.In contrast, the allocation ratios for RA and HA are fixed, regardless of the task and network conditions, and are determined by the algorithm design.The allocation ratio for GA is determined by the GA threshold value, which was set to 0.2 in this evaluation.
Figure 4 shows that VDN, IQL, GA, and ES primarily allocated tasks to the neighboring cloud.The result shows that the neighboring cloud is the preferred node for allocating tasks.This is because of the sufficient cloud-processing capacity and the short transmission delay between the edge and cloud, making assigning tasks to the cloud preferable in this evaluation condition.However, this result is only an example of an evaluation and does not deny the effectiveness of EC.The allocation to the edge may become dominant depending on the evaluation conditions.
Figure 4 shows that VDN changed the allocation ratio in accordance with the task type.VDN mainly allocated all types of tasks to the neighboring cloud.Looking at the ratio using the edges, VDN allocated most download-heavy tasks with high transmission costs (Task 2) to the edges.However, it allocated more computing-heavy tasks (Task 3) to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.neighboring clouds, even though the average traffic volumes of Tasks 2 and 3 were equal.We conclude that VDN learns a superior policy through cooperative learning.However, the latency-sensitive tasks (Task 4) were primarily allocated to neighboring clouds, even though they should be allocated to the nearest edge to minimize latency.We assume the reason is that VDN attempts to maximize the overall network efficiency by allocating lightweight tasks to the cloud because of limited resources at the nearest edge.Figure 4 shows that IQL changed the allocation ratio in accordance with the task type, although it is not associated with satisfactory performance, as shown in Figure 3. RA and GA did not change the allocation ratio in accordance with the task type, which is assumed to be one reason for the low performance of these methods.Figures 4 and 5 show that HA was suitable for latency for Tasks 3 and 4 because it changes the allocation node depending on the task type.However, as mentioned in Section VI-D1, HA performed poorly because it does not take into account the link constraints.
Figure 4 shows that ES can use the nearest edges compared with VDN, which is assumed to be one reason that ES performed better than VDN.Even though ES selects the node that maximizes the reward for all candidate nodes, the performance difference between VDN and ES is slight.This is because ES aims to maximize the immediate reward obtained in the present.That is, selecting the node that maximizes the immediate reward in the present may not maximize the expectation of the sum of rewards obtained in the future.

E. Network-Topology Dependency
Table IX shows the average reward of each method for various network topologies.We carried out 20 calculations with random initial conditions and calculated the mean value and standard deviation of rewards.We did not evaluate IQL because it is evident that learning agents is more difficult in larger topologies.The results indicate that VDN outperformed the other methods except for ES in all network topologies.Thus, the proposed method can be applied to larger network topologies.
We first discuss the average performance with each topology.Table IX shows that the average performance of all methods in Abilene decreased compared with that of other topologies.Comparing the conditions between Internet2 and other topologies, as shown in Table VII, only the network size is larger, and the other parameters are the same.We consider that the graph structure of the topology determines the performance.We next discuss the performance of each method.Table IX shows that VDN performed comparably to ES in all topologies, whereas the performance difference between VDN and ES increased in Abilene.Since all methods performed worse in Abilene, the performance difference between VDN and ES could be significant in environments where agent learning is complex.Table IX also shows that the performance trends of GA and HA remain unchanged.On the other hand, the performance of RA improved as the topology size increased.This is because a larger topology provides more control options, enables load balancing, and improves performance regardless of the control method.
Figures 6-8 show the allocation ratio of each task type for larger topologies.We omit the results of RA, GA, and HA because these methods do not change the allocation ratio depending on evaluation conditions.These figures show that VDN mainly allocated all types of tasks to the neighboring cloud, similar to that of Internet2.On the other hand, the allocation ratio of ES depended on network topology.
Figure 6 shows that the allocation ratio of VDN in Abilene was about the same as that in Internet2, while it increased the task allocation to the nearest edge.This means that VDN changed the allocation ratio in accordance with Abilene and learned the superior policy in accordance with the environment.Figures 7-8 show that the allocation ratio in larger topologies was similar to that of Internet2.However, the difference in the allocation ratio for each task type became smaller as the topology size increased.This means that learning in accordance with task type becomes more difficult as the topology size increases.
In summary, VDN outperformed the other methods except for ES in all network topologies.In addition, VDN can allocate tasks in accordance with the task type when the topology is small, but this becomes more difficult as the topology size increases.Although VDN cannot learn by task type in larger topologies, it achieves superior performance competitive with that of ES by learning the appropriate allocation to edges and clouds depending on the situation.

F. Computation Time
Table X shows the average execution time of each method for various network topologies.The execution time denotes the computation time required to determine the allocated node for a single task, which corresponds to line 4 in Algorithm 2 for VDN.We used Apple M1 Ultra for the evaluation.The execution time of all methods except for ES was less than 1 ms for all topologies.The execution time tended to increase as topology size increases, but the increase is almost negligible for task latency.This means that VDN performs sufficiently in terms of computation time.However, the execution time of ES was about 1 s in the fastest case.The execution time of these task-offloading methods was added to the task latency in real-world scenarios.When the execution time takes 1 s, the task latency cannot be shorter than 1 s.Therefore, it is a critical for many tasks to keep latency to less than a few tens of milliseconds.The execution time drastically increased as the topology size increased because the computational complexity of the optimization calculation also drastically increased.We conclude that VDN is the best task-offloading method, considering the allocation performance shown in

G. Latency-Weighting-Parameter Dependency
Table XII shows the weighting-parameter dependency of the objective function shown in Eq. (1).The weighting parameter λ determines the importance ratio of resource efficiency and task latency.We carried out 20 calculations with random initial conditions and set the same random seeds for all calculations.The results indicate that, as λ increased, VDN improved in task-latency efficiency and worsened the sum of node and link utilization.We revealed that the proposed method can adjust the importance of the objective function by λ.However, these indicators show no drastic change for the increase in λ.This is because reducing task latency leads to better node and link utilization.For example, to reduce task latency, the taskoffloading method must balance server loads between clouds and edges in accordance with the task characteristics, which improves node utilization.To reduce task latency, it must minimize the route length of the task, which decreases link loads and improves link utilization.

H. Scalability Regarding Number of Tasks
We evaluated the scalability of VDN regarding the number of tasks K.We simultaneously evaluated the generalization performance of VDN regarding the number of tasks.The generalization performance measures how accurately a method can perform for previously unseen data.In other words, we evaluated the performance of VDN with a different number of tasks than during training.We increased link capacity and node capacity depending on K, as shown in Table XIII.Therefore, the relative load on the system is kept constant and we can only evaluate the scalability of VDN.Other parameters were those in Table VII.We evaluated performance under strict conditions where the capacity increase is 95% of the constant multiple.
Table XIV shows the scalability of VDN regarding the number of tasks K in Internet2.We carried out 20 calculations with random initial conditions and calculated the average and the standard deviation of rewards.We use agents trained when K = 1000 for all conditions.VDN kept the average rewards even as the number of tasks increased.The agents of VDN learn the policy from accepted task information and each node and link utilization information.Since this information of the physical network is normalized and independent of the number of tasks, VDN can determine the preferred task allocation without depending on the number of tasks.We conclude that the proposed method has scalability and generalization performance regarding the number of tasks.However, the variance of rewards tends to increase as the number of tasks increases.Therefore, retraining the agents is advisable when the number of tasks significantly differs from the training condition.

I. Generalization Performance for Task Types
We evaluate the generalization performance of VDN for various task types.We evaluate the average reward when each Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.XV shows various cases with different ratios of task types.We set the conditions in Table VI as default.We prepared cases that increase the ratio of each task type and decrease the ratio of basic tasks.We also prepared cases that change the ratio of three task types per 5% and the ratio of basic tasks per 15%.
Table XVI shows the average reward and other indicators for various cases changing the ratio of task types.We carried out 20 calculations with random initial conditions and set the same random seeds for all calculations.The average reward decreased as the ratio of Tasks 2-4 increased from the training condition.For Task 3, VDN had generalization performance until the task ratio increased by 2%.The average reward notably decreased for increases of 5%.This is because tasks with high average computing demand occupy the computing resources over a long period and reduce node utilization.For Tasks 2 and 4, VDN had generalization performance even when the task ratio increased by 20%.This is because Tasks 2 and 4 occupy computing resources for only a short time.Similarly, when we simultaneously increased the ratio of the three task types by 5%, VDN showed no generalization performance.In other words, the proposed method has generalization performance when the prediction error in the proportion of computing-heavy tasks is within 2%.Conversely, for the case in which we simultaneously decreased the ratio of the three task types, the performance of VDN improved.Therefore, we conclude that the proposed method can have generalization performance for various task types by pre-training with the predicted maximum percentage of resource-consuming tasks.

J. Discussion
We discuss the future work of the proposed method.We have demonstrated the effectiveness of the proposed method.However, there are still challenges that we should explore for applying our method in a commercial environment or a future network.
We discuss the challenge of the applicability of the proposed method.We assume that the routes in the access network take the shortest path.Current wired access networks with optical fiber automatically select the shortest path from a few routing options.Current wireless access networks for cellular automatically select the best BS for each ED on the basis of the received signal power.Therefore, the proposed method is applicable in real-world scenarios of the current network.However, there are some challenges in applying the proposed method in a commercial environment or a future network.
The first challenge is scalability for edges.The proposed method may only be applicable to networks with EC servers with dozens of edges.We target commercial networks that cover a large area, such as networks in multiple regions or parts of a country.Such an access network includes more than 10 000 edges.Similar to other methods [1], [2], [3], [4], [5], [6], [7], [8] that can only handle a few dozen BSs at most, as described in Section II, our method also cannot control more than 10 000 edges.We deploy each agent at each edge (i.e., at each BS), and each agent learns cooperative control.Therefore, the agents should learn cooperative control among many agents in the mentioned environment.However, this is difficult to achieve with current technology and is one of the future works.As a possible solution, the combined method of Mean-field Game (MFG) theory and DRL may handle a large-scale task-offloading system with many BSs.MFG theory [35] studies strategic decisionmaking by small interacting agents in large populations, inspired by mean-field theory in physics.Each agent minimizes or maximizes the problem objective, taking into account the decisions of the other agents.Several studies have addressed network control methods that combine MFG theory and DRL, such as unmanned aerial vehicle (UAV) control [36], [37] and computation offloading in EC [38], [39], [40].
The next challenge is scalability for tasks.The processing time of the proposed method for determining the offload server Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XVI
GENERALIZATION-PERFORMANCE EVALUATION FOR VARIOUS TASK TYPES is about 0.2 ms per task.Also, because the proposed method processes tasks sequentially, one at a time, it cannot handle thousands or millions of tasks per second required in realworld environments.Therefore, sufficient scalability for tasks is one of the future works.One possible solution is parallelization.Assigning many trained agents at each edge makes it possible to process many tasks simultaneously.Many trained agents working independently may cause new problems.
The other challenge is to optimize task offloading for an entire end-to-end network.Since task latency can be expressed as the sum of backbone and access networks, we can find nearoptimal task offloading and routes between EDs and servers by optimizing the backbone and access networks independently.However, future networks will require more efficient optimal task offloading by considering the entire end-to-end multidomain between EDs and servers, such as backbone network, access network, edge servers, and cloud servers.For this purpose, the proposed method should consider the latency and link quality of the wireless network between EDs and BSs.The difficulty in the wireless network is that the observed information is incomplete or uncertain.Several studies have addressed task offloading in the wireless network under imperfect channel state information (CSI) [41], [42], [43], [44], [45], [46].In particular, security is one of the major concerns for mobile EC with incomplete CSI.Several studies have been conducted on security-aware task allocation under incomplete CSI [47], [48], [49].In addition, several studies have been conducted on end-to-end network slicing or multi-domain network slicing to integrate the multiple controls [50], [51], [52].Although much research has been done, optimizing task offloading in an end-to-end network remains an unsolved problem and one of the major challenges.By combining our method with above studies, the problem should be addressed in the future.

VII. CONCLUSION
We formulated an optimal task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints.We proposed a cooperative task-offloading method that is based on cooperative multi-agent deep reinforcement learning (Coop-MADRL).This method can quickly achieve efficient task offloading by learning the relationship between network-demand patterns and optimal task offloading by using deep reinforcement learning in advance.This method also introduces a cooperative multi-agent technique, improving the efficiency of task offloading.Evaluations revealed that the proposed method can minimize network utilization and task latency while minimizing constraint violations in less than 1 ms in various network topologies.They also revealed that cooperative learning improves the efficiency of task offloading.We demonstrated that the proposed method can have generalization performance for various task types by pre-training with many resource-consuming tasks.We plan to evaluate the performance of the proposed method and conduct further detailed analysis in more complicated use cases or real-world applications.We also plan to improve the scalability and interpretability of the proposed method.
the task type uniquely given for each application, C k is the required computing demand [G cycles], B up k ∈ R + and B down k ∈ R + are the required upload and download traffic demands [Mbits], and τ max k ∈ R + is the maximum permissible latency to accomplish the k th task [ms].Regarding the relationship between t k and τ max k , if we can complete the processing of the task k by the time t k +τ max k

Fig. 6 .
Fig. 6.Allocation ratio of each task type in Abilene topology.

Fig. 7 .
Fig. 7. Allocation ratio of each task type in Atlanta topology.

Fig. 8 .
Fig. 8. Allocation ratio of each task type in Geant topology.

TABLE IV NOTATION
DEFINITIONS FOR TASK-OFFLOADING PROBLEM

TABLE V NOTATION
DEFINITIONS FOR PROPOSED METHOD

Algorithm 2
Decentralized Execution of Coop-MADRL 1: t sim ← 0, Qe o e , a e ← train all agents G by Alg. 1 2: while all tasks are allocated do D e,t ← select one task (D e,t sim ) Table VI summarizes the parameters for task generation.The task model is parameterized by upload and download traffic demand, B up k and B down

TABLE VIII SUMMARY
OF OUR METHOD AND COMPARISON METHODS Table IX and execution times shown in Table X.Table XI shows the training time of VDN for various network topologies.The training time denotes the computation time required for the agents to complete the training of timesteps T = 5 × 10 5 .The training time increases as topology size increases.Most of the increase in the training time was due to the increased computational complexity of the network simulation, which corresponds to Algorithm 3.Although larger topologies require longer training times, this is not problematic since the training only needs to be done once.

TABLE XV VARIOUS
CASES FOR GENERALIZATION-PERFORMANCE EVALUATION task type ratio differed from the training condition.Table