CoEdge: Exploiting the Edge-Cloud Collaboration for Faster Deep Learning

Recently a great number of ubiquitous Internet-of-Things (IoT) devices have been connecting to the Internet. With the massive amount of IoT data, the cloud-based intelligent applications have sprang up to support accurate monitoring and decision-making. In practice, however, the intrinsic transport bottleneck of the Internet severely handicaps the real-time performance of the cloud-based intelligence depending on IoT data. In the past few years, researchers have paid attention to the computing paradigm of edge-cloud collaboration; they offload the computing tasks from the cloud to the edge environment, in order to avoid transmitting much data through the Internet to the cloud. To present, it is still an open issue to effectively allocate the deep learning task (i.e., deep neural network computation) over the edge-cloud system to shorten the response time of application. In this paper, we propose the latency-minimum allocation (LMA) problem, aimed at allocating the deep neural network (DNN) layers over the edge-cloud environment while the total latency of processing this DNN can be minimized. First, we formalize the LMA problem in general form, prove its NP-hardness, and present an insightful characteristic of feasible DNN layer allocations. Second, we design an approximate algorithm, called CoEdge, which can handle the LMA problem in polynomial time. By exploiting the communication and computation resources of the edge, CoEdge greedily selects the beneficial edge nodes and allocates the DNN layers to the selected nodes by a recursion-based policy. Finally, we conduct extensive simulation experiments with realistic setups, and the experimental results show the efficacy of CoEdge in reducing the deep learning latency compared to two state-of-the-art schemes.


I. INTRODUCTION
In the past few years, the popularity of emerging Internet-of-Things (IoT) applications has been generating a huge amount of real-time data. Based on the IoT data, the IoT end-users can make accurate monitoring and effective decisions with their intelligent utility deployed in cloud. For such cloudbased intelligent IoT applications, however, the data generated by the ubiquitous IoT devices has to be delivered, through the Internet, to the cloud for further processing. In industry, cloud computing has played an indispensable role in executing large-scale and intensive-computing tasks, such as deep learning, and thus, the intelligence of the IoT applications usually resides in the cloud [1], [2]. Under such a cloud-centric paradigm, the data delivered from IoT devices The associate editor coordinating the review of this manuscript and approving it for publication was Jjun Cheng . to the cloud will be inevitably impacted by the inherent transmission bottleneck of the Internet [3]- [6]. Accordingly, the response time or delay of application will be significantly extended, especially in the intelligent IoT applications where massive real-time IoT data struggles to traverse the Internet, swarming into the cloud. As a matter of fact, the nonnegligible transmission delay of the Internet has become the essential obstacle to expediting the deep learning-based IoT applications [7]- [9].
One promising way of addressing the above issue is edge computing. It is a new computing paradigm complementary to cloud computing, in which the IoT data processing or at least part of it is moved from the cloud to the edge of the Internet, so that the computing task can be executed at the proximity of data sources, rather than being done at the remote, hard-to-reach cloud [7], [10], [11]. Many recent studies show the ability of edge computing in reduc- VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ ing the system delay. In [12], for instance, the researchers offload the computation from the cloud to the wearable cognitive assistance. They reduce the response time of application by 80 ms to 200 ms; moreover, their work squeezes 30%-40% less energy consumption from the cloud-based strategy. Edge computing has been envisioned to be a unifying platform that can engender a new breed of emerging services and support a variety of new intensive-computing real-time applications.
Recently researchers have attempted to process the deep learning task in edge, mainly in order to reduce the system response delay. In general, the execution of deep learning task is a layer-by-layer process of the deep neural network (DNN) model, which typically consists of a set of consecutive perceptron layers [13]. The raw IoT data is fed into the first DNN layer, and the inference or classification can be finally yielded out of the last DNN layer. In the ongoing process of DNN, the intermediate results are transfered through the DNN layers and can be quickly scaled down in terms of size. For edge-based intelligent IoT systems, the whole DNN model or a part of its layers can be deployed on the edge node, which is far closer to the IoT devices than the cloud server is. Sometimes, the IoT devices are resource-rich and then can by themselves serve as a kind of edge devices. The edge node only needs to load up to the cloud some size-reduced intermediate data or even the final result, instead of the raw IoT data of large size. The edge participation can therefore shorten the response time of DNN-based IoT system by reducing the Internet traffic. In order to further reduce the system response time, however, it is still an open issue to design an effective paradigm that can take full advantage of the edge-cloud resources.
In this paper we design and implement a new allocation scheme, called CoEdge, which attempts to allocate the DNN layers over the edge and the cloud such that the deep learning delay can be further reduced in comparison with the schemes that only involve a single edge node. With a greedy criterion, CoEdge iteratively finds out the best edge node and forms a set of edge nodes, over which the DNN is then allocated with a recursion-based policy. Essentially, CoEdge unlocks the potential of edge in deep learning, by exploiting the high-speed connection and the computing capacity embraced by the edge environment.
The remainder of this paper is organized as follows. Section II briefly introduces major works related to ours. Section III and Section IV give the models and the detailed design of CoEdge. Section V evaluates our design and compares it with two state-of-art schemes via extensive experiments. Finally, Section VI concludes this paper.

II. RELATED WORK
In this section, we first introduce the task offloading in edge computing and then, the approaches to allocating DNN learning or inference tasks to edge-cloud system.

A. OFFLOADING COMPUTE TASK ONTO EDGE
In cloud computing, transferring data to the cloud server often needs a considerable amount of time, which will surely weaken the quality-of-service for the time-sensitive applications. To address this issue, a promising approach is to offload a part of computing tasks onto the resource-rich edge, alleviating the deficiencies of network congestion, latency and energy consumption [7]. Exiting works in edge-cloud offloading focus on how to determine the effective and efficient task offloading policy [14], [15], [16]- [18]. To improve the resource efficiency, reference [19] designs a resource-efficient edge computing framework, which enables intelligent IoT users to flexibly offload task across the edge device, the nearby assistant devices, and the adjacent edge cloud. In [20], the authors exploit the possibility that massive mobile devices collaboratively execute the on-edge task, aimed to optimize the energy efficiency of those devices. In [21], the authors consider a scenario where flying unmanned aerial vehicles serve as edge nodes; they propose a resource scheduling approach that can offload tasks in dynamic environments by leveraging based a learning algorithm. How to offload the computing task onto the edge has recently attracted more and more attention in industry and academia. We have observed that the sustainable edge computing can bring latency reduction and there is a great chance to deploy deep learning in edge.

B. ALLOCATING DNN TO EDGE-CLOUD
The high-accuracy deep learning usually depends on a lot of training data, and as a result, the demand for bandwidth rises up dramatically in the cloud-based intelligent applications. To accelerate the model training or inference of DNN, a popular means nowadays is to transfer to the edge environment a part of or even all the DNN layers, i.e., making them closer to the data sources.
Many hardware platforms including GPU or customized accelerators such as FPGA and ASIC have emerged. In [22], DjiNN is designed, which is an open infrastructure for DNN with large-scale GPU servers to achieve high throughput and low network occupancy. There have been several approaches to accelerating machine learning [23]- [25]. FPGA-based accelerators have more flexibility than ASICs in accelerating large-scale CNN models. A deeply pipelined multi-FPGA architecture is designed in [26], which can achieve lower latency by using a dynamic programming method to map a DNN onto several pipelined FPGAs. For this approach based on a fixed pipeline, however, the FPGA devices are assumed to be homogeneous in computing capacity and each of them is required to undertake at least one DNN layer.
The demand for memory, computing and energy capacity has gradually grown up to be a critical bottleneck for the allocation of DNN to edge devices. How to deploy DNN into edge environment has been studied primarily over industry and academia. A software accelerator for DNN execution in edge network is presented in [27], in which the resource demands are reduced by decomposing DNN layers into various unitblocks that can be effectively processed by heterogeneous processors. In [28], a method called AAIoT is proposed to allocate DNN to a set of devices that form a multi-level IoT system. AAIoT balances the computation and transmission time to minimize the overall response time. Reference [29] schedules DNN layers in edge computing environment: it allocates as many deep learning tasks to the edge devices as possible, while satisfying a given constraint on response time. For a given deep learning task, however, the authors only employ a single edge device to share some DNN layers. Particularly, they do not consider the potential of collaboration across the edge devices, and thus, their allocation policy cannot well suit the deep learning task with a strict requirement for system latency.
Different from the above works, the CoEdge proposed in this study is designed for a general edge-cloud computing where the edge devices could be heterogeneous in computing and communication resources. Additionally, in order to further shorten the deep learning latency, CoEdge can elaborately exploit the high-speed links and strong computing capacity contained in the edge network.

III. MODELS AND PROBLEM
In this section, we first introduce the system model in use and the corresponding notations. Second, we describe the proposed LMA problem, reformulate it as a constrained delay-minimization problem, and analyze its intractability; and particularly, we theoretically reveal an important feature of all optimal LMA solutions, which can exactly help reduce the searching space of our algorithm (to be given in Section IV) and thus improves the efficiency.

A. MODELS AND PRELIMINARIES
Typically, an edge-cloud consists of a cloud server and an edge network. We use c to represent the cloud server, and E = {e 1 , e 2 , . . . e n }, to represent the set of edge nodes. The edge nodes are interconnected via D2D links, 5G network, or other kinds of high-speed network. We assume that the transmission in edge is much faster than that between any edge node e ∈ E and the cloud node c. If the deep learning task is completely processed in the edge, we still report the final result to the cloud node.
A deep neural network (DNN) can be represented with a hierarchical structure L = L i |1 ≤ i ≤ m , where each layer L i is a set of neurons. Because of the intrinsic directionality of processing DNN, L is a partially ordered set (or simply a sequence) of perception layers. In this paper, we define a partial-order relation (denoted by ''≺'' and '' '') on a set U -if u i ≺ u j , u i proceeds u j in U , and if u i u j , we have u i ≺ u j or i = j. If there exists no u k ∈ U such that u i ≺ u k and u k ≺ u j , we say that u i and u j are adjacent or consecutive in U . For two layers L i and L j of L, ''L i ≺ L j '' means that L i should be executed earlier in the deep learning task than L j . Fig. 1 shows a typical DNN which involves five layers. The calculation of layer L i (1 ≤ i < m) yields the intermediate data, which will immediately be passed on to layer L i+1 for further processing. We denote by θ in i the size of the data input to L i , and by θ out i the size of the data output from L i . Often, layer L 1 is called the input layer, which receives the input data to be classified; L m is called the output layer, which returns the final classification result; and other layers are called hidden layer because they are not connected with the external world. Fig. 1 shows an example where a pixelated image of cat is fed to a 5-layer CNN and after two stages (i.e., the feature extraction and the feature classification), a cat is recognized. Despite the number of layers, the DNN also specifies a neuron set for each layer and the connection pattern between two adjacent layers. Both factors collectively determine the computational cost at each layer and the volume of intermediate results to be transferred between two consecutive layers. Given the raw input data (often, an image), the latency of DNN-based learning or inference has two parts: the total computation time of all the layers, and the total time of transferring all the intermediate results between any two consecutive layers plus transferring the final result to the cloud server. In this paper, we call set L i a segment of DNN L, if L i is empty or a subset of L that involves a single layer or multiple consecutive layers. A segment of L is an element-consecutive subsequence derived from L. In Fig. 1, for instance, L 1 and L 2 , L 3 are two segments of that 5-layer CNN, but neither of L 1 , L 3 and L 2 , L 1 is a segment. We say that a partially-ordered set P(L) is a partition of L, if P(L) is a set of the segments of L and the following three properties hold: , a partition of L should exactly cover all the layers of L, 2) L i ∩ L j = ∅ for any two distinct segments L i and L j of P(L), and 3) L i ≺ L j , if the last layer of L i and the first layer of L j satisfy the relation of ≺ in L. By the above definition, we can also simply consider a partition as a sequence of segments. Especially, ∅, L and L, ∅ are both partitions of L distinct from each other.

B. PROBLEM DESCRIPTION AND ANALYSIS
In the edge-cloud system we consider, for a given deep learning task, some edge node of E (denoted by e r ) receives the raw input data, and the cloud node c receives the final result of learning or inference. So, both e r and c will always participate in the deep learning task, although either of them does not have to undertake any DNN layers. In this paper, the LMA problem is described as: for a given deep learning task defined with L, to determine E ⊆ E\e r as well as a partition P(L) of L and allocate each segment of P(L) to exactly one node of E ∪ {e r , c}, while minimizing the total latency of processing L. For simplicity, we hereafter use S rc to represent E ∪ {e r , c} for any E ⊆ E\e r . Next, we formally present the LMA problem to be addressed and then prove its NP-hardness.
The allocation of L over E ∪ {c} can be expressed with a tuple A = (P(L), S rc , ϕ), where function ϕ : P(L) → S rc is the allocation policy. If ϕ(L j ) = e i , node e i will perform all the layers in segment L j ; we denote by d j i the time cost for e i to perform L j . In practice, d j i can be figured out in advance, according to the available computing resource of e i and the total computing load incurred by L j . Noticeably, we can reasonably neglect the time consumed to transfer the intermediate data between any two adjacent layers of segment L j , because such on-node data transfers are carried out by e i within its local space of memory access. Consider two adjacent segments L j and L j+1 of partition P(L), and assume that ϕ(L j ) = e i and ϕ(L j+1 ) = e k . If e i and e k are two different nodes of S rc , the cross-edge data transfer is needed-node e i needs to transfer to node e k the intermediate data output from the last layer of L j . We use d j i→k to represent the delay of such a cross-edge data transfer from node e i to node e k .
We denote by δ(A) the latency that the allocation A can achieve on processing a deep learning task. Thus, the LMA problem can be formally written as where d j i→k equals zero if e i is the cloud node under the current allocation policy ϕ. Given partition P(L) and S rc , Fig. 2 shows three feasible application policies. In the solution for LMA, some nodes of S rc might not process any segments but relay the intermediate data between nodes. In Fig. 2, for example, e 3 and e r are just relaying nodes under allocation policies ϕ 2 and ϕ 3 , respectively.
Theorem 1: Generally, the LMA problem is NP-hard. Proof: To show the NP-hardness of the LMA problem, we here consider a special instance of LMA, in which (1) the nodes that can process DNN L are given in advance as S rc = {e 1 , e 2 . . . e k }(k ≥ 2), (2) the time cost d j p→q is extremely low and can be neglected for any pair of nodes e p , e q ∈ S rc , and (3) each node e i will consume τ j i time in performing layer L j of L. For this special case, we can treat S rc to be a set of parallel machine and L, to be the set of jobs with precedence. This special problem instance is thus transformed into determining a partition of L and a policy that can allocate each resulted segment to a machine of S rc while minimizing the total processing time. Clearly, this problem can be modeled as the Workload Partition Problem (WPP) with the precedence constraint of jobs. The WPP has proven NP-hard [30]- [32]. To solve the problem proposed in this paper, we need to determine not only the partition of L and the allocation policy ϕ but the subset S rc of E ∪ {c}, and thus, the LMA problem is generally at least NP-hard.
Theorem 2: Suppose that there is an allocation of partition P(L) over S rc (|S rc | ≥ 2) and the corresponding allocation policy is ϕ. Consider three consecutive nonempty segments L i−1 , L i , and L i+1 of P(L) such that ϕ(L i−1 ) = ϕ(L i+1 ) = e p and ϕ(L i ) = e q , where e p and e q are two distinct nodes of S rc . We declare that the allocation policy ϕ cannot lead to an optimal allocation of P(L) over S rc .
Proof: We prove this theorem by construction. The left part of Fig. 3 shows the allocation policy ϕ on the given three consecutive segments. We next reshape ϕ under two mutually exclusive cases: • Case I: e p is at least as powerful as e q in terms of computing capacity, and • Case II: e q is more powerful than e p in terms of computing capacity. If Case I holds, we can derive a new allocation policy ϕ from ϕ, by only re-allocating segment L i to e p , as shown in the middle part of Fig. 3. We first evaluate by (2) the time cost of ϕ on processing these three consecutive segments.
Under allocation policy ϕ, nodes e p and e q process segments L i−1 and L i , respectively, and then, e p takes over segment L i+1 . ϕ results in two cross-edge intermediate data transfers, both of which consume times of d i−1 p→q and d i q→p , respectively. In (2), the last item of the right-hand indicates that there are two alternatives for ϕ to transfer the intermediate data output from the last layer of L i+1 : one is transfer the data directly from e p to some node e x processing L i+2 , and the other, through e q to some node e y processing L i+2 . Similarly, the time cost of ϕ on processing the three segments can be expressed with In Case I, e p is identical to or faster than e q in terms of processing speed, i.e., d i p ≤ d i q . Comparing (3) and (2), we always have δ(ϕ ) < δ(ϕ), regardless of how the allocation policy ϕ relays the intermediate data output from the last layer of L i+1 .
If Case II holds, i.e., e q can process segment L i faster than e p , we can create a new allocation ϕ that re-allocates segment L i to e q , the sole difference from ϕ. Also, we can easily prove δ(ϕ ) < δ(ϕ). In conclusion, we can always reshape ϕ given in this theorem into a latency-shorter allocation policy.
Comparing the three allocation policies shown in Fig. 3, we can see that both L i−1 and L i+1 are allocated by ϕ to node e p but ''cut in'' by L i , which is allocated to another node e q . Theorem 2 implies that if an allocation policy allocates nonconsecutive segments to some node, such a cut-in policy is not optimal. Not limited to the case of three consecutive segments, Theorem 2 can be easily extended such that it can apply to the case of any three nonempty segments L i ≺ L j ≺ L k where L i and L k are both allocated to one node but L j is allocated to another node. The heuristics offered by Theorem 2 make designers relievedly bypass those cut-in allocations, which will narrow down the solution space (or the feasible region) and then help speed up their algorithms.

IV. DESIGNS
Recall that for the deep learning considered in this paper, we let c and e r represent the cloud node and the edge node that receives the input data, respectively. If we only allocate the L to e r and c, as [29] does, we can easily figure out an optimal solution for our problem with polynomial time. Besides these two nodes, actually, any other nodes of E could be included in the optimal solution for the general case. To address the LMA problem, we design an approximate algorithm, called CoEdge. With a greedy policy, CoEdge attempts to iteratively insert a new edge node into the S rc which is initialized only with {e r , c}, until the iteratively updated S rc cannot assure a shorter period of time to perform the deep learning task. Algorithm 1 shows how the proposed CoEdge works. Before diving into the algorithm description, we first introduce the information to be input to CoEdge and how to initialize these inputs.

Algorithm 1 CoEdge
Input: E ∪ {c}, L, W , and C Result: the allocation of L over a subset S rc of E ∪ {c} 1 S rc ← e r , c 2 Determine an optimal layer allocation A over S rc and then obtain the minimum total latency (i.e., δ min ) 3 while E ∪ {c} − S rc = ∅ do 4 foreach e ∈ E ∪ {c} − S rc do 5 Determine a partition P(L) as well as an allocation policy ϕ for S rc ⊕ e to minimize the total latency 6 end 7 Select e * from the nodes examined in the above for-loop such that the corresponding partition P * (L) and allocation policy ϕ * on S rc ⊕ e * can achieve the minimum latency (denoted by δ * ) 8 if δ * < δ min then 9 δ min ← δ * 10 S rc ← S rc ⊕ e *

11
Update A with S rc , P * (L), and ϕ * The input information needed by CoEdge consists of four data sets: E ∪ {c}, L, W , and C. The latter two sets profile the bandwidth resource and compute capacity of the edge-cloud environment. More specifically, W is a matrix of |E ∪ {c}| rows and |E ∪ {c}| columns; each element ω ij measures the best available bandwidth between two distinct nodes e i and e j . Since we neglect the time consumed in on-node data transfers, we let ω ii = ∞. Input C is also a matrix; its elements c ij stores the computational time that each node e i of E ∪ {c} needs to pay if it is assigned to perform any possible segment L j of L. Next we introduce how to determinate matrices W and C before algorithm CoEdge can go ahead.
We assume that the edge network is connected and each edge node connects with the cloud node through the Internet. So, there exists at least one communication path between any two edge nodes or between any edge node and the cloud node. We employ the Floyd algorithm to calculate the best available VOLUME 8, 2020 Since each segment can be possibly included in an optimal allocation, it is necessary for CoEdge to know how fast each node processes any possible segments.

B. DESCRIPTION OF CoEdge
After the above four input data sets come prepared, algorithm CoEdge initializes S rc with a partially-ordered set of e r , c , in which e r is the edge node receiving the input data of DNN L and c is the cloud server. At the very beginning, therefore, only e r and c are prepared to share all the DNN layers. By brute-force enumeration, CoEdge can achieve an optimal solution A in linear time for allocating L to e r , c , which yields the shortest latency of processing L, denoted by δ min . CoEdge will next enter a greedy iterative procedure (line 3 of Algorithm 1), which tries to invite more edge nodes to collaboratively process DNN L, aimed at updating δ min with a shorter latency. In essential, the CoEdge algorithm continually grows the S rc set to pursue the acceleration of deep learning. In CoEdge, we regulate S rc such that is a partially-ordered set: for two distinct nodes e i , e j ∈ S rc that process segments L i and L j , respectively, we have e i ≺ e j if L i ≺ L j holds. Such a partial-order regulation on S rc makes the CoEdge algorithm bypass checking the cut-in allocation policies, which will not be optimal solutions, according to Theorem 2. Keeping S rc partially-ordered can thus help CoEdge narrow down the search space in its iterations. When a new node e is examined in each iteration in hope of reducing the latency, CoEdge will always put it right ahead of the last node (i.e., the cloud server c) of S rc . The addition of e into S rc is expressed with ''S rc ⊕ e'' in this paper.
After entering the iteration in line 3 of Algorithm 1, CoEdge first checks all the nodes of E that have not yet been inserted into S rc , in order to decide which of these nodes can lead to a latency-minimum allocation (see the lines from 4 to 7). If the addition of e * into S rc and the corresponding partition P * (L) can achieve a minimum latency so far, we then update δ min and the allocation while doing S rc ⊕ e * . More specifically, CoEdge in line 5 determines the latencyminimum partition over S rc ⊕ e in a recursive way. We next resort to Fig. 5 to explain how to recursively obtain the best partition over the given S rc . As shown in Fig. 5, we suppose that there are k nodes in sequence S rc (including the cloud server c) and an m-layer DNN L is to be allocated to S rc while minimizing the total latency. Recall that e r receives the raw input data and then should be included in sequence S rc . To obtain a latency-minimum partition over S rc , we need only to consider two cases: (1) allocating all the DNN layers L to e r , and (2) allocating the layers from L 1 to L p to e r while the remaining layers are allocated to the subsequence S rc \e r . After evaluating the latencies of these two cases, we can pick out the best allocation. When we are going to determine the allocation of layers L p+1 to L m over the subsequence S rc \e r , we can also calculate the corresponding shortest latency by recursion. We next denote by L i∼j the set of consecutive layers from L i to L j , and by δ(L i∼j , S) the latency of some partition of L i∼j over S ⊆ S rc , where 1 ≤ i, j ≤ m and ''i > j'' makes L i∼j be an empty set. We use π (S) to represent the first node of S (i.e., the node who proceeds all the other nodes of S) and then, π (S rc ) = e r . For the allocation of L i∼j over the subsequence S, its shortest latency, denoted by δ * (L i∼j , S), can be recursively expressed as whered p is time consumed in transferring the intermediate data from layer L p to layer L p+1 . In (4), when p = i − 1, we have δ(L i∼p , π(S)) = 0 because none of layers is assigned to node π (S). In this case, although π (S) does not process any layers, it still needs to payd p time to relay the intermediate data from layer L i−1 to layer L i . Given a DNN, the size of the intermediate data output from layer L p (i.e., θ out p ) is foreknown. If we allocate L p and L p+1 to nodes e i and e j , respectively, we can then evaluate the time costd p by θ out p /ω ij , where ω ij is the best available bandwidth from e i to e j and stored in matrix W . For a given sequence S rc and an m-layer DNN, we can obtain a latency-minimum allocation by recursively solving δ * (L 1∼m , S rc ) according to (4). At last, CoEdge returns a partition P(L), a nonempty S rc , and an allocation policy ϕ; and for any L i ≺ L j of this partition, we always have ϕ(L i ) ≺ ϕ(L j ).
In each iteration, for a given e ∈ E ∪ {c} − S rc , CoEdge employs recursion to complete allocating all the layers of L across S rc ⊕ e. According to (4), CoEdge needs to evaluate δ(L i∼j , π(S rc )) for any 1 ≤ i < j ≤ m on the first node of sequence S rc . As analyzed above, there are m(m+1) 2 different partitions for an m-layer DNN. In addition, we have created the set C before CoEdge enters greedy iterations, and c ij stores the total computational time for node e i to perform all the consecutive layers of partition L j . We thus know that in the iteration with S rc and a given e, we can obtain δ(L 1∼m , S rc ⊕e) with the time complexity of O(m 2 (|S rc |+1)). Furthermore, we know that in each greedy iteration, CoEdge needs O(m 2 (|S rc | + 1) · |E − S rc |) time to find out e * and the corresponding δ * . We then easily know that for an m-layer DNN and an edge-cloud of size (n + 1), the total time complexity of CoEdge is upper bounded with O(m 2 n 3 ).

V. EXPERIMENTS
In this section, we conduct simulation experiments with realistic setups to evaluate our designs and compare it with two baseline algorithms [26], [29], which are here termed by fixedEdge and singleEdge. We use the AlexNet [33] and VGGNet-19 [34] in our experiments to do image classification. AlexNet is an eight-layer DNN, including five convolution layers and three fully-connected layers; and the first, second, and fifth layers of AlexNet also involve the max pooling. VGGNet-19 is a 19-layer DNN, which is divided into five convolutional segments. Each convolutional segment of VGGNet-19 is followed by a max pooling layer that is used to reduce the size of the image data. Since our objective is to reduce the deep learning latency by turning to the collaborative edge, we evaluate our algorithm and the two baselines in terms of latency under a variety of experimental cases.

A. EXPERIMENTAL SETUP
In simulation, we set the parameters with realistic setups. The computing capacity of the cloud node is set to 3200 Gflops. We set the edge with four different setups in bandwidth resource and computing capacity; they are given as follows.
1) high-speed edge: the bandwidth of in-edge link ranges from 500 Mbps to 1000 Mbps, 2) low-speed edge: the bandwidth of in-edge link ranges from 10 Mbps to 200 Mbps. 3) high-capacity edge: the compute capacity of edge node ranges from 80 Gflops to 640 Gflops. 4) low-capacity edge: the compute capacity of edge node ranges from 4 Gflops to 32 Gflops. All the above setups are advised by [29], [35], [36] on the basis of empirical measurements. We evaluate the proposed CoEdge and the baselines under four different cases. For different experimental cases, the computing capacity and the bandwidth are randomly chosen from the corresponding ranges. In all the experiments, we make the edge network connected, although not all the pairs of edge nodes are directly connected. Each edge node can communicate with the cloud through the Internet; for a given edge node, its bandwidth to the cloud is set to a random value between 1 Mbps and 10 Mbps. The input images for the AlexNet and the VGGNet-19 are 227 × 227 pixels (about 1.1794 Mb) and 224 × 224 pixels (about 1.1484 Mb) in size. The computing load and the reduced ratio of intermediate results are set with the default values of these two DNN models. Each experimental case is repeated 40 times, each with a randomly-chosen edge node as the data source (i.e., the receiver of the input images); and the average for that case is reported.

B. RESULTS AND ANALYSIS
Figures 6-9 show how these three schemes perform in different cases. In Fig. 6, we first examine the inference latency of our CoEdge when the edge network is formed with edge devices that are higher in processing capability than 80 Gflops and support high-speed communications of bandwidth ranging from 500 Mbps to 1000 Mbps. It can be seen that for the AlexNet and the VGGNet-19, the CoEdge can always achieve the fastest inference, regardless of what the edge network size is set to be. Additionally, compared with the other two baselines, especially with the fixedEdge, the CoEdge remains much stable under each DNN model-its inference time in need experiences a subtle fluctuation as the network size increases. It is worth noticing in Fig. 6 that although VGGNet-19 involves only more than twice layers of AlexNet, the deep inference for VGGNet-19 takes far longer time than that for AlexNet. For example, the inference latency of CoEdge in AlexNet is only 1.77 ms, on average, whereas the latency of CoEdge in VGGNet-19 is higher than 90 ms. The inference time needed by the singleEdge sharply grows up to the order of hundreds of milliseconds. Such an observation reflects that for a complex DNN with a huge amount of computation, it is necessary and feasible to ''dissolve'' this DNN into the edge-cloud to further improve the inference performance.   Fig. 7 compares the three schemes in the cases where each edge node works with a constrained computing capacity that ranges from 4 Gflops to 32 Gflops. Comparing Fig. 7 with Fig. 6, we can find that the computing capacity impacts considerably the deep inference delay. For AlexNet and VGGNet-19, CoEdge outperforms the two baselines. The latency CoEdge on VGGNet-19 increases up to eight times the latency of CoEdge on AlexNet; a similar increase in latency also happens to the singleEdge scheme. For the fixed-Edge scheme, however, the constrained computing capacity considerably increases its latency when it works with VGGNet-19, deeper than AlexNet. The comparison shown in Fig. 7 leads to a two-fold indication. First, in the high-speed and low-capacity edge, the singleEdge scheme recruits only one edge node to share the DNN layer and then avoids the possible latency bottleneck caused by these low-capacity edge nodes. That is why the singleEdge performs better than the fixedEdge on scheduling the VGGNet-19 model, where the fixedEdge always pre-select a set of edge nodes without considering the actual edge network resources. Second, in comparison with the two baselines, CoEdge always dynamically cherry-picks the ''best-fitting'' edge nodes and lets them collaboratively perform the DNN task, thereby resulting in far lower latency. Fig. 8 compares the three schemes in latency under lowspeed and high-capacity edge. The overall performances of our CoEdge for both the AlexNet and the VGGNet-19 are much close to its counterparts shown in Fig. 6. On the contrary, the fixedEdge performs a little worse in low-speed edge than it does in the high-speed edge, which means the impact of the reduced edge bandwidth on the deep learning latency. Comparing the results in Fig. 8 and Fig. 9, we can find that when the edge resource is very constrained in terms of bandwidth and computing capacity, the proposed CoEdge scheme is more capable of reducing the deep learning latency, because it well leverages the edge-cloud collaboration. In summary, the above experimental results show that CoEdge is more resource-aware than the two baselines. The singleEdge baseline uses only a single edge device to share the DNN layers with cloud, and the fixedEdge baseline processes a DNN on an already-configured set of pipelined edge devices. Both baselines allocate DNN layers without considering how to exploit as many available edge resources as possible, which is the essential reason they perform worse than CoEdge does.

VI. CONCLUSION
In this paper, we have designed and implemented CoEdge, which exploits the communication and computing resources to deploy the deep learning task on the edge-cloud computing environment, in order to minimize the deep learning latency. CoEdge involves effective approaches to determining beneficial nodes, and can in polynomial time allocate the deep learning layers to these selected nodes, laying a foundation of implementing collaborative processing of the deep learning task. Our extensive simulation experiments with realistic setups also demonstrate that CoEdge outperforms two stateof-the-art schemes in term of latency. In the future, we will take a further step towards achieving latency-aware allocation of DNN layers in the edge-cloud system with dynamic edge resources in terms of communication and computation.
LIANGYAN HU received the B.E. degree in computer science and technology from the School of Information Science and Technology, Beijing Forestry University, Beijing, China, in 2018. She is currently pursuing the master's degree in software engineering with Beijing Forestry University. Her current research interests include edge computing, deep learning, and mobile computing.
GUODONG SUN (Member, IEEE) was a Postdoctoral Researcher with Tsinghua University, China, before joining the Faculty of Beijing Forestry University. He was a Visiting Professor of computer science with North Carolina University at Charlotte, USA. He is currently an Associate Professor of computer science with the School of Information Science and Technology, Beijing Forestry University, Beijing, China. His research interests include mobile computing, wireless ad-hoc and sensor networks, combinatorial optimization, the Internet of Things, and machine learning. He is a member of the IEEE Computer Society.
YANLONG REN received the B.E. and M.E. degrees from the School of Information Science and Technology, Beijing Forestry University, Beijing, China, in 2013 and 2015, respectively. He is currently a Senior Engineer with the Network Information Management and Service Center, Beijing University of Civil Engineering and Architecture, Beijing. His major efforts are put to maintaining the large-scale network infrastructure and the cloud-based data center of his university. VOLUME 8, 2020