Joint Online Coflow Optimization Across Geo-Distributed Datacenters

Growing data volumes have been generated, stored and processed across geographically distributed datacenters. Processing such geo-distributed datasets should consider task or job placement, heterogeneous available link bandwidth, and routing decisions. Thus optimizing the transmissions of inter-datacenter flows, especially coflows that capture application-level semantics, is important. However, prior solutions on coflow scheduling either merely ignore routing or fix the endpoint placement. This article focuses on the problem of jointly considering endpoint placement, coflow scheduling, and coflow routing to minimize the average coflow completion time (CCT) across geo-distributed datacenters. This paper first presents the system model and problem formulation, then derives a randomized approximation algorithm for a single coflow. After that it also proposes an online algorithm to handle the complex cases of multiple coflows. Through theoretical analysis, a proof that the proposed algorithms have a non-trivial competitive ratio is given. Results from extensive simulations demonstrate that the proposed algorithms can significantly reduce the CCT of coflows and at the same time, have similar algorithm run time, compared with the state-of-the-art solutions.


I. INTRODUCTION
Today, large volumes of data, e.g., from games, online video, data mining, scientific calculation, will be generated at geographically distributed datacenters. A key feature of these applications is that a collection of flows, termed coflow [1], will be generated to transfer the intermediate data between successive computation stages. A coflow will not finish only until all its flows have completed [2].
Based on this trend, it is more economic to leave data in-place and execute jobs over geo-distributed network, rather than aggregating all the data required for the applications to a single datacenter [3], [4].
The reason is that, a coflow's completion time (CCT) can account for more than 50 percent of job completion time [5]. As the difference between computation and storage devices are witnessed to get greater [6], flow transformation is more likely to become the performance bottleneck for a job.
However, in the geo-distributed networks, the flows within a coflow have to traverse the heterogeneous inter-datacenter links, where several critical factors have impacts on the trans-The associate editor coordinating the review of this manuscript and approving it for publication was Young Jin Chun . mission. First, endpoints of flows in a coflow are closely correlated to the tasks of the corresponding job, while each task can be placed in any datacenter that has available computational resources [7]. Second, the bandwidth on the inter-datacenter links is limited and can vary significantly across different links [8]. Third, complex network environment among datacenters makes routing more challenging. Fourth, coflows in the network might concurrently compete for the transmission resources such as the link capacity [9].
Thus, minimizing the average CCT becomes urgent to improve the performance of jobs running across geo-distributed datacenters [10]. Many prior works have focused solely on optimizing scheduling network flow, routing paths, endpoint placement to reduce CCT. However, they never consider task placement, coflow scheduling and coflow routing together. As a result, the whole picture of coflow transmission is lost [11]. Now it's necessary to use a small example to illustrate the problem. In Fig. 1, there are three datacenters (DC1, DC2, DC3) connected together with green links (wider green link means larger bandwidth capacity). The current resource occupation is as follows: FIGURE 1. An illustrating example, where two coflows A and B have 3 and 2 tasks, respectively. Each task has to find a spare computing slot (circle) and receive all the data with the same color (cylinder).
• DC1 has one spare computing slot (circle) and three places for storing data (cylinder), where two of them are occupied by blue data and black data (larger cylinder means bigger data volume) respectively.
• DC2 has three spare computing slots (circle) and two places for storing data (cylinder), where one of them stores black data.
• DC3 has two spare computing slots (circle) and three places for data storage (cylinder), where two of them are occupied by blue data and black data. Assume two coflows A and B are arriving with 3 and 2 tasks respectively. Since tasks in coflow A are colored blue, every task needs all the blue data (cylinder) among DC1-DC3 to finish its work (one data in DC1 and one in DC3 in this example), and at the same time, it needs to occupy a spare computing slot (circle) among three datacenters. Similar to tasks in coflow A, any task in coflow B has the same demand except that it is colored black, so it needs all the black data (cylinder, one in DC1, one in DC2, and one in DC3) and a spare computing slot (circle). Note that links among datacenters have various bandwidth capacities and data in different datacenters may have diverse volume, so the problem is how to assign spare computing slots, link bandwidth to tasks in coflows to obtain minimized average CCT.
In brief, this paiper focuses on the problem of jointly considering endpoint placement, coflow scheduling and routing to minimize the average CCT of coflows across geo-distributed datacenters. The paper develops the resource (bandwidth, computing slots, paths) assignment problem model and formulate a mixed integer programming. In this model, the paper takes into account heterogeneous link bandwidth capacities, different coflow arrival times, and available computation resources of datacenters. Since in most practical scenarios, people do not have the information about the future coflows, and hence an online solution is more suitable. Therefore, this paper presents an online coflow-aware optimization framework. It first proposes an approximate algorithm to provide the solution for one single coflow as well as its bounds. Next, it proposes an efficient online algorithm to minimize the average CCT when in the situation of multiple coflows with theoretical performance guarantees.
The main contributions are as follows: • This paper studies the problem of jointly considering endpoint placement, coflow scheduling, and routing to minimize the average CCT of coflows across geo-distributed datacenters. Specifically, it develops the mathematical model and formulate a mixed integer programming.
• This paper presents a coflow-aware optimization framework to solve the problem. In particularly, it develops online algorithms for single and multiple coflows that dynamically arrive at the network.
• This paper conducts theoretical analysis to demonstrate that the proposed algorithms can achieve a good competitive ratio.
• Extensive trace-driven simulations evaluate the performance of the proposed algorithms, in terms of average CCT, algorithm run time, and impact of coflow characteristics.
The rest of this paper is organized as follows. Section II discusses the related works. Section III develops the system model and presents the problem formulation. Section IV shows the details of algorithms. The implementation and experiment results are presented in Section V. And Section VI concludes this paper.

II. RELATED WORKS
The abstraction of coflow comes up to express the collective behavior of flows in intra-and inter-datacenter networks [1]. Researchers have studied coflow in many problems [3]: scheduling, routing, task placement, fairness, cost with information-agnostic or information-aware, centralized or distributed methods [10], [12]. Specifically, this work contains three factors: endpoint placement, coflow scheduling, routing, and it will discuss them in the followings.
Placing tasks close to their intermediate data for optimizing the completion times of jobs is called endpoint placement [7]. Iridium [13] and Flutter [14] reduce the job completion time by placing reduce tasks on or close to datacenters that have a large amount of intermediate data and relatively high link bandwidth. In order to transform the formulation to LP, they use simple objective functions such as linear combination, which limits the scope of their algorithms. Shuffle-Watcher [15], places both map and reduce tasks on the same set of racks in a single datacenter for improving the locality of shuffle. Sinbad [2] uses the flexibility in the placement of output data of jobs, and it only considers one stage at a time for each job. CLARINET [4] takes network flow scheduling and task placement into account one after another. Corral [11] adopts joint optimization of input data and task placement to reduce the network contention.
Aiming to finish task faster as well as improve network utilization is mainly in the field of scheduling and routing [16]. Varys [12] proposes the abstraction of coflow VOLUME 8, 2020 and heuristics to schedule coflows aiming at minimizing the average CCT and meeting coflow deadlines and [17] guarantees coflow completion within their deadline. Moreover, BlindFlow [18] assumes flows arrive without critical demand volumes and proposes an online algorithm to get approximate ratio. Qiu [19] proposes the first deterministic algorithm with a constant approximation ratio for multiple coflow offline scheduling. RAPIER [20] studies coflow routing and scheduling simultaneously, which uses heuristics and hence lack theoretical performance guarantees on minimizing the average CCT. Aalo [10] applies multi-level feedback queues (MLFQ) to schedule coflows, which is an online method and DeepAalo [21] uses deep reinforcement learning to change thresholds. Further, Fai [22] helps to identify the bottleneck flow compared with Aalo. CODA [23] employs an algorithm to automatically divide network flows into different coflows before scheduling, such that extract coflows can be done without modification. Dogar [24] and Luo [25] investigate the decentralized coflow-aware scheduling problem.
Besides the above studies, some recent works [26]- [29] study coflow scheduling in optical circuit switches (OCS). Chen [30] and Wu [31] propose to achieve lexicographical max-min fairness among multiple jobs and take the datacenter network as a big switch. And mixCoflow [32] uses similar methods to deal with deadline and non-deadline coflows together. [33] proposes a metric called coflow age to measure streaming applications. Li [16] combines cost and performance together by using an online framework called Lever to balance the trade-off. Weaver [34] formulates the network architecture as heterogeneous parallel networks and divides it into microscopic and marcoscopic levels to address the problem.
The most related works to this papers are OMCoflow [9], [35], which combines scheduling and routing to minimize average CCT, and SmartCoflow [7], which combines scheduling and endpoint placement to minimize average CCT. For example, Li [35] proposes OneCoflow and OMCoflow, two online algorithms, to route and schedule coflows simultaneously. They have been proved to provide a theoretical guarantee of the average CCT, which is similar to this work. However, this work considers all the 3 factors endpoint placement, coflow scheduling and coflow routing into consideration.

III. MODELING
This section describes the system mathematical model and present the problem formulation. Important notations used this paper are listed in Table 1.

A. SYSTEM MODEL
This paper considers a network with multiple geo-distributed datacenters denoted as N = {1, . . . , N }. In this network, there are a set of inter-datacenter links, which are denoted as E. For each link l ∈ E, denote C l as its bandwidth capacity. Suppose that there are a set of coflows in the network, which is denoted as K = {1, . . . , K }. Each coflow k ∈ K arrives at the inter-datacenter network at time a k . The information associated with each coflow k is assumed to be known as soon as this coflow arrives: it includes the number of tasks to be launched on, where we use U i to denote the capacity of available computing slots in datacenter i ∈ N . Note that when a coflow k comes up and one of its task r is located on some datacenter i, it's assumed that people know how many flows will be created, the size of each flow, and the set of available paths between coflow k and datacenter i [7], [9].
To indicate the task placement and coflow routing, define I k,i r,f ,p as whether the rth task associated with coflow k ∈ K is placed on datacenter i ∈ N and has a flow f ∈ F k,i r on path p ∈ P f . Then we have Also, it's assumed that each task can be processed by only one datacenter, thus this paper has the following constraint: The summation of p on the left of the equation means that, for any flow, there is only one path that it can finally use.
To indicate the coflow scheduling, define b k,i r,f (x) as the amount of bandwidth allocated to coflow k for supporting the data transmission to datacenter i through flow f on path p at time x (x ≥ 0). Note that b k,i r,f (x) can be zero for some x's, implying that the network flow is waiting for transmission or there is no such flow, for coflow k.
Define T k as the CCT of coflow k. Since all flows in a coflow k must finish transmitting their data between the arrival time a k and the completion time T k , then where v k,i r,f is a flow volume that may be known in advance and should be satisfied from time a k to time a k + T k . At any time, the sum of the bandwidth requirements on any link cannot exceed its link capacity. Then where 1(l ∈ p) equals 1 if the link l is on the path p, and 0 otherwise.

B. PROBLEM FORMULATION
Now this paper formulates the problem of jointly considering task placement, coflow scheduling, and coflow routing to minimize the average CCT of coflows across geo-distributed datacenters, as show in the following problem P1: The objective (5a) is the minimum of the average CCT across all coflows, where I k,i r,f ,p is endpoint placement variable and b k,i r,f (x) is scheduling decision variable. Constraint (5d) means that the total number of tasks assigned to datacenter i should not exceed U i , which is the total number of available computing slots. The rest four constraints represent endpoint placement, coflow scheduling and coflow routing mentioned previously.
It's observed that problem P1 is an mixed integer programming and NP-hard by Theorem 1.
Theorem 1: Problem P1 is NP-hard. Proof: When there is only one datacenter, i.e., |N | = 1 and routing paths are fixed, all tasks can only be placed in this datacenter. In this case, the non-preemptive Single-Machine Scheduling Problem (SMSP) can reduce to problem P1. This shows that P1 is no easier than SMSP, which has been proved to be NP-hard [36].
On the other side, when task placements and routing paths are fixed, the NP-hard Coflow Scheduling Problem of minimizing the average CCT is a special case of the problem P1 [12].
In order to solve this problem, people can figure out an offline algorithm, which needs a knowledge of all the information about coflows from beginning to the end of run time, including the source/destination nodes as well as the volumes. However, in practice such information remains unknown until a coflow arrives, and can not be predicted without prediction techniques. Hence, an online algorithm is more desired. In the following, this paper presents several algorithms to address problem P1.

IV. ALGORITHMS
This section first pays attention to deal with one single coflow using a randomized approximate algorithm to minimize the CCT, and then extends the situation to handle the online multi-coflow case.

A. MINIMIZING SINGLE COFLOW COMPLETION TIME
If there is only one single coflow in the network, then |K| = 1. The problem can be formulated as following P2: Sample an i with probability I i r,f ,p .

4:
If U i = 0, repeat step 1. Otherwise, set I i r,f ,p = 1 and update U i = U i − 1. 5: end for 6: Find a smallest real number α ≥ 1 to make the solution of P3 may not be feasible for the ILP P2. In order to obtain a feasible solution, this paper then chooses to use rounding technique. A summary of the procedure of minimizing the single coflow CCT is in the Algorithm 1. At step 1, the algorithm gets the optimal solution of P3. In step 2 − 5, it samples a datacenter i with probability I i r,f ,p for each r, f , p combination. In step 6, it scales the bandwidth to make the solution feasible. The

governing overhead of Algorithm 1 is LP calculation and hence its time complexity is O(LP(O(NRFP), O(NRFP))),
where N = |N |, R = max k |R k |, F = max k,i,r F k,i r , P = max f P f , and LP(x, y) is the time complexity of solving an LP with x variables and y constraints.
Next, this paper provides a lower bound and an upper bound of the CCT achieved by Algorithm 1. Define a solution of P3 as follows, which is similar as [9]: T . Note that it only remains to verify constraint (6f).
for l ∈ E. By the definition of the completion time, it's easy to get T On the other hand, S is a valid solution, thus for all l, it's known that Therefore, Theorem 3: Algorithm 1 guarantees its competitive ratio α for problem P2 to satisfy Pr(α > 4 ln 2n) ≤ 1 4 . Proof: For l ∈ E, define Then, let α = max l α l which is the smallest bandwidth scaling factor to insure a feasible solution. It is adequate to prove that by the union bound. Next, the paper fixes l and focuses on the proof of Pr[α l > 4 ln 2n] ≤ 1 4n 2 . Define random variable X i r,f := 1(l ∈ p) · b i r,f and X := i∈N r∈R f ∈F i r X i r,f . Note that it has the following properties: . Proof: Let t > 0 be a parameter and can be fixed later as in [7]. Then where the Markov's inequality ensures the third transformation [9]. It can be concluded by definition where the fact that 1 + x ≤ exp(x) for all real x insures the last inequality [16]. Then We select t to satisfy exp( where the last inequality is by E[X ] ≤ Y . Therefore, Note that the constraint exp(tg i ) ≤ 1 + 1 2 tβg i for all i [16] has to be satisfied. To verify it, where the third transformation follows from the fact that a x ≤ 1 + ax, for a ≥ 1, 1 ≤ x ≤ 1.
Substituting t, then which terminates the proof.

B. HANDLING MULTIPLE COFLOWS
Now this paper has Algorithm 1 for one single coflow situation, then it can move forward on that to design an online algorithm to minimize the average CCT of multiple coflows. The basic idea is that the Algorithm 1 is invoked to calculate the task placement, bandwidth allocation and routing for each new arrival coflow. Then, it scales up or down the bandwidths of all existing flows, with the purpose of obtaining a feasible solution for each coflow and fully exploiting the link capacity.
The algorithm details are shown in Algorithm 2, which is proved later to be competitive with a non-trivial ratio for problem P1.  Add this new coflow to J or remove the completed coflow from J . Here, J stores the set of coflows that are not completed at current time. 3: for each coflow k ∈ J do 4: Define , where T k P3 is the optimal objective of the LP P3 for coflow k. Find a largest factor to rescale the bandwidths of all flows in J to seek the property of work conversing. 8: end while Theorem 4: Algorithm 2 is K γ -competitive for the original problem P1, where γ is the competitive ratio of Algorithm 1.
Proof: Let T P1 , T k P2 and T k P3 denote the optimal value of problem P1, the optimal CCTs of coflow k for problem P2 and P3, respectively as in [16].
It can be found that each coflow k contributes to the average CCT with no less than its minimum completion time T k P2 when it occupies the network exclusively. Given the result of Theorem 2, it has T P1 ≥ 1 [16]. Therefore, the paper only need to compare the performance of our algorithm to 1 K K k=1 T k P3 . Specifically, let T a denote the average CCT achieved by Algorithm 2, and the proof of T a ≤ γ K k=1 T k P3 ≤ K γ T P1 is in the following. Suppose that there exists an optimization problem for some where x k is non-negative value [9]. By the Cauchy-Schwarz inequality, then , ∀k ∈ J , the above optimization problem can be optimally solved. Then for each k in each iteration of Algorithm 2, the weighted factors λ k 's are optimally chosen and the λ k 's have least impact on the average CCT when rescaling the bandwidth of each coflow. Denote and denote T k a as the CCT of coflow VOLUME 8, 2020 k achieved by the Algorithm 1 [9]. Since J is a subset of {1, . . . , K }, it has the following inequality for any k Combining Algorithm 1, the CCT of each coflow is at most Applying the Cauchy-Schwarz inequality again, then Thus, the proof is completed.
Note that, Algorithm 2 computes endpoint placement, scheduling and routing for each coflow only once when it arrives.

V. PERFORMANCE EVALUATION
This paper develops a trace-driven simulator which compares this work with the following two state-of-the-art schemes: • OMCoflow [9]: combines scheduling and routing to minimize average CCT.
• SmartCoflow [7]: combines scheduling and endpoint placement to minimize average CCT. Specifically, this section examines the average CCT, algorithm run time, impact of coflow characteristics for OMCoflow, SmartCoflow, and this work.
The evaluation platform is macOS Big Sur 11.0.1, with 2.3GHz 8 cores Intel i9 and 32GB 2400MHz DDR4.

A. SIMULATION SETUP
The evaluation chooses the similar simulation setup as in [7], [9], so as to test the performance under the same scalability.
Practical data is crucial for verifying the proposed algorithms like in [37]. Thus the workload is based on a   Hive/MapReduce trace [38] that was collected on a 3000machine 150-rack cluster with 10:1 oversubscription ratio, which was used in [10], [12]. Moreover, the paper rescales the above coflows to match a 20-datacenters inter-datacenter network. Particularly during the scaling process, the paper maintains the original coflow communication pattern as in [16]. The capacity of each inter-datacenter link is selected between 100Mbps and 2Gbps randomly, in order to simulate practical bandwidth environments [16]. In practice, heterogeneous link capacities can be achieved by Linux Traffic Control [39].
Similar to [12], [19], non-zero coflows are divided into 4 categories as shown in Table 2. A coflow is N (narrow) if it involves less than 50 flows, and otherwise it is W (wide); and a coflow is S (short) is its length is less than 5MB, and otherwise it is L (long). Fig. 2 illustrates the average CCT of coflows achieved by different schemes, where the measurement unit of y-axis is ms×10 4 . Across all types of coflows, this work can speed up the average CCT by 4% − 12% than OMCoflow and SmartCoflow. The reason is that this work jointly optimizes endpoint placement, scheduling and routing, while the other two schemes only combine two of three factors.

1) THE PERFORMANCE ON CCT
To understand CCT of coflows at a microscopic level, this paper further plots CDF (Cumulative Distribution Function) of the completion time across all coflows schemes in Fig. 3. It can be clearly seen that the curve of this work lies on the top of the other two. The percentages of coflows completed within 5000ms are 31.2%, 25.3%, 33.0% by this work, OMCoflow, and SmartCoflow schemes, respectively. And all coflows can be completed within 241020ms, 260320ms, 28144ms respectively.
In Fig. 4, we can see the algorithm run time for three schemes. As we know, our algorithm can run in O(|K|) · O(LP(O(NRFP), O(NRFP))), with OMCoflow [9] in O(|K|) · O(LP(O(RP), O(RP + N 2 ))) and SmartCoflow [7] in O(|K|) · O(LP (O(RN ), O(N 2 ))). Note that the measurement unit of y-axis is now ms×10. Obviously, the run time of this work is much higher (3 times) than it of the other two schemes. However, compared with the average CCT (ms × 10 4 ) in Fig. 2, the influence of algorithm run time (ms × 10) is relatively small.

2) IMPACT OF COFLOW PARAMETERS
Next, the paper studies the impacts of key coflow parameters such as the total coflow number, the coflow width, the coflow size, and the inter-coflow arrival interval, which is similar in [9]. In the following figures, the comparison baseline is the scheduling optimization with random routing and endpoint placement. Generally, in Fig. 5 It can be seen that this work has better performance than OMCoflow and SmartCoflow under different scenarios.
Coflow Number: It sets the coflow width, the coflow size and the mean inter-coflow arrival interval as 100, 500MB and 100ms, respectively. Fig. 5a shows that for all the three schemes, the performance of the average CCT increases with the growth of the coflow number. The reason is that the scheduling and routing strategy will get more goodness when there is a competition of severe network resource originated from more coflows. And this work have at least 25% more augmentation than OMCoflow and SmartCoflow.
Coflow Width: It sets the coflow number, the size and the mean inter-coflow arrival interval as 100, 500MB and 100ms, respectively. Fig. 5b shows that more improvement in average CCT is gained by all the three schemes, due to the similar explanation as before. And this work outperforms the other two schemes by at least 12.5%.
Coflow Size: It sets the coflow number, the width and the mean inter-coflow arrival interval to be 100, 100 and 100ms respectively. Fig. 5c shows that the larger coflow sizes also lead more intense resource competition and with the help of three schemes, the average CCT is improved. This work has 10% more improvement than the other two schemes.
Inter-Coflow Arrival Interval: The other parameters are fixed as the previous parts. Fig. 5d shows that as the intervals increase, the improvement in the average CCT promotes at first and then become flat. And we can observe that when the intervals are small, this work is the worst. The reason could be that the algorithm run time can not be ignored any more at this time. However, with the growth of arrival interval, this work gets more improvement on average CCT than the other two schemes.

VI. CONCLUSION
This paper studies the problem of jointly considering endpoint placement, coflow scheduling, and coflow routing to minimize the average CCT across geo-distributed datacenters. This paper first constructs a mathematical model, then proposes an approximate algorithm for minimizing the CCT of single coflow. After that it develops an online algorithm to minimize the average CCT of multiple coflows. VOLUME 8, 2020 Further it's proved that the proposed algorithms have a non-trivial competitive ratio without prior knowledge of future coflows. Extensive trace-driven simulations demonstrate that the algorithms can speed up the completion of coflows without increasing algorithm run time. Moreover, considering transmission cost in coflow optimization might be a promising future direction, since the trade-off between bandwidth and cost is an critical metric over the inter-datacenter and intra-datacenter networks.
ZHAOXI WU (Member, IEEE) received the B.E. degree from Shanghai Jiaotong University, China, in 2014. He is currently pursuing the Ph.D. degree with the School of Information Science and Technology, ShanghaiTech University. His research interests include data-center networks and network optimization.