An Optimized Straggler Mitigation Framework for Large-Scale Distributed Computing Systems

Nowadays, Big Data becomes a research focus in industrial, banking, social network, and other fields. In addition, the explosive increase of data and information require efficient processing solutions. Therefore, Spark is considered as a promising candidate of Large-Scale Distributed Computing Systems for big data processing. One primary challenge is the straggler problem that occurred due to the presence of heterogeneity where a machine takes an extra-long time to finish execution of a task, which decreases the system throughput. To mitigate straggler tasks, Spark adopts speculative execution mechanism, in which the scheduler launches additional backup to avoid slow task processing and achieve acceleration. In this paper, a new Optimized Straggler Mitigation Framework is proposed. The proposed framework uses a dynamic criterion to determine the closest straggler tasks. This criterion is based on multiple coefficients to achieve a reliable straggler decision. Also, it integrates the historical data analysis and online adaptation for intelligent straggler judgment. This guarantees the effectiveness of speculative tasks by improving cluster performance. Experimental results on various benchmarks and applications show that the proposed framework achieves 23.5% to 30.7% execution time reductions, and 25.4 to 46.3% increase of the cluster throughputs compared with spark engine.


I. INTRODUCTION
In the last decade, the huge amount of digital data becomes 17 a key issue to be stored, managed and analyzed. As a con-18 sequence, Large-Scale Distributed Computing Systems is 19 a promising solution in many fields [1], [2], [3], [4], [5], 20 [6], [7]. Many companies believe that these computing sys- 21 tems are the most effective and fault-tolerant method to 22 store and handle enormous volumes of data [8]. Hadoop [9] 23 and Spark [10] are two popular distributed computing sys- 24 tems that are widely used in the industry and academia. 25 Hadoop is an open-source software framework for handling 26 massive volumes of data, providing comprehensive process- 27 ing and analytical capabilities. Hadoop core composed of a 28 distributed file system storage and a MapReduce process-29 ing [11]. This data processing computing system consists of 30 three stages: Map phase, Shuffle phase and Reduce phase. 31 The associate editor coordinating the review of this manuscript and approving it for publication was Daniel Grosu .
In this system, huge files are decomposed into several little 32 pieces of similar size and distributed on the cluster for stor-33 age. Spark is an alternative distributed computing technol-34 ogy that is open-source and free to use. It is implemented 35 on top of the Hadoop and its goal is to build a general-36 purpose programing model faster and more fault-tolerant than 37 MapReduce. Resilient Distributed Dataset (RDD) [12] is a 38 technology introduced by Spark that provides application 39 program interfaces (APIs) that enable transformations and 40 parallelization of data which can be adapted by users on 41 basis of their applications. As a result, the performance of 42 the batch, interactive, streaming and iterative computations 43 can be increased by persisting RDD in memory. Furthermore, 44 Spark offers a variety of sophisticated modules which are 45 built on top of the Spark core including Spark Streaming [13], 46 Spark SQL [14], GraphX [15] and MLlib [16]. Spark Stream-47 ing module allows Spark to build streaming applications, 48 while Spark SQL module used for structured data process-49 ing. Also, GraphX is a graph API that allows you to do 50 for speculative tasks for more efficient straggler mitigation. 89 We can summarize the major contributions of that paper as Framework that presents dynamic criterion to predict the 93 tasks that suffer from straggler in an efficient manner. 94 (2) The proposed framework criterion is based on multi-95 ple coefficients which lead to finding the optimum straggler    This paper is organized as follows: Section II describes the 102 background and motivation. Section III demonstrates the sys-103 tem model and problem formulation. Section IV describes the 104 implementation details of the proposed framework. Section V 105 examines the complexity analysis. Section VI shows an 106 illustrative example. Section VII explores the performance 107 evaluation and detailed results. In section VIII, the paper is 108 concluded with the main findings.

110
This section presents a concise overview of Spark computing 111 system. After that, the straggler problem as well as its solution 112 in spark will be explored.

114
Apache Spark [10] is a promising candidate in large scale 115 distributed computing systems. Spark is intended to improve 116 application execution as well as fulfill scalability and fault 117 tolerance by using resilient distributed dataset (RDD) [12]. 118 RDD is a read-only collection of objects partitioned among 119 a number of machines that can be reconstructed when losing 120 one of the partitions. Each Spark application launches a single 121 master process known as the driver, which is in charge of task 122 scheduling. It employs a hierarchical scheduling procedure 123 that includes jobs, stages, and tasks, where the term ''stages'' 124 refers to smaller groups of tasks that are separated from 125 interdependent jobs. As shown in Fig. 1, A Spark cluster is 126 made up of only master node as well as many slave nodes 127 known as workers. Every Worker is handled on an execution 128 node, which may incorporate one or many executors. Each 129 executor has the ability to use many cores and execute tasks 130 at the same time. In case of a Spark application is submitted, 131 the master calls the resource manager for getting computing 132 resources based on the application's needs. Once the resource 133 is ready, tasks are assigned to all executors in parallel through 134 Spark scheduler. Then, the master node will track the sta-135 tus of executors and gathers the results from worker nodes 136 throughout this process. In this paper, Spark is used as the 137 target framework to determine and predict the tasks that will 138 suffer from straggler in an efficient manner. In Spark, a job is broken down into one or more stages. After-141 wards, stages are divided into separate tasks. A task is con-142 sidered as a unit of execution that runs on the Spark worker 143 in the cluster. When a task in execution becomes slower than 144 other tasks in the same job, this task is called a ''straggler 145 task'' which prolong the entire job and the cluster throughput 146 will be affected. There are numerous causes that make a task  This problem solution is the speculative execution [31]. 152 Although it seems that speculative execution mechanism is 153 a simple matter. It allows you to restart the straggler tasks on 154 another machine, in actuality it is a complicated issue because 155 speculative tasks consume resources which may affect other 156 running tasks. As a consequence, if a straggler task is not 157 detected correctly or a backup task is finished earlier than 158 the original task, this will consumes resources with no use.

159
Also it is leading to increase the job execution time and    nism. To achieve this goal, the proposed framework uses 193 a dynamic criterion to evaluate the most suitable tasks for 194 speculation. This criterion is based on multiple coefficients 195 to achieve a reliable straggler decision. Four coefficients 196 are used by the proposed framework, namely job quality 197 of service limitation, stage proceeding behavior, processing 198 bandwidth, and cluster utilization level.

199
The Proposed Mitigation Framework Architecture: The 200 proposed framework is designed to work in conjunction with 201 Spark parallel data processing platform. The main modules 202 of the proposed framework include a Straggler Decision 203 Engine Module and Straggler Alleviation Module. The strag-204 gler decision engine Module is constructed from two compo-205 nents; the initial historical calculator component that collects 206 tasks execution logs to perform initial calculations, and the 207 Dynamic Weighted Straggler Decision component that deter-208 mines the best threshold for identifying stragglers. After that, 209 the decision for straggler is made based on four weighted 210 coefficients. The latter module, Straggler Alleviation Module, 211 foresees the machine performance to provide further and 212 effective straggler mitigations. Additionally, it guarantees the 213 effectiveness to speculative tasks. Figure 2 shows the archi-214 tecture of the proposed straggler mitigation framework. The straggler decision engine module include the initial his-220 torical component that used to gather the information of task 221 execution logs and preliminary calculations. In Spark, the 222 lifetime of an executed task is comprised into three time peri-223 ods (TP 1 , TP 2 , and TP 3 ). These periods are deserialization 224 of task period, running task period, and serialization of task 225 results period respectively. The deserialization of task period 226 (TP 1 ) is the elapsed time spent to deserialize the task object 227 and data. Also, the running task period (TP 2 ) is the elapsed 228 time that spent running this task. This includes the time of 229 fetching the shuffle data. While TP 3 is the serialization of task 230 result period which is the elapsed time spent in serializing the 231 task result. It should be noted that the information of tasks' 232 execution is collected for each node n. So, the mean execution 233 time of each period for a node n is recorded as TP1 n , TP2 n , 234 and TP3 n . Therefore, the total time of each task for all periods 235 can be computed as follows in (1) 236

IV. THE IMPLEMENTATION DETAILS OF THE PROPOSED
where j is the period id and n is the node id.

238
Also, the total time for all completed and successful tasks 239 on node n is defined as MT n . In this context, the average total 240 time of all successful tasks is (MT ).

2) DYNAMIC WEIGHTED STRAGGLER DECISION CRITERION 242
According to the proposed framework, the judgment of a task 243 to be a straggler or not is dynamically identified. To increase 244 the speculation efficiency, the straggler decision criterion is 245 VOLUME 10, 2022 The straggler decision depends on the residual time to 274 where Q T is the required time for a task and considered as the 276 quality of service coefficient. When Q T value is more than 277 the maximum Task time , this indicates a long limitation time 278 for quality of service. Therefore, there is no need to make a 279 large number of clones because there are no risks of inten-280 sive performance implications. In this case, the coefficient is 281 computed as the division between the quality of service time 282 limitation (Q T ) and the mean time (MT ). When Q T value is 283 less than the maximum Task time , the coefficient is set to the 284 minimal Task time divided by the mean time (MT ) as in Eq. (4). 285 The pseudo-code for calculating the job quality of service 286 limitation coefficient (C Q ) is illustrated in Algorithm 2.

287
The stage proceeding behavior coefficient (C P ): Ideally, 288 a speculation should preferably be detected early in the job 289 lifecycle. This saves cluster resources and that reflects on the 290 job completion time. As a consequence, it is vital to examine 291 the stage proceeding behavior to get effective straggler detec-292 tion. The proceeding behavior in the stage can be computed as 293 the ratio between the number of completed tasks in the stage 294 (n) and the total number of tasks in that stage (m) as follows 295 in (5). Then, the average proceeding Proceeding avg for all stages can 298 be computed, which indicates the present proceeding in the 299 entire job lifecycle. After that, the computation of the stage 300 proceeding behavior coefficient (C P ) at time t is given in (6)  • TP1 n : Mapping between node and the mean deserialization of task period • TP2 n : Mapping between node and the mean running task period • TP3 n : Mapping between node and the mean serialization of task result If τ is not completed then 8.
If τ in the first period 10.
Else if τ is in the second period 12.
AddInSpeculationQueue ( to avoid ineffective speculation, it is reasonable to raise the Proceeding avg = Proceeding sum j 6. C P = Proceeding avg − P th 7. / * end of job quality of service limitation coefficient * / * Proceeding avg : The average proceeding for all stages. * P th : The proceeding threshold threshold value in response to late progress. Also, it is accept-310 able to reduce the threshold value early in the task lifecycle 311 to motivate replica generation. This is because the replica 312 should have a greater chance of surpassing the original task. 313 In such situations, it is preferable to run these tasks instead 314 of conserving resources on the cluster. The pseudo-code for 315 calculation the stage proceeding behavior coefficient (C P ) is 316 illustrated in Algorithm 3.

317
The processing bandwidth coefficient (C BW ): This coef-318 ficient measures a job's process speed in order to identify 319 slow tasks more quickly. The amount of processed data for 320 a given time period is used to calculate the process speed. 321 The processing bandwidth (Processing BW ) for a stage can be 322 computed as the ratio between processed data size in the stage 323 VOLUME 10, 2022 One of the major challenges of speculative execution is the 386 backup of tasks at appropriate nodes. Since every node's 387 capability may vary, it is essential to have an appropriate 388 metric to measure the performance of heterogeneity nodes. 389 Therefore, the capability of a node can be obtained through 390 the amount of tasks completed and total tasks processed as 391 in (15)     Get_Bt() //as in Eq. 16 4.
Ignore this task 6.
Get profit of backup and profit of not backup 8.
Select the task which will do the maximum profit and assign it on the highest node capability //as in Eq. 15 9.
Delete the original task 11.
End if 12. End for 13. / * end of Speculative Execution Efficiency Algorithm * / it is considered as a benefit parameter. The benefit of backing 406 up a task will be measured with taking into account assigning 407 another slot for back up task. Since both the original and 408 the backup must continue to run till the task is completed. 409 While conserving one slot is equal to the difference between 410 the residual time and the backup time. The residual time to 411 complete a task can be computed as referred in (3). The mean 412 execution time of three periods of all tasks is adjusted by 413 VOLUME 10, 2022

418
Otherwise, the cost of not backing will be one slot of residual 419 time which is consumed with no benefit. Eqs. (17) and (18) 420 show the benefit backup and benefit not_backup as follows: where α and β are benefit and cost weights respectively.
Similraly, Rt and Bt are the residual and backup times 425 respectively. When the benefit backup is greater than the 426 benefit not_backup , the task is considered as a slow task as in 427 Eq. (19).
429 when replacing β α with γ , we can obtain:    In this case study, we use equal weights set to 0.25. Also, 476 the threshold parameters as P th , Rate th , cpu th , mem th , disk th 477 and net th are set to 0.5. These settings characterize a common 478 configuration for the two scenarios, and can be customized 479 for different purposes. Then, the coefficients of the stage 480 proceeding behavior (C P ), the processing bandwidth (C BW ), 481 and the cluster utilization level (C U )can be computed as in 482 Eqs. (6), (9) and (14). Therefore Straggler decision (D S ) can 483 be obtained as in Eq. 2. In this context, we can denote each 484 task to be either a straggler task (S) or a none straggler task 485 (N). Table 2 shows the straggler decision (D S ) for low and 486 high states with Q T = 8 and 14 respectively. For low states, 487 the proposed criterion encourages replica creation in early 488 stages of the life cycle by generating smaller coefficients 489 while ensuring that the quality of service. While in high 490 states, the proposed criterion creates fewer replicas to avoid 491 overload of the system. Also, it guarantees the realization of 492 quality of service requirement. A. PLATFORM

509
The experimental big data cluster used in this work con-510 sists of four virtual machines that are made up of a 511 master and three workers. The cluster classification and 512 the software configurations are demonstrated in Table 3. 513 The cluster applied with 28 cores and 40 GB memory 514 space.

515
The software configurations are illustrated Table 4. Also, 516 the block size of the Hadoop Distributed File System (HDFS) 517 used is 128 MB.

519
There are many spark applications incur a lot of resources. 520 Based on many researches that conducted on spark 521 97084 VOLUME 10, 2022

536
This application is a popular benchmark that counts the num-537 ber of times each word appears in the input data file. It has two 538 stages: stage 0 and stage 1. Stage 0 read data from the HDFS, 539 and performs map and reduce operations. Stage 1 read the 540 output data of stage 0 through shuffler, and performs reduce 541 operations. In our experiment, we use the real dataset, enwiki 542 dump progress [37] of size 10, 20 and 30 GB. It partitions a set of data point into K clusters. It is a com-545 monly used on large data sets which automatically classifies 546 the input data points into K clusters. So, it is appropri-547 ate candidate for parallelization. To generate the datasets, 548  execution time is influenced by many parameters as the 563 benchmark category, the input data sizes and the assigned 564 node capability. Furthermore, the throughput is calculated 565 likewise, the job numbers over the execution time. In TeraSort 566 benchmark, Figure 3 shows the performance comparisons 567 with Spark-Default, Spark-Speculation, and Spark-ETWR at 568 workloads 3 GB, 6 GB and 12 GB.

569
On average, the proposed framework achieves 26.8% less 570 execution time than Spark-Default. Also, it gives 24.3% 571 execution time reduction against Spark-Speculation, and 14% 572 reduction compared with Spark-ETWR. Furthermore, show 573 that the cluster throughput increased by 46.3%, 42.8% and 574 15% with respect to the competitive methods respectively.

575
Similarly for WordCount benchmark as in Fig. 5, the 576 proposed framework achieves a reduction of 28.7% in the 577 average execution time compared with Spark-Default. Addi-578 tionally, it provides 21.5% execution time reduction com-579 pared with Spark-Speculation and 13.6% reduction with 580 Spark-ETWR. 581 Also, Figure 6 is showing that the cluster throughput is 582 improved by 30.7%, 24.2% and 15.7% compared with other 583 competing methods.

584
For K-means clustering algorithm as in Fig. 7 and 8 Spark-Speculation and Spark-ETWR respectively.