Dynamic and Static Energy Efficient Scheduling of Task Graphs on Multiprocessors: A Heuristic

For energy efficient scheduling of task graphs on multiprocessors, dynamic voltage and frequency scaling (DVFS) and duplication are two widely used techniques. DVFS is generally used to utilize the execution slack by lowering the voltage and frequency of a task to decrease the dynamic energy consumption. Whereas duplication decreases the schedule length and communication energy consumption by replicating certain dependent tasks to avoid communication delays. However, while making decisions on DVFS and duplication for a task, the static energy consumption is mostly overlooked. With chip technologies reducing to a few nano meters, static energy consumption due to leakage current has become important. This article proposes a novel polynomial time heuristic that uses both DVFS and duplication to optimize static energy consumption along with dynamic and communication energy when scheduling task graphs on heterogeneous multiprocessors. The proposed list scheduling algorithm also balances schedule length with energy consumption using proposed normalized difference parameters while making scheduling decisions for a particular task. The results demonstrate the ability of the proposed algorithm to decrease the overall energy consumption with an improved or comparable schedule length as compared with other algorithms in various scenarios.


I. INTRODUCTION
The static scheduling of task graphs or Directed Acyclic Graphs (DAGs) on heterogeneous multiprocessors is an NP-Hard problem [1], [2]. This problem gained more attention in the research community with the increase in energy consumption of multiprocessors [3], [4]. However, most of the published literature deals with optimizing performance (schedule length or makespan) along with only dynamic power consumption [4], [5]. Dynamic voltage and frequency scaling (DVFS) is one widely employed technique to reduce the dynamic power consumption by running certain tasks on low voltage-frequency pairs. Since, the dynamic power consumption is directly proportional to the frequency and voltage, this decreases the dynamic energy consumption, however, increases the execution cost of the tasks and hence, increases the schedule length. Many scheduling algorithms uses DVFS only for slack reclamation i.e., in an already known schedule, tasks with idle slots (also called as slack) The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . are run on low voltage/frequency to fill the idle slot. By doing this, the energy consumption can be decreased without increasing the schedule length.
Along with the dynamic power consumption, the static power consumption has significantly increased as an implication of the Moore's law [6], [7]. Static power consumption is independent of the tasks or processes being run on the hardware. The major source of the static power is the leakage current which is because of the reverse bias current in a transistor even in an off state. The increasing chip density and hence chip technology dropping below 65nm is one of the major reasons of an increasing leakage current. Also, leakage current is directly proportional to the temperature of the device. Hence, if a processor does more computation, then, along with dynamic power consumption, the temperature also rises which increases the leakage current and eventually static power consumption.
Current processor manufacturers employ dynamic power management (DPM) to keep the processors on a low-power state whenever possible to reduce the static power consumption. The Advanced Configuration and Power Interface VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (ACPI) standard [8], [9] defines various low power states (S1 to S4) in the standby mode. Each of these low power states save more power than its predecessor, however, requires more time (and energy) to bring the system back to the active state. Hence, a processor needs to be idle for a certain minimum time instance to save on static energy by going into the low power state. For every low power state, this time instance is called as a break-even time.
The DVFS uses the idle slots to decrease dynamic energy consumption whereas, in DPM, these idle slots can be optimized to decrease the static energy consumption. Hence, there is an interesting trade-off in using idle slots for either running tasks on low voltages/frequencies to save dynamic energy consumption or keeping the idle slots within break-even time to save more on static energy consumption. Another interesting trade-off of DVFS and duplication for optimizing computation and communication energy along with performance has been recently explored [10], [11]. The duplication of tasks in idle slots is done to reduce the inter-task communication delay to shorter schedule length [12] and also to reduce the communication energy consumption. Hence, idle slots can be used in three different ways as follows: 1) putting the processor into a sleep state to reduce static energy consumption and also temperature of the processor 2) duplicating tasks to reduce communication delay and energy consumption and also to shorten schedule length 3) use slack reclamation using DVFS to reduce computation energy consumption This article proposes a polynomial time scheduling heuristic named DSEAS to optimize dynamic, static and communication energy consumption along with the schedule length when tasks graphs are scheduled on heterogeneous multiprocessor using DVFS and duplication. While deciding on dynamic voltage/frequency pair for a task or it's duplicate, DSEAS considers static energy consumption to make optimal decision and hence balance schedule length with the energy consumption. The DSEAS algorithm also uses an integer linear program (ILP) named OptILP-idle to optimally uses an idle slot in the partial schedule to decrease dynamic and static energy consumption. The results show that DSEAS is able to decrease the energy consumption with a better of comparable schedule length than the state-of-the-art algorithms. The rest of the article is organized as follows: Section II describes the related work. The task and system model is presented in section III. The DSEAS algorithm is presented in detail in section IV with OptILP-idle in section IV-A. The experimental results are discussed in section V followed by conclusions and future directions in section VI.

II. RELATED WORK
Static scheduling of DAGs on multiprocessors with only makespan objective is an NP-hard problem [1], [2]. To deal with hardness, researchers have proposed optimal ILPs [13]- [15], meta-heuristics [16]- [19] and lowcomplexity heuristics [20]- [22]. The Heterogeneous Earliest Finish Time First (HEFT) [20] is one of the first very popular polynomial time list scheduling algorithm, which sorts tasks in a DAG in to a list and later greedily schedule tasks from the list one at a time to reduce the finish time. The HEFT algorithm was later extended with duplication [21] to further shorten the makespan. The last decade has seen the focus with scheduling of task graphs been shifted from performance (reducing makespan) to optimizing energy consumption and also to control the thermal properties of the system. To reduce the dynamic energy consumption, Dynamic Voltage/ Frequency Scaling (DVFS) has been widely utilized [3], [23]- [26]. A detailed survey on energy aware scheduling is presented in [3]. Kappiah et al. [23] effectively reduced the processor energy consumption by applying slack reclamation to the MPI programs. They experimentally evaluated the approach without modifying the programs. Son et al. [25] simultaneously reduced the voltage on the processors and the network links to utilize the available slack. Rizvandi et al. [24] proposed the concept of scheduling certain tasks with integrated max-min voltages to effectively reclaim the slack. Their results were found to be close to the optimal solution.
There are a few efforts to simultaneously reduce the makespan as well as the energy consumption. Lee and Zomaya [26] proposed a polynomial time Energy Conscious Scheduling (ECS) algorithm using DVFS, which uses a parameter that simultaneously considers the energy and the schedule length to make task scheduling decisions. However, it is seen that ECS sacrifices the makespan to reduce the power consumption [11]. Duplication has also been used to decrease the energy consumption [27], [28]. Zong et al. have proposed an algorithm that duplicates a task only if it keeps the increase in energy and schedule length below a threshold level. An extension of HLD, named Enery Aware Minimizing Duplication (HLD-EAMD), was proposed [28]. HLD-EAMD works in two steps. First, it runs the HLD algorithm to decrease the makespan with duplication and then, it removes the redundant duplications to decrease the energy consumption. However, both of these algorithms do not uses DVFS and perform poorly in the case of low communication costs. Scheduling with duplication has also been implemented using an Mixed Integer Linear Program (MILP). Bender [13] described an MILP to reduce the makespan with duplication. Tosun and Suleyman [29] used DVFS and duplication for scheduling independent tasks using an MIP. In [10], authors proposed an MILP formulation using DVFS and duplication to optimize both of these objectives together. The formulation has proved to be very effective, however, only for smaller instances of the scheduling problem. A normalization based heuristic with duplication and DVFS is proposed in [11] but it does not optimizes static energy consumption.
All of the above works do not optimise static energy consumption. Niu and Quan [30] proposed a hard real-time algorithm to reduce dynamic and static energy consumption. The main idea of their paper is to shift scheduled tasks in such a way that the small idle slots are merged together to make bigger ones with a constraint that none of the task misses its deadline. They used DVFS to finish tasks earlier so that slightly bigger idle slots can be obtained. However, their solution is only for independent tasks. A similar approch to combine execution slacks is proposed in [31]. The authors have used multiple sleep states and also proposed an offline method to calculate break-even time for each sleep state. Chen and Thiele [32] also looked at a similar problem in real-time system with independent tasks. Their approach is to find optimal frequency to run a task such that there is a possibility to push the processors into a dormant state. In [33] authors proposed an algorithm to shutdown the under utilised processors and limit the number of processors along with DVFS to reduce the static energy. Ma et al. considered dependent tasks and used clustering based scheduling along with duplication and dynamic power management techniques. They assume that DVFS is not available in the cluster system. An optimal ILP based solution for scheduling hard real-time and mixed-criticality tasks on multiprocessors to optimize static energy consumption is proposed in [34]. This work uses multiple sleep states as utilized in our work. But, they do not use DVFS and the algorithm is designed for independent tasks only. More recently, Kaur et al. [35] proposed a duplication based approach combined with dynamic power methods to reduce dynamic as well as static power consumption for dependent tasks. However, their work also does not utilize DVFS. In this paper, we propose a DVFS and duplication based polynomial time heuristic that also focuses on static power consumption along with the dynamic power consumption and balances power consumption with performance i.e., decreasing the makespan.

III. TASK AND SYSTEM MODEL
We define the task and system model used in this study in this section. The system model used in this research is similar to the one in [11], [26], [34]. Table 1 describe all the notations used in this work.

A. TASK, PROCESSOR AND NETWORK MODEL
In static version of the scheduling problem, the information about the task graph which is represented as a directed acyclic graph (DAG), is available in advance. The application DAG (F) is required to be scheduled on a set P of heterogeneous processing elements (PEs). There are N nodes (represented as a set F) of the DAG which represents N non-preemptive tasks. A set G ∈ F × F of directed edges between the nodes in the DAG defines dependency among tasks. An edge between two tasks i and j means that the task j can not begin its execution until the task i has finished and communicated required data to task j. Since, we employ duplication, there can be multiple copies of a task on different processors. In case copies of tasks i and j are scheduled on different VOLUME 8, 2020 gives the time required to transfer data from the i th task to the j th task. Both the tasks, if scheduled on same processor, are assumed to exchange data through shared memory, which does not incur any cost. T [i, m] ∈ R F×P represents the execution cost of the i th task on the m th processor at the highest voltage (V m max ) and frequency pair. The V m max represents the maximum voltage of the m th processor.
Each processor in this model can run on multiple voltage and frequency pairs, where V m k represent the k th voltage/ frequency pair on the m th processor. As energy consumed while transiting between various voltage pairs is small, we do not consider this in our study. Each of the processors, when not in use, can be in one the idle states s where s ∈ S[m]. The set S[m] define idle states on a processor m. We consider four different idle states as have been used in [35]. The state S[m, 0] represents active-idle state on a processor m. In an active-idle state, a processor is idle i.e., it is not executing any tasks, however, all of the components of the processor are in active (power-on) state. Generally, a processor is in active-idle state when it is idle for a short duration of time while waiting for some communication to finish. In case a processor is idle for relatively large duration of time then some parts of the processor can be powered-off to bring the processor to a state such that static energy consumption can be reduced. These states are called as passive-idle states. We use three passive-idle states viz. standby, dormant, and shutdown as has been extensively used in the literature [35]. We describe the amount of static energy consumed during these states in the section III-B.
For inter-process communication, we assume a contention free fully connected topology. Each of the processors have a network interface card (NIC) which is used to connect the processor to a network. Here, we assume homogeneous network cards throughout the system, where ENB and ENI are the network interface busy and idle energy consumption respectively. It has been noticed that a switch consumes almost similar energy in busy as well as idle mode [36]. Hence in busy and idle modes, a unified parameter ES is used to describe energy consumption of a switch.

B. ENERGY MODEL
The total energy consumption for scheduling task graphs on heterogeneous multiprocessors is defined in equation E 1 in table 2, where, en P dyn and en P sta stands for dynamic and static energy consumption of processors respectively. We explain how to evaluate all of these energy components in the following sections.

1) DYNAMIC ENERGY
The dynamic energy consumption (en P dyn ) can be evaluated by calculating the amount of time processors (or cores) are busy executing the tasks. The part * in Equation E1 gives the energy consumption of a task i on processor m.
In our heuristic, a task is allowed to be executed using integrated voltage-frequency pairs i.e., some fraction on one voltage and the remaining on the other. It is reported in literature [24] that the slack time can be optimally utilized when a task is allowed to be executed on integrated voltages. In part * in Equation E1, d k [i, m] represents fraction of time task i executes on voltage-frequency pair k and E b [m, k] is the unit busy energy consumption of processor m on V m k .
The β m is evaluated from the fact that the capacitive (dynamic) power consumption of a processor is defined as E b = ACV 2 f , where A: the number of switches per clock cycle, C: the total capacitance load, V , f : voltage and frequency.

2) STATIC ENERGY
A processor consumes static energy consumption while it is busy executing tasks as well as when it is idle. However, when busy (or active), the static energy consumption depends on the thermal properties of the system. Since, in this work, we are not focusing on thermal modeling, we assume a fixed active-static energy consumption for a processor m i.e., E a s [m]. Hence, total active-static energy consumption of all the processors is: The equation E2 is exactly similar to the equation E1 with a minor difference that now we multiply the execution cost of tasks with unit active-static energy consumption (E a s [m]). The crucial part of saving static energy consumption is when a processor is in idle state. Depending on the amount of time a processor is idle i.e., the size of the idle slot, a processor can be put into one of the idle states as discussed in section III. The unit idle-static energy consumption (E i s [m, active−idle]) of a processor decreases as we send a processor to a deeper idle state i.e., we can decrease the static energy consumption:

3) COMMUNICATION ENERGY
The communication energy is mostly consumed by network interface cards (en C N ) and the switches (en C S ). Hence, the total communication energy ((en C )) consumption is calculated as en C = en C N + en C S as used in [36]. To simplify the calculation of en C N , we assume that all of the network cards consume idle energy consumption for the entire makespan (|P| · ENI · f max ), where ENI is the unit idle energy consumption of the network cards. Later, after calculating the total communication time in part * in equation E4, we add twice the busy network energy consumption for this amount to account for sender and receiving processors and subtract the similar amount of idle energy consumption which was added earlier. Equation E5 is used to calculate energy consumption of switches.
In equation E5, ES is the unit energy consumption of switches. Also, depending on the number of processors, we can calculate the number of switches (N switch ) as follows: if |P| ≤ N port |P| N port + 1, otherwise LISTING 1. Algorithm for DSEAS (input: F; output:schedule).

IV. ALGORITHM:DSEAS
The proposed algorithm uses greedy approach as used in [11], [26]. We name our algorithm as Dynamic and Static Energy Aware Scheduler (DSEAS). DSEAS algorithm is significantly different than the other state-of-the-art algorithms since it explores the interplay of DVFS, duplication and DPM to optimize performance, dynamic, static and communication energy consumption at the same time. The main objective is to generate schedules which balances energy consumption with makespan. Algorithm 1 describe the pseudo-code of our heuristic. The algorithm begins by topological sorting of the nodes of the task graph with decreasing values of the b-level. The b-level (or bottom-level) for each node is calculated as follows: In the equations above, µ e is taken as the average execution cost of task i on V m k where m ∈ P and succ(i) gives all of the successors of task i. The b-level measures the largest path from a task to a leaf task and is widely used in literature for sorting task in a graph to a list for list scheduling heuristics. The tasks are selected for allocation and scheduling based on decreasing values of b-level. Initially, for every task, we select any random processor m 0 and the minimum voltage level on m 0 (i.e., V m 0 min ). Later, we iterate through every processor (step-5) and all of voltage/frequency pairs (step-7) to see which processor and voltage level is more appropriate for a particular task. To decide this, we make use of a parameter named as Improvement Factor (IF). Equation P1 in table 3 is used to calculate IF. The parameter IF is defined as the improvement in makespan and energy consumption when a new voltage-frequency pair and a processor is selected for a task as compared to the current mapping. If a task i is currently allocated to a processor m with voltage V m k and we want to see if processor m and voltage level V m k should be preferred then the improvement factor should be positive i.e., IF should be positive (steps 7-9). The distinct feature of IF is that it makes use of affective energy consumption (e a [i, m, V m k ]) which consider dynamic as well as static energy consumption while scheduling a task with or without duplication. To be the best of authors knowledge this kind of parameter is not used by any known greedy algorithm.
The IF in makespan is evaluated as the normalized difference in the earliest finish time (EFT ) of task i on (m , V m k ) and (m, V m k ). Similarly, the IF in the energy consumption is evaluated by taking into account the affective energy consumption (e a [i, m, V m k ]) of the task i on (m , V m k ) and (m, V m k ). Importantly, we take in to account the affective energy consumption in comparison to only busy energy consumption as used in [11], [26]. The idea behind using affective energy consumption is that in case a task does not execute in a particular slot (i.e., no busy energy consumption) then there is still some idle static energy consumption. Hence, we trade-off dynamic and static energy consumption while scheduling tasks. To calculate the affective energy consumption (equation P4), we subtract the idle energy consumption of task i on The EFT [i, m, V m k ] is the time the task i can finish the earliest while executing on processor m on voltage V m k . This also depends on the data arrival time from parents of task i. In case IF is negative (steps 12-23) then we try to duplicate the most important immediate parent (MIIP) of task i which delays its execution the most on processor m. This can reduce the EFT , however, can also increase the energy consumption, hence, we also consider the duplicated parent while calculating the affective energy consumption as in equation P5. In case duplication leads to IF > 0 then we schedule this task with duplication (step 26) otherwise without duplication (step 28). Once all of the tasks are scheduled then we remove any redundant duplication in step 27 to create more idle slots in the schedule. Finally, for each idle slot, we optimize the dynamic and static energy consumption by running an optimal ILP as described in section IV-A. The complexity of DSEAS is |O(|F|log(|F|) + (|F| + |E|) * |P| * V 2 max )|, where V 2 max is the maximum number of voltage levels on a processor. The DSEAS time complexity is a V max factor more than the [26] and [11] because we decide on a voltage level for the duplicated task as well. This helps us to achieve better results without increasing much on the time complexity. For large task graphs, a restricted number of voltage-frequency pairs can be used [15] to reduce the time complexity.

A. OPTILP-IDLE
As highlighted earlier, an idle slot can be used either to reduce the dynamic energy consumption (by running tasks on low voltage/frequency pairs) or to decrease the static energy consumption (by putting the processor to a deep idle state). We propose an integer linear program (ILP) named as OptILP-idle for optimized use of the idle slots. Figure 1 gives the basic idea of the problem which we solve using the ILP.  Every idle slot other than at the beginning of a processor is preceded by a scheduled task (for ex. task i). We re-define the size of the idle slot here, which now begins at the start of this task i and finish at the end of the actual idle slot as shown in the figure 1. The problem Task-in-Idle-Slot is described as follows: Problem 1: Task-in-Idle-Slot: Given an idle slot on processor m with size len idle_slot and a task i to be scheduled from the beginning of the slot, minimize the total dynamic and static energy consumption such that the final schedule remains feasible.
The objective of the OptILP-idle is as follows: where, en i dyn and en i sta−a are the dynamic and active static energy consumption of task i on processor m. The en i sta−i represents the idle static energy consumption of the idle slot that follows task i on processor m. Table 4

V. EXPERIMENTAL RESULTS
In this section, we compare DSEAS algorithm with three other state-of-the-art algorithms viz: NormEAS [11], C-SEED [35] and ECS+idle [26]. Table 5 briefly compare these algorithms with their properties.  Table 6 provides details of the parameters used to generate widely used real task graphs: LU Decomposition (LUD) [16], Fast Fourier Transform (FFT) and Random task graphs [4]. Approximately ten thousand graphs have been generated by varying graph parameters as listed in table 6 using the tool, Task Graph Generator [37]. The parameters are set same as [11], [26], [35]. Other than these graphs, we also use three VOLUME 8, 2020  application task graphs Robot Control (RC), Sparse matrix Solver (SMS), SPEC fpppp (SPECF) from Standard Task Graph Set (STG) [38]. The STG set defines graphs only with execution costs, the communication cost is generated to maintain various CCR values as shown in table 6.
All the algorithms have been implemented in C++. We have used CPLEX ILP Solver [39] to solve OptILP-idle which takes only a few seconds to optimally solve. Figure 2(a) [35] shows the two processor types we use in this study with voltage pairs for both the processors in 2(b) [11], [26]. It is important to mention that the proposed algorithm does not depend on a particular processor architecture and these two processor types are only selected because of the availability of the technical specifications (specifically idle-states parameters) of the processors from research papers. We take equal number of both types of processors while varying number of processors as described in table 6.
For comparison, we have used two performance metrics: SLR (Schedule Length Ratio) and ECR (Energy Consumption Ratio) [26] as used in other relevant studies. The SLR is computed as follows: where, CP is a set of tasks on the critical path of the application task graph. The ECR is evaluated as: where en c cp is the communication energy for Critical Path (CP). The lower the values of SLR and ECR, the better the algorithm has performed. Next, we compare the algorithms by varying CCR, number of tasks and number of processors and then summarize the results based on various graph types.
A. IMPACT OF CCR Figure 3 shows the impact of varying CCR on SLR (a) and ECR (b). For low values of CCR (<1) i.e., when the computation cost is higher than the communication cost, an only duplication based algorithm C-SEED suffers a little with the energy consumption because of not using DVFS, with limited duplication, it is able to reduce the SLR though. Similarly, ECS+idle sacrifices SLR by running tasks on low voltage/frequency pairs to reduce the dynamic energy consumption and hence reducing ECR than C-SEED. NormEAS does better than the two by using duplication and DVFS both and finding a nice balance between SLR and ECR. However, NormEAS does not do anything to reduce the static energy consumption. Especially, when the number of tasks are higher, even for CCR<1, still there are idle slots, which are utilised by DSEAS to reduce the static energy consumption along with dynamic energy consumption by using DVFS. The DSEAS algorithm also employs selective duplication to reduce the SLR than other algorithms and hence, reduces the communication energy consumption as well. Overall, the DSEAS performs the best among other algorithms.
With increasing values of CCR (≥1), communication begins to dominate, and hence, SLR and ECR increases for ECS+idle for not using duplication and any method to reduce the static and communication energy consumption. Since, computation is lower than communication, DVFS impact to reduce only the dynamic energy consumption is negligible. Infact, using only duplication and dynamic power management (DPM) methods to reduce static energy consumption works in favour of C-SEED. With duplication, C-SEED reduces the communication and hence reduces SLR and communication energy consumption. Duplication increases the dynamic energy consumption, however, savings on communication energy consumption dominates dynamic energy  consumption. Another, interesting result is that the ECR of C-SEED is lower than NormEAS for CCR (≥2). This is attributed to DPM techniques used by C-SEED to put the processors into deep idle states when the idle slots are big enough. With increasing CCR, the idle slots starts getting bigger and hence saving static energy consumption dominates dynamic energy consumption, since, NormEAS also employs DVFS and duplication to reduce dynamic energy consumption and communication energy consumption.
Interestingly, DSEAS is able to optimize the use of DVFS, duplication and deep idle states to significantly decrease the ECR than C-SEED. The DSEAS algorithm sacrifices a little on SLR (Figure 3(a)) for higher values of CCR but reduces the dynamic energy consumption far more than C-SEED. Also, DSEAS decision of whether to duplicate a task of not depends both on dynamic and static energy factors (refer Table 5, equation P6) which helps to do selective and optimized duplication and hence, does not blindly increases redundant duplications. This is also a distinctive factor of DSEAS as compared to NormEAS. Other than this, NormEAS, first allows to do redundant duplications and once all the tasks are scheduled then look for removing redundant duplications. We argue that this is not a good technique because allowing redundant duplications fill up the idle slots which can be used to either to schedule other tasks or used as idle slots to reduce static energy consumptions and DSEAS does exactly the same to reduce ECR than NormEAS.

B. IMPACT OF NUMBER OF TASKS
Another interesting way to look at the results is to compare algorithms on the basis of increasing number of tasks in an application DAG. Figure 4(a,b) shows the impact of number of tasks (or nodes) on SLR and ECR respectively. All duplication based algorithms except ECS+idle achieved similar SLR. This effect is attributed to duplicating tasks to reduce the schedule length, especially, when either communication cost or dependency among tasks increases. Only for high number of tasks, C-SEED is able to improve SLR by a small fraction upon NormEAS and DSEAS. This is because of use of selective duplication and DVFS techniques in NormEAS and DSEAS to balance SLR with ECR as compared to C-SEED. In case of NormEAS, this can be seen for medium sized graphs (<200), where NormEAS is better than C-SEED. However, as the number of nodes in the graph increases (≥200), C-SEED is able to pull down ECR than NormEAS. Because, as the number of tasks increases, the dependency among tasks also increase, hence, the delay in executing these tasks increases the idle slots. Increasing size of the idle slots give chance to both C-SEED and DSEAS algorithms to reduce static energy consumption by putting processors into deep idle states.
The DSEAS algorithm is however better than both C-SEED and NormEAS for all the graph sizes while comparing ECR. Compared to NormEAS, DSEAS does duplication optimally by considering the static energy consumption along with dynamic energy consumption of the duplicated task. This step helps DSEAS to prevent redundant duplications and make optimal duplication choices. Secondly, is the optimization of static energy consumption, as in DSEAS every idle slot is optimally used to decrease dynamic as well as static energy consumption as explained in section IV-A. The downside of C-SEED is not using DVFS, which is efficiently used in DSEAS to decrease ECR. Figure 5 shows the impact of increasing number of processors on SLR and ECR. With an increase in number of processors, the SLR decreases for all of the algorithms, however, increasing processors from 16 to 64, does not improve the SLR further. Because, for relatively small or medium sized graphs, 16 number of processors seems sufficient. The C-SEED algorithm reduces SLR slightly than DSEAS because of increased duplications which can be seen as C-SEED having higher ECR than DSEAS. Also, to save more dynamic energy consumption, DSEAS uses DVFS which slightly affect it SLR as compared to C-SEED. However, with increasing number of processors, the possibility of having (large) idle slot increases, which gives an opportunity to DSEAS and C-SEED to save on static energy consumption. As can be seen in ECR plots, from processors 16 to 64 C-SEED reduces the ECR than NormEAS. However, as is seen in other plots, Opt-ILP helps DSEAS to decrease the static energy further with ECR of DSEAS is significantly better than NormEAS. A without duplication ECS+idle does relatively well as compared to C-SEED for smaller number of processors by scarifying SLR. But clearly is not a good choice to achieve a balance of SLR and ECR. Whereas, DSEAS with an optimized mix of DVFS, duplication and dynamic power management techniques able to achieve a perfect balance of SLR and ECR.

D. SUMMARY OF RESULTS
Tables 7 and 8 summarises and presents percentage improvement of DSEAS for ECR and SLR over other algorithms for various graph types. The impact of optimizing both dynamic and static energy consumption at the same time can be easily seen by comparing 14.88% and 13.53% improvement over NormEAS and C-SEED (uses DPM for static energy consumption) achieved by DSEAS. NormEAS reduces dynamic energy consumption for computation dominated task graphs and small graph sizes with less dependencies as compared to C-SEED whereas C-SEED reduces static energy consumption for high CCR and large graph sizes. However, both of these algorithms are comparable with ECR. But, DSEAS focuses on both dynamic and static energy consumption and performed significantly better than these algorithms to reduce the energy consumption. For SLR, C-SEED does more duplication and reduces makespan on an average 1.98% than DSEAS. Whereas, DSEAS performs significantly better than NormEAS and ECS+idle that tries to balance energy consumption with schedule length. The results clearly shows that it is important to focus both of dynamic as well as static energy consumption and DSEAS does exactly the same.

VI. CONCLUSION
This paper talks about an ingenious polynomial time heuristic DSEAS for scheduling precedence tasks on heterogeneous multiprocessors with a focus on dynamic energy, static energy and communication energy consumption along with the makespan. The DSEAS algorithm optimizes the use of DVFS (for reduing dynamic energy consumption), duplication (for reducing schedule length and communication energy) and DPM (for reducing static energy consumption). The results exhibit that it is important to focus on all three energy consumption along with makespan. The proposed algorithm is able to generate balance schedules where schedule lengths are comparable or better and total energy consumption is always better than the state-of-the-art algorithms. In future, DSEAS can be made temperature aware as well to account the impact of energy consumption on temperature of the system and vice versa.