Energy-Aware Task Scheduling on Heterogeneous Computing Systems With Time Constraint

As a technique to help achieve high performance in parallel and distributed heterogeneous computing systems, task scheduling has attracted considerable interest. In this paper, we propose an effective Cuckoo Search algorithm based on Gaussian random walk and Adaptive discovery probability which combined with a cost-to-time ratio Modification strategy (GACSM), to address task scheduling on heterogeneous multiprocessor systems using Dynamic Voltage and Frequency Scaling (DVFS). First, to overcome the shortcomings of poor performance in exploitation of the cuckoo search algorithm, we use chaos variables to initialize populations to maintain the population diversity, a Gaussian random walk strategy to balance the exploration and exploitation capabilities of the algorithm, and an adaptive discovery probability strategy to improve population diversity. Then, we apply the improved Cuckoo Search (CS) algorithm to assign tasks to resources, and a widely used downward rank heuristic strategy to find the corresponding scheduling sequence. Finally, we apply a cost-to-time ratio improvement strategy to further improve the performance of the improved CS algorithm. Extensive experiments are conducted to evaluate the effectiveness and efficiency of our method. The results validate our approach and show its superiority in comparison with the state-of-the-art methods.


I. INTRODUCTION
Modern High Performance Computing (HPC) systems, such as Tianhe-2 [1] and Sunway TaihuLight [2], typically consist of heterogeneous computing components interconnected by a high speed network. Such systems are expected to be used for fast processing of computationally intensive applications with different computing needs. These applications often have certain time constraints. Because high energy consumption is a bottleneck for the deployment of HPC systems, a major research challenge for heterogeneous HPC systems is how to provide services to applications in such a way that minimizes energy consumption while satisfying the applications' time constraints.
Due to the importance of energy consumption, various techniques have been developed, such as DVFS, consolidation virtualization and duplication [3], [4]. Among them, DVFS has been shown to be a very promising The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
technique, and has been widely used in energy-aware scheduling to make processors energy-efficient [3], [5]- [8]. DVFS reduces energy consumption by scaling down supply voltage/frequency of processors [9]. When a real-time application executes on a heterogeneous multiprocessor system with DVFS technique, it contains three phases, namely, task prioritizing, processor selection and power supplying phases [3], [10]. The task scheduling problem on heterogeneous multiprocessor systems has been proved to be NP-hard, as its time complexity grows exponentially on the choice of number for voltage settings [3], [11].
It is difficult to find an effective way to solve the above problems, because a processor has several voltage settings, and the same task has different processing times and energy consumption levels when executing on different processors. Traditional scheduling studies focus on heuristic-based algorithms, which are often based on greedy local optimal selection for some heuristic strategies [3], [12], [13]. However, due to the greedy nature, heuristic-based methods can not always produce consistent results for different problem instances [3], [12]. Because of the high adaptability, many well-known meta-heuristic algorithms have been adopted, including Genetic Algorithms (GA) [12], [14]- [19], and [20], Simulated Annealing algorithms (SA) [21], [22], Quantum-inspired Hyper-heuristics Algorithms (QHA) [3], [16], Ant Colony Optimization (ACO) [23]- [26], etc. However, the search process of the meta-heuristic algorithm varies from problem to problem, and has the disadvantages of large randomness, low global search efficiency, and premature convergence in the late iteration.
Although there have been many studies on task scheduling, energy-aware task scheduling using DVFS technique still faces many challenges. First of all, due to its greedy nature, existing heuristic algorithms are unable to obtain a consistently good scheduling scheme in complex situations [27]. Secondly, many existing random search algorithms have high time complexity and low search efficiency, and their search performance needs to be improved [3]. Since each scheduling technique has its pros and cons, and different techniques may complement each other, hybrid algorithms as an effective way to improve algorithm performance appeared. The Cuckoo Search (CS) algorithm, proposed by Yang and Deb in 2009, can solve the optimization problem by simulating the behavior of brood-parasitism and lévy flights [28], [29]. It has the characteristics of simple structure, fast search speed, and few parameters, and some studies have shown that the CS algorithm is more efficient than some swarm intelligence algorithms such as GA, the Artificial Bee Colony (ABC) algorithm and the Particle Swarm Optimization (PSO) algorithm, etc., [28]- [37]. CS has been widely used for solving optimization problems in engineering applications, thus using it to search task graph scheduling is expected to improve the scheduling quality and shorten the search speed. Therefore, in this paper, we propose an improved cuckoo search algorithm combined with a heuristic modification strategy. By combining these algorithms, we can maintain their complementary advantages and achieve better universality.
The standard CS algorithm can easily fall into local optimum when solving complex problems, and has the disadvantages of low solution accuracy [31]- [34]. In order to overcome this shortcoming, we propose a Cuckoo Search algorithm based on Gaussian random walk and Adaptive discovery probability (GACS). We use chaos variables to initialize populations to maintain population diversity, a Gaussian random walk strategy to balance the exploration and exploitation capabilities of the algorithm, and an adaptive discovery probability strategy to improve population diversity. In this paper, we apply the GACS algorithm to assign tasks to the processors and their voltage states, and then use a widely used downward rank heuristic to find the corresponding scheduling sequence, and a cost-to-time ratio heuristic strategy to further improve the performance of the GACS.
The four main contributions of this paper are listed below.
(1) We propose an improved cuckoo search algorithm deploying Gaussian random walk and adaptive discovery probability, which can effectively balance exploration and exploitation capabilities of the CS algorithm.
(2) We use an Adaptive Fitness Transformation (AFT) method to solve the performance-constrained energy optimization. As far as we know, this is the first time that AFT method is applied to task scheduling problem.
(3) We propose a improvement strategy on cost-to-time ratio to improve the performance of the GACS algorithm, which can further reduce energy consumption under performance constraint.
(4) The simulation results reveal that our algorithm has better performance compared with the state-of-the-art algorithms.
In this work, we propose the GACSM algorithm to study energy-aware task scheduling problem with DVFS. The goal of our task scheduling problem is to allocate tasks to available processors to meet the precedence constraints of these tasks, so as to minimize energy consumption under certain time constraints. The difference between GACSM and other algorithms is that our algorithm combines a heuristic modification algorithm with the improved CS algorithm, and we use single population strategy and AFT method. We propose the GACSM algorithm, which utilizes a chaotic search strategy, Gaussian random walk strategy and adaptive discovery probability strategy, and combines with the cost-to-time ratio modification strategy, to minimize energy consumption under a time constraint for task scheduling on heterogeneous computing systems with DVFS. The average complete computing time of our algorithm is shorter than two advanced algorithms with respect to different graph sizes under 1000 evaluations for 30 runs. We perform extensive experiments using real-world graphs and 18 randomly generated graphs. The results verify that our algorithm has good search accuracy and search efficiency, and is superior to the state-of-art algorithms.
The remainder of this paper is organized as follows. Section 2 reviews some existing related studies on task scheduling on heterogeneous systems. Section 3 describes the model of heterogeneous systems. Section 4 presents our GACSM algorithm. Section 5 reports our experiment results. Section 6 concludes the paper.

II. RELATED WORK
Static task scheduling of applications on multiprocessors has been widely studied [5], [38]. The proposed scheduling algorithms can be classified as heuristic-based and metaheuristic. Heuristic-based scheduling algorithms typically find a scheduling scheme in polynomial time based on incomplete information [39]- [44]. Topcuoglu et al. in [38] proposed two classical algorithms: Heterogeneous Earliest Finish Time (HEFT) and Critical Path On a Processor (CPOP). Metaheuristic scheduling algorithms usually use the technique of random search [45]- [48]. Metaheuristic algorithm usually generates schedules of better quality than that of heuristic-based algorithm; however, due to the low search efficiency, its computation cost is much higher than that of heuristic-based algorithm [3], [15].
Many studies have been conducted for energy-aware task scheduling on processors with DVFS (see Table 1). Most of them either focused on homogeneous computing systems [49] and independent task scheduling [49], [51], [52], [58], [61], or have very high computational cost [3], [53], [54]. For the continuous DVFS situation, there are some studies that considered reducing energy consumption [8], [50], [55]. However, since many scheduling problems are discrete in reality, energy-aware task scheduling becomes quite complex in this case. For the discrete DVFS situation, the authors in [3], [5], [7], [27], [49], [51]- [54], [56]- [62] investigated the scheduling problems. However, some of them adopted the strategy of shutting down the processors in the system [56], [57], which is unreasonable in reality [7]. Lee and Zomaya in [5] proposed an Energy-Conscious Scheduling (ECS) algorithm, and the authors in [27] proposed an Energy Aware task scheduling in the context of Service Level Agreement (EASLA). These approaches are mainly based on heuristic methods which are not agile for different application situations [7], [27]. The work [3] proposed a quantum-inspired hyper-heuristics algorithm (QHA), but its computational cost is too high and its scheduling results may violate the precedence constraint of the tasks [63].
The authors in [26] proposed an Improved Multi-Population Co-evolution Ant Colony Optimization (ICMPACO) algorithm, which is based on the multi-population strategy, co-evolution mechanism, pheromone updating strategy and pheromone diffusion mechanism. The ICMPACO algorithm uses a positive feedback mechanism, which is different from our GACS algorithm. In the ICMPACO algorithm, each individual can only perceive local information and cannot directly use global information, while our GACS algorithm can share information through the current optimal individual. In this work, we address the performance-constrained energy optimization problem for task scheduling on heterogeneous computing systems with DVFS by combining the GACS algorithm and a cost-to-time ratio modification strategy.

III. THE MODELS
In this section, we discuss the mathematical models of heterogeneous multiprocessor systems with dynamically variable voltage. We assume that the heterogeneous multiprocessor system in this work has the following characteristics [12]: (1) non-preemptive; (2) fully interconnected network; (3) task duplication is prohibited; (4) communication links with different startup time and bandwidth; (5) each processor has an independent I/O unit that allows for communication and computation to be performed simultaneously [9], [12].

A. SYSTEM MODEL
We assume that the system consists of a set of heterogeneous processors P = {P 1 , P 2 , · · · , P M } that are fully interconnected by a high-speed network, where M represents the number of heterogeneous processors and each processor P j ∈ P is DVFS-enabled with a finite numer of h(k) different voltage supply levels [3]. Let V k = (V k1 , · · · , V kh(k) ) be the voltage supply vector of P k , where V kr is the voltage corresponding to the rth Voltage Supply Level (VSL) of processor P k . Specially, we denote P kr as the processor P k under supply voltage V kr . When processor P k is idle, its supplied voltage V kh(k) is minimal [3].

B. APPLICATION MODEL
Let a task graph G = (T , E) be a Directed Acyclic Graph (DAG) composed of a set of tasks T = {T 1 , · · · , T N }, where vertex set T represents tasks, edge set E represents execution precedences among tasks, and N is the number of tasks. We assume that each task can only be executed sequentially without preemption in the same processor. There is an entry task and an exit task in a DAG. The vertex weight, denoted as D w (T i ), which represents the computation amount of task T i . Each edge e ij ∈ E represents a precedence constraint between T i and T j and implies that if T i → T j , then T i is the predecessor of T j and T j is the successor of T i [15], i.e., the output of T i has to be transmitted to T j before T j start its execution [5]. The edge weight is denoted as C w (T i , T j ), which represents the communication amount  between task T i and T j . An example of DAG is showed in Fig. 1, which shows a DAG of nine tasks that need to be assigned to the given number of available processors. The weight 3.2 of T 3 represents the computation amount of T 3 denoted as D w (T 3 ) = 3.2, and the edge weight 2 between T 1 and T 3 indicates the communication amount denoted as The precedence constraints of tasks are known a priori and remain unchange during scheduling and task execution. When scheduling the tasks of DAG to the processors, it is necessary to satisfy the precedence constraints among tasks and the availability of the processors.
Let B ikr be the processing time of task T i on P k with V kr , and Sr(V kr ) be the relative speed when T i is executed on P k with V kr . Then B ikr can be expressed as The communication between tasks assigned to different processors is performed through message passing over the bus [7]. If the data that a task needs to read is available in the local memory, inter-processor communication will not occur [9]. When T i and T j are scheduled to the same processor, the communication time is zero as the intra-processor communication can be ignored [3], [7], [9].
If T i and T j are assigned to different processors, then the communication cost incurs [3], [7]. Suppose that T i is assigned to processor P k and T j is assigned to processor P l . Let D(P k , P l ) be data transfer rates between processor P k and processor P l , and C s (P k ) be the communication startup time of processor P k [15]. The communication time C(T i , T j ), which represents the time spent in transferring data from T i to T j , is measured in seconds. Thus, C(T i , T j ) can be expressed as Let EFT (T i ) be the earliest finish time of task T i on processor P k given supply voltage V kr . Then EFT (T i ) is defined as addedwhere EST (T i ) represents the earliest start time of task T i . EST (T i ) can be expressed as where eavt(P k ) represents the earliest available time when P k is ready for task schedule, pr(T i ) is the immediate predecessor task set of task T i , and T en represents the entry task. Let Ms(G) be the makespan (scheduling length) of G. Then Ms(G) is defined as [3] Ms(G) = max

C. ENERGY CONSUMPTION MODEL
The energy consumption consumed by processors, power supply modules, memory and fans varies under different workload in high-performance computing systems [64], [65]. Some studies show that processors are the main consumers of system energy [64], [65]. In this work, we focus on the energy consumption of the processors. We assume processors are based on the Complementary Metal Oxide Semiconductor (CMOS) technology [3], [5], [66]. The power consumption is dominated by dynamic power dissipation P d , which is VOLUME 8, 2020 defined as where C is the effective switched capacitance, V is the supply voltage, and F is the processor clock frequency. Since F ∝ V , so P d = λV 3 , where λ represents a parameter that differs with each type processor [3]. The dynamic energy consumption of all the tasks executed can be expressed as [3] where U k is the task set on processor P k . Obviously, U k is a subset of T . The total idle energy consumption of all the idle nodes is [3] where V kh(k) is the minimum supply voltage on P k . Let E(G) be the total power consumption of a task graph G, then it can be calculated as [3], [5]

D. PROBLEM MODEL
The problem of performance-constrained energy optimization we study in this paper is defined as: subject to: where E(G) represents the total energy consumption of task graph G, and S represents the time constraint of task graph G.

IV. ALGORITHM FRAMEWORK
In this section, we will present the framework of our improved cuckoo search (GACSM) algorithm deploying Gaussian random walk and adaptive discovery probability and combined with a modification strategy. In order to take the advantages of GACS-based and heuristic-based algorithms and avoid their disadvantages, we use an approach by combining GACS algorithm and heuristics. In this paper, we apply the GACS algorithm to assign task to the processor and its voltage state. In our algorithm, after obtaining the task-to-resource mapping scheme, we use a widely used downward rank heuristic to calculate the task priority according to the mapping results, and then we can evaluate the fitness value f (x i ) and constraint violation degree v(x i ).
Firstly, we call the chaos method to create an initial population P 0 (line 2). Secondly, if the random number Algorithm 1 GACSM Require: Parameters for GACSM and task scheduling. Ensure: A task schedule.
1: g = 0; 2: Call Algorithm 2 to create an initial population P 0 ; 3: repeat 4: g = g + 1; 5: if the random number r 2 ≤ rank(i)/Np then 7: Generate new individuals by using Eq.(17) and xx xx x obtain new population P g−1 ; 8: end if 9: P g =ChooseBestIndividual(P g−1 , P g−1 ) (Algorithm xx3); 10: Get cuckoo with eggs randomly by lévy flights; 11: Choose nest j randomly among P g ; 12: if x k is better than x j then 13: replace x j by the new individual x k ; 14: end if 15: Abandon a fraction P a of worst nests by using Eq. (19) xxand build new ones via Eq.(20); 16: Obtain new population P g new ; 17: new ; 18: until the stopping criterion is reached; 19: Call modification strategy to further improve the population P g (Algorithm 4); 20: return the best solution of schedule. r 2 ≤ rank(i)/Np, where rank(i) represents the order in which the individual x g i is in the population according to the fitness value from small to large and Np represents the population size, then we use the Gaussian random walk strategy to balance the exploration and exploitation capabilities of the algorithm by Eq.(17) (line [6][7][8]. Then, we select Np better individuals in population P, P (line 9). Thirdly, in line 10-16, we perform the CS operator. Among the CS operator, we abandon a fraction P a of worst nests by using Eq. (19) and build new ones via Eq.(20) (line 15).
The loop iterates until the stopping criterion is reached. After performing GACS, we use the cost-to-time ratio strategy to improve its performance (line 17). The outline of GACSM is depicted in Algorithm 1.

A. CUCKOO SEARCH
The CS algorithm is an emerging biological heuristic algorithm proposed by Yang and Deb in 2009 which simulates the brood parasitism behavior of cuckoos. Because of its simplicity and easy implementation, CS has been successfully applied to solving practical problems such as engineering optimization, and widely accepted in the field of intelligent algorithms [32]- [37]. Its main idea is below: When generating a new solution x g+1 i , a lévy flight is performed as follows where α represents the step size, x g+1 i the next generation solution, x g i the current generation solution, and product ⊕ the entry-wise multiplications. Lévy(β) represents the lévy random number. For the convenience of calculation, the literature [29] uses the Eq. (13) to calculate the lévy random number where µ and ν are the random numbers of normal distributions satisfying the following conditions: CS discards some inferior solutions by a discover probability P a , and then regenerates the same number of new solutions by using preference random walks: where x g j,d , x g k,d represents two randomly selected solution, and r 1 is a uniformly distributed random number in the interval (0,1).

B. CUCKOO SEARCH ALGORITHM BASED ON GAUSSIAN RANDOM WALK AND ADAPTIVE DISCOVER PROBABILITY (GACS) 1) CHAOS METHOD
We use the chaos method to initialize the population. The nature of chaos is random, unpredictable, and regular. Searching by chaos method can make the algorithm jump out of local optimum, maintain population diversity, and improve global search ability [67]. In this paper, an ergodic chaos mapping is introduced to transform the initial variables into chaos variables. The sinusoidal iteration formula is adopted as follows where cf j k is a randomly generated number of interval (0,1), j = 1, · · · , N ; k = 0, 1, · · · , MaxCh, MaxCh is the maximum numbers of chaotic iterations, and N is the number of tasks.
The sinusoidal iteration formula Eq.(15) is introduced into the process of population initialization, and the population variable transformation formula is as follows where x j min and x j max are the lower and upper limits of the jth dimension variable, respectively.

2) GAUSSIAN RANDOM WALK STRATEGY
Since Gaussian random walk strategy has strong local exploitation ability [68], [69], we use this strategy to generate a new random population, which can balance the global exploration and local exploitation ability of the algorithm.
where x g i is the ith candidate solution in the population, and x g b is the best solution. r 3 is a random number of interval [0, 1], and ζ is defined as Using the best individual to guide the poor individual can help the poor individual to move toward the best individual, which can speed up the convergence of the algorithm. This strategy mainly operates on poor individuals with a high probability, which increases the efficiency of algorithm evolution. In addition, the Gaussian distribution is controlled by the adaptively adjusted variance ξ . In the early stage of the algorithm, the value of the variance ξ is large, which helps maintain the global exploration ability of the algorithm; the value of the variance ξ decreases with the increase of the number of iterations g, which helps to improve the local exploitation ability of the algorithm.

3) ADAPTIVE DISCOVER PROBABILITY
The CS algorithm discards some worse nests at a probability P a , and continue searching from the rest. It determines a suitable probability P a . If P a is too small, it is difficult to generate new individuals. If P a is too large, the algorithm will become a pure random search algorithm. Therefore, the convergence of the CS algorithm is affected by choosing an appropriate P a . In the standard CS algorithm, the value of P a is usually equal to a constant number. Intuitively, a fixed P a is likely to reduce the convergence performance of the algorithm. To overcome this problem, We use the following dynamic adaptive mechanism to adjust the discover probability P a : where f i represents the fitness of the the current solution x i , f min is the minimum fitness of all solutions, and f max is the maximum fitness of all solutions. P max , P min are two parameters in the interval (0,1).
It can be seen from Eq.(19) that the closer the solution is to the optimal solution, the smaller the P a is, which makes the solution more likely to be retained to the next generation. When the difference between the fitness of the current solution and the optimal solution is large, the P a is large, which makes the solution to be discarded easily.
Let r 4 , r 5 be a random number in the interval [0,1]. If r 4 ≤ P a , then the individual x g+1 i is operated as follows: where are two randomly selected different individual, r 6 , r 7 are two uniformly distributed random number in the interval [0,1], and x 8 are a random number of interval [0, 1]. It can be seen that through individual screening strategy, individuals with poor fitness are more likely to be discarded, and new individuals are generated according to Eq. (19). At the same time, Eq.(20) uses two different types of mutation operators, namely random search and mutation operator of optimal individuals, in order to enhance the exploratory ability of the algorithm while improving its development ability.

C. ENCODING OF SOLUTIONS
We first show the priority queues for DAG applications, and then present the encoding mechanism of task scheduling.

1) TASK PRIORITY CALCULATION
We use a widely used downward rank heuristic to calculate task priority by strategy [15]. Its definition is below: Let Rk(T i ) be the downward rank of a task T i on P k given supply voltage V kr , then Rk(T i ) can be defined by Eq. (21): where pr(T i ) denotes the set of immediate predecessors of task T i .

2) NEST REPRESENTATION
For mapping tasks to resources, we first divide the voltage supply levels of a processor into non-idle and idle voltage supply levels. Then we encode the non-idle voltages supply levels of all processors in turn. Each processor has several non-idle voltages supply levels, and each non-idle voltage supply level corresponds to a unique index number (see Table 3). Finally, we can determine the corresponding processor based on the index of the non-idle voltage supply level, and use Eq.(8) to calculate the total energy consumption of all the idle nodes. The encoding of solutions is chosen randomly from 1 to N tv , where N tv is the total number of the non-idle voltage supply levels. Cuckoo search works on the problem with continuous space, but the problem of graph scheduling is a problem of discrete space, so we need to discretize the space. In our algorithm, the dimension of individual x i = (x i1 , · · · , x iN ) is N , which is consistent with the tasks number of DAG. If there are N tv non-idle voltage states, each task can be assigned to voltage states in the range of 1, · · · , N tv . We set x ij (j = 1, · · · , N ) in the range (0.5, N tv + 0.5), and then rounded the x ij value to the nearest whole number. For example, the value of 10.9 in the fourth dimension in Fig. 2 indicates that task T 4 is assigned to a processor of pair3 with the non-idle voltage index of 11 (see Fig. 3), i.e., task T 4 is executed on a processor with a voltage of 1.9 and a relative speed of 0.85.

D. CONSTRAINT HANDLING STRATEGY
We apply constraint optimization for the GACS search process. Constrained optimization problems are usually expressed as follows: where x ∈ ω ⊆ Sp represents the decision vector, ω represents the feasible area, and Sp represents the search space. Usually, this constraint is transformed into the following inequality constraint: where η represents tolerance factor, and it usually greater than 0. Then the degree of constraint violation of an individual on jth constraint can be evaluate as Then, the total degree of standardization constraint violation v(x) of individual x can be calculate as In order to deal with the constrained optimization problem, the authors proposed an Adaptive Fitness Transformation (AFT) method to divide the population into three states: infeasible state, semi-feasible state and feasible state [70].
(1) Infeasible state: In the infeasible state, the population only contains infeasible solutions. In this case, only the degree of constraint violation needs to be considered, and its fitness value can be calculated as follows [70] (2) Semi-feasible state: In the semi-feasible state, the population contains not only several feasible solutions but also some infeasible solutions. In this case, the population is divided into feasible solution set (W 1 ) and infeasible solution set (W 2 ). Therefore, the objective function value f (x i,g ) of solution x i,g can be converted as [70] f (x where φ is the feasible solution ratio of the previous generation population, and x g b , x g w represent the best and worst solution of feasible solution set W 1 , respectively. Eq. (27) can be normalized as The degree of constraint violation can be calculated by Eq.(25), then Eq.(25) is normalized as Therefore, the fitness value f fit (x i ) can be expressed as (3) Feasible state: In the feasible state, all individuals in the population are feasible solutions. At this time, the fitness value can be calculated as follows [70]

E. MODIFICATION STRATEGY
Inspired by the literature [1], we propose an improved costto-time ratio modification strategy to further reduce energy consumption under time constraint. The cost-to-time function Ra(T i , P kr ) can be defined as follows [1]: where DiE(T i , P kr ) and DiT (T i , P kr ) represent respectively the increased energy consumption and execution time when task T i is moved from the currently assigned processor to P kr [1]. Our strategy works as follows: • Compute a critical path cp : T i T j in G. Apparently, the computation time of the critical path (CP) cp is equal to the makespan of G.
• If T (G) > S, we re-allocate the processor of a task selected from the CP to reduce the makespan. In order to obtain the minimal energy consumption and meet the time constraint, if P jr is a new processor for T i and DiT (T i , P jr ) < 0, we select a task with the maximal ratio Ra(T i , P kl ) and move it to a processor P kl . Since the increased execution time is negative, for the same amount of reduced execution time, a larger ratio means a smaller increase of energy. For example, assume DiE(T i , P jr ) = 10, DiE(T i , P kr ) = 4, and DiT (T i , P jr ) = DiT (T i , P kr ) = −2, then according to Eq.(32), we can see that Ra(T i , P jr ) = −5, and Ra(T i , P kr ) = −2. Obviously, in this case, it is better to reassign Ti to resource P kr . After the task assignment adjustment, the algorithm attempts to find a new CP in G and tries to reduce the completion time until the time constraint is met, or the makespan G cannot be reduced any more.
• If T (G) ≤ S, we try to reduce the energy consumption by moving a task with the minimum ratio to a new processor and voltage index. In order to reduce energy consumption, the task reassignment must satisfy DiE(T i , P jr ) < 0. If there exist Ra(T i , P jr ) > 0, which means DiT (T i , P jr ) < 0, then we give priority to assign P kl with the smallest positive ratio in CP to T i . For example, assume DiE(T i , P jr ) = DiE(T i , P kr ) = −10, and DiT (T i , P jr ) = −5, DiT (T i , P kr ) = −2, then according to Eq.(32), we can see that Ra(T i , P jr ) = 2, and Ra(T i , P kr ) = 5. Obviously, in this case, it is better to reassign Ti to resource P jr . Else if all Ra(T i , P jr ) ≤ 0, which means DiT (T i , P jr ) > 0, then we assign P kl with the smallest negative ratio in CP to T i . After reassigning a node, the algorithm attempts to find another node and continues this attempt until the energy consumption can no longer be reduced. The modification strategy iterates for each idle processor for to assign a task with the minimum energy consumption within the time constraint.
The description of the further improvement is depicted in Algorithm 3.
where Gen represents the maximum generation. The space complexity of GACSM is O (Np × N ), because we need an array of size N to store each nest and there are at most Np nests.

A. EXPERIMENT SETUP
In the simulation environment, the target system comprises a set of completely interconnected heterogeneous processors which are DVFS-enabled. In our experiment, processors are uniformly distributed among four different sets of voltage supply levels, which are listed in Table 3. The parameter λ k of processor P k is set the same as [51].

Algorithm 4 Modification Strategy
if T (G) > S then repeat find a CP cp in G; T cp ← all tasks in cp; for each T i ∈ T cp do for each P jr ∈ P do if P jr is a new index for T i and xx xx xxx xx xx xxxxx DiT (T i , P jr ) < 0 then calculate Ra(T i , P jr ); end if end for end for Ra(T i , P kl ) ← the maximal ratio in cp; Assign P kl to T i ; until T (G) ≤ S else repeat for each T i ∈ G do for each P jr ∈ P do if P jr is an available index for task T i and xx xx xxx xx DiE(T i , P jr ) < 0 and T (G) ≤ S then calculate Ra(T i , P jr ); end if end for end for if there exist Ra(T i , P jr ) > 0 then Ra(T i , P kl ) ← the smallest positive ratio in CP; Assign P kl to T i ; else Ra(T i , P jr ) ← the smallest negative ratio in CP; Assign P kl to T i ; end if until E(G) cannot be reduced end if We use two sets of graphs to evaluate the algorithms. The first test set is the Modified Molecular Dynamics Code (MMDC) [3]. The second test set is randomly generated task graphs.
The parameters of the random graph generator are set the same as [3]. The graph height of a random DAG is calculated by a uniform distribution with a mean value of √ N ψ , where N represents the number of tasks in the DAG [3], and ψ represents a parallelism factor. Let D i be the mean computation amount of task T i . D i is generated randomly with a uniform distribution of [0, 2 × D G ], where D G is the average computation amount of the given DAG. The computation amount of task T i , i.e., D w (T i ), is in the range , where δ is the computation capacity heterogeneity factor. Then, the processing time of task T i on P k with V kr , i.e., B ikr , is calculated as Rs(V kr ) . The communication time among tasks is generated with a uniform distribution [0, 2 × D G × CCR], where CCR is the ratio of communication to computation [3].
All simulations are performed on the PC with an Intel Core i7-3770 3.40 GHz CPU and 12.0 GB RAM. The experimental tool is Python 2.7.

B. COMPARISON METRICS
Energy Consumption Ratio (ECR) is an important comparison metric. The ECR value of an algorithm is defined as where E represents the energy consumption of an algorithm with DVFS, λ k represents the parameter of processor P k , V kh(k) is minimal voltage of P k , and B ikh(k) is the processing time of task T i on P k with V kh(k) . It can be seen from Eq.(34) that the denominator is the lower bound of the energy consumption of a given task graph. Energy-saving-ratio (ESR) can also be use to measure the performance of algorithms. The ESR value [27] is expressed as where E HEFT is the energy consumption of all tasks in the HEFT algorithm [38] performed at the highest frequency. The makespan extension can be defined by: where ζ is the makespan extension rate, and MS b is the makespan of a best effort HEFT schedule. In our experiment, we set the makespan extension rates at 0, 0.1, 0.2, 0.3, 0.4, respectively.

C. PARAMETER SETTING
The setting of parameters will greatly affect the experimental results, however, our main purpose is to illustrate the applicability of GACS to task scheduling. In this paper, the values of all experimental parameters are verified by repeated experiments or set by experience. To reduce randomness, the simulation results of our experiments are the average of 30 independent runs. We use the control variable method to discuss the influence of parameters, i.e., we first fix other parameter values, and then analyze the influence of the studied parameters on the algorithm. In this paper, we uniformly set the number of population to 40 and the maximum number of iterations to 200. The parameters of GACS are set as follows: step size α = 0.10, maximum discovery probability P max = 0.50, minimum discovery probability P min = 0.20. The ICMPACO [26] parameters, i.e., pheromone factor, heuristic factor, volatility coefficient, pheromone amount, and initial concentration, are set as 1, 5, 0.1, 100, 1.5, respectively. The QHA [3] parameters, i.e., NumP, SP, stasize and σ , are set as 4, 10, 20 and 0.05, respectively.

D. COMPLETE COMPUTING TIME
In this section, we compare the complete computing time of our proposed GACS with two random heuristic algorithms, i.e., QHA and ICMPACO algorithms. Fig.4 depicts the average computing time of different algorithm with respect to different graph sizes under 1000 evaluations for 30 runs. As can be seen from Fig.4, the average complete computing time of GACS is faster than QHA and ICMPACO by (13.8%, 7.4%), (13.7%, 9.1%), (10.7%, 6.0%), (10.1%, 5.3%), (10.3%, 5.5%), for the tasks number of 16, 32, 64, 128, and 256, respectively. The reason is that our proposed GACS algorithm evolves more easily than QHA and ICMPACO, and has fewer parameters to adjust, thus its speed is relatively fast.

E. REAL WORLD APPLICATION GRAPHS
We use application graph of real-world problem, the modified molecular dynamic code (MMDC) [3], to evaluate the performance of GACSM.
We test the search effectiveness of GACSM algorithm on MMDC problems. There are three state-of-the-art algorithms for solving performance constraint energy optimization problems, EASLA [27], ICMPACO and QHA. In order to ensure fairness, we first use the same modification strategy mentioned in Algorithm 4 to improve the performance of EASLA, ICMPACO, QHA and GACS, and denote them as EASLAM, ICMPACOM, QHAM and GACSM, respectively. We apply HEFT [38] and ECS [5] to the problem of modified molecular dynamic code, then obtain the makespan of the graph. The average ECR on MMDC is are shown in Fig. 5. The result of the algorithms with respect to different CCR values are shown in Fig. 5a. Our proposed GACSM algorithm is superior to EASLAM, ICMPACOM, and QHAM, where the QHAM algorithm is sometimes better or worse than ICMPACOM. The ECR values of the algorithms for different M and δ values are shown in Fig. 5b and Fig. 5c, respectively. Fig. 5 shows that our proposed GACSM algorithm outperforms EASLAM, ICMPACOM, and QHAM algorithms on VOLUME 8, 2020

F. RANDOM GENERATED APPLICATION GRAPHS
In this section, we use 18 randomly generated DAG instances to evaluate the performance of GACSM (see Table 4).  The methods and parameters used for them are same as to those used by [3]. In these instances, we consider the impact of different application graphs and the number of processors. There are three state-of-the-art algorithms for solving the performance constraint energy optimization problem, EASLA [27], ICMPACO [26] and QHA [3]. In order to ensure fairness, we first use the same modification strategy described in Algorithm 4 to improve the performance of EASLA, ICMPACO, and QHA, and denote them as EASLAM, ICMPACOM, and QHAM, respectively. We perform ICMPACO and QHA algorithms in the same number of iterations as GACSM.  The experimental results of randomly generated application graphs are shown in Table 5, which reports the statistical performance comparison of four algorithms, where ''-'' indicates that the data is infeasible. Table 5 shows that GACSM achieves better performance than ICMPACOM and QHAM in many of the test instances, such as R2, R3, R4, R10, R12, R14, R17 and R18. In contrast, ICMPACOM and QHAM does not outperform GACSM in any instance. In addition, the feasible rate and mean values of GACSM is better than ICMPACOM and QHAM (see Table 5), which shows that our GACSM algorithm has strong global search performance and high robustness. Figure 6-8 plot the convergence of energy consumption for processing the R2, R4, and R6 test cases, which are taken as representatives of 18 test cases. Figure 6-8 also show that their convergence speeds are rather different, i.e., the GACSM algorithm converges faster than ICMPACOM and QHAM. It can be observed from the figures that the final energy consumption achieved by GACSM is better than the other two algorithms. The reason behind lies in that our GACSM algorithm uses the optimal individual-guided population search strategy of Gauss random walk, thus its search speed is fast; and GACSM has strong local search ability in the later stage.
In what follows, we use another parameter ESR to compare the energy saving of EASLAM, ICMPACOM, and QHAM with our proposed algorithm. The energy saving results of the algorithms with respect to various makespan extension rates are shown in Fig. 9. The number of tasks is set at 300, 600 and VOLUME 8, 2020 1000, respectively, and the number of processors is set at 16, 32 and 64, respectively. We can see from Fig. 9 that our GACSM algorithm outperforms the other three algorithms under different conditions. As the makespan extension rate increases, the energy saving results of four algorithms also increase. Our GACSM algorithm can improve on energy consumption by 14.9%, 6.9%, 8.4% than the EASLAM, ICMPACOM, and QHAM algorithms respectively when ζ = 0.3, N = 1000 and M = 64. GACSM, ICMPACOM, and QHAM algorithms are outperform EASLAM in all case. The reason behind this is that EASLAM algorithm adopts heuristic strategy, and it is not easy to find good solution in complicated task graph; while GACSM, ICMPACOM, and QHAM algorithms adopt random search strategy, which can be used to solve complicated problems and have better search ability in large solution space, thus they can find better results than EASLAM.

VI. CONCLUSION
In this paper, we address the problem of energy-aware scheduling on heterogeneous computing systems with time constraint. We propose an improved cuckoo search algorithm incorporating a heuristic strategy to solve task scheduling on heterogeneous multiprocessor systems with DVFS. We first present an improved cuckoo search algorithm based on Gaussian random walk and adaptive discovery probability to establish the mapping of tasks and processor voltage states. We then give a downward rank heuristic strategy to find the corresponding scheduling sequence. Finally, we present a cost-to-ratio modification strategy to further improve the performance of the GACS. The simulation results show that our proposed algorithm exhibits better performance than the state-of-the-art algorithms.
In the future, we plan to consider new guided random search algorithms to solve the DVFS-based task scheduling problem. Moreover, we plan to find more effective and efficient scheduling algorithms which can reduce time complexity and improve energy efficiency.
ZIHAN YAN received the B.S. degree from Zhejiang Gongshang University. He is currently pursuing the M.S. degree with Sun Yat-sen University. His research interests include high-performance computing and machine learning.
HUIMIN HUANG received the B.S. degree from Shihezi University and the M.S. degree from Chongqing Jiaotong University. She is currently pursuing the Ph.D. degree with Sun Yat-sen University. Her research interests include influence diffusion in social networks, recommendation systems, and machine learning.
HONG SHEN received the B.Eng. degree from the Beijing University of Science and Technology, the M.Eng. degree from the University of Science and Technology of China, and the Ph.Lic. and Ph.D. degrees from Abo Akademi University, Finland, all in computer science. He was a Professor and the Chair of the Computer Networks Laboratory, Japan Advanced Institute of Science and Technology, from 2001 to 2006. He has been a Professor (Chair) of computer science with Griffith University, Australia, where he taught nine years, since 1992. He is currently a specially appointed Professor with Sun Yat-sen University, China, and a tenured Professor (Chair) of computer science with The University of Adelaide, Adelaide, SA, Australia. He has published more than 300 articles, including more than 100 articles in international journals, such as a variety of IEEE and ACM transactions. His main research interests include parallel and distributed computing, algorithms, data mining, privacy preserving computing, high-performance networks, and multimedia systems. He was a recipient of many honours and awards.