A Novel and Adaptive Transient Fault-Tolerant Algorithm Considering Timing Constraint on Heterogeneous Systems

Due to high performance and low power consumption, heterogeneous processors are widely used in many real-time systems. In these systems, if tasks are not completed before deadline, it will cause disastrous consequences, and thus it is important to provide fault-tolerance. This paper proposes a novel, adaptive and transient fault-tolerant scheduling algorithm to solve the fault-tolerant problem in heterogeneous real-time systems, aiming to improve system reliability within a given deadline. Since task replication is efficient in improving system reliability, the proposed algorithm supports multiple replicas for each primary task and allows the primary tasks and their replicas to be scheduled on the same processor to increase reliability and lower latency. Also, the algorithm can dynamically adjust the number of replicas for each task to accommodate the deadline and ensure higher reliability. Simulated results show that the proposed algorithm can achieve higher reliability in comparison with existing and related fault-tolerant algorithms. To be specific, the proposed algorithm can obtain the reliability of 89.37% whereas the two existing algorithms DB-FTSA and FTSA obtain the reliability of 47.05% and 84.75% for the benchmark of sixty tasks, respectively, to be detailed in Fig. 4 in experiment.


I. INTRODUCTION A. BACKGROUND
In the past decades, society has witnessed continually improved performance of computing systems, which successfully caters to the demands in both industry and academia [1]. Although the performance of computing systems has been improved, its scale and complexity have also been increased, which has triggered increased failures [2], [3]. According to the 54th edition of the TOP500 in November 2019, Oak Ridge National Laboratoryaŕs Summit system holds top honors with an High Performance Linpack (HPL) result of 148.6 petaflops and contains 2,414,592 cores. Heterogeneous systems are widely applied to many security-related real-time systems, due to the high performance and low cost, as their scale and complexity become larger and larger, the possible failures The associate editor coordinating the review of this manuscript and approving it for publication was Xiao-Sheng Si .
increase. Therefore, it is no longer reasonable to ignore the fact that an application running on a very large system can crash. Especially, higher reliability is an indispensable design goal for real-time systems that are safe-critical. In these systems, the correct behavior depends not only on computing results but also on the issue when results are yielded [4]. No matter how good the computation result is, as long as the completion time exceeds the timing constraint, it is invalid. What is worse, missing deadline may cause serious catastrophe. Fault-tolerant scheduling is an effective way to achieve fault tolerance. Different scheduling schemes may result in different reliability and completion time. Therefore, designing efficient fault-tolerant scheduling techniques has become essential for achieving better reliability within timing constraint and thus far has attracted great interest amongst researchers.
Faults may happen when computing systems are running [5]. They can be primarily classified into two categories: VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ permanent faults and transient faults [6]. Once permanent faults occur, they exist all the time in the system. The failed component cannot perform any task, and the unfinished task must be migrated to other non-faulty components [7]. Transient faults are short-lived and occur primarily due to temporary malfunctioning of the components or external interferences such as electrostatic discharge, electrical power drops or overheating [8], [9]. In computing systems, there are all kinds of faults, but a majority of them are transient [10]. Thus, this paper is centered on analyzing transient faults.
There are several methods to solve transient failure problems. One commonly used method is based on the task duplication technique which allows multiple replicas for each task. Task duplication has two forms. One is the active form, where replicas of a task can be assigned to multiple processors, and executed in parallel to reduce failures [11]. Only when a task or one of its replicas is successfully performed, the task is regarded to have been finished. The other is the passive form [12]. A task and its replicas are classified as the primary task and the backup tasks, where the original task is called the primary task and all its replicas are deemed as the backup tasks. When a primary task fails, it needs to select a backup task to restore the pre-failure state, resulting in a longer recovery time. This form also makes an assumption that a fault detection mechanism is used to detect processor crashes. Nevertheless, there is no need for the active form to detect and handle failures [13], thereby saving time. In this paper, we take the active form to cope with transient failures. Unfortunately, most of the time, increasing the reliability causes an increase in the completion time [14]. Therefore, how to achieve higher reliability within a given time period is a big challenge.

B. RELATED WORK
Fault-tolerant scheduling under different situations has been widely studied in previous literatures.
Some aim to maximize the system reliability. Luo et al. [15] develop a real-time fault-tolerant algorithm RDFTAHS to schedule preemptive periodic tasks on heterogeneous distributed systems to boost system reliability. Yan et al. [16] focus on the fault tolerance when task runtime is uncertain. They propose an efficient fault-tolerant scheduling algorithm DEFT for real-time tasks in the cloud, aiming to achieve both fault tolerance and resource utilization efficiency, whilst not taking hosts' communication time into account. Latiff et al. [17] propose a DCLCA technique for dynamic clustering fault tolerance aware intelligent scheduling using the LCA optimization algorithm, not considering timing constraint. Zhang et al. [18] devise a novel algorithm RMEC which incorporates task priority establishment, frequency selection, and processor assignment to maximize the system reliability with energy constraint. They do not consider timing constraint, either. Mottaghi and Zarandi [19] present a dynamic scheduling algorithm called DFTS for real-time tasks in multicore processors to tolerate single and multiple transient faults. However, DFTS is only used for independent tasks and identical processing cores. Wei et al. [20] propose a fault-tolerant algorithm to deal with the transient failures which allows at most one backup task and cannot make full use of deadline.
Some others focus on studying energy-efficient faulttolerance. Guo et al. [11], [21] develop energy-efficient fault-tolerance (EEFT) techniques to schedule periodic tasks running on systems with identical functionalities. Zhao et al. [22] propose the SHR-DAG algorithm for scheduling a set of frame-based real-time tasks with individual deadlines on a single-processor system to minimize energy consumption while preserving the system reliability. Xie et al. [23] study the energy-efficient fault-tolerant scheduling problem on heterogeneous distributed embedded systems. They want to reduce the energy consumption while satisfying the reliability goal and do not consider timing constraint. Chatterjee et al. [24] study faulttolerant dynamic task mapping and scheduling problem for Network-on-Chip-based multicore platform.
Apart from the works above, others have their different focuses. Nair et al. [25] and Devaraj et al. [26] study the fault-tolerant problem on independent tasks. Studies [27]- [30] tackle the fault-tolerant problem in homogeneous multicore systems. Benoit et al. [31] propose an efficient fault-tolerant scheduling algorithm named FTSA for heterogeneous systems based on an active replication scheme to minimize the latency given a fixed number of failures. They do not consider communication time. Zhao et al. [32] design a fault-tolerant scheduling algorithm called MaxRe for heterogeneous systems to satisfy users' reliability requirements with minimum resources. Nor do they consider communication time. Samal et al. [33] propose a hybrid GA for primary-backup fault-tolerant scheduling of hard real-time tasks on multiprocessor systems with identical processors to maximize system utilization and efficiency. Kurt et al. [34] present a fault-tolerant dynamic task graph scheduling algorithm that recovers from faults without global coordination to minimize the slowdown of the application in the presence of soft errors. Timing constraint is not taken into consideration, either.
In this paper, we focus on transient failures and applications represented by DAGs considering communication time, aiming to improve system reliability by using redundancy to tolerate faults within timing constraint.

C. OUR CONTRIBUTION
In this paper, we propose a novel, adaptive and transient fault-tolerant scheduling algorithm to solve the fault-tolerant problem with timing constraint in heterogeneous system, aiming to improve system reliability. Generally, the more replicas, the more likely to obtain higher reliability. This further generate smaller communication overhead between tasks executed on the same processor than that on different processor. Based on this, our proposed algorithm allows multiple replicas for each primary task. The primary task and its replicas can be assigned to one processor. After determining the assignment of all primary tasks and their replicas under the given deadline, the algorithm calculates the maximum reliability that the system can achieve. We conduct a series of simulated experiments including randomly generated graphs with various characteristics and real-word applications to evaluate the proposed algorithm.
The main contributions of this paper are summarized as follows: • We propose a novel, adaptive and transient fault-tolerant scheduling algorithm to solve the transient fault-tolerant problem. The algorithm can dynamically determine the number of replicas based on the given deadline, and obtain as much reliability as possible.
• We propose a task assignment algorithm to assign each task to its best fit processor.
• Simulated results show that the proposed algorithm can always keep a higher reliability compared with existing and related fault-tolerant algorithms within the given deadline. The remainder of this paper is organized as follows. Section II defines models and the problem in discussion. Section III shows a motivational example to illustrate the importance of proper fault-tolerant scheduling. Section IV proposes a novel, adaptive and transient fault-tolerant scheduling algorithm. A comparison of the proposed algorithm with existing and related algorithms is conducted in Section V to evaluate the performance of the proposed algorithm, before Section VI concludes this paper.

II. MODELS AND PROBLEM DEFINITION
In this section, we first describe the system model, task model, and fault model. Then, we offer the definition of the problem discussed in this paper.

A. SYSTEM MODEL
The system targeted in this study is composed of a set of heterogeneous processors labeled as p 1 , p 2 , . . . , p M , where M is the number of total processors. Denote P = {p 1 , p 2 , . . . , p M } to be the set of processors. These processors are connected to each other with communication links and can communicate with each other. Different links between processors have different data transmission rate. The transmission rate of a link (namely bandwidth) is measured in bits per second [35], [36]. Suppose communication bandwidth between any two processors is symmetrical, and we use notation B ij to represent the communication bandwidth from processor p i to processor p j . As a result, we have B ij = B ji . We also assume that all interprocessor communications are performed without contention.

B. TASK MODEL
An application is generally modeled by a weighted directed acyclic graph (DAG) G =< V , E >. V is the set of nodes and contains N nodes v 1 , v 2 , . . . , v N . Each node represents a task. In this paper, we use the terms ''node'' and ''task'' interchangeably. E ⊆ V ×V is the set of edges corresponding to precedence relations between tasks. For instance, the directed edge (v i , v j ) connecting node v i to node v j indicates that node v j cannot start executing until task v i has finished execution. The computational heterogeneity of tasks means the difference of execution time that each task is executed on each available processor in a system. Denote ET ij as the execution time of task v i on processor p j . When the redundancy technique is taken to provide transient faulttolerance, a task may have multiple replicas. We use v i and v b j i to label the primary task and its jth replica, and they have the same execution time when executed on the same processor. In a DAG, a task may need the output generated by other tasks as its input, then data transfer happens. Let data(v i , v j ) be the data volume transferred from task v i to task v j . If two tasks are mapped to the same processor, the communication time will be zero. Otherwise, if v i is mapped to processor p m and v j is mapped to processor p k , then the communication time from task v i to task v j will be computed by data We assume that all the tasks have a shared deadline and tasks are performed in a non-preemptive means.
If there is an edge (v i , v j ) from node v i to node v j , then v i can be said to be the predecessor (parent) of v j and v j as the successor (child) of v i . A task v i may have multiple predecessors or multiple successors, and we use pre(v i ) to denote the set of all predecessors of task v i . Similarly, we use succ(v i ) to denote the set of all successors of task v i . Fig.1 gives an example of a DAG. In a DAG, a task without any predecessor is called an entry task, and a task without any successor is called an exit task. A DAG may have multiple entry tasks and multiple exit tasks. Without loss of generality, suppose there is only one entry task and one exit task in a DAG. The DAG is finished only when its exit task is finished. Therefore, the finished time (namely makespan) is obtained by where FT (v exit ) is the finished time of task v exit .

C. FAULT MODEL
At run-time, failures may occur due to various reasons, such as hardware failures, electromagnetic interferences as VOLUME 8, 2020 well as the effects of cosmic ray radiations. Based on the common exponential distribution assumption in the reliability research [37] for every processor, the arrival of failures follows a poisson distribution with the λ value representing the expected number of occurrence of failures in unit time. Different processors have different λ values. Denote λ 1 , λ 2 , . . . , λ M to be the λ values for processors p 1 , p 2 , . . . , p M , respectively, and = {λ 1 , λ 2 , . . . , λ M }. The failure distribution in t unit times for processor p j can be defined as where k is the number of actual failures in t unit times. The reliability of task v i executed on processor p j is the possibility of its successful execution. That is task v i is successfully executed if it or at least one of its replicas is successfully finished. So the possibility for successful execution of v i is computed by where C i is the set of indexes for v i and its replicas, and p(i ) is the index of the processor that the v i is assigned to. Therefore, the reliability of a system with N tasks is calculated by

D. PROBLEM
We address the problem as follows: Given a system composed of a set of M heterogenous and connected processors P = {p 1 , p 2 , . . . , p M }, and an application represented by a DAG G =< V , E > with a shared deadline D, the goal is to find a fault-tolerant scheduling scheme which can maximize the system reliability as well as satisfy task dependencies and meet the give deadline. The problem can be mathematically described as follows: Since the problem is NP-hard [31], a heuristic algorithm is proposed to cope with it. Redundancy is an efficient technique to improve the system reliability. Thus, the heuristic takes active backup strategy. It first calculates out the most replicas that the system can tolerate within the given deadline. Then it computes the final system reliability. Different schedules will lead to different makespan and system reliability. Different number of replicas will also cause different makespan and system reliability. This will be illustrated in-depth in the following section.

III. A MOTIVATIONAL EXAMPLE
An example is initially presented here to illustrate that different fault-tolerant scheduling schemes have a significant impact on the makespan of an application and the system reliability when it is executed on a heterogeneous system. Suppose that the application is shown in Fig. 1(a) and the execution times of all its five tasks executed on a system with three heterogeneous processors are shown in Fig. 1(b). For simplicity, we assume that the value of the communication bandwidth between any two processors is 1.  In some schemes, each task has one replica, whereas in other schemes, each task may have multiple replicas. Denote vi j to be the jth replica of task vi, 1 ≤ i ≤ 5, j ≥ 1. In the first scheme shown in Fig. 2(a), every task has one replica, and the primary task as well as its replica are assigned to different processors: v1, v2, v3, v4, v5 are separately assigned to processors p2, p3, p1, p3, p2 while v1 1 , v2 1 , v3 1 , v4 1 , v5 1 are separately assigned to processors p3, p2, p3, p2, p1. The makespan is 30. In the second scheme shown in Fig. 2(b), although each task still has one replica, the primary task and its replica could be assigned to the same processor, different from the first scheme. The makespan is 26. In the third scheme shown in Fig. 2(c), each task has multiple replicas. The makespan is 35. Assume that processors p1, p2, p3 have λ values: = {0.015, 0.014, 0.013}, and we obtain the reliability calculated by (5) for the three scheduling schemes in Fig. 2 from left to right are 97.54%, 97.67%, and 99.76%, respectively.
Different scheduling schemes generate different reliability and different makespan. Usually, more replicas means higher reliability. Compared with the first two schemes, the third scheme has more replicas and the highest reliability. This brings us to the question of whether more replicas represents a better scenario.   3 shows what will happen when the number of replicas increases from one to three using the third scheme above. In Fig. 3(a), each task has one replica, the makespan is 26 time units, and the reliability is 97.67%. In Fig. 3(b), each task has two replicas, the makespan is 35 time units, and the reliability is 99.76%. In Fig. 3(c), each task has three replicas; the makespan is 43 time units and the reliability is 99.98%, missing the given deadline 35 time units. From Fig. 3, we can get to know: (1) when the number of replicas increases, the reliability and makespan also increase; (2) when the number of replicas increase to some value, and the reliability is high enough (i.e., larger than 99%), it will not cause significant increase in reliability if more replicas are considered.
Although more replicas can provide higher reliability, it consumes more resources like CPU, time, and memory, etc. As a consequence, it is necessary and important to design efficient fault-tolerant scheduling method to improve the system reliability with various constraints.
Notations used in this paper are summarized in Table 1.

IV. THE PROPOSED ALGORITHM
In this section, we propose a novel, adaptive and transient fault-tolerant scheduling algorithm (AFTSA) which is the third scheme mentioned in Section III to solve our stated problem. For the convenience of reading, we first explain some concepts and formulas, and then present the proposed algorithm in detail.

A. CONCEPTS AND FORMULAS
EST (v i , p k ), the earliest execution start time of task v i on processor p k , is computed by where avail(p k ) is the time once processor p k finishes its last assigned task and becomes idle. RT (v i , p k ) is the time when v i receives all the data from its predecessors and is ready to be executed on p k . It is computed by where pro(v x ) is the processor to which the primary task v x is assigned and pro(v x b j ) is the processor to which the jth replica v x b j is assigned, respectively. c xi (c x j i ) is the communication p k ), the earliest execution finish time of task v i on processor p k , is computed by AFT (v x , pro(v x )), the actual finish time of primary task v x on its assigned processor pro(v x ), is computed by where p k is the processor that v x has the minimum earliest execution finish time among all processors. Let , the actual finish time of the jth replica of v x on processor pro(v x b j ), is computed by FT (v i ), the finish time of v i , if v i has no replicas, then we have Otherwise, the finish time of v i is the time when both itself and its replicas have finished execution. Thus, the finish time of v i is computed by where Backup(v i ) is the set of replicas of task v i . rank u (v i ), the uprank of task v i , is used to determine the scheduling order of v i and computed as following where ET i is the average execution time of task v i on all processors and c ij is the average time consumed when task v i communicates with task v j . We have  where B is the average data transmission rate among processors. It can be calculated by A larger value of uprank value implies a higher priority. For example, if rank u (v 1 ) > rank u (v 2 ), then the priority of v 1 will be higher than that of v 2 and it will be scheduled earlier than v 2 . If rank u (v 1 ) = rank u (v 2 ), then tie is broken by firstly executing the task of smaller index.

B. ALGORITHM DESCRIPTION
The basic idea of the proposed AFTSA is as follows. First, it calculates all tasks' rank u values and stores them in the descending order of their rank u values. Next, it calculates the makespan without backup tasks (replicas) by calling a scheduling function. Then, it calculates the maximum number of replicas for each primary task that the system can tolerate within the given deadline. Finally, it computes the maximum reliability that the system can achieve. AFTSA is advantageous in that it makes good use of the given time to dynamically generate the maximum number of replicas for each task to obtain higher reliability in limited time, allows the primary task and its replicas to be assigned to the same processor, and assigns the primary task and its replicas to processors of the minimum earliest execution finish time among all processors. We now describe the proposed algorithm in detail.
Algorithm 1 outlines AFTSA. The Input is a DAG G = <V , E>, a deadline D, and M processors p 1 , p 2 , . . . , p M . The Output is a schedule scheme for all tasks of G that achieves the maximum system reliability. Firstly, the proposed algorithm calculates the rank u values for all tasks by Equation (14), stores these tasks according to the non-increasing order of rank u (v i ) in a list list, as well as initializes the parameters makespan, CN , and Fmakespan to be zero (lines 1-3). Secondly, it invokes the function shown in Algorithm 2 to get the makespan without considering replicas (line 4). Thirdly, it uses a while loop to calculate out the most replicas for each task that the system can tolerate within the given deadline D (lines 5-15). The inner while loop from line 7 to line 14 checks if the system can tolerate CN replicas or not for each task until the makespan exceeds D or all tasks have been copied. In order to reduce overhead, the maximum Algorithm 1 AFTSA Input: A DAG G = {V , E}, a common deadline D, P = {p 1 , p 2 , . . . , p M }; Output: A schedule scheme for all tasks of G that achieves maximum system reliability; 1: compute rank u (v i ) for ∀v i ∈ V by (14); 2: store all tasks by a non-increasing order of rank u (v i ) in a list list; 3: makespan ← 0, CN ← 0, Fmakespan ← 0; 4: makespan ← TaskAssignment(list, ∅, P, CN ); 5: while makespan <= D&&CN < 3 do 6: k ← 0, clist ← ∅, CN + = 1; 7: while makespan <= D&&k < N do 8: Fmakespan ← makespan; 9: for j = 0 to CN − 1 do 10: clist ← clist ∪ {list[k]}; 11: end for 12: makespan ← TaskAssignment(list, clist, P, CN ); 13: k + +; 14: end while 15: end while 16: if makespan > D then 17: clist ← clist − {list[k − 1]}; 18: end if 19: makespan ← Fmakespan. 20: compute the system reliability R by (5).
value of CN is set to be two. That is, the number of replicas for each primary task shall be no larger than two. Fifthly, it checks if the makespan exceeds the given deadline. If the makespan is larger than D, then the last task added to the list clist will be taken off (line 17). Finally, it updates the makespan and calculates the reliability that the system can achieve under the final scheduling scheme by Equation (5)  Algorithm 2 shows how to map the primary tasks and their replicas to proper processors. The Input is three arrays list, clist, P and a parameter CN . The Output is the makespan. While list is not empty, the algorithm first select the task with the maximum uprank value, calculates its earliest execution finish time on all processors, assigns it to the processor with the minimum earliest execution finish time, and updates the makespan. Then, the algorithm calculates the assignment for the replicas of the selected task and updates the makesapn. The time complexity of Algorithm 2 is O(NM |E|).

V. EXPERIMENT
To evaluate the effectiveness of the proposed algorithm, several series of experiments have been conducted on a computer with the 64-bit Windows 10 operating system, a dual processor Intel(R) Core (TM) CPU @ 2.2GHz and an 8GB RAM. Two algorithms DB-FTSA [20] and FTSA [32] are Algorithm 2 TaskAssignment(list, clist, P, CN) Input: list, clist, P, CN ; Output: makespan; 1: while list! = ∅ do 2: v i ← the task with maximum rank u (v i ) in list; 3: for j = 1 to M do 4: calculate EFT (v i , p j ) by (9); 5: end for 6: schedule v i on processor pro(v i ) which makes v i have the minimum EFT ; 7: (10); 8: if makespan < AFT (v i , pro(v i )) then 9: makespan ← AFT (v i , pro(v i )); 10: end if 11: for j = 1 to CN do 12: for k = 1 to M do 13: calculate EFT (v b j i , p k ) by (9); 14: end for 15: if clist! = ∅&&v i ∈ clist then 16: schedule 20: end if 21: end if 22: end for 23: end while 24: return makespan. selected as comparison baselines. FTSA is an excellent fault tolerance algorithm for solving reliability problems with no timing constraint. It requires that the primary task and its replicas must be assigned to different processors. DB-FTSA is the latest and efficient fault tolerance algorithm to solve the problem similar to ours. It only supports up to one replicas and allows the primary task and its replicas to be assigned to the same processor. Randomly generated application graphs with various characteristics and real-world applications are adopted as test instances. In the following, we present the experimental parameters and metrics.

A. EXPERIMENTAL SETTING
In the experiments, we use randomly generated directed acyclic graphs which have been widely used in many studies, such as [38], [39]. To generate a directed acyclic graph, several parameters are needed: • N , the number of tasks in a directed acyclic graph, its value is ranged from 20 to 100 with the increment of 20.
• the indegree of a task, its value is randomly generated from the interval [0, 4].
• the outdegree of a task, its value is randomly generated from the interval [0, 4]. VOLUME 8, 2020 • CCR, the communication to computation ratio which is equal to the value of the average communication time divided by the average computation time in a system. Like many other studies, the value of CCR is selected from {0.1, 0.5, 1, 2, 5}. To account for communication heterogeneity in the system, the unit data delay of the processors is chosen uniformly from the range of [0, 2]. In addition, we consider two system platforms. One is composed of four connected processors and the other consists of eight connected processors. The failure rate λ of processors is randomly generated from the range of [1 × 10 −2 , 2 × 10 −2 ].
We choose two metrics to assess the performance of the proposed algorithm as follows: • Reliability: it is an important metric to measure whether a fault-tolerant algorithm is effective. A higher reliability means a better performance in system reliability.
• Makespan: it is the time to complete all tasks of a given application. A shorter makespan means a lower system delay. In what follows, we will present and discuss the experimental results in detail. Fig.4 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for different benchmarks with different numbers of tasks when CCR = 1, the average execution time ET = 15, and the number of processor M = 4 under the given deadline. It is observed that from Fig.4(a), DB-FTSA performs best in terms of makespan, while AFTSA performs slightly worse than DB-FTSA but better than FTSA. This is probably because DB-FTSA takes up to one replica into consideration while AFTSA and FTSA allow multiple replicas. More replicas will take more execution time. Also, AFTSA allows the primary task and its replicas to be assigned to the same processor while FTSA requires the primary task and its replica to be assigned to different processors, which incur additional communication time and a longer makespan. With the number of tasks increasing, the makespan produced by all above three algorithms increases. Fig.4(b) shows that in terms of reliability, AFTSA always maintains higher reliability compared with the other algorithms. Reasons are multifold: AFTSA can generate multiple replicas for each primary task which has a good effect on improving reliability; FTSA needs much more time to generate more replicas to provide higher reliability, and DB-FTSA cannot make full use of the given time which leads to a lower reliability. With the number of tasks increasing, the reliability generated by all above three algorithms decreases. Fig.5 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA when CCR = 1, the average execution time ET = 15, and the number of processor M = 8 under given deadline. On the whole, Fig.4 and Fig.5 convey to us similar information: For all three algorithms, with the number of tasks increasing, the makespan increases and the reliability decreases. It is worth pointing out that AFTSA can obtain higher reliability. The difference lies in that for the same benchmark, the platform with eight processors leads to lower makespan and higher reliability in comparison with the platform with four processors. This is because more resources is helpful to reduce the burden on the system and improve its performance. This shapes the very advantage of the parallel computing with multiple processors. Table 2 shows the makespan and reliability of benchmarks in Fig. 4 produced by algorithms AFTSA, DB-FTSA, and FTSA when the given deadline increases. It is observed that when the deadline increases, the makespan and reliability generated by all three algorithms also increase. When the deadline increases to a certain value, if it continues to increase, the makespan and reliability generated by AFTSA and DB-FTSA keep unchanged. Because AFTSA can support one replica at most and DB-FTSA can tolerate up to one replica, when the deadline is small, the two algorithms cannot guarantee the maximum number of replicas for each task and get small makespan and reliability; when the deadline is large enough, the two algorithms can achieve the maximum  Fig. 4 when deadline increases. number of replicas for each task and get large makespan reliability. If the deadline further becomes larger, the makespan and reliability will keep invariant. However, FTSA can get increasing makespan and reliability because it can support more and more replicas with deadline increasing. What is more, AFTSA can obtain higher reliability and substantially less makespan compared with FTSA, and this helps to save time and resources. Fig.6 shows the reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for different benchmarks when the average execution time and CCR change their values with the platform of M = 4 processors under given deadline. Fig.6(a) shows the reliability gained for benchmarks with N = 40, CCR = 1, and different average execution time. Table 3 shows the deadline and the makespan generated by algorithms AFTSA, DB-FTSA, and FTSA for each benchmark. It can be seen that AFTSA can obtain the highest reliability, followed by FTSA, and followed by DB-FTSA. This is because AFTSA can produce more replicas for each primary task and get higher reliability than FTSA, and DB-FTSA supports no more than two replicas for each primary task. More replicas usually denote higher reliability. Fig.6(b) shows the reliability gained for benchmarks   with N = 50, ET = 15, and different CCR value. It can be seen that AFTSA can still achieve better reliability than the other two algorithms. Fig.7 shows the reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for different benchmarks when the average execution time and CCR change their values with the platform of M = 8 processor under the deadline. Table 4 shows the deadline and the makespan generated by algorithms AFTSA, DB-FTSA, and FTSA for each benchmark in Fig. 7. Similar to Fig. 6, it reveals to us that AFTSA is more reliable than the other two algorithms within deadline. The only difference lies in that for the same benchmark, the platform of eight processors achieves higher reliability than the platform of four processors since more resources are beneficial to reducing the burden of systems and improve efficiency.

C. EXPERIMENTAL RESULTS AND DISCUSSIONS FOR REAL-WORLD APPLICATIONS
This subsection considers three types of real-word applications: Gaussian elimination [40], Fast fourier transform (FFT) [39] and a molecular dynamics code [38] which are adopted to test the effectiveness of the proposed algorithm. Task graph instances for each of these three applications from [40] are shown in Fig. 8, Fig. 9 and Fig. 10.

1) GAUSSIAN ELIMINATION
The structure of data-flow graph for gaussian elimination applications is already known, and we only need to know the number of tasks N and the execution time of tasks for these applications. N is computed by a formula N = MS 2 +MS−2 2 [40], where MS is the matrix size of the coefficient matrix for gaussian elimination applications. Fig.11 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for Gaussian elimination applications with different matrix size when CCR = 1, ET = 15, and the deadline D = MS × (ET + ET × CCR) on the platform of four processors. The matrix  size of gaussian elimination applications increases from 5 (corresponding to 14 tasks) to 20 (corresponding to 209 tasks) with an increment of 3. In terms of makespan, DB-FTSA performs best, AFTSA performs slightly worse than DB-FTSA, and FTSA obtains a much larger makespan than the former two algorithms. In terms of Reliability, AFTSA always performs best, and DB-FTSA does better than FTSA in most cases. Fig.12 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for Gaussian elimination applications with different matrix size when CCR = 1, ET = 15, and the deadline D = MS × (ET + ET × CCR) on the platform of eight processors. It tells a similar story as Fig.11. One difference is that in reliability, AFTSA performs best, followed by FTSA, and further by DB-FTSA. The other difference is that on the platform with eight processors, all three algorithms can obtain a smaller makespan and a higher reliability, and the makespan generated by AFTSA is very close to that generated by DB-FTSA. When the matrix size is less than or equal to 14, AFTSA and DB-FTSA have the same makespan. The reason is that more resources generally provide more opportunities for reducing makespan and improving reliability.

2) FAST FOURIER TRANSFORM
The structure of data-flow graph for one-dimensional and recursive FFT applications is known [39], and we only need to know the number of tasks and the execution time of tasks for these applications. The FFT algorithm with an input vector size of S has 2S − 1 recursive call tasks and S log S butterfly operation tasks. Fig.13 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for FFT applications at different vector size when ET = 15, CCR = 1, and the deadline D = (2 × log 2 S + 1) × (ET + ET × CCR) on the platform of four processors. S varies from 2 to 32 with a multiplier of 2. In view of makespan, DB-FTSA performs best, followed by AFTSA, and further by FTSA, and the gap between them becomes wider and wider with the increase of the vector size. In view of Reliability, AFTSA always keeps higher reliability among three algorithms, especially when the vector size is large. After all, more replicas generally mean longer makespan and higher reliability with limited resource, and that the primary task and their replicas are assigned to different processors will increase communication time and cause a longer makespan. Fig.14 shows the makespan and reliability produced by algorithms AFTSA, DB-FTSA, and FTSA for FFT applications at different vector size when ET = 15, CCR = 1, and the deadline D = (2 × log 2 S + 1) × (ET + ET × CCR) on the platform of eight processors. Fig.13 and Fig.14 reflect similar insight. All three algorithms can obtain a smaller makespan and a higher reliability for each benchmark on the platform of eight processors, and the makespan obtained by AFTSA and DB-FTSA are very close but much smaller than that by FTSA, especially when the vector size is large. The reason is that more processors are more likely to reduce makespan and improve reliability.

3) MOLECULAR DYNAMICS CODE
The structure of data-flow graph for molecular dynamics code applications and the number of tasks are known [40], and we only consider two factors: CCR and the average execution time of tasks in the experiments. Fig.15 shows the makespan and reliability generated by algorithms AFTSA, DB-FTSA, and FTSA for molecular dynamics code applications at different values of CCR when ET = 15 and M = 4 within the given deadline. It can be seen that AFTSA can obtain the highest reliability, followed by FTSA, and followed by DB-FTSA. FTSA can obtain the largest makespan, followed by AFTSA, and followed by DB-FTSA. Because AFTSA and FTSA can support more replicas for each task than DB-FTSA, and more replicas usually mean higher reliability and longer makespan. In addition, the primary task and its replicas can be allowed to be assigned   to the same processor to reduce communication time and makespan for AFTSA. When the value of CCR is smaller than or equal to 1, the gap of the makespan obtained from any two algorithm is not very large; when the value of CCR is larger than 1, the makespan obtained by three algorithms increase dramatically, and the makespan obtained by AFTSA and DB-FTSA are quite close but much smaller than that by FTSA. Fig.16 shows the makespan and reliability generated by algorithms AFTSA, DB-FTSA, and FTSA for molecular dynamics code applications at different values of CCR when ET = 15 and M = 8 within the given deadline. It takes on similar information as Fig.15. The difference is that on the platform of eight processors, all three algorithms can obtain a smaller makespan and a higher reliability for the same benchmark. This is because more processors embody more likelihood to reduce makespan and improve reliability.
According to all above experimental results and analysis, we conclude that AFTSA can always get higher reliability within limited time compared with DB-FTSA and FTSA, and more resources provide more opportunities to reduce makespan and improve reliability.

VI. CONCLUSION
In this paper, we propose a novel, adaptive and transient fault-tolerant scheduling algorithm to solve the fault-tolerant problem on heterogeneous systems, aiming to optimize the reliability in a shared deadline. The proposed algorithm allows multiple replicas, which has been proved in its capability to improve reliability. The primary task and its replicas can be assigned to the same processor which helps to decrease makespan. The proposed algorithm first works out the maximum number of replicas for each primary task that the system can tolerate within the given deadline. Then it assigns all primary tasks and their replicas to appropriate processors and computes the maximum reliability that the system can achieve. Simulated results show that the proposed algorithm can keep a higher reliability in comparison with the existing and related fault-tolerant algorithms. When the number of tasks for an application is large and the given deadline is large enough, the reliability obtained by our proposed algorithm will be smaller than that obtained by FTSA. In the future, we will continue to find a better way to solve this problem. Also, we will explore how to reduce energy consumption on the basis of improving reliability, and use more complicated fault model to solve this type of problems. CHUNHUA DENG received the Ph.D. degree in pattern recognition and intelligent systems from the School of Automation, Huazhong University of Science and Technology, Wuhan, China, in 2016. He is currently a Lecturer with the School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan. His current research interests include computer vision, pattern recognition, and machine learning. VOLUME 8, 2020