Type-Aware Federated Scheduling for Typed DAG Tasks on Heterogeneous Multicore Platforms

To utilize the performance benefits of heterogeneous multicore platforms in real-time systems, we need task models that expose the parallelism and heterogeneity of the workload, such as typed DAG tasks, as well as scheduling algorithms that effectively exploit this information. In this article, we introduce type-aware federated scheduling algorithms for sporadic typed DAG tasks with implicit deadlines running on a heterogeneous multicore platform with two different types of cores. In type-aware federated scheduling, a task can be executed in one of the three strategies: Exclusive Allocation, Semi-Exclusive Allocation, and Sequential and Share. In Exclusive Allocation, clusters of cores of both core types are exclusively allocated to tasks, while cores of only one type are exclusively allocated to tasks in Semi-Exclusive Allocation. The workload of the other type from tasks in Semi-Exclusive Allocation and the workload from tasks in Sequential and Share share the cores that are not exclusively allocated to any task. We prove that our type-aware federated scheduling algorithm has a capacity augmentation bound of 7.25. We also show that no constant capacity augmentation bound can be obtained without Semi-Exclusive Allocation. Compared to the state of the art, the type-aware federated scheduling algorithm achieves better schedulability, especially for task sets with skewed workload.


INTRODUCTION
T HE development of heterogeneous multicore platforms has been thriving in recent years. A heterogeneous multicore platform consists of multiple types of execution units, each with different performance and energy characteristics. Heterogeneous platforms aim to provide higher performance and more energy efficiency compared with homogeneous platforms. One concrete example for heterogeneous computing systems is the integration of main processing units with accelerators. For example, NVIDIA Tegra [19] and Samsung Exynos [21] SoCs integrate ARM processors with GPUs, while Xilinx Versal [24] integrate processors with AI accelerators on one chip.
To fully utilize the potential of a heterogeneous multicore system, we can analyze the tasks and determine the appropriate execution unit for running each code segment. Typed directed acyclic graphs (typed DAGs) commonly represent for modeling parallel real-time tasks running on heterogeneous multicore platforms. In a typed DAG task, each vertex represents a code segment that must be executed sequentially on a particular type of execution unit. Fig. 1 shows an example of a typed DAG task.
Scheduling typed DAG tasks with real-time constraints on heterogeneous multicore platforms is an emerging research topic. Most of the prior work [3], [9], [16], [18] focuses on scheduling un-typed real-time DAG tasks on heterogeneous platforms, and proposes methods for determining the core type each code segment (i.e., each vertex in a DAG) should be executed on. For typed DAG tasks, Han et al. [13] analyze the worst-case response time (WCRT) of a typed DAG task running on heterogeneous multicore platforms, and propose WCRT bounds with self sustainability [2]. In their follow-up paper [14], they propose a federated scheduling algorithm [17] for typed DAG tasks running on heterogeneous multicore platforms.
We note that the state of the art for federated scheduling of typed DAG tasks considers only two execution modes, i.e., heavy and light, independently of the heterogeneity of the workload distribution. However, we prove that no federated scheduling approach with only two execution modes (like that in [14]) may yield a constant capacity augmentation bound as soon as tasks with a density greater than 1 are classified as heavy, in Section 3. Capacity augmentation bounds [17] are one of the standard metrics to quantify the performance of scheduling algorithms for real-time systems.
In this paper, we introduce a type-aware federated scheduling algorithm for scheduling sporadic typed DAG tasks with implicit deadlines on a heterogeneous multicore platform with two types of cores. In our type-aware federated scheduling, each task is executed following one of three strategies: 1) Exclusive Allocation: a cluster of cores consisting of both core types is exclusively allocated to the task. 2) Semi-Exclusive Allocation: a cluster of cores consisting of one core type is exclusively allocated to the task. Workload of the other type is scheduled sequentially on a single core shared with other tasks. 3) Sequential and Share: both types of workload in the task are scheduled with other tasks on shared cores. Workload within a task is executed sequentially. The formal definition of each execution strategy is given in Section 4. We explain how tasks are scheduled, and analyze their corresponding schedulability in Section 5. We then prove that our type-aware federated scheduling algorithm has a capacity augmentation bound of 7.25 in Section 6.
In the type-aware federated scheduling algorithm developed in Section 5, we adopt several rigid "enforcement rules" [8] to simplify the structure of the scheduling problem, and to allow the derivation of a capacity augmentation bound. Specifically, purely based on the parameters of a task, these enforcement rules determine the number of cores exclusively allocated to tasks in Exclusive Allocation and Semi-Exclusive Allocation strategies in Section 5.2. While such enforcement rules often yield constant capacity augmentation bounds, as reported by Chen et al. [8], they may also harm performance in practice by unnecessarily constraining scheduling.
Thus, in Section 7, we go on to explore an improved algorithm with the same capacity augmentation bound but without employing explicit enforcement rules. The improved algorithm is based on four principles: 1.) a sequence of attempts is made to determine the most appropriate execution strategy instead of a greedy decision based solely on the parameters of a task, 2.) preference is given to sharing over exclusive allocation where possible, 3.) the number of exclusively allocated cores is minimized for Semi-Exclusive Allocation, and 4.) combinatorial optimization is applied for Exclusive Allocation. By scheduling a task set with fewer dedicated cores, the improved type-aware federated scheduling algorithm can achieve higher schedulability in practice without sacrificing the augmentation bound.
To summarize, the contributions of this paper are as follows: We design a type-aware federated scheduling algorithm for scheduling sporadic typed DAG tasks with implicit deadlines on a heterogeneous multicore platform with two core types in Section 5. We prove a capacity augmentation bound of 7.25 for our type-aware federated algorithm in Section 6. We improve the type-aware federated scheduling algorithm by eliminating enforcement rules in Section 7. The improved algorithm maintains a capacity augmentation bound of 7.25 and is shown to exhibit better performance in our experimental evaluation. The evaluation results show that our type-aware federated scheduling algorithms achieve better schedulability on synthetic workload compared to the state of the art, especially for task sets with a skewed workload.

SYSTEM MODEL AND ANALYSIS BACKGROUND
In this paper, we focus on a heterogeneous multicore platform with two different types of execution units, i.e., cores. Let Q ¼ fa; bg be the set of core types. Existing heterogeneous computing systems such as NVIDIA Tegra [19] and Samsung Exynos [21] integrate two types of execution units, i.e., CPUs and GPUs, on one chip.
We present the task model in Section 2.1 and the problem formulation in Section 2.2. Capacity augmentation bounds are defined in Section 2.3, followed by a summary of suspension-aware schedulability analysis in Section 2.4, which we rely on in our new type-aware federated scheduling algorithm.

Typed DAG Task
V i is a set of vertices, in which a vertex corresponds to a piece of code that must be executed sequentially.
vÞ in E i indicates a precedence constraint of the execution order of the vertices u and v in V i , i.e., v cannot start its execution before u finishes when ðu; vÞ 2 E i . g i : V i ! Q is a function that assigns each vertex v in V i to its core type. Thus, in a typed DAG task, each vertex is explicitly bound to be executed on a specific type of core. v i : V i ! R þ is a function that defines the WCET of each vertex v in V i on its assigned core type. We assume v i ðvÞ > 0 for any v in V i . T i > 0 is the minimum amount of time between two consecutive releases of t i . D i is the relative deadline of t i . If a task is released at time r i , all of its vertices must finish their executions no later than r i þ D i . Fig. 1. Example of a typed DAG task with two types of vertices. Each circular and rectangular vertex represents a code segment that must be executed sequentially on an execution unit of the corresponding type. The number in a vertex indicates the WCET of the vertex.
Based on the information specified above, we can derive the WCET of a task t i on type a and type b cores as For example, in Fig. 1, C a i ¼ 24 for type a (circular) and C b i ¼ 12 for type b (rectangular) vertices.
We define a path p in G i as a sequence of vertices connected via edges that starts at a source vertex, i.e., a vertex without predecessors, and ends at a sink vertex, i.e., a vertex without successors. We use PathsðG i Þ to denote the set of all paths in G i . The length of a path is the sum of the vertices' WCETs on the path. The path with the longest length is called the critical path. We denote the critical path length of G i as L i . As the underlying graph is acyclic, L i can be computed in linear time based on a topological ordering of the vertices. We further define L a i (respectively, L b i ) to be the length of the critical path in G i by considering only the execution times on type a (respectively, b) vertices. That is, where PathsðG i Þ is the set of all paths through G i . L a i (respectively, L b i ) can be computed in the same way as L i by temporarily setting the weights of all nodes of type b (respectively, a) to zero, and apply algorithm such as topological sorting to find the longest path in the modified DAG. By definition, L a i L i , L b i L i , and L i L a i þ L b i . In this paper, we consider implicit-deadline task systems, in which D i ¼ T i for every task t i . The utilization of task t i on type a and type b is defined as T i , respectively. Furthermore, the worst-case response time R i of task t i is an upper bound on the response time of all jobs of t i . Due to the assumption of implicit-deadline tasks, we have R i T i for every task t i if t i meets its deadline.

Problem Formulation
We consider scheduling a set of N implicit-deadline sporadic typed DAG tasks T ¼ ft 1 ; t 2 ; . . . ; t N g on a heterogeneous multicore platform with M a type a cores and M b type b cores, where the parameters and characteristics of each sporadic DAG task are defined in Section 2.1. Our objective is to design a scheduling algorithm that generates a task-tocore mapping and a schedule for T, so that all jobs of the tasks in T finish before their deadlines. For simplicity of presentation, we implicitly assume that C a i > 0 and C b i > 0 for every task t i 2 T, whilst C a The following theorem shows that the problem is NPhard in the strong sense even for special cases. Proof. We show that this special case of the typed DAG scheduling problem is identical to a special case of the job shop scheduling problem. In the job shop scheduling problem, given n jobs and m machines, where a job must be processed on the machines in a given order, the objective is to minimize the makespan for completing all jobs. We consider a job shop scheduling with two shops (type a and type b), in which each shop has one machine (M a ¼ 1 and M b ¼ 1). Specifically, the scheduling problem to minimize the makespan is denoted as J2jchains; p ij ¼ 1jC max in three-field classification notation of scheduling problems and is NP-hard in the strong sense [23, Table 3].
Next, we construct an input instance of the studied problem, reduced from the decision version of the J2 jchains; p ij ¼ 1jC max problem. Consider an instance with n jobs in the J2jchains; p ij ¼ 1jC max problem, in which D is given as the makespan constraint of the schedule. Each job must be executed on the two shops, alternating several times. Each execution in a shop takes one time unit. We can reduce this instance to an input instance of the studied problem by mapping the n jobs to one single task with n chains with a deadline D. A chain is a sequence of vertices where each vertex has only one predecessor and one successor, except for the head and tail vertex. In each chain, the operation alternates from the executions on type a and type b, each with unit execution time. Since J2jchains; p ij ¼ 1jC max can be reduced to the studied problem in polynomial time, we reach the conclusion. t u

Capacity Augmentation Bound
The capacity augmentation bound, originally proposed in [17], is a metric for analyzing the quality of a scheduling algorithm. We first recall their definition for homogeneous multiprocessor systems. A scheduling algorithm has a capacity augmentation bound of 1 r (0 < r 1) if any task set T that satisfies the following conditions is schedulable by the algorithm on M cores: X where L i is the length of the critical path of task t i . Since the total utilization P t i 2T U i can be calculated in linear time, capacity augmentation bounds immediately yield efficient schedulability tests. Li et al. [17] proved that federated scheduling for implicit-deadline DAG sporadic tasks on homogeneous multiprocessor systems has a capacity augmentation bound of 2. They also showed that a scheduling algorithm that has capacity augmentation bound of 1 r also guarantees a resource augmentation bound (speed-up factor) of 1 r . 1 For our studied problem with two cores types, a scheduling algorithm has a capacity augmentation bound of 1 r (0 < r 1) if any task set T that satisfies the following conditions is schedulable by the algorithm on M a type a cores and M b type b cores Chen [4] showed that federated scheduling does not admit constant speedup bounds for constrained-deadline task systems. Therefore, we focus on implicit-deadline tasks.

Existing Suspension-Aware Analysis
We intend to analyze the studied problem using a technique originally applied to the analysis of self-suspending tasks under preemptive static-priority scheduling on uniprocessor systems. To motivate the application of suspension-aware analysis for uniprocessor systems, consider the following two example settings: 1) A Semi-Exclusive Allocation task is assigned two exclusive cores of type a and executes its type b workload sequentially on a single type b core shared with other tasks. In our example, in Fig. 2a, core b 1 is the shared type b core. From the perspective of this core, the task's executions on its exclusive cores can be modeled as suspensions, and one is left with a uniprocessor scheduling problem on core b 1 . The maximum suspension time can be analyzed separately, based on the number of exclusively assigned cores. 2) Similarly, a Sequential and Share task assigned to one shared type a and one shared type b core can be modeled as a suspending task from each core's perspective. We now summarize an existing jitter-based suspension analysis for static-priority preemptive scheduling that we later employ in the analysis of Semi-Exclusive Allocation tasks. How to properly map a given Semi-Exclusive Allocation task to this self-suspension model is discussed later in Section 4.2.
Let t k be a dynamic self-suspending task with a worstcase execution time C k > 0 and a maximum suspension time S k ! 0. Suppose that hpðt k Þ is the set of the higher-priority self-suspending tasks running on the same core with task t k . Further assume that R i is an upper bound for the worst-case response time of t i with R i T i for t i 2 hpðt k Þ. A (sufficient) schedulability test of an implicit-deadline task t k under static-priority preemptive scheduling due to Chen et al. [7] is We note that Chen et al. in [7] further proposed a unifying schedulability test framework, which can also be applied in our analysis without affecting our theoretical analysis. Here, we use the jitter-based analysis for simplicity of presentation. We employ Eq. (1) in Theorem 9 to validate the schedulability of a Semi-Exclusive Allocation task.

LIMITATION OF EXISTING METHODS
In federated scheduling, tasks are classified as heavy or light based on some metrics. For example, the state of the art proposed by Han et al. [14] classifies the tasks based on their density, i.e., the ratio of their total WCET to their deadline. In their paper, a task is heavy if its density is greater than 1; otherwise it is light. A heavy task is allocated to dedicated cores utilizing its DAG structure for potential parallel execution, whilst the vertices of a light task are sequentially executed on the remaining cores in competition with other light tasks.
We demonstrate that such federated scheduling with only two execution strategies does not yield a constant augmentation bound for scheduling typed DAG tasks on heterogeneous multi-core systems, due to tasks with skewed workload, i.e., heavy workload on one core type but extremely light workload on the other type. The heavy workload makes it impossible to schedule such tasks in the light execution strategy, while scheduling them in the heavy execution strategy wastes resources, which may ultimately result in non-schedulability due to an insufficient number of cores in the system.

Theorem 2. Federated scheduling with only the light and heavy execution strategies has a capacity augmentation bound of
Proof. Consider the following example. Suppose we have a heterogeneous multi-core system consisting of M a > 1 type a cores and 1 type b core. Given two fully parallel typed DAG tasks t 1 and t 2 , i.e., there are no dependencies between vertices with different types within each task. Both of them have the same period T ) 1.
As there is no available type b core anymore, it is not possible to execute the light task t 2 . Therefore, federated scheduling with only heavy and light execution strategies is not able to schedule both of them whenever h > 1. Since T ) 1 and h > 1, we have < h. Therefore, the capacity augmentation bound of such federated scheduling is at least M a h , which approaches M a , when h approaches 1. t u

EXECUTION MODES AND ALLOCATIONS
The limitation of existing federated scheduling shown in Theorem 2 can be conquered by introducing a third execution strategy, Semi-Exclusive Allocation, described at the introduction of this paper. As there are two types of cores, this third execution strategy actually has two concrete incarnations resulting in four concrete execution modes in total. We use a similar notation of federated scheduling in the literature to name these four execution modes (resulting from three execution strategies): Heavy ab , Light, Heavy a , and Heavy b , in which the former two are adopted in the state of the art [14], whilst the latter two correspond to the Semi-Exclusive Allocation strategy introduced in this paper: Fig. 2. Suspending behavior of a DAG task on shared cores. (a) a Semi-Exclusive Allocation task can be modeled as a self-suspending task running on core b 1 ; (b) Suspending behavior of a Sequential and Share task from the perspectives of core a and core b .
For a task t i in the Heavy ab mode, a cluster of cores consisting of m a i type a cores and m b i type b cores is exclusively allocated to a task, where m a i and m b i are positive integers. (Section 4.1) For a task t i in the Light mode, both types of the vertices are scheduled on one corresponding type of core together with other tasks. Vertices within a task are executed sequentially. (Section 4.3) For a task t i in the Heavy a mode, m a i type a cores are exclusively allocated to task t i . Vertices of type b of task t i are sequentially scheduled on one type b core together with other tasks. (Section 4.2) For a task t i in the Heavy b mode, m b i type b cores are exclusively allocated to task t i . Vertices of type a of task t i are sequentially scheduled on one type a core with other tasks. (Section 4.2) In this section, we explain how tasks in these four modes are scheduled. How to determine which mode a task is in and how many cores are exclusively allocated to each task is discussed in Section 5.

Exclusive Allocation
A task t i is in the Heavy ab mode if m a i type a cores and m b i type b cores are exclusively allocated to the task. Under this scenario, there is no inter-task interference from other tasks, as only task t i is executed on these cores. Therefore, the schedulability of task t i depends only upon the internal schedule of t i on the m a i type a cores and m b i type b cores. In this section, we assume that m a i and m b i are given. The details on the determination of m a i and m b i are discussed in Sections 5 and 7.
Deriving a feasible schedule to meet the timing constraint of t i under the specified m a i and m b i is a challenging problem. One approach is to formulate the scheduling problem as a combinatorial problem and solved with constraint programming [20], which requires high complexity to solve a problem instance. As different combinations of m a i and m b i have to be considered in our algorithm, using constraint programming is a solution with very high complexity. Our paper adopts an alternative solution, which applies work-conserving scheduling algorithms to schedule task t i on the dedicated cores. The list scheduling algorithm has been analyzed and adopted in the literature. Specifically, list scheduling for a DAG task executed only on one core type has been widely explored in real-time systems. As an example, consider that t i is a DAG task with only type a workload. The analysis from Graham [11] shows that the makespan of a list schedule of a job of a DAG task on m a i cores is upper bounded by L i þ ðC a i À Lðt i ÞÞ=m a i , where L i is the critical path length of task t i . If the above upper bound is no more than T i , then the jobs of t i can always meet their timing constraints on the m a i cores assigned to t i exclusively.
Extending the analysis of list scheduling to a typed DAG task with multiple core types, as the problem studied in this paper, has been recently provided by Han et al. [13]. Their analysis for two core types can be summarized as follows: (2) where PathsðG i Þ is the set of all paths in G i , Lðp; aÞ and Lðp; bÞ are the sum of the WCETs of type a vertices and type b vertices on the path p, respectively. That is, Lðp; aÞ is P v:v2p^g i ðvÞ¼a v i ðvÞ. Proof. This comes from Theorem 3.1 in [13]. We rephrase their Eq. (4) by taking the maximum among all paths instead of defining a critical path. Moreover, Definition 3.1 in [13] defines a scaled graph in which the result is equivalent to the subtraction of the execution time (volume in their definition) by the contribution to the length on each core type. t u We can weaken the above condition in a way that is still sufficient for our capacity augmentation bound analysis. Specifically, an over-approximation of Eq. (2) can be achieved by separately considering the type a and the type b critical paths. The lengths of these paths are L a i and L b i , respectively.
. By definition, Lðp; aÞ L a i and Lðp; bÞ L b i for any path p 2 PathsðG i Þ. We reach the conclusion as m a i ! 1 and m b i ! 1. t u Han et al. [13] show that the worst-case response time bound (the one we rephrased into Eq. (2)) is self-sustainable, i.e., adding more cores from one type for exclusive execution of t i does not make a feasible schedule of t i infeasible. We note that Han et al. [13] also provide a tighter analysis. However, we only need the weaker analysis here for our worst-case analysis.

Semi-Exclusive Allocation
If C a i ! T i and C b i is positive but very small, exclusively allocating a type b core to such t i would be a waste, even making the task set not schedulable as demonstrated by the example in the proof of Theorem 2. In this case, it is more resource efficient to allow other tasks to also execute on the type b core that task t i is assigned to. In other words, t i should not be the only task executed on this core of type b. Interestingly, the execution behavior of task t i in this scenario can be modeled as a dynamic self-suspending task on a type b core, as shown in the example in Section 2.4.
Lemma 5. Suppose that t i is in the Heavy a mode, exclusively allocated with m a i type a cores. The execution behavior of task t i on the type b core with sequential execution can be modeled as a dynamic self-suspending task. Under list scheduling, the Proof. A task t i in the Heavy a mode can be modeled as a dynamic self-suspending task running on a type b core as follows. In task t i , suspending from the execution on the type b core implies that only the workload of type a is executed. That is, the suspension time from the type b core is at most the amount of time executing C a i on the dedicated m a i type a cores with the critical path length of L a i . This is upper bounded using Lemma 4 as list scheduling is applied. By setting the worst-case execution time , we can construct a dynamic self-suspending task t 0 i running on a type b core which is a conservative approximation of task t i . The symmetric case for task in the Heavy b mode is identical. t u

Sequential Execution Without Exclusive Allocation
When both C a i and C b i are small, executing task t i sequentially can also be a feasible option. In this treatment, we can consider that the vertices in V i are ordered by a total order (using topological sort) and executed one after another.
Definition 6. Suppose that a typed DAG task t i is sequentially executed without any exclusive allocation on the cores, i.e., t i is in the Light mode. At any time t, if a job of task t i is not completed yet, either the job executes or is blocked by some higher-priority job on a type a core (i.e., the job suspends from the type b core), or the job executes or is blocked by some higher-priority job on a type b core (i.e., the job suspends from the type a core). The execution can be modeled as a 3-tuple ðC a i ; C b i ; T i Þ with sequential executions in an interleaving manner on type a and type b cores. t u

TYPE-AWARE FEDERATED SCHEDULING
In this section, we discuss the Federated Scheduling paradigm. We provide the schedulability analysis for each execution mode in Section 5.1. In Section 5.2, we describe how to determine the execution mode of a task, and how many cores are exclusively allocated to each task in the Heavy ab , Heavy a , and Heavy b modes. According to the definition, we have four task execution modes: Heavy ab , Heavy a , Heavy b , and Light, as described in Section 4. Table 1 summarizes the four execution modes.
We start with the definition of the type-aware federated static-priority preemptive scheduling for typed DAG tasks.
Definition 7. Type-aware federated static-priority preemptive scheduling for typed DAG tasks: 1) Each task is in one of the four task execution modes: Heavy ab , Heavy a , Heavy b , and Light. 2) If task t i is in the Heavy ab , Heavy a , or Heavy b mode, a cluster of cores with the corresponding core types are dedicated to t i . That is, each of these cores only has one task assigned to it.
3) For the core type without exclusive allocation of task t i , the task is assigned to be executed on one assigned core. When there are multiple tasks assigned to a core, static-priority preemptive scheduling on the core is applied. t u We use rate monotonic scheduling priority assignment, i.e., a task t i has a higher priority than a task t k if T i T k , in which ties are broken arbitrarily. For the rest of this paper, we use p to denote a core of type a and q to denote a core of type b. The set of tasks that are assigned to core p and core q are denoted as C a p and C b q , respectively. C a p ðt k Þ (respectively, C b q ðt k Þ) is the set of tasks that are assigned to core p (respectively, q) that have higher priorities than t k . Under typeaware federated static-priority preemptive scheduling, when a task t i is in C a p ðt k Þ (respectively, C b q ðt k Þ) and the WCET of t i on core p is C a i (respectively, on core q is C b i ), a job of t i can block a single job of t k from execution with at most C a i time units on core p (respectively, C b i times units on core q).

Schedulability Analysis
Given the execution mode of a task, we can validate the schedulability of the task based on the following theorems. Theorem 8. A task in the Heavy ab mode meets its deadline if the worst-case response time in Lemma 3 or 4 is no more than its relative deadline.
Proof. As there is no other task assigned to the cores exclusively allocated to the task, the theorem holds naturally. t u Theorems 9 and 10 are the schedulability tests for tasks in the Heavy a , Heavy b , and Light mode. As the tests in Theorems 9 and 10 require the worst-case response time R i of a higher-priority task t i , when analyzing the schedulability of t k , we should also set the corresponding R k after handling t k accordingly. That is, R k is set to be the minimum t satisfying the corresponding condition applied for t k in Eqs. (4), (5), or (6).
Theorem 9. Suppose that R i T i ; 8t i 2 C b q ðt k Þ. Task t k in the Heavy a mode assigned to core q (i.e., core type b) meets its deadline if Response time analysis in [13] or constraint programming Heavy a Exclusive allocation Shared Dynamic self-suspension task on type b core, Eq. (4) in Theorem 9 Heavy b Shared Exclusive allocation Dynamic self-suspension task on type a core, Eq. (5) in Theorem 9 Light Shared Shared Sequential execution, based on [15] and restated in Theorem 10 Suppose that R i T i ; 8t i 2 C a p ðt k Þ. Symmetrically, t k in the Heavy b mode assigned to core p (i.e., core type a) meets its deadline if Proof. According to Lemma 5, the execution behavior of a task t k in the Heavy a mode (Heavy b mode, respectively) can be modeled as a dynamic self-suspending task running on core q with type b (core p with type a, respectively). By substituting the worst-case execution time C k and the maximum suspension time S k with C b k and S b k (C a k and S a k , respectively) in the schedulability test in Eq. (1), we reach the conclusion. t u Task t k in the Light mode assigned to core p of core type a and core q of core type b meets its deadline if Proof. This comes from the symmetric view of execution on core p and on core q using Theorem 1 in [15] (stated in the Appendix), which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TC.2022.3202748. In [15], Huang et al. provide a resource-centric symmetric timing analysis for real-time tasks on multi-core platforms with shared resources. We adopt their timing analysis for t k by considering core p as a resource shared by all t 0 2 C a p . By setting B, i.e., the overhead for requesting the shared resource, in [15] to 0, the response time of t k is upper bounded by XðtÞ þ SðtÞ according to Theorem 1 in [15], where XðtÞ is the amount of time that t k is accessing the shared resource, and SðtÞ is the amount of time that t k is suspended from the shared resource (core p). This also corresponds to Definition 6. Since that XðtÞ is upper bounded by

Greedy Type-Aware Federated Scheduling Algorithm
After presenting the analysis and the scheduling philosophy, we present our scheduling algorithm based on a greedy approach. The algorithm consists of two parts.
In the first part, we classify the tasks in the input task set T into four classes based on a control parameter r, 0 < r 0:5: 1) Task t i is in the Heavy ab mode when C a i > rT i and 3) Task t i is in the Heavy b mode when C a i rT i and C b i > rT i 4) Task t i is in the Light mode when C a i rT i and C b i rT i For a given r, this step classifies the tasks in T into four disjoint sets. Here, we use H ab , H a , H b , and LI to denote these four sets for simplicity. In the second part, we first handle the number of cores exclusively allocated to the tasks in H ab , H a , and H If any of the above setting of m a i and/or m b i is negative, we return failure. Otherwise, we allocate the dedicated cores for the tasks in H ab , H a , and H b accordingly. By enforcing the number of dedicated cores allocated to tasks with the above procedure, the derivation of the capacity augmentation bound is easier to present. This, however, may sacrifice the schedulability.
Let M ] a be M a À i as the remaining number of type a and type b cores after allocating the dedicated cores. If it is not possible to allocate enough dedicated cores, i.e., if M ] a < 0 or M ] b < 0, we return failure. Otherwise, we continue to schedule and partition the tasks in H a , H b , and LI to resolve the competition on the shared M ] a type a cores and M ] b type b cores. We order the priorities of the tasks in H a , H b , and LI in the rate-monotonic manner, and start from the task with the highest priority (shortest T i ). Suppose the next task to be assigned is t k . 1) If t k is in LI, we try to assign it on a core p of core type a and a core q of core type b, and apply the schedulability test in Theorem 10. If it is not possible under any combination, we return failure; otherwise, one combination of p and q is selected to assign task t k onto. 2) If t k is in H a , we try to assign it on a core q of type b, and apply the schedulability test in Theorem 9. If it is not possible, we return failure; otherwise, one q is selected to assign task t k onto. 3) Symmetrically, if t k is in H b , we try to assign it on a core p of type a like above. After all tasks are feasibly assigned and scheduled, the task-to-core mapping is returned as a feasible solution.

Discussions of the parameters of the greedy algorithm
Selection of r: The greedy algorithm classifies the tasks into four execution modes according to the control parameter r. If we pick a small r, more tasks are in the Heavy ab mode, which requires more exclusively allocated cores. On the other hand, a larger r results in more Light tasks. Therefore, r should be carefully selected, detailed in Section 6.
Fitting strategies: first fit, worst fit, best fit: For the tasks in the Light, Heavy a , and Heavy b modes, finding a core p of core type a and/or a core q of core type b can be formulated as a bin packing problem. We can apply existing fitting strategies such as First-Fit, Worst-Fit, or Best-Fit to find a core (or a pair of cores) that is schedulable for the task. For First-Fit, we assign the task to the first core (or first pair of cores in the Light mode) in the list that can schedule the task by following predefined indexes of cores. For Best-Fit (Worst-Fit, respectively), a task in the Heavy a or Heavy b mode is assigned to the feasible core that has the highest utilization (lowest utilization, respectively) after assigning the task. Furthermore, a task in the Light mode is assigned to the pair of feasible cores that have the highest sum of utilization (lowest sum of utilization, respectively) after assigning the task for Best-Fit (Worst-Fit, respectively). Note that we only assign the task to an unused core, i.e., a core that has not been assigned with any task yet, if the task is not schedulable on any used core.
Time complexity: We can directly compute the number of type a and type b cores that are exclusively allocated to tasks in H ab , H a , and H b . For tasks that share cores, i.e., tasks in the Light, Heavy a , and Heavy b modes, there are at most OðM a M b Þ possible combinations of p and q when considering t k . The schedulability test in Theorems 9 and 10 and their worst-case response time analysis require OðkT k Þ time complexity. The time complexity of the fitting strategy to select cores for t k is OðM a M b Þ. Therefore, the overall time complexity is OðN 2 M a M b T N Þ, where T N is the maximum inter-arrival time of the given N tasks. The time complexity is pseudo-polynomial, which can be reduced by approximating the tests in Theorems 9 and 10 in polynomial time [7].

CAPACITY AUGMENTATION BOUND
We prove the capacity augmentation bound of our greedy type-aware federated scheduling algorithm in Section 5.2 in this section. First, we provide the upper bound for the total number of type a and type b cores that exclusively allocated to tasks in Theorem 13. We then prove the schedulability of the workload that share the remaining cores. The capacity augmentation bound our greedy type-aware federated scheduling algorithm is proved to be 7.25 in Theorem 17.
To prove the upper bound for the total number of type a and type b cores that exclusively allocated to tasks, we derive the following two lemmas. Recall that we enforced the number of dedicated cores allocated to tasks in the tasks in the Heavy a mode and Heavy b mode, respectively.
Proof. The algorithm sets The case for m b i is identical.
Proof. For a task t i in H a , the algorithm sets If U a i < 1=3, since L a i =T a i r < U a i , we have The case for m b i is identical for task By the above two lemmas, we have the following theorem.
Theorem 13. For any 0 < r 1=6, when 0 < L a i rT i and 0 < L b i rT i for every task t i 2 T, and Proof. It comes from the combination of Lemmas 11 and 12. t u Next, we prove the schedulability of the tasks in the Heavy a , Heavy b , and Light modes on the remaining cores. We need the following lemma which is based on the generalized utilization-based analysis framework by Chen et al. [5].
Lemma 14. Suppose that T i T k and y i > 0 for every i 2 Y and x k > 0. The following condition implies that Proof. The proof can be achieved by following the suggested procedure in [5] by specifying the corresponding parameters. It can be found in the Appendix, available in the online supplemental material. t u i are the remaining number of type a and type b cores after allocating the dedicated cores. We have the following two lemmas.
Lemma 15. For any 0 < r 1=6, when 0 < L a i rT i and 0 < L b i rT i for every task t i 2 T, if task t k in H a is the first task that cannot be feasibly scheduled on one of the M ] b > 0 cores under the greedy type-aware federated scheduling algorithm, then X Similarly, if task t k in H b is the first task that cannot be feasibly scheduled on one of the M ] a > 0 cores, then X Proof. By Theorem 9, for every core q among the M ] b type b cores, we have Since higher-priority tasks meet their deadlines before considering t k , we have R i T i for every task t i 2 C b q ðt k Þ. Therefore, Recall that m a k is set to To prove a lower bound on P t i 2C b q ðt k Þ U b i , we adopt the generalized utilization-based analysis framework stated in Lemma 14 by reformulating Eq. (20) using x k as C b k þ S b k and y i as C b i for every By adopting Lemma 14, the condition in Eq. (21) leads to Since every higher-priority task t i is dedicated to at most one core q under the type-aware federated scheduling paradigm, we know that The other case for M ] a follows as well. t u Lemma 16. For any 0 < r 1=7:25, when 0 < L a i rT i and 0 < L b i rT i for every task t i 2 T, if task t k in LI is the first task that cannot be feasibly scheduled on one of the M ] a > 0 type a cores together with one of the M ] b > 0 type b cores under the greedy type-aware federated scheduling algorithm, then Thus, either Eqs. (17) or (18) holds.
Proof. By Theorem 10 and the fact R i T i for every higherpriority task t i before considering t k , for every of core p among the M ] a type a cores and every of core q among the M ] b type b cores, we have Therefore, for all combinations of M ] a M ] b pairs of p and q, the above condition holds. By summing up all these M ] a M ] b inequalities, we have for t i 2 H a in the above condition when adopting Lemma 14, we have We are ready to prove the capacity augmentation bound of the greedy type-aware federated scheduling algorithm.
Theorem 17. The capacity augmentation bound of the greedy type-aware federated scheduling algorithm is at most 7.25. That is, when r is 1=7:25, if then the greedy type-aware federated scheduling algorithm guarantees to derive a feasible schedule.
Proof. Suppose for contrapositive that the algorithm fails to derive a feasible solution.
As the first case, suppose that this is due to the exclusive allocation phase. By Theorem 13, we know that either M a < P r . This concludes the case. For the second case, the exclusive allocation is successful and, therefore, M ] a ! 0 and M ] b ! 0. By Theorem 13, X Suppose that the algorithm fails when it tries to assign task t k . There are three further sub-cases: Sub-case 1: t k 2 H a , i.e., it cannot find a core q among the M ] b type b cores to assign t k . It holds b ¼ 0 and otherwise by Lemma 15. Together with Equation (29) we obtain X Sub-case 2: t k 2 H b , i.e., it cannot find a core p among the M ] a type a cores to assign to t k . It holds a ¼ 0 and otherwise by Lemma 15. Together with Equation (28) we obtain Sub-case 3: t k 2 LI, i.e., it cannot find a pair of cores p; q among the M ] a type a cores and the M ] b type b cores to assign to t k . If M ] a ¼ 0, then a and together with Equation (28), we obtain Equation (31).
and together with Equation (29), we obtain Equation (30). If M ] a > 0 and M ] b > 0, then we apply Lemma 16. In particular, P > r holds and therefore we obtain Equations (31) or (30) as well. Hence, we reach the conclusion that one of the assumptions in Eq. (27) is violated, and the theorem is proved. t u

IMPROVED SCHEDULING ALGORITHM
The greedy type-aware federated scheduling algorithm in Section 5 is presented based on several rigid "enforcement rules," i.e., choice of parameters, that guide the algorithm to achieve a capacity augmentation bound of 7.25. However, these enforcement rules may lead to poor performance in practical settings [8]. In this section, we introduce an improved type-aware federated scheduling algorithm without the enforcement on the parameters. The improved algorithm determines the execution mode of a task and assigns the task to cores one task after another based on four principles. We also prove that the capacity augmentation bound of the improved algorithm remains 7.25.

Algorithm Description
The improved algorithm works based on four principles: 1) P-Attempt: For each task t k , the algorithm attempts to determine the execution mode of the task with the following preference order: Light > (Heavy a _ Heavy b ) > Heavy ab . More precisely, when considering task t k , the algorithm tries to assign at first t k in the Light mode, in case of failure, followed by another attempt of the Heavy a _ Heavy b mode. In case executing task t k in all of these three modes is not feasible (based on the schedulability tests presented in Section 5.1), task t k is assigned to the Heavy ab mode, which is validated at the end of the algorithm, i.e., principle P-Exclusive here. 2) P-Share: The algorithm prefers to share cores, i.e., it tries to assign tasks to the shared cores already assigned with certain higher-priority tasks, and only assigns the task to a core without any task assigned on it when such an option is not possible. 3) P-Efficient: When a task is in the Heavy a or Heavy b mode, the number of exclusively allocated cores is minimized just to meet its deadline. That is, the suspension time from core q of type b (symmetrically core p of type a) can be extended as long as the task meets its deadline. 4) P-Exclusive: For the tasks that are in the Heavy ab mode, there are potentially multiple choices of m a i and m b i for t i . We search for the best combination of them by a combinatorial approach. Algorithm 1 provides the pseudocode of the improved algorithm. After its initialization, it tries to assign task t k by following the P-Attempt principle, divided into four blocks: Lines 4 -8 are for the Light mode attempt, Lines 9 -37 are for the Heavy a mode attempt, Lines 38 -45 are for the Heavy b mode attempt, and Line 46 temporarily assigns t k to the Heavy ab mode. If an earlier attempt is successful, the flag success is marked as true, and there is no need for subsequent attempts.
Light Mode Attempt LI Ã (Lines 4 -8). The algorithm first tries to execute task t k in the Light mode if possible. The function Light_Attempt (pseudocode in the Appendix), available in the online supplemental material, returns a pair of cores (p, q) on which t k can be feasibly executed on these two cores, by validating the schedulability using Theorem 10. By following the P-Share principle, the algorithm prefers sharing cores. That is, whenever possible, task t k should be assigned on cores ðp; qÞ which already had certain higher-priority task(s) assigned onto them. Only if this option is not possible, the algorithm tries to assign task t k to a core p of type a and/or a core q of type b which was not yet assigned with any higher-priority tasks. If there are multiple valid core pairs, the algorithm applies a fitting strategy to choose one. Recall that finding a pair of cores for a task can be formulated as a bin packing problem, as discussed in Section 5.2. Any fitting strategy can be applied.
Heavy a Mode Attempt H aÃ (Lines 9 -37). If task t k can not be scheduled in the Light mode and C b k rT k , the algorithm tries to schedule the task in the Heavy a mode with an objective to minimize the number of exclusively allocated type a cores, following the P-Efficient principle. To minimize the number of type a cores exclusively allocated to t k , we use a function HeavyA_Attempt (pseudocode in the Appendix), available in the online supplemental material, which returns the minimum number of type a cores exclusively needed for task t k when its type b execution is assigned on a specified core q, together with the other tasks C b q assigned on it. This calculation is based on Lemma 5 and Theorem 9.
It also follows the P-Share principle by first trying to find a type b core q Ã which already has higher-priority task(s) assigned on it. Therefore, it starts from the P-Share principle in Lines 11 -23 if sharing the execution of task t k on a type b core q Ã does not result in more than  "Initialization.
2: while T 0 is not empty do 3: pop out task t k with the smallest T k from T 0 ; 4: if the pair ðp; qÞ Light_Attempt(t k ) can be found then 5: LI Ã LI Ã [ ft k g; 7: continue; "Light mode successful 8: end if 9: if C b k rT k then 10: success false; 11: if C b q is ; for every q then 12: m aÃ k 1; 13: else 14: m aÃ k min C b q 6 ¼; HeavyA_Attempt(t k ; q); 15: q Ã arg min C b q 6 ¼; HeavyA_Attempt(t k ; q); "ties broken arbitrarily 16: end if 17: if m aÃ k ðC a k À L a k Þ=ðT k =3 À L a k Þ AE Ç then "try to assign t k with m aÃ k type a cores exclusively and to core q Ã of type b 18: if there are m aÃ k cores of p with C a p ¼ ; then 19: find m aÃ k cores of p with C a p ¼ ; and set C a p ft k g; 21: success true;

22:
end if 23: end if 24: if success is false and there exists q with C b q ¼ ; then "try to assign t k with m a k type a cores exclusively and a new type b core q 25: find a core q with C b q ¼ ;; 26: m a k HeavyA_Attempt(t k ; q); 27: if there are m a k cores of p with C a p ¼ ; then 28: C b q ft k g; 29: find m a k cores of p with C a p ¼ ;, Heavy ab Mode (Destiny) H abÃ . After examining all the tasks, those that were not assigned to be in the Light, Heavy a , or Heavy b modes have to be checked whether they can be executed in the Heavy ab mode, based on the P-Exclusive principle. Let X a and X b be the remaining number of type a and type b cores with no task assigned on them, respectively. The function isFeasible_-HeavyAB builds a dynamic programming table to assign cores to the tasks in H abÃ and validates whether the tasks in H abÃ can be feasibly scheduled under exclusive allocations. Let Furthermore, if task t k can be feasibly scheduled on m a k type a cores and m b k type b cores, the function Schðt k ; m a k ; m b k Þ is True; otherwise is False.
Schedulability test for Schðt k ; m a k ; m b k Þ can be done by any approaches in Section 4.1. If the adopted test is selfsustainable of the schedulability test, we can find feasible assignments of task t k by performing binary searches on the number of both core types. Note that there can be multiple pairs of ðm a k ; m b k Þ that Schðt k ; m a k ; m b k Þ returns True, and we only care about the pairs that are not dominated by other pairs. We say that a pair (m a k ; m b k ) is dominated by (m a 0 k ; m b 0 k ) if m a k ! m a 0 k and m b k ! m b 0 k since if t k is schedulable on (m a 0 k ; m b 0 k ) cores then it must also be schedulable on (m a k ; m b k ) cores due to the self-sustainable of the schedulability test.
We apply the standard dynamic programming approach to validate whether P ðH abÃ ; X a ; X b Þ is True or not. If it is True, we backtrack the dynamic programming table and return a feasible assignment. The proposed algorithm then returns a feasible solution if we can feasibly schedule the tasks in H abÃ and returns failure if not.
Time complexity: The time complexity of the improved type-aware federated scheduling algorithm can be derived as follows. For tasks that share cores, i.e., tasks in the Light, Heavy a , and Heavy b modes, the time complexity is OðN 2 M a M b T N Þ, where T N is the maximum interarrival time of the given N tasks, as discussed in Section 5.2. For tasks with dedicated cores, we can construct the dynamic programming table with time complexity OðNM a M b Þ by keeping only the non-dominated entries in the table. The overall time complexity is OðN 2 M a M b T N Þ, which is pseudo-polynomial and can be reduced by approximating the tests in Theorems 9 and 10 in polynomial time [7].

Capacity Augmentation Bound
We prove that the capacity augmentation bound of the improved type-aware federated scheduling algorithm is still 7.25 in this section. Due to the structural similarity of the greedy type-aware federated scheduling algorithm and the improved version, most of results from Section 6 can be applied with proper restatements.
Proof. As Algorithm 1 does not allocate more than m a i ¼ cores for t i when t i 2 H aÃ , with the same procedure of the proof of Lemma 12, we reach the conclusion.t u Lemma 19. (Extension of Lemma 16) For any 0 < r 1=7:25, let task t k be first light task in Algorithm 1, which is classified as a light task in LI but cannot be feasibly assigned during the Light mode attempt, i.e., not in LI Ã . If such a task t k exists, then and maxfL a i ; L b i g L i rT i , after Line 49 of Algorithm 1, one of the following three conditions holds: Suppose that right after assigning task t k there are two type b cores with core utilization > 0 and < r. In Appendix, available in the online supplemental material, we show that this is not possible unless The condition that there is at most one type b core with core utilization > 0 and < r implies that In this case, H abÃ H ab and every task t i in H abÃ that has U a i > r and U b i > r. The improved algorithm fails to derive a solution when P ðH abÃ ; X a ; X b Þ is False due to insufficient amount of cores. One particular solution is to assign m a i ¼ % type b cores for every task t i in H abÃ . In this case, P ðH abÃ ; P t i 2H abÃ m a i ; P t i 2H abÃ m b i Þ is guaranteed return True. Therefore, since m a i and m b i are integers, either X a < X a þ 1 P t i 2H abÃ m a i or X b < X b þ 1 By applying Lemma 11 we further know that Together with the three conditions in Lemma 21, we conclude that either Due to the above cases, the capacity augmentation bound of 7.25 is proven as the contradiction is reached. t u

EVALUATION
In this section, we conduct an experimental evaluation of our proposed algorithms on synthetic task sets.

Environment Setting
Given M a type a cores, M b type b cores, and the target utilization U, we generate a task set as follows. The number of DAG tasks N is selected uniformly at random from ½ 1 2 Â maxðM a ; M b Þ; 2 Â maxðM a ; M b Þ. We apply the Dirichlet-Rescale (DRS) algorithm [12] to determine the utilization for each of the N tasks. The period of a task, i.e., T i , is selected uniformly at random from ½100; 1000. To generate a DAG, we first determine the number of nodes by selecting a number uniformly at random from ½ 1 2 Â ðM a þ M b Þ; 5 Â maxðM a ; M b Þ, and apply DRS to generate the utilization for each node. The Gðn; pÞ algorithm [10] is used to generate the edges between nodes, with probability p e 2 ½0:1; 0:9. We generate task sets with different target utilization, from 0 to 60% Â ðM a þ M b Þ with step of 5% Â ðM a þ M b Þ. For each target, 100 sets are generated and stored as pure_task_sets. The type of each node in the pure_task_sets is determined as follows. To evaluate the effect of the skewness of the workload, we introduce two parameters: r controls the share of tasks that have a skewed workload; and P ' controls the skewness of the skewed tasks. More specifically, in each task set, we pick r% of the tasks to be skewed tasks. Skewed tasks are determined to be type a skewed with a probability of 50% and type b skewed otherwise. For a type a skewed task, P ' % of the nodes in the task are selected uniformly at random to be in type b, while the rest of the nodes are assigned to type a. For non-skewed tasks, each node in a task has a probability M a =ðM a þ M b Þ of being assigned type a and a probability M b =ðM a þ M b Þ of type b. Note that assigning types to nodes does not change the structure of a DAG. We use task sets with different r and P ' as inputs in our evaluation. Note that this setup may result in many skewed tasks (depending on parameter r), but the workload of a task set as a whole is still expected to be balanced.
We apply the following algorithms in our evaluation: FED-IMPROVED: The improved type-aware federated scheduling algorithm proposed in Section 7 with Theorems 8 and 9 as schedulability tests for tasks in the Heavy ab and Heavy a /Heavy b modes, respectively. FED-GREEDY: The greedy type-aware federated scheduling algorithm proposed in Section 5.2. HAN-EMU/GREEDY: federated scheduling algorithms proposed by Han et al. in [14]. HAN-EMU enumerates all possible combinations of (m a ; m b ) for each heavy task, while HAN-GREEDY applies a penalty-based greedy algorithm to find a feasible assignment. Note that their schedulability analysis for light tasks in Lemma 5.1 in [14] is unsafe as they only considered the self-suspending behavior of the task under analysis but not the higher-priority self-suspending tasks. In addition, their analysis did not consider carry-in jobs, as Junjie Shi (Student Member, IEEE) received the master's degree in electronic technology and information technology from TU Dortmund University, Germany, in 2017, and currently working toward the PhD degree with TU Dortmund University. His research interests are resource-sharing protocols for real-time systems, resource aware scheduling for machine learning algorithms, and computation offloading for real-time systems.
Niklas Ueter received the master's degree in computer science from TU Dortmund University, Germany, in 2018 and now is working toward the PhD degree with TU Dortmund University, supervised by Prof. Dr. Jian-Jia Chen. His research interests are in the area of embedded and real-time systems with a focus on real-time scheduling.
Mario G€ unzel (Student Member, IEEE) received the MSc degree from the Faculty of Mathematics, the University of Duisburg-Essen, Germany, in 2019.
Since that time he is currently working toward the PhD degree with the chair for Design Automation of Embedded Systems, TU Dortmund University, Germany, where he is supervised by Prof. Dr. Jian-Jia Chen. His research interest is in the area of embedded and real-time systems. Currently, he focuses his research on the schedulability analysis of self-suspending task sets. He has received more than 10 best paper awards and outstanding paper awards and has involved in Technical Committees in many international conferences.