Dynamic Allocation/Reallocation of Dark Cores in Many-Core Systems for Improved System Performance

A significant number of processing cores in any many-core systems nowadays and likely in the future have to be switched off or forced to be idle to become dark cores, in light of ever increasing power density and chip temperature. Although these dark cores cannot make direct contributions to the chip’s throughput, they can still be allocated to applications currently running in the system for the sole purpose of heat dissipation enabled by the temperature gradient between the active and dark cores. However, allocating dark cores to applications tends to add extra waiting time to applications yet to be launched, which in return can have adverse implications on the overall system performance. Another big issue related to dark core allocation stems from the fact that application characteristics are prone to undergo rapid changes at runtime, making a fixed dark core allocation scheme less desirable. In this paper, a runtime dark core allocation and dynamic adjustment scheme is thus proposed. Built upon a dynamic programming network (DPN) framework, the proposed scheme attempts to optimize the performance of currently running applications and simultaneously reduce waiting times of incoming applications by taking into account both thermal issues and geometric shapes of regions formed by the active/dark cores. The experimental results show that the proposed approach achieves an average of 61% higher throughput than the two state-of-the-art thermal-aware runtime task mapping approaches, making it the runtime resource management of choice in many-core systems.

to the same application. There have been a number of studies on mapping applications to both active and dark cores [4]- [9]. They basically allocate dark cores to applications in the way that the active cores are allowed to operate at higher frequency levels, and thus, achieve higher performance at a cost of higher power consumption. However, these approaches fail to deliver optimized system performance due to the following reasons. First, as reported in [10], the application arrival rates vary significantly at different times. Particularly, as shown in Fig. 1(a), there is a vast gap between the maximum (highest number of applications arriving at the system per hour) and the minimum workloads, by as much as 200× [10]. Even over just one single minute, as shown in Fig. 1(b), the ratio of the maximum number of applications to the minimum can be as high as 6:1 [10], which implies that the number of dark cores can also vary greatly over that short time span. However, for the sake of simplicity, both schemes in [6], [7] inaccurately assume that the number of dark cores remains unchanged over a long time interval, undermining the quality of the application mapping results. An example illustrating two different schemes: how the free cores are assigned to five applications that system needs to service.
Second, due to workload fluctuations, allocating dark cores to applications necessitates the consideration of a number of competing requirements, such as throughput and individual application's waiting and completion times, as shown in Fig. 2. Assume that at initial time t 0 , four applications (A 1 -A 4 ) occupy core regions each of which also includes one or a few dark cores, as shown in Fig. 2(a). With application A 5 arriving at time t 1 , there are two possible allocation schemes that can map A 5 to the cores: • In one scheme, the dark cores already bound to A 1 -A 4 will be reallocated to A 5 such that A 5 can run immediately at time t 1 , but A 1 -A 4 have to slow down due to fewer dark cores available to help their active cores' heat dissipation (shown in Fig. 2(b)); • Alternatively, A 5 will be asked to wait until some applications (A 1 through A 4 ) finish their executions and thus their cores are freed up for A 5 to grab (shown in Fig. 2(c)). In this case, A 1 -A 4 can maintain their desired performance, but A 5 has to undergo a longer waiting time before it starts its execution. From Fig. 2, one can see that the mapping results generated from the two schemes differ significantly from each other in terms of performance (completion time) of currently running applications and the performance of the newly arrived or future applications.
Third, application's computation demand, measured by throughput in terms of instructions per cycle (IPC), varies with time. For instance, the computation demands of simultaneously running Facesim and Swaptions in the system, shown in Fig. 3(a), vary differently. As there are dark cores already allocated to Swaptions and it has decreasing computation demands as time passes by, the resource manager can reclaim some of the dark cores occupied by Swaptions, and instead allocate them to Facesim at time t 5 , as shown in Fig. 3(b). If the dark cores can be genuinely adjusted at runtime, both applications will be able to have their computation demands met.
In order to achieve optimized system performance by addressing the aforementioned challenges, we propose a runtime mapping scheme to dynamically allocate and adjust both active and dark cores. Here are the highlights of the proposed scheme.
• The proposed mapping algorithm takes the varying workloads, the waiting times of newly arrived applications, and the computation demands of applications into account, while the operating temperature is treated as a thermal constraint for safe and reliable operation of the chip. Instead of pushing each individual application's performance to its highest, our approach attempts to optimize the performance of currently running applications and the ones that are about to run.
• Based on a throughput model, a dynamic programming network framework is proposed to determine both the number of active and dark cores in the system for the newly arrived applications, and the number of dark cores that is allocated to executing applications, with the objective of maximizing the system performance.
• The mapping algorithm also includes region determination and task-to-core mapping. In general, the dark cores are placed near the cores that need to dissipate heat or run at higher frequencies. Moreover, the locations and geometric shapes of the core regions are regulated to minimize the communication latency and fragmentation of the free core regions, which further improves the system performance.
The remainder of the paper is organized as follows. Section II reviews the related work, and Section III describes the target system and provides the problem definition. Section IV presents the overview of the proposed method. Section V, VI, and VII describe the detail of three steps of the proposed method. Extensive experiments are conducted to compare the proposed scheme against the state-of-the-art thermal-aware runtime mapping methods, and the results are reported and analyzed in Section VIII. Finally, Section IX concludes this paper.

II. RELATED WORK
Runtime allocation of available system resources to tasks has been an active research area since the inception of the manycore era [11]. Of the many resource allocation approaches that have been proposed, they, based on whether remapping is allowed at runtime, can be broadly classified into two classes: • Dynamic mapping without task migration, where no mapping change happens after the initial task-to-core mapping; and • Dynamic mapping with task migration, where tasks can be mapped and remapped to different cores at runtime.

A. DYNAMIC MAPPING WITHOUT TASK MIGRATION
Dynamic mapping without task migration can be further classified into three categories according to their optimization goals: communication-oriented mapping, power-aware mapping, and thermal-aware mapping. Communication-oriented approaches (e.g., [12], [13]) aim at reducing network latency or minimizing traffic congestion, and they are similar to the contiguous mapping method [7]. However, these mapping approaches might lead to thermal hotspots in high power density chips since they do not consider the power budget [4].
Power-aware algorithms (e.g., [14], [15]) try to perform mapping under the thermal design power budget, which alone is not enough to avoid thermal violations, as found in [5]. As a fix, some thermal-aware approaches take the temperature of the cores into account during mapping [16].
The thermal-aware mapping approaches in [16], [17] try to minimize the power consumption and peak temperature.
As alluded before, the existence of dark cores presents opportunities to optimize system temperature. Failing to take advantage of the availability of dark cores might lead to suboptimal performance, as the cases of [16], [17]. To efficiently exploit dark cores, many dark-core-aware approaches have been considered [4]- [8]. The mapping approaches in [5], [6] assume that the system has a fixed number of dark cores, but in reality, the number of dark cores can vary significantly even in a short period of time [10]. Approaches in [5], [7], [8] do not consider the application arrival rate, and thus, their mapping results tend to cause applications to wait too long before they can start their execution. Although the work in [4] considers the application arrival rate when allocating dark cores to applications, a big drawback is that it cannot guarantee that the cores can meet the changing computation demands of applications.
In short, none of these dynamic mapping algorithms described here deliver the optimal performance, as they do not take full advantage of the dark cores in the system, workload variation, and changing computation demands of applications.

B. DYNAMIC MAPPING WITH TASK MIGRATION
Recognizing the deficiencies of the dynamic mapping approach without task migration, dynamic mapping approaches allowing task migration at runtime are proposed to help improve the runtime application performance. The dynamic mapping approaches can be classified into three categories: fragmentation-aware migration, communicationaware migration, and thermal-aware migration. Fragmentation-aware migration schemes (e.g., [18], [19]) reallocate tasks with a hope of forming a contiguous region of cores, while the communication-aware migration approaches (e.g., [20], [21]) focus on adjusting core allocation to minimize communication latency. Thermal-aware migration approaches (e.g., [22], [23]) move tasks from overheated cores to cooler ones to reduce hotspots. However, the above mapping approaches still do not exploit dark cores for better performance [8]. Although an early study in [9] presents a dark-core-aware migration algorithm to produce better computation performance, it does not address the changing computation demands of applications.
In the next section, we will present a runtime dark core allocation and adjustment scheme that addresses the outstanding issues of workload variations and applications' computation demands.

III. SYSTEM MODEL AND PROBLEM DEFINITION
A. THE TARGET MANY-CORE PLATFORM AND APPLICATION MODEL Fig. 4(a) shows the target many-core platform, which has a set of homogeneous cores Q, connected by a 2D mesh network. A core in Q is denoted as c i . One core will be designated as the resource manager and it has the authority and capacity to make any runtime core allocation and adjustment decisions.  The many-core platform executes applications organized as a set, A = {A 1 , A 2 , . . . , A N }. When an application is ready to execute at time t, it is placed in the system waiting queue (denoted as H (t)). When an application in H (t) is allocated to certain cores for execution, it is added into the running queue (denoted as T (t)). When an application in T (t) finishes its execution, it is deleted from this queue. The notations used throughout the paper are summarized in Table 1.
To model the time-varying features of computation demands, an application A i is divided into mutiple phases, and at a phase τ the application is represented as a task graph AG i (τ ) = (V i (τ ), E i (τ )), as shown in Fig. 4(b). The task graphs at different phases can be obtained by the heartbeat framework [24]. V i (τ ) is the set of tasks associated with application A i , and E i (τ ) is the set of edges governing the communications among tasks. v ij is the j th task of application A i . e ijk is the edge of connecting tasks v ij and v ik indicating the communication between v ij and v ik . Each task v ij ∈ V i (τ ) has a weight a(v ij , τ ) which gives the execution time at phase τ . An edge e ijk = (v ij , v ik ) ∈ E i (τ ) has a weight of w(e ijk , τ ) that defines the communication volume in terms of the number of packets from tasks v ij to v ik at phase τ . Mapping a task to a core is defined as a one-to-one mapping; that is, only one task can run at a core, and no tasks can share a core at any given time [13]. A mapping function M (v ij ) = c i , maps task v ij to core c i .

B. THROUGHPUT MODEL
Application A i 's computation demand is measured by its throughput, which is the lumped throughput (IPC) of the cores running all the tasks of A i . A throughput model is set to compute the throughput of application A i (denoted as i,|B i (t)| ), with |B i (t)| dark cores assigned to A i . In specific, where B i (t) is the set that associates with application A i .ā i andw i are the average execution time and communication volumes of the tasks, respectively, and i (τ ) is given below.
The throughput model is used at runtime to estimate the throughput, given the number of dark cores. To find the regression coefficients β j , δ j , ϑ j , θ j and ε, the maximum likelihood method [25] can be used. The throughput model can be trained offline by running various applications. There are four steps.
Step 1 (Find a Near Square Shape): Since the throughput of application A i is associated with the core region, and a square shape is ideal to address the communication latency concerns [13], a core region close to a square shall be pursued when mapping applications. Let R i (t) be the core region of application A i , which includes the dark cores B i (t) and active cores. First, a basic square with length α = √ (|R i (t)|) is found. If φ = |R i (t)| − α 2 = 0, this region is a square and it shall be selected as the shape of core region for throughput modeling. For a region with a non-square shape, this region can take of the shapes made of a basic square combined with one or two rectangles.
• Case 2: one rectangle. If φ ≤ α, the near square shape consists of the basic shape and a rectangle of size 1 × φ, as shown in Fig. 5

(b).
Step 2 (Determine the Task Positions): The mapping method described in Section VII can be used to determine the positions of tasks. The cores that are not occupied by tasks in the core region of application A i are powered-off as dark cores.
Step 3 (Set Other Running Applications): In order to simulate the case that there are many other applications running simultaneously in the system, which also consume power, applications from PARSEC [26] are randomly picked and mapped to cores adjacent to the application of interest.
Step 4 (Set Voltage/Frequency Levels of Cores): When running applications, it is necessary to ensure that each core c i is running safely with its power consumption below the maximum power capacity P m (c i ), which is obtained from the thermal power capacity model in [4]. The total power consumption comes from the dynamic power P d (c i ) and leakage power P l (c i ). Therefore, The leakage power P l (c i ) can be obtained as in [27]. The dynamic power P d (c i ) is determined by: where µ i is the switching activity, z i is the effective capacitance, f i is the frequency of core c i , and i is the supply voltage. The frequencies, power, and throughput of the dark cores are 0. Once the positions of the tasks are determined in the system, the method in [4] is used to set voltage/frequency levels of the cores so that they can run at a high speed without violating temperature constraint and the maximum power capacity.

C. PROBLEM STATEMENT
The applications in set A arrive at the system at different times, and the objective is to maximize A , the system throughput of running the applications in set A.
Eqn. (6) can be transformed to maximize the system throughput of the application set A(t) which can be executed at time t. A(t) contains the applications that are either in the running queue T (t) or in the waiting queue H (t) at time t. With the throughput model, the maximum throughput for the application set A(t) can be computed by: where γ if_run i is a binary value. If γ if_run i is 1, application A i can start its execution immediately as there are sufficient cores available in the system. If γ if_run i is 0, it means application A i is put on hold and it waits for core(s).
At each control time, decision needs to be made regarding the number of dark cores to be allocated to each application, together with the task-to-core mapping.

IV. OVERVIEW OF THE PROPOSED METHOD
The decision to map a new application to cores, or adjust the core regions of running applications, could usher in a couple of challenges.
First, fragmentation of free cores [18] might occur. Dark cores released by other applications might not form a contiguous region, which increases the communication latency for newly arrived applications. VOLUME 8, 2020 Second, a near square shape of the core region is ideal for communication latency concerns [13]. However, the shape tends to be irregular after adding or removing dark cores, which might lead to increased communication latency.
To address these challenges, a three-step algorithm is proposed as follows: Step 1 (Dark Core Budgeting (Section V)): A dynamic programming framework is applied to decide the number of dark cores for the running applications and newly arrived ones.
Step 2 (Region Determination (Section VI)): Given the number of dark cores from the previous budgeting step, the shape and location of each application's core region are determined and reallocated to avoid fragmentation.
Step 3 (Task Mapping (Section VII)): A task mapping algorithm maps the tasks within its core region, together with the determination of the locations of the dark cores.
The proposed method will be triggered at each control time.

V. DARK CORE BUDGETING
Dark core budgeting, which decides the numbers of dark cores that shall be allocated for maximal throughput (defined in Eqns. (7) and (8)), can be transformed into the longest path problem in an acyclic network, where a dynamic programming network (DPN) can be built.

A. DYNAMIC PROGRAMMING NETWORK DEFINITION
The dynamic programming network (DPN) is denoted as a graph DPN (O, Y ), as shown in Fig. 6, with O and Y representing the vertex and edge sets, respectively. We assume that dark cores should be allocated to applications in set Here |B(t)| is the maximum number of dark cores in the system when applications in F(t) are running in the system, and it is computed by: Two dummy vertices, source vertex s and destination vertex d, are added to indicate the start and the end of the DPN, respectively. The vertex to represent the optimal overall throughput after assigning a total of b dark cores to applications f i (t), f i+1 (t), . . ., f |F(t)| (t). An edge connecting the vertices o i,b and o i+1,k is defined as , corresponding to the decision of assigning b − k dark cores to application f i (t). Each vertex at stage i is connected to at most |B(t)| + 1 vertices in the next stage i + 1.
An edge with utility i,b−k (the throughput obtained from the throughput model defined in Section III-B) exists between Let d s be a feasible path from the source vertex (s) to the destination vertex (d). The maximum throughput resulting from the dark core allocations for the application set F(t) can be computed by finding the longest path from vertex s to vertex d. Such a longest path can be found recursively in the form of Bellman equations [28]. That is, the dynamic By expanding Eqn. (11) from vertex s to vertex d (i.e., U (s, d)), the maximum throughput resulting from the dark core allocations for the application set F(t) can be computed by: Algorithm 1 shows the computation of the Bellman equations given in Eqns.

B. FINDING THE RUNNING APPLICATION SET
To find the application set T * (t + 1) that is to be run at time t + 1 (equivalent to determining γ if_run i in Eqn. (7)), a threestep dark core budgeting algorithm is applied, as shown in Fig. 7.   Step 1: T (t) is denoted as the application set of all the applications that have not completed their executions after time t. From T (t), the maximum number of applications that can be added into the running queue from the waiting queue H (t), denoted as n wait , is computed by the order that these applications join the waiting queue (assume none of the applications including the applications in T (t) running at time t + 1 have dark cores).
Step 2: set T l (t + 1) = T (t) ∪ l j=1 h j (t), l ∈ {0, . . . , n wait }, h j (t) ∈ H (t). Here T l (t + 1) is the set of currently running applications after time t and l applications in the waiting queue. These l applications are selected from the waiting queue H (t), according to the order when they join the waiting queue. For each application set T l (t + 1), l ∈ {0, . . . , n wait }, the maximum throughput U l (s, d) in Eqn. (12) is computed by exploring the dynamic programming network (Algorithm 1).
The worse-case time complexity of finding the running application set and dark core budgeting scheme is O(n wait · (|T (t)| + n wait ) · |B(t)| 2 ).

VI. REGION DETERMINATION
Dark core budgeting algorithm computes the number of dark cores that is allocated to applications running at time t + 1. The currently running applications whose dark core number is about to change, i.e., |B i (t)| = |B i (t + 1)|, and the newly arrived ones need to find a new region, following a three-step region determination algorithm, as shown in Fig. 8. Step 1 (Find the Largest Contiguous Region): Starting with the largest contiguous region R largest will help alleviate the core fragmentation problem. All of the applications whose core regions are not located in R largest need to be adjusted for their core regions.
Step 2 (Determine the Relocation Order): The applications that need to find a new region are prioritized.
Step 3 (Find the Core Region): For each application that needs to be adjusted, a three-step algorithm is performed.
• First, find all the possible locations of the cores that an application can be mapped to.
• Second, the candidate regions are formed, starting from each possible core location of an application.
• Third, choose the new region out of the candidate regions. VOLUME 8, 2020

A. FINDING THE LARGEST CONTIGUOUS REGION
Let ψ ⊂ T * (t + 1) be a set that includes two types of applications: (1) the ones that are newly added into the running queue, and (2) the currently running ones which see their number of affiliated dark cores is about to change. Let RC i ∈ RC = {RC 1 , RC 2 , . . . , RC m , . . . } be the i th contiguous region occupied by the applications in set K no_adjust = T * (t + 1) − ψ, which is the set of the currently running applications that will hold the same number of dark cores.
To find RC, the following steps are performed iteratively. For each RC i ∈ RC, initially, a core that is occupied by an application in set K no_adjust is found, and it is added to RC i . For each core c j ∈ RC i , each of the neighboring cores c l is checked. If core c l is already running a task of an application in K no_adjust , it is added to RC i . If all of the cores in RC i are checked and ∀i,1≤i≤|RC| |RC i | is less than the total number of cores that are running the tasks of applications in set K no_adjust , the iteration can continue to find RC i+1 ; otherwise the iteration terminates.
The largest contiguous region R largest is the one in RC that has the maximum number of cores. All of the applications running at time t + 1, whose core regions are not located in R largest , are added to a set ψ . The core regions of applications in ψ need to be adjusted.

B. DETERMINATION OF THE RELOCATION ORDER OF APPLICATIONS
To determine the relocation order of the applications, applications in ψ are sorted in ascending order by the Manhattan distance between the geometric center of core region R i (t) and the geometric center of region R largest . Applications showing shorter Manhattan distances between the two will have higher priority to be relocated earlier. If two applications have identical Manhattan distances, the application with more tasks will be relocated earlier, since it is more difficult to find an appropriate core region for this application than those applications with fewer tasks. The Manhattan distance between the geometric center of a newly arrived application and the geometric center of core region R largest is first set to infinity, and this application is mapped after the currently running application. Note that a core c i has a 2D coordinate of < x i , y i >. The geometric center c(x l , y l ) for core region R i (t) can be approximatively determined by:

C. FINDING THE APPLICATION'S CORE REGION
For each application A i in ψ , a three-step algorithm is perform to find a core region.
Step 1 (Find All the Possible Core Locations That an Application Can Be Mapped To): Two classes of cores are first defined: periphery cores and internal cores. A periphery core is the one that is physically located on the edge of the network, while an internal core is the one that is at least one core away from the edge of the network. A core c k that falls into one of the two cases is a possible core location for application A i , and is added into a set i .
• Case 1: c k is a periphery core and only one neighboring core is occupied.
• Case 2: c k is an internal core, and c k shares two occupied neighboring cores with another core, c j , where c j is one of the cores located at {c(x k + 1, y k + 1), c(x k − 1, Step 2 (Form the Candidate Regions): For each core c k in i , cores are selected to form the candidate region R ik . R ik ∈ R i is defined as the k th candidate region for application A i , starting from the possible core location c k . To form the candidate region R ik , |R i (t)| − 1 cores with the minimal D j are added into R ik , where D j is the summation of the Manhattan distances between free core c j and all the cores that are already in R ik . That is, D j = ∀c l ∈R ik D(c j , c l ), where D(c j , c l ) is the Manhattan distance between two cores, c j and c l .
Step 3 (Choose the New Region Out of the Candidate Regions): From candidate region set R i , the region with the minimal migration cost is selected as the new core region for application A i . The migration cost of a candidate region is approximated as the Manhattan distance between the geometric center of application A i 's current core region and the geometric center of its candidate region. For the newly arrived application, select a core region randomly from the candidate regions as the new region.
Since the time complexity of determining the new region for an application is O(| i | · |R i (t + 1)| · |B(t)|) and there are |ψ | applications that need to be mapped or mapped remapped, the time complexity of region determination at each control time is O(|ψ | · | i | · |R i (t + 1)| · |B(t)|).

VII. TASK MAPPING
The task mapping algorithm maps the tasks of an application in ψ to its core region while minimizing the communication latency and improving the computation performance. Specifically, there are two major steps in this algorithm, as shown in Fig. 9: for each application in ψ , (1) extend the task graph, and (2) perform the task-to-core mapping.

A. EXTENDING TASK GRAPH
The number of dark cores |B i (t)| allocated to application A i was obtained from running the dark core budgeting algorithm presented in Section V. Since application A i can be descried by its task graph, |B i (t)| dummy tasks (nodes), all with node weight of zero, are created, and each of these |B i (t)| dummy nodes is connected to a task that has the maximal execution time in the task graph of application A i . If the execution times of two tasks happen to be identical, the one with fewer neighboring tasks will be selected first to connect with a dummy task. The k th dummy task, associated with task v ij , is denoted ash v ij ,k ∈ H v ij . Binding dummy taskh v ij ,k to task v ij does not change the characteristics of the task graph, ash v ij ,k has only one neighbor, task v ij , and the node weight ofh v ij ,k and the communication volume of edge (v ij ,h v ij ,k ) are both set to be zero.  To add three dummy tasks into it, it is found that tasks v 12 , v 14 and v 17 have the longest execution times, thus each of these three tasks is connected with a dummy task. The extended task graph is shown in Fig. 10(b).
The dummy taskh v ij ,k cannot be mapped until its neighboring task v ij has been mapped, andh v ij ,k is mapped to a core that is adjacent to the one running task v ij . The positions of all the dummy tasks in H v ij , associated with task v ij , are determined by the function L(H v ij , R i (t)) with the following steps (Algorithm 2).
For each dummy taskh v ij ,k in H v ij , run the following steps to find set C v ij ,k including the possible cores that can be allocated toh v ij ,k . A core is randomly selected from C v ij ,k to runh v ij ,k .
Step 1: build set C 1 which contains the free core(s) selected from core region R i (t) such that the Manhattan distance between each c l in C 1 and the core M (v ij ) (occupied by task v ij ) is the shortest.
The cores in C 1 are next added into C v ij ,k (Lines 3-4). If there is only one core in C 1 , jump to Step 4. Initialize the set C v ij ,k = ∅; // Step 1 3: C v ij ,k = C 1 ; // Step 2 5: if |C 1 | > 1 then 6: C v ij ,k = ∅; 7: if |C 2 | > 1 then 10: C v ij ,k = ∅; 11: C v ij ,k = C 3 ; 13: end if 14: end if // Step 4 15: Randomly select a core c * from C v ij ,k ; 16: M (h v ij ,k ) = c * ; 17: end for Step 2: if there are more than one core in C 1 , clear set C v ij ,k and find set C 2 from C 1 such that, where c l in C 2 is the farthest-away core from all the currently mapped dummy tasks of application A i . The cores in C 2 are next added into C v ij ,k (Lines 5-8). This step helps to distribute the dark cores across the chip. If there is only one core in C 2 , jump to Step 4.
Step 3: if there are more than one core in C 2 , clear set C v ij ,k and find set C 3 from C 2 such that, where each core in C 3 has the minimal number of available free neighboring cores (denoted as ℵ c k ), and all the cores in C 3 are added into C v ij ,k (Lines 9-13). This step can reduce the impact of dark cores on communication latency, since a dummy taskh v ij ,k does not incur any communication with other tasks.
Step 4: from set C v ij ,k , a core is selected randomly, and map dummy taskh v ij ,k to the selected core (Lines 15-16). The core occupied by a dummy task is turned-off as dark core.
The time complexity of Algorithm 3 Task-to-Core Mapping Input: The application core region of A i . I : Sorted task set.

Output:
The mapping result. if v ij has already mapped neighboring tasks then // Case 1 6: Z v ij = Z 1 ; 8: if |Z 1 | > 1 then 9: Z v ij = ∅; 10: Z v ij = Z 1 ; 15: if |Z 1 | > 1 then 16: Z v ij = ∅; 17: Randomly select a core c * from Z v ij ; 22: M (v ij ) = c * ; 23: L(H v ij , R i (t)); // Call Algorithm 2 24: end for B. TASK-TO-CORE MAPPING Algorithm 3 shows the two-step task-to-core mapping to map the tasks of an application to its core region: Step 1: v m , the task with the highest total communication volume, is mapped to the geometric center (c center ) of application A i 's core region R i (t). If task v m is connected with dummy tasks, their respective positions are determined by function L(·, ·) (Algorithm 2) (Lines 1-2).
Step 2: let I be the set of tasks in V i (τ ) sorted by their communication volumes in descending order, and those connected with the dummy tasks are mapped first. For each task v ij in I , find set Z v ij which includes the possible positions that can map v ij . A core c * is randomly selected from Z v ij to run task v ij . There are two cases to consider to find set Z v ij (Lines 3-24).
• Case 1: (Lines 5-11) if at least one neighbor of task v ij has been mapped, build set Z 1 such that, where a core in Z 1 is closest to all the tasks in V ij , and V ij is the set of the already mapped neighboring tasks of v ij . The cores in Z 1 are next added into Z v ij (Lines 5-7). If there are more than one core in Z 1 , clear set Z v ij and find set Z 2 from Z 1 such that, where the number of available neighboring cores of each core c l in Z 2 is closest to the number of unmapped neighboring tasks of task v ij (i.e., β v ij ). The cores in Z 2 are added into Z v ij (Lines 8-11).
• Case 2: (Lines 12-20) if none of v ij 's neighboring tasks are mapped yet, build set Z 1 such that, where the number of available neighboring cores of each core c l in Z 1 is closest to β v ij (the number of unmapped neighboring tasks of task v ij ). The cores in Z 1 are added into Z v ij (Lines 13-14). If Z 1 has more than one core, clear set Z v ij and find set Z 2 from Z 1 such that, where each core in Z 2 is a farthest-away core from the positions of all of the tasks in V i , and V i is the set of the already mapped tasks of application A i . The cores in Z 2 are added into Z v ij (Lines [15][16][17][18][19]. After building set Z v ij , a core is randomly selected from Z v ij , and task v ij is mapped to this selected core (Lines 21-22). After mapping task v ij , check and find positions for its dummy tasks, using function L(·, ·) described in Algorithm 2 (Line 23).
The time complexity of Algorithm 3 is O( Once the positions of the tasks for all of the applications in set ψ are determined, the method in [4] is used to set voltage/frequency levels of the cores so that they can run at a high speed without violating temperature constraint and the maximum power capacity. Fig. 11 shows a mapping example for the task graph with three dark cores in Fig. 10. First, map the task with the highest communication volume. The task v 14 is first mapped to core c 5 in Fig. 11(a), the geometric center of core region. A core is selected randomly from {c 2 , c 4 , c 6 } for dummy taskh v 14 ,1 , since c 2 , c 4 and c 6 have fewer neighboring cores than that of c 8 , as shown in Fig. 11(b).
Second, map the tasks connected with dummy tasks. Task v 17 is mapped to c 7 , since task v 17 has no mapped neighboring tasks and has four unmapped neighboring tasks, which is closest to ℵ c 7 = 3, the number of available neighboring cores of c 7 . The dummy taskh v 17 ,1 is mapped to core c 10 , as the Manhattan distance of c 10 to core M (h v 14 ,1 ), occupied by dummy taskh v 14 ,1 , is larger than that of c 4 and c 8 to M (h v 14 ,1 ), as shown in Fig. 11(c). In a similar fashion, task v 12 and its dummy taskh v 12 ,1 are subsequently mapped, as shown in Fig. 11(d).
Third, map the tasks whose dummy task set H v ij is empty. Task v 15 is mapped to core c 8 , as task v 15 is a neighbor of tasks v 14 and v 17 . Similarly, tasks v 13 , v 16 , v 11 , and v 18 are mapped onto the core region, as shown in Fig. 11(e).

VIII. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
To model the task graph and application execution, we implement an event-driven C++ network simulator with its configuration summarized in Table 2. This simulator is able to model packet delay and energy consumption in communications in a cycle accurate manner. Hotspot [29] is used for temperature simulation and McPat [30] is used as the power model. The power needed to turn on a dark core is set to the same as that in [31]. The floorplan of the underline many-core system is adopted from [32].
We evaluate the proposed method on random and real workloads, as tabulated in Table 2. The task degree of the random applications ranges from 1 to 14, and the number of tasks per application varies from 4 to 15. The task graphs of the real applications are generated from the traces of PARSEC [26] and SPLASH-2 [33] benchmark suites. These traces are collected by executing them in an NoC-based cycle accurate many-core simulator [34], whose configuration is also reported in Table 2. The applications in PARSEC and SPLASH-2 benchmarks are running with thread number of 16 and 64 in two network sizes, 4×4 and 8×8, respectively.
We compare our proposed method with the following methods: (1) Fixed_dark_core_allocation, which cannot adjust the number of dark cores after the initial task-to-core mapping and uses only the mapping method described in Section VII-B; (2) Bubble_budgeting [4], which uses virtual mapping to determine the number and the positions of dark cores; and (3) Adboost [6] where a core region including dark cores is found for an application. These two schemes [4], [6] are the state-of-the-art thermal-aware runtime task mapping approaches, which also consider the dark core allocation. Herein after our proposed scheme (including the adjustment) is termed as the Proposed.
In the following experiments, we compare the the four methods in terms of throughput, communication latency, and waiting time under different network sizes and application arrival rates. The waiting time occurs when there are insufficient cores to run the newly arrived applications.
where i,|B i (t)| and i,|B i (t)| are the throughputs obtained from the simulator and the throughput model, respectively. VOLUME 8, 2020 From Fig. 12(a), one can see that the seventh order polynomial regression has the lowest error (7.61%) among all. Therefore, in the following experiments, the seventh order polynomial regression model is used as the throughput model.

C. THE COMPARISON OF OFFLINE AND ONLINE THROUGHPUTS
It is possible that the core region used for training of the throughput model (see Section III-B) is different from the one selected at runtime. In addition, the thermal profile of the runtime system might also be different from that for the throughput model training. Thus, the estimated throughput (denoted as A ) used in the dark core budgeting algorithm for application set A may be different from the online throughput (denoted as A ) obtained from application execution at runtime, and the difference is defined as: Among the many different application sets executed, the difference is within 6%, as shown in Fig. 12(b). Moreover, the experimental results show that the average aspect ratio of application core regions determined at online is 1.22, which is close to the average aspect ratio (close to 1) of the core regions used for training the throughput model. These results indicate that the throughput model can give fairly accurate prediction of the throughput, which is so much needed in the dark core budgeting algorithm.

D. FINDING THE INTERVAL LENGTH OF CONTROL TIME
Our approach is triggered at each control time to process the workload variation and applications' computation demands. Fig. 13 shows how the throughput varies with various lengths of interval between two control times (in million cycles). Applications with different execution times and communication volumes are executed under different system settings in terms of network size and application arrival rate. From Fig. 13, one can see that the interval length of control time of 75M cycles generates the best performance. Therefore, in the following experiments, we set the interval length of control time to be 75M cycles.

E. PERFORMANCE EVALUATION ON RANDOM BENCHMARKS
Fig. 14 compares the throughput, waiting time, and communication latency of the four methods when they are performed in the system with different network sizes, running the random benchmarks where applications arrive at the system randomly. These results are normalized to that of the proposed method. It can be seen from Fig. 14(a) that the proposed method improves the throughput by 23.9%, 26.3%, and 29.2% compared with Fixed_dark_core_allocation as the network sizes vary from 5 × 5, 8 × 8, to 12 × 12, respectively. The proposed algorithm can adjust the dark cores of each application at runtime to optimize both currently running applications and newly arrived ones. Therefore, the proposed approach considers all of the applications to make a sound global decision that redistributes the dark cores among the running applications and newly arrived ones. The Fixed_dark_core_allocation only takes the next arrived application into account and cannot change the dark core allocation in response to the changing computation demands, which leads to sub-optimal performance. It can also be seen from Fig. 14(a) that, on average, the throughput of the proposed method is 1.45× and 1.82× over Bubble_budgeting and Adboost, respectively. The reason is that Bubble_budgeting only optimizes an individual application, without considering all the currently running applications. Therefore, it might allocate excessive number of dark cores to certain applications. Adboost, on the other hand, assumes the system has fixed number of dark cores, and it cannot allocate cores to applications according to their computation demands.
As shown in Fig. 14(b), on average, the proposed approach reduces the waiting time by 33.0%, 44.7%, and 71.0% over Fixed_dark_core_allocation, Bubble_budgeting, and Adboost, respectively. The reason is that, the proposed approach makes a global decision to balance the execution time of currently running applications and the waiting time of newly arrived ones on the fly. The communication latency of the proposed approach is also lower than those of the other three methods, as shown in Fig. 14(c). The reason is that the proposed approach adjusts the mapping scheme according to the changing computation demands. Moreover, with the proposed method, the dark cores are placed in the way that they have little impact on the communication latency. With large network sizes, the proposed approach achieves better performance in terms of waiting time and communication latency. The reason is that there are more dark cores for applications that can be adjusted at runtime to meet the workload variations.   . 15 compares the throughput, waiting time, and communication latency of the four methods when they are adopted in a system running the random benchmarks with different application arrival rates. The results in Fig. 15(a) (b) and (c) are normalized to that of the proposed method. The respective throughput in Fig. 15(d) is normalized to that of Adboost when application arrival rate is 1. The application arrival rate is defined as the number of applications arrived at the system per 10 5 cycles, which measures the workloads of the system. It can be seen from Fig. 15(a) that, when the arrival rate is high, e.g., 2.78 applications arrive at the system per 10 5 cycles, the throughput of the proposed approach is 1.20×, 1.42×, and 1.96× over Fixed_dark_core_allocation, Bubble_ budgeting, and Adboost, respectively. On average, the proposed approach reduces waiting time by 83%, 96%, and 99% over Fixed_dark_core_allocation, Bubble_budgeting, and Adboost, respectively. The proposed approach achieves better performance since it can adjust the dark cores to reduce the waiting time of newly arrived applications when application arrival rates are high.
It can also be seen from Fig. 15(d) that, Adboost and Fixed_dark_core_allocation reach their throughput saturation points at the arrival rates of 1.25 and 1.85 applications per 10 5 cycles, respectively, while those of the proposed approach and Bubble_budgeting are both arriving at 2.50 applications per 10 5 cycles. The reason for this is that the proposed approach and Bubble_budgeting both take application arrival rate into their consideration. Moreover, the throughput of the proposed approach increases rapidly compared with that of Bubble_budgeting, as it considers all of the applications to make a global optimization. Fig. 16 compares the throughput, waiting time, and communication latency of the four approaches when they are adopted in a system with different network sizes, running the real benchmarks where applications arrive at the system randomly. These results are normalized to that of the proposed method. The throughputs of the proposed method are 1.15×, 1.40×, and 1.73× over Fixed_dark_core_allocation, Bub-ble_budgeting, and Adboost on average, respectively. The proposed approach also shows substantially reduced waiting time and communication latency, as shown in Fig. 16. The reason is the proposed method can make decision of dark core allocation and adjustment at runtime, which helps to optimize the performance of currently running applications and the newly arrived ones.    . 17 shows the throughput, waiting time, and communication cost of the four methods when running the real benchmarks with different arrival rates. These results are normalized to that of the proposed method. When the arrival VOLUME 8, 2020 rate is high, the throughput achieved by our approach is about 1.16×, 1.35×, and 1.48× over Fixed_dark_core_allocation, Bubble_budgeting, and Adboost, respectively. On average, the proposed approach reduces waiting time by 43%, 79%, and 86% over the Fixed_dark_core_allocation, Bubble_budgeting, and Adboost, respectively. The reason for this case is similar to that seen in the case of the random benchmarks. In a simple term, adjusting dark core can achieve higher system performance. Fig. 18 evaluates the average peak temperatures of the four methods by running applications in a system with different configurations for one hundred times. One can see that the peak temperatures of all the four algorithms are below the temperature threshold 80 • C, but the proposed method achieves the lowest temperature, i.e., the proposed approach reduces the average peak temperature by 1 • C, 2 • C, and 3 • C over Fixed_dark_core_allocation, Bubble_budgeting, and Adboost, respectively. The reason is that the proposed mapping algorithm spreads the dark cores across the chip and redistributes them when needed at runtime. Doing so has a positive impact on heat dissipation to bring down chip temperature.

H. COST ANALYSIS OF THE PROPOSED ALGORITHM
The time penalties of running the three-step proposed approach, Bubble_budgeting, and Adboost are all in the order of 0.25M cycles. This is averaged out by running the algorithms one hundred times with different system parameters, such as network size, arrival rate, and communication volume of applications. In practice, most of applications run for as long as more than 10 8 cycles. Therefore, from the perspective of the application execution time, the time penalty of running the proposed algorithm is quite low. The power consumption of running the proposed algorithm is also considered in the experiments, which is 17.01W. The global average migration overhead at a control interval of 75M cycles is in the order of 0.2M cycles, which is also acceptably low.

IX. CONCLUSION
In this paper, built upon a dynamic programming framework, a runtime dark core allocation and dynamic adjustment scheme was proposed, taking into account the application arrival rate as well as the variation of the application's computation demands. An efficient task mapping algorithm was also proposed to reduce the negative impact of dark cores on communication latency and fragmentation. The experiments confirmed that, compared with two existing runtime thermal-aware resource management approaches, the proposed approach improves the system throughput by as much as 61% on average. The time penalty of running the proposed algorithm is very low, making it a suitable method for runtime resource management in many-core systems.