PEW: Prediction-Based Early Dark Cores Wake-up Using Online Ridge Regression for Many-Core Systems

Future many-core systems need to address the dark silicon problem, where some cores would be turned off to control the chip’s thermal and power density, which effectively limits the performance gain from having a large number of processing cores. Task migration technique has been previously proposed to improve many-core system performance by moving tasks between active and dark cores. As task migration imposes system performance overhead due to the large wake-up latency of the dark cores, this paper proposes a prediction-based early wake-up (PEW) to reduce the dark cores’ wake-up latency during task migration. A window-based online ridge regression (RR) is used as the prediction model. The prediction model uses the past window’s thermal, power, and core status (i.e., active or dark) to predict the future core temperatures at run-time. If task migration is predicted in the next control period, the proposed PEW puts the dark cores in a power state with low wake-up latency. Thus, the proposed PEW reduces the time for the dark cores to start executing the tasks. The comparison results show that our proposed PEW reduces the completion time by up to 7.9% and 4.1% compared to non-early wake-up (NoEW) and a fixed threshold wake-up (FEW), respectively. It also shows that the proposed PEW increases the MIPS/Watt by up to 5.5% and 2.3% over NoEW and FEW, respectively. These results show that the proposed PEW improves the many-core system’s overall performance in terms of reducing dark cores’ wake-up latency and increasing the number of executed instructions per Watt.


I. INTRODUCTION
The key concept of increasing computing circuits performance was increasing the processor frequency guided by Dennard scaling [1]. However, around 2005, Dennard scaling ended, where the power per transistor could no longer scale down with the scaling of fabrication technology. This led to an end to increasing the frequency of single-core processors due to the high power density. To overcome this problem, manycore systems were introduced by integrating more cores The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei .
with lower operating frequencies into the processor's chip to improve the overall computing performance.
Adding more cores by reducing the technology size, according to Moore's law [2], increases the total power of many-core systems that resulted in higher chip temperatures. Thus, only a part of the many-core system can be in an active state (i.e., turned on) while the rest should remain in a dark state (i.e., turned off). Turning off some cores will limit the performance gain from the increasing number of cores in many-core systems. This limitation from using all the processing cores is called the dark silicon problem [3], which is expected to be a major issue in future many-core systems. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ According to Ref. [3], [4], over half of the cores in many-core system-on-chip (MCSoC) would be dark cores in 8-nm technologies. This prediction led researchers to identify techniques for the dark silicon problem to improve the performance of many-core systems under either power budget constraints [5]- [14] or thermal constraints [15]- [19]. To avoid a run-time thermal violation, most of these techniques use dynamic thermal management (DTM), such as task migration and dynamic voltage frequency scaling (DVFS). However, the techniques that used task migration avoid migrating tasks to dark cores due to the high wake-up latency of the dark cores, where the dark cores need a longer time to turn on all core components that were previously off in the dark state. Previous studies in Ref. [18], [19] show that migrating tasks from active to dark cores can improve the many-core system performance. However, the dark cores' wake-up latency is significant as dark cores need a longer time to be ready to run the upcoming tasks. In a dark core state, all core components are off and need time to be back to operate normally. Studies in Ref. [20], [21] proposed an early wake-up to address the wake-up latency of the dark cores. However, these studies use a fixed wake-up threshold, which may not suit high thermal fluctuating applications.
This paper proposes a prediction-based early wake-up (PEW) technique to reduce the wake-up latency impact of dark cores during task migration. The proposed PEW consists of two parts: online ridge regression (RR) and early wakeup (EW) algorithm. The online RR is used as a prediction model to predict the future core temperatures at run-time every predefined time called control period. Meanwhile, the EW algorithm is used to predict the likelihood of task migration in the next control period based on the predicted cores' temperatures. The proposed PEW sets the dark core power state to the one with a lower wake-up latency if task migration is expected to be used in the next control period. This reduces the time for the cores to start executing the tasks, which collectively improves the many-core system's overall performance. In summary, the contributions of this paper are as follows: • This paper presents the PEW technique to reduce the dark cores' wake-up latency impact during task migration.
• The online RR is used as a prediction model to predict cores' temperatures in the next control period.
• The EW algorithm is used to put the dark cores in a power state with low wake-up latency based on the predicted temperatures.
• A comprehensive study using compute-and memoryintensive real-world applications has been conducted to validate the proposed PEW technique. The remainder of this paper is structured as follows. Related works are discussed in Section II. The system model and problem definition are presented in Section III. The methodology of the proposed work is described in Section IV, while the performance of the proposed work is evaluated in Section V. Finally, the conclusion and future work are presented in Section VI.

II. RELATED WORK
The increased power densities in many-core systems due to technology node shrinking has resulted in the so-called dark silicon problem. The dark silicon problem limits performance gain from using all available cores in a many-core system [22]. This problem has received a lot of attention in recent years as a significant many-core systems issue that requires careful attention. Many techniques for optimizing the performance of dark silicon many-core systems have been proposed in recent years. These performance optimization techniques can be categorized into performance optimization under the power constraint and performance optimization under the thermal constraint.
The power constraints techniques use thermal design power (TDP), which is a fixed per-chip power budget [5], [8]- [10] or thermal safe power (TSP), which is a fixed per-core power budget [6], [7] to avoid thermal violations. However, the use of the power budget can cause chip thermal violations since transient temperature and heat transfer between cores are excluded [23]. In contrast, the thermal constraint techniques consider the transient temperatures and heat transfer between the cores to prevent chip thermal violations. In thermal constraint techniques, task migration is one of the DTM techniques often used to balance the chip's thermal and prevent thermal violations at run-time. However, migrating the task to a dark core imposes an overhead due to the dark core wake-up latency. As our proposed work focuses on reducing the dark cores wake-up latency due to the task migration, the following paragraphs present related works that used task migration to maximize performance for dark silicon many-core systems.
To improve dark silicon many-core systems performance, some techniques use task migration and application mapping. Shafique et al. [17] introduced DaSiM, a variability-aware management technique for dark silicon many-core systems. DaSiM models the variations of core-to-core leakage power. It uses thread mapping and dark silicon patterning to activate or boost more cores by reducing the maximum temperature. DaSiM provides a lightweight prediction technique to predict the thermal distribution of a certain mapping and patterning solution at run-time. To handle thermal violation, DaSiM uses power-gating or task migration.
Some studies used a combination of DVFS and task migration for maximizing the dark silicon many-core performance. Hanumaiah et al. [24] proposed a run-time scheduling technique to improve many-core system performance. This technique uses task migration to allocate tasks to cores at run-time. During the first period of the task migration, it sets the DVFS levels of cores to a maximum level that does not violate the safe chip temperature. In a similar work, Wang et al. [9], [25] introduced a run-time thermal management technique to improve many-core system performance.
Based on model predictive control (MPC) decisions, this technique use task migration to balance the chip's thermal by migrating tasks between active cores. DVFS is used instead if task migration cannot be used. The aforementioned techniques avoid task migration to dark cores due to the high wake-up latency of dark cores. However, migrating tasks among active cores may increase the migration overhead. For example, if two active cores exchange the tasks between them to balance the temperature, the task migration overhead will be twice. In contrast, migrating tasks from active to dark cores reduces this overhead by half. Some studies used dark cores to migrate the tasks. Studies in Ref. [26]- [29] used a virtual task migration to pattern the active and dark cores for optimizing the communication and computation performance of dark silicon manycore systems. These techniques move the location of dark cores and not the actual tasks. Dark cores are used as bubbles to distribute the active cores' heat. In Ref. [19], a technique for optimizing dark silicon many-core systems called DTaPO was introduced. DTaPO uses task migration to swap the tasks between the active and dark cores to maintain high overall system performance and keep the many-core system temperature within a safe thermal operating range. However, all these studies did not provide a solution for the issue of wake-up latency of dark cores due to task migration.
A scheduling technique to optimize system performance under thermal constraint by reducing the wakeup time needed for the task migration was proposed by Bashir et al. [20]. Based on offline thermal results, the proposed technique estimates the time needed to reach the threshold temperature to put the sleeping cores in the idle mode before performing task migration. However, this technique is not suitable for uncharacterized applications. In another work, Bashir et al. [21] proposed an improved technique suitable for run-time performance optimization. In this technique, the temperature is sensed at run-time, and task migration is to move the tasks to dark cores to address the thermal violation. These works use early switching the dark cores to idle mode and depend on a fixed early wakeup threshold. Although the cores in idle mode can run the upcoming tasks immediately, early switching to idle mode may cause more performance degradation due to more frequent DTM calls. Moreover, using a fixed wake-up threshold may not be suitable for applications that have high thermal fluctuation.
This paper provides a solution for dark cores wakeup latency overhead during task migration by proposing a prediction-based early wake-up (PEW) technique. Instead of using a fixed wake-up threshold, the proposed technique uses a prediction model to determine when to wake up the dark cores. An online sliding window-based ridge regression (RR) is used as the prediction model. If task migration is expected to be used in the next control period, the early wake-up (EW) algorithm uses the core's power states to put the dark cores in a power state with low wake-up latency (∼10 µs). Thus, it reduces the time for the dark cores to start running the tasks to improve the many-core system overall performance.

III. SYSTEM OVERVIEW AND PROBLEM DEFINITION
This section presents the dark silicon many-core system model, a background on core power states, as well as problem definition and formulation.

A. SYSTEM MODEL
The system model is presented in Fig. 1. The many-core system consists of 64 homogeneous cores. The many-core system supports preemptable tasks so that a task can be stopped and moved to another core to continue the execution. As this study targets the dark silicon many-core system, we assume that only half of the cores can be activated simultaneously. The active and dark cores were patterned like a chessboard so that dark cores surround each active core for better heat dissipation [23]. Despite that the chessboard pattern adds one hop for each active core to the communication latency, it has a low peak chip temperature compared to the contiguous pattern [19].
DTaPO [19] is used to continuously tracks the many-core system status. Specifically, it monitors the active and dark cores' locations, voltage/frequency level, power, and transient temperature. DTaPO swaps the active and dark cores locations using the task migration to manage the thermal violation. In case no thermal headroom is available, it reduces the voltage/frequency level using the DVFS. For more details about DTaPO, refer to Ref. [19].

B. CORE C-STATES
Modern many-core processors are designed to support a set of low-power states called C-states [30] to reduce power consumption. C-states are designated by the letters C0, C1, C2, . . . , Cn, where the processor's designer decides the value of n. The active state is C0, in which the core is in active mode. As the C-state progresses, further power-saving steps are taken, such as turning off more core components such as caches.
According to the ACPI standard [30], as shown in Fig. 2, the C1 state lowers the core voltage and turns off the core's VOLUME 9, 2021 clock while preserving the L1/L2 cache contents. In the C2 state, the L1/L2 cache contents are flushed to the last level cache (LLC) cache. The core is completely dark or off in the C3 state. However, turning off more core components will increase the cores' time to return to a fully operational state (C0). The proposed technique assumes that our manycore system supports the C0, C1, and C3 power states.

C. PROBLEM DEFINITION
Migrating tasks to dark cores causes performance degradation due to the substantial wake-up latency of the dark core. Fig. 3 illustrates that migrated tasks should wait until the dark cores are ready to execute them. Fig. 3a shows that when the dark core was in C3 state (dark state), the task migrated at time t m should wait until the starting time t s . Thus, reducing the task waiting time W t = t s − t m improves the overall system performance. The proposed PEW technique aims to reduce t s by putting the dark cores in a power state with low wake-up latency, i.e., C1 state, just before the task migration at t m . Thus, the dark core will start executing the migrated task earlier, as shown in Fig. 3b. This minimizes the W t of the migrated task and improves the overall performance of a many-core system. Our aim can be mathematically expressed as follows:

IV. PROPOSED TECHNIQUE: PEW
The proposed PEW consists of a prediction model and early wake-up (EW) algorithm. Fig. 4 shows how the proposed technique is integrated into the system model. The proposed PEW uses ridge regression (RR) as a prediction model to predict the core's temperature. The prediction model uses the current core's status (i.e., active/dark), power, and thermal to predict the core's temperature in the next control period. Based on the predicted temperatures, the proposed EW algorithm predicts whether there will be a migration in the next control period. If migration is predicted in the next control period, it will put the dark cores in a power state with a low wake-up latency. Thus, it reduces the waiting time for the dark cores to be ready for new coming tasks and improves the overall performance. On the other hand, if it predicts no  migration in the next control period, it will leave the dark cores in a power state that saves power.

A. PREDICTION MODEL
Linear regression is one of the most widely used techniques for predictive modeling. It tries to find a linear relationship between the inputs (independent variables) and the output (dependent variable) according to the following formula: where Y is the dependent variable and X is an n × p matrix representing the independent variables, where n is the number of samples and p is the number of features. Vector β represents the regression coefficients. Vector represents the random errors, which are the residuals that are not explained by the Xβ term. In this work, a type of linear regression called ridge regression [31] is used as a prediction model. The linear prediction model is used because the changes in the transient temperature are linear in the short control period < 1 ms.

1) RIDGE REGRESSION
Ridge regression (RR) [31] is a type of multiple linear regression represented by Eq. (2). It is used when there is a correlation between the independent variables called the multicollinearity problem [32]. It adds a penalty of a squared magnitude of the coefficients to the loss function to overcome the multicollinearity problem: where β 2 = p j=1 β 2 is the penalty term, and λ is the regularization parameter that represents the penalty control. Ridge regression becomes ordinary linear regression when λ → 0.
Ridge regression was chosen in our work because it fits our regression problem, where there is a correlation between the independent variables, i.e., the current temperature, power, and cores status (i.e., active/dark). Lasso regression may also be used to eliminate the collinearity in a large number of independent variables by selecting a subset of them. However, it may eliminate some important collinear variables that may affect the prediction accuracy, especially when the number of independents variable is small, as in our prediction system model.

2) ONLINE RIDGE REGRESSION
The ridge regression uses all available data samples to make an accurate prediction. However, using all data samples is computationally intensive and infeasible for online prediction where the ridge regression has O(np 2 ) time complexity [33]. In highly fluctuating input data such as the core's temperature, the old data samples may be worthless. Therefore, considering only the last data samples using a sliding window will reduce the time complexity since: where m represents all data samples, and w is the sliding window size. The sliding window starts to move when the data samples are larger than the window size. The time complexity now depends on w and p. This is suitable for our online system model because it has only three independent variables (p) and w is small.

B. PROPOSED PREDICTION-BASED EARLY WAKE-UP
This subsection presents the second component of the proposed PEW, which is the EW algorithm. All symbols used in our proposed EW algorithm are defined in Table 1. Algorithm 2 describes the proposed EW algorithm in detail. The proposed algorithm receives the predicted temperatures T p from the prediction model. Also, it receives the set of all active cores = {a 0 , . . . , a k−1 }, the set of all dark cores = {d 0 , . . . , d k−1 }, the threshold temperature T th and the safe margin θ.
In each control period, the proposed algorithm reads the predicted temperature of each active core T p [a i ]. If the predicted temperature of the active core is higher than the threshold temperature, it is most likely that DTaPO will do either task migration or DVFS in the next control period. Thus, the proposed EW algorithm reads the predicted temperature of a destination dark core T p [d i ]. If the predicted temperature of the destination dark core is lower than the threshold temperature by θ or the temperature of dark cores is lower than the active core by θ (this condition statement is the conditions statement used by DTaPO [19] to do task migration), it sets H [t i ] = 1 to indicate that the task exceeded the threshold temperature is a movable task (lines 4-5). Otherwise, it marks the task that exceeded the threshold temperature as a non-movable task by setting H [t i ] = 0 (line 8). If all the tasks in H are movable, the proposed algorithm puts the dark cores in the C1 power states to reduce the wakeup latency of the dark cores (lines [12][13][14]. Otherwise, it will leave the dark cores in the C3 power state to save more power (lines 15-17).

Algorithm 2 Early Wake-up (EW) of Dark Cores
Input: , , T p , T th , and θ Output: Power states of the dark cores The time complexity of ridge regression is O(np 2 ) [33]. The proposed PEW uses sliding ridge regression, where p = 3 is kept constant, whereas n is a small sample represented by window size (w). Therefore, time complexity in our online ridge regression is O(w). For the EW algorithm, it needs to check the predicted temperature of all cores. Therefore, the time complexity depends on the number of cores (k). Thus, the time complexity is O(k).
The space complexity of ridge regression is O(wp + w). As ridge regression needs to store matrices X and Y as w × p and n × 1 matrix, respectively, to find β according to Eq.(3). Therefore, the online ridge regression space complexity is also O(w). For the space complexity of EW, the indices of active and dark cores and a 1 × k vector of predicted temperature needs to be stored. Therefore, the space complexity of the EW algorithm is O(k).
The overall complexity of the proposed PEW is the summation of online ridge regression and EW algorithm complexities that are computed one after another. Hence, the time and space complexity of the proposed PEW technique are linear.

V. EXPERIMENTAL EVALUATION
Many experiments were conducted to evaluate our proposed work. The following subsections show how the experiments are set up, the comparison results, and the discussion of the comparison results.

A. EXPERIMENTAL SETUP
Our proposed work was evaluated on a many-core system that consists of 64-core, where 32-core are active cores, and 32-core are dark cores. These cores are connected using an 8 × 8 mesh network-on-chip (NoC). All the cores share the same instruction set architecture (ISA) (homogeneous microarchitecture) and can operate at different frequencies (heterogeneous frequency). Every core can run at a maximum frequency of 4 GHz. The floorplan of the simulated system is shown in Fig. 5 Table 2 shows the summary of the system setup.    used. LifeSim is a tool that integrates Sniper [36] with HotSpot [37] thermal simulator. Sniper is an architectural x86-64 many-core simulator (including the power framework McPAT). It is faster than cycle-accurate simulations with a 25% average performance error compared to actual hardware.
McPAT is commonly used for modeling integrated power, area, and timing because it provides comprehensive design space exploration for multi/many-core processor configurations. HotSpot is the most widely used thermal simulator. It is built on the widely used stacked-layer packaging scheme used in modern very-large-scale integration (VLSI) systems, as shown in Fig. 7.
As Sniper does not support core power-gating, Sniper's scheduler was modified to assign the tasks to the active cores only using the core mask pattern. Also, McPAT was modified to estimate only the power of caches and memory management unit for the C1 power state as only the caches are active in the C1 state. The power of the dark state (C3 state) is not considered. The wake-up latency for C1 and C3 are assumed to be 10 µs and 200 µs, respectively. These values are chosen according to Linux's intel_idle driver for Nehalem microarchitecture as our simulation's cores are Nehalembased. Thus, only 10 µs is added to the execution time for every migration that was predicted correctly, i.e., wake-up System configurations, such as the number of cores, floorplan, caches, are used to configure the simulated system. To generate performance traces, Sniper runs applications from the SPLASH-2 [39] and PARSEC [40] benchmark suites. These traces are used by McPAT to estimate each core's power consumption. HotSpot estimates the transient temperature using the estimated power traces. The HotSpot configuration parameters are listed in Table 3. In each control period, DTaPO is used to schedule the tasks and do thermal management based on the transient temperature generated from the Hotspot. To predict the temperature of each core in the next control period, the ridge regression uses the transient temperature generated from HotSpot, the power generated from McPAT, and the core states from DTaPO. Based on the predicted temperature, the early wake-up algorithm decides whether to wake up the dark cores.
In our experiment, compute-and memory-intensive applications from SPLASH-2 and PARSEC benchmark suites are used to evaluate the efficiency of the proposed technique. High-temperature tasks in compute-intensive applications can rapidly increase the core temperature, making them good candidates for validating our proposed algorithm. In contrast, memory-intensive applications have a large number of memory accesses, showing the task migration overhead due to cache misses. The experimental evaluation was done in two phases; preliminary study and comprehensive study. In both studies, the value of θ is set to 5% of the threshold temperature, and the control period interval length is set to 1 ms. To eliminate the experiment results' randomness, the results reported in this study are the average results of conducting the experiment ten times.
A preliminary study was carried out by executing a mix of four 8-thread applications: Bodytrack, Ocean, Radix, and Blackscholes. The threshold temperature was set to 70 • C. The threshold temperature was chosen based on the temperature profile of the studied application on the target platform. There may be no migration when threshold temperature is too high. Also, there may not be any cold cores to migrate to when threshold temperature is too low. For more details on the impact of threshold temperature on task migration, please refer to Ref. [19].
The preliminary study was carried out to determine the best-fixed threshold for fixed threshold early wake-up (FEW) [21] technique and evaluate the performance of the proposed PEW technique under various prediction accuracy scenarios. In this preliminary study, a pre-known future temperature generated from the Hotspot simulator was used as an input to the proposed EW algorithm. These pre-known future temperatures represent the prediction model with 100% accuracy, which is the best-case scenario. In the other scenarios, this accuracy was reduced 10% each time by introducing a uniformly distributed random error.
In the second phase, a comprehensive study was carried out by running eight 32-thread applications: Fluidanimate, Bodytrack, Cholesky, Blackscholes, Raytrace, FFT, Ocean, and Swaptions individually. The threshold temperature was lowered to 65 • C to show the efficiency of the proposed work. The comprehensive study used RR as a prediction model to predict the future temperature. This predicted temperature was used as input to the proposed EW algorithm. There is a trade-off between the prediction accuracy and the window size. The bigger the window size, the better prediction accuracy. However, the prediction overhead will increase as the window size increase. Therefore, the value of window size (w) is set to 30, which gives a good prediction accuracy and low prediction overhead.
The value of the regularization parameter (λ) for RR is set to 0.2. This value was empirically determined by conducting experiments and choosing the value with a low average mean absolute error (MAE). Table 4 shows the MAE for the studied applications at different regularization parameter (λ) values. It can be seen that when λ = 0.2, the average MAE for all the studied applications is the lowest. In the comprehensive study, the computation efficiency in terms of completion time, power efficiency in terms of million instructions per second/Watt (MIPS/W), and average temperature were reported.

B. COMPARATIVE RESULTS AND ANALYSIS
The proposed prediction-based early wake-up (PEW) was compared with our previous work that uses a non-early wake-up technique (NoEW) [19] and with the state-of-the-art fixed threshold early wake-up (FEW) [21] that uses a fixed threshold to wake up the dark cores. Moreover, for a fair comparison, the dark cores are switched to a C1 power state instead of an idle state in FEW.

1) PRELIMINARY RESULTS
The results from the preliminary study are shown in Fig. 8. These results are the relative completion time of executing the mix of four multi-threaded applications: Bodytrack, Ocean, Radix, and Blackscholes. The completion time is plotted relative to the proposed PEW with 100% prediction accuracy (the best case). The proposed algorithm was evaluated using different accuracy levels, starting from 100% to 50%. FEW was also evaluated using two fixed early wake-up thresholds, 2 • C (FEW@2 • C) and 3 • C (FEW@3 • C) under the temperature threshold. For the studied application, it is obvious that the proposed techniques outperform the NoEW by 3.1% and 1.5% at 100% and 50% accuracy, respectively. Thus, even at low prediction accuracy, using prediction-based early weak-up still performs better than without early wake-up. Moreover, using prediction-based early wake-up at 100% accuracy outperforms the fixed early wake-up threshold FEW@2 • C and FEW@3 • C by 2% and 2.2%, respectively. In addition, FEW@2 • C reduces the completion time by 0.2% compared to FEW@3 • C. Thus, in the comprehensive study, the fixed early wake-up threshold in FEW was set to 2 • C below the threshold temperature.

2) COMPREHENSIVE RESULTS
In the comprehensive study, RR is used as a prediction model. Table 5 illustrates the average number of task migrations and the percentage of wake-up accuracy (i.e., the percentage of task migration predicted accurately using the EW algorithm). It also shows the RR prediction model accuracy in terms of MAE and root mean square error (RMSE) for the cores temperatures. It is obvious that using the prediction model gives better wake-up accuracy than using a fixed wake-up threshold. On average, the proposed PEW predicts 91.42% of the task migration accurately compare to 76.62% using a fixed wake-up threshold. Fig. 9 shows the actual and predicted  temperature of core 0 for all studied applications. The results of all cores are not presented because they show similar trend as shown in Fig. 10, which shows the results of three cores (core 1-3) for Blackscholes and FFT. Although the prediction model does not fit well for some applications, the prediction model can predict well when the temperature exceeded the threshold temperature (65 • C), which is important for the EW algorithm to make the early wake-up decision. VOLUME 9, 2021   For the Cholesky and Blackscholes applications, using a fixed wake-up threshold resulted in a higher wake-up accuracy than the prediction model. This is because these applications have a large percentage of serial phase, as shown in Fig. 11. For more characteristics of these applications, refer to Ref. [41]. In the serial phase, these applications run only one cool thread with a small number of task migrations so that our prediction model cannot fit well. Although Raytrace and FFT also have a large percentage of serial phase, the serial phase of these applications has a high number of task migrations. Thus, our prediction model fitted well with these applications.
The comparison results of the computational and power efficiency shown in Fig. 12 are relative to the proposed PEW technique. These results show the comparative results when executing the nine multi-threaded applications individually with the proposed PEW, NoEW, and FEW techniques. The efficiency of computation in terms of relative completion time is shown in Fig. 12a. In all the studied applications, the proposed prediction-based early wake-up (PEW) reduces the completion time by 4  for Cholesky and 0.6% for Blackscholes because these applications have a large percentage of serial phase, as mentioned previously. In general, the overall completion time of the studied application is improved because the waiting time (W t ) of the tasks is reduced. It is worth mentioning that all the comparison results are based on assuming the wake-up latency for dark state 200 µs according to Linux's intel_driver. However, if the wake-up latency for the dark state is longer like in the LEAT processor (261.77 ms), the improvement is expected to be much better.
The comparative results of the power efficiency in terms of relative MIPS/Watt are shown in Fig. 12b. On average, our proposed PEW performs better than NoEW and FEW by 3% and 1%, respectively. In all the studied applications except for Bodytrack and Cholesky, our proposed PEW increased the MIPS/Watt by up to 5.5% and 2.3% over NoEW and FEW, respectively. The lower MIPS/Watt in Bodytrack and Cholesky is due to the high prediction RMSE for these applications, as shown in Table 5.
The thermal efficiency was also evaluated, as shown in Fig. 13. Fig. 13a shows the average, max, and min of the variation between the coldest and the hottest core. Our proposed PEW, on average, exhibits less temperature variation than FEW and NoEW. On the other hand, Fig. 13b shows the average, max, and min of the cores' transient temperature.
It can be noted that the average temperatures for the three techniques are identical as all these techniques use identical thermal management.

3) SIGNIFICANCE TEST
A significance test (t-test) was performed to verify the significance of the performance improvement in terms of task completion time. A paired t-test was conducted for the completion time of our proposed PEW against FEW and NoEW. The significant level (α) is set to the standard value of 0.05. The null hypothesis H 0 is tested against the alternative hypothesis H a . H 0 assumes that the improvement is not significant, and H a assumes that the improvement is significant. The null hypothesis H 0 is rejected if p-value < α. Table 6 shows the significance test for the proposed technique's completion time against FEW and NoEW. It shows that the improvement when using our proposed PEW against FEW is statistically significant for most of the studied applications. The improvement is not significant for Cholesky and Blackscoles that suggests that the prediction model may need to be tuned to fit these applications, which is beyond the scope of this paper. The overall improvement when using our proposed PEW against NoEW is statistically significant for all studied applications.

VI. CONCLUSION
This paper proposes a prediction-based early wake-up (PEW) for the dark cores technique that utilizes an online sliding window-based ridge regression (RR) to reduce the dark cores wake-up latency during the task migration. RR predicts the future's core temperatures based on the previous thermal, power, and core status. Based on these predicted temperatures, the proposed early wake-up (EW) algorithm puts the dark cores in a power state with low wake-up latency if task migration is expected in the next control period. Thus, our proposed PEW reduces the time for the dark cores to start VOLUME 9, 2021 running the tasks, which improves the many-core system's overall performance. The comparison results show that using our proposed PEW reduces the task completion time by up to 7.9% and 4.1% compared to non-early wake-up (NoEW) and using a fixed threshold wake-up (FEW), respectively. It also shows that using our proposed PEW increases the MIPS/Watt by up to 5.5% and 2.3% over NoEW and FEW, respectively. Moreover, a significance test shows that our improvements are statistically significant for all studied applications except those that cannot fit well in our prediction model. For future work, we plan to propose a technique that dynamically tunes the prediction model parameters (window size and regularization parameter) according to the running application and to evaluate the impact of chip floorplan on the temperature.