3D-DNaPE: Dynamic Neighbor-Aware Performance Enhancement for Thermally Constrained 3D Many-Core Systems

The continuous scaling of silicon technology has enabled many-core systems to become ubiquitous, offering enormous computational power for various applications spanning from high-performance computing to mobile devices. However, this advancement resulted in increased power density that exacerbated the thermal challenges of dark silicon, where certain cores are turned off or become dark due to thermal constraints. While various methods have been put forward to enhance the performance of thermally constrained 2D many-core systems, 3D designs introduce more serious thermal issues due to heightened power density and challenges with heat dissipation in vertically stacked configurations. This paper introduces a dynamic neighbor-aware performance enhancement for thermally constrained 3D many-core systems (3D-DNaPE). 3D-DNaPE is a technique that improves the performance of a thermally constrained 3D many-core system where only a limited number of cores can be activated. Initially, it uses the proposed neighbor-aware pattern (NaP) algorithm to select the coldest core among the four adjacent dark cores suitable for task migration. Subsequently, it uses the proposed 3D dynamic thermal management (3D-DTM) algorithm to optimize system performance by considering the core and memory bank temperatures. A static non-uniform cache access (S-NUCA) configuration mitigates cache misses resulting from task migration. Comprehensive evaluations indicate that 3D-DNaPE performs better than its contemporaries, showing improvements reaching up to 43% in execution time, a 34% decrease in performance slowdown, and an up to 51% enhancement in energy efficiency. This research not only underscores the challenges faced by 3D many-core systems but also provides a robust solution with promising implications for future 3D many-core designs.


I. INTRODUCTION
The continual scaling of silicon technology in recent years has given rise to the emergence of many-core systems, which The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino .integrate many processor cores onto a single chip.These systems offer great computational power and have become the driving force behind various applications, encompassing a broad spectrum from high-performance computing to mobile devices [1].However, this increase in computational power has also led to an increase in power density, which has caused significant thermal challenges.One of these challenges is the dark silicon issue, which refers to the portion of cores that cannot be fully utilized due to thermal constraints [2].In this paper, the terms dark silicon and thermally constrained are used interchangeably.The dark silicon issue is further exacerbated by the transition from 2D to 3D many-core architectures, a move aimed at overcoming off-chip memory bandwidth limitations.In these 3D architectures, cores and main memory/cache layers are vertically stacked [3], leading to even higher power density and decreased heat dissipation capabilities when layers are active [4], [5], [6], [7].Consequently, 3D many-core systems face more significant thermal challenges compared to their 2D counterparts.Moreover, external environmental conditions, such as ambient temperature, play a pivotal role in influencing the system's thermal behavior.Additionally, within the many-core system itself, distinct zones or sections can exhibit varied thermal behaviors, necessitating targeted thermal management strategies for each zone.
Several techniques have been suggested to cope with the challenge of dark silicon or thermally constrained many-core systems, with a predominant focus on the performance improvement of 2D thermally constrained manycore systems.Some of these 2D optimization techniques concentrate on mapping and pattern techniques to enhance the performance of 2D thermally constrained many-core systems [8], [9].Another set of techniques involves the use of the computation sprinting mechanism, which temporarily boosts the frequencies of cores utilizing dynamic voltage and frequency scaling (DVFS) [10], [11], [12], [13], [14], [15], [16].Other techniques [17], [18], [19] leverage the dark cores to aggressively lower the system's temperature by migrating the tasks from the active cores to these dark cores and turning off the active cores.In addition, these techniques use DVFS to progressively decrease the system temperature.However, all the 2D optimization techniques only focus on the temperature of cores and ignore the temperatures of memory.On the other hand, only a few methods have been suggested to improve the performance of 3D many-core system [20], [21], [22].However, some use a fixed power budget, ignoring transient temperature fluctuations and heat transfer across cores.Others depend on the applications' performance models being available at design time.Therefore, it cannot be used for unknown applications.Moreover, to the best of our knowledge, no one has proposed task migration in a 3D dark silicon many-core system.
This study presents a dynamic neighbor-aware performance enhancement for thermal-constrained 3D many-core systems (3D-DNaPE).The proposed technique comprises two stages.The first stage utilizes the proposed neighboraware pattern (NaP) algorithm to select one coldest core of the four adjacent dark cores suitable for task migration.This allows 3D-DNaPE to selectively perform task migration only for the hot cores rather than migrating all cores as was in our previously proposed DTaPO [17] for 2D dark silicon many-core.In the second stage, a 3D dynamic thermal management (3D-DTM) algorithm is used.This algorithm utilizes task migration to enhance the performance of manycore system while ensuring that the operating temperature remains within safe thermal limits.However, unlike DTaPO, which focuses solely on core temperatures, 3D-DNaPE considers both core and memory bank temperatures in 3D many-core architectures.In case there is no surrounding cold core, DVFS is used to progressively reduce the system temperature.
It is known that using task migration leads to cache misses.To address this, a shared last-level cache (LLC) can be used to mitigate the cache misses resulting from task migration [17], [23].In DTaPO, the tasks were only migrated horizontally among the two adjacent cores that shared the same L3 cache.In contrast, the 3D-DNaPE technique allows tasks to be moved in all directions among the four adjacent cores, based on the coldest neighboring core.To enable this, a static non-uniform cache access (S-NUCA) [24] configuration is utilized as the LLC.In the S-NUCA architecture, the LLC banks are physically distributed across all cores within the many-core system.However, they still logically form a singular and vast cache shared by all cores.Such architecture can be found in commercially many-core processors [25].During task migration in an S-NUCA many-core system setup, only the cache lines from the source core's private caches, from which the task is migrating, must be flushed to the LLC.Subsequently, the core to which the migrated task relocates can access these cache lines via the shared LLC.Thus, this strategy effectively reduces the overhead associated with task migration stemming from cache misses.In summary, the key contributions of this paper can be outlined as follows: the final remarks and provides insights into future research directions.

II. RELATED WORK
The dark silicon problem raises a vital question: how can we use available computational resources effectively in the face of power and thermal limitations?This issue has gained significant attention in computer architecture and design.Researchers and engineers are searching for innovative methods to enhance the performance of many-core systems within these power and thermal limitations.
Recent years have seen several studies improving the performance of thermally constrained many-core systems.Some have employed mapping and pattern techniques [8], [9], [26], [27], while others have harnessed the computation sprinting mechanism, briefly increasing core frequencies using DVFS [10], [11], [12], [13], [14], [15], [16].Kanduri et al. [28] introduced adBoost, a thermal-aware performance-boosting technique that patterns dark cores among active ones to create thermal headroom.Raghunathan and Garg [29] developed a scheduler using queuing theory and job arrival rates to make run-time decisions for task and cluster optimization.Mohammed et al. [17], [18] introduced a dynamic thermal-aware performance optimization technique that uses task migration and DVFS to enhance the performance of thermally constrained manycore systems.Moreover, in [30], the researchers proposed a prediction-based early wake-up of dark cores to reduce the dark cores' wake-up latency and improve the overall performance of thermally constrained many-core systems.Several other researchers have also proposed techniques emphasizing dynamic power budgeting [31], [32], [33].
All aforementioned techniques target thermally constrained 2D many-core systems.However, 3D many-core systems face more significant thermal challenges compared to their 2D counterparts.This is primarily due to higher power density and reduced heat dissipation when active layers are stacked vertically.Moreover, in addition to the cores, non-core components, such as memories and caches, play a significant role in generating heat [20], [22].Several task scheduling-based techniques for dynamic temperature management in 3D many-core systems have been proposed [34], [35], [36], [37], [38].The authors of [39] and [40] proposed performance optimization techniques under power and thermal constraints.Thermal management and performance optimization techniques for 3D manycore systems with hybrid SRAM/MRAM L2 caches were proposed by Lee et al. [41], [42].By considering thermalinduced stress, Zou et al. [43] introduced a thermal managing approach for 3D systems.Wang et al. [44] proposed an artificial neural network-based run-time stress estimator.Also, STREAM was presented to optimize the 3D manycore performance considering the thermal-induced reliability issues [45].However, the techniques above were not aimed at thermally constrained 3D many-core systems, where only a part of the system's units can be active simultaneously.
There are currently limited techniques to improve the performance of thermally constrained 3D many-core systems [20], [21], [22].Asad et al. [20] consider the power consumption of cores and non-core components concurrently to enhance the performance of thermally constrained 3D many-core systems.However, they use a fixed power budget, which over-constrains the system's performance at runtime.Wan et al. [21] proposed a greedy-based core-cache co-optimization algorithm to optimize the performance of thermally constrained 3D many-core systems at run-time.However, it depends on the applications' performance models being available at design time.Therefore, it cannot be used for unknown applications.
Siddhu et al. [22] proposed a dynamic thermal management approach called CoreMemDTM.CoreMemDTM is a joint approach to managing the thermal levels of a computing system's processor cores and memory.Based on the idea that the core and memory are interdependent, a dynamic thermal management (DTM) decision made for one can reduce the temperature of the other, thereby lowering overheads.CoreMemDTM does this by utilizing a multilevel slack-balanced DVFS technique to control the cores (CoreDTM) and low-power states to manage the memory (MemDTM).CoreMemDTM activates an appropriate DTM policy if the temperature of the core or memory rises.When both the core and memory components overheat, CoreMemDTM executes DTM for the component with the lowest thermal slack.However, CoreMemDTM uses only DVFS and power gating and does not use task migration.Our previous work [17] has shown that task migration can substantially lower chip temperature without compromising overall system performance.This approach balances thermal loads across the cores, allowing for efficient utilization of system resources while adhering to thermal constraints.
In summary, most previous performance optimization techniques for thermally constrained many-core systems targeting 2D are not suitable for 3D stacked layers, where they only concentrate on the temperatures of cores and ignore the temperatures of memory.On the other hand, 3D system performance optimization techniques do not target dark silicon problems, where only part of a many-core system can be activated at the same time.Only a few works are aimed at thermally constrained 3D many-core systems.However, some use a fixed power budget, ignoring transient temperature fluctuations and heat transfer across the cores.Others depend on the applications' performance models being available at design time.Therefore, they fail to accommodate unknown applications that are not characterized at design time.Moreover, to the best of our knowledge, no one has used task migration in a 3D dark silicon many-core system.Our previous work [17] proves that using task migration can aggressively decrease a chip's temperature while getting good overall performance from a thermally constrained many-core system.

III. SYSTEM AND APPLICATIONS OVERVIEW
This section describes the proposed system model for a 3D dark silicon many-core system, the applications under consideration, and the problem definition and formulation.

A. PROPOSED SYSTEM MODEL
This work focuses on addressing the thermal challenges and performance enhancement in 3D many-core systems.The proposed system model shown in Fig. 1 is used for evaluating the effectiveness of the proposed techniques.
While the specific configuration may not directly reflect the exact architectures available in the market, the research provides insights into the challenges and potential solutions for thermally constrained 3D many-core systems.The 3Dstacked many-core system consists of three layers: one core layer and two memory layers.The core layer comprises 64 homogeneous cores, while the memory layers consist of 128 memory banks.An 8 × 8 mesh-based network-onchip (NoC) is utilized as a communication medium.The memory layers are vertically stacked and interconnected through vertical channels.These channels consist of pathways utilizing through-silicon vias (TSVs) [46], facilitating data transfer between the memory layers and the cores.As this study targets to improve the performance of the thermally constrained many-core system, we assume that only half of the cores can be activated simultaneously.Previous studies [17], [18] show that the use of half of the cores in a thermally constrained environment can give better results than using all of the cores.Initially, the active and dark cores and memory channels are organized in a checkerboard pattern, where dark cores surround each active core to enhance heat dissipation by providing thermal headroom [47].However, this pattern keeps changing during the execution time according to the temperature of the cores using task migration.More details on how task migration changes the active and dark core patterns are discussed in Section IV.
The proposed 3D-DNaPE comprises two stages.The first stage aims to identify the coldest neighboring core to which tasks can be migrated using the neighbor-aware pattern (NaP) algorithm.The second stage involves performing DTM on the thermally constrained 3D many-core system, taking into account the individual temperatures of cores and memory banks, facilitated by the 3D-DTM algorithm.It is necessary to monitor the temperatures of cores and memory banks separately due to their varying heat dissipation characteristics, which are influenced by the specific applications being run.More details about the proposed algorithms are presented in Section IV. 3D-DNaPE continuously monitors the status of the many-core system at predefined control intervals while multi-threaded applications are running.More details about these multi-threaded applications are provided in the following subsection.Specifically, 3D-DNaPE tracks the locations of active and dark cores, the DVFS level, the power consumption of both cores and memory banks, as well as the transient temperatures of these components.Assuming that the many-core system supports preemptable tasks, 3D-DNaPE intervenes when potential thermal violations are detected.It halts the tasks and relocates them to another core selected by the first stage for continued execution.It modifies the voltage/frequency level by utilizing DVFS if no thermal headroom is available.

B. MULTI-THREADED APPLICATIONS
We focus on multi-threaded applications, drawing from a range of scientific computing and engineering domains.These applications are not inherently designed with strict real-time constraints.They are represented in the SPLASH-2 [48] and PARSEC [49] benchmark suites.The SPLASH-2 suite encompasses multi-threaded applications spanning engineering, scientific, and graphic applications.Conversely, the PARSEC suite introduces a collection of emerging applications in recognition, mining, and synthesis (RMS) [50].This suite also presents multi-threaded applications characteristic of commercial programs, including animation, media processing, enterprise servers, computer vision, and computational finance applications.Utilizing both SPLASH-2 and PARSEC offers diversity in aspects like working set size, cache miss rate, and instruction distribution [51].
A multi-threaded application encompasses multiple threads.Each thread is an independent task, and while all threads share a common data space, each possesses a unique thread ID, a register set, a stack, and a program counter [52].This structure is visualized in Fig. 2. In this paper, the terms thread and task are used interchangeably.In this work, we utilize nine compute-and memory-intensive multithreaded applications from the PARSEC and SPLASH-2 benchmark suites to assess the efficacy of our proposed work.These applications are Blackscholes, Bodytrack, Cholesky, Dedup, FFT, Fluidanimate, Ocean, Radix, and Raytrace.For more details about these applications' characteristics, please refer to Ref. [53].

C. PROBLEM FORMULATION
Consider a 3D many-core system, which consists of C cores and M memory channels, that runs multi-threaded applications.Given that only 50% of the cores and memory channels can be active at any given time due to the thermal constraints, the goal of our proposed technique is to minimize the total execution time E t , which refers to the overall time taken for the execution of the multi-threaded applications in a 3D many-core system.Simultaneously, we aim to ensure that the temperatures of the cores and memory banks do not exceed a specified threshold temperature T th .This goal can be expressed in mathematical terms as follows: Here, T c represents the transient temperature of core c and T m represents the transient temperature of memory channel m.
This formulation takes into account the transient temperatures of both the cores and memory banks.

IV. PROPOSED 3D-DNaPE TECHNIQUE
This section outlines the methodology of our proposed 3D-DNaPE technique.The objective of our technique is to dynamically enhance the performance of thermally constrained 3D many-core systems while considering thermal constraints.Thus, it is crucial for our suggested technique to be computationally lightweight.Task migration and DVFS are suitable lightweight options for managing chip thermal conditions in real-time if used efficiently.Task migration can effectively lower the system temperature by shutting off hot cores and moving tasks to cooler ones.In contrast, DVFS can gradually reduce the system temperature by incrementally lowering the DVFS level of hot cores when there is insufficient thermal headroom for task migration.
The proposed 3D-DNaPE takes into account vertical heat transfer across all layers of the 3D stack.Therefore, the DTM techniques, namely task migration and DVFS, should consider the temperatures of both the core and memory bank layers in the 3D many-core system.Unlike DTaPO [17], which migrates tasks across all cores to maintain the checkerboard pattern, 3D-DNaPE uses a neighbor-aware pattern.It checks all the dark neighbor cores of the current core and selects the coldest one as the migration destination.Migrating tasks only to neighbor cores in thermally constrained 3D many-core systems has the advantage of reducing the search complexity and minimizing data transfer overhead.
To simplify the search for all surrounding neighbors, their indexes need to be found.To do that, first, the coordinates (x, y) of the current core on the NoC are calculated based on the current core's position using Eq.(2, 3).
where x is the row coordinate of the current core, y is the column coordinate of the current core, C index is the current core index, and C is the total number of cores.After finding the current core coordinates, the surrounding neighbor core coordinates (n x , n y ) are calculated based on the current core's coordinates.Finally, Eq. ( 4) is used to find the neighbor core index.
where N index is the index of the neighbor core, n x is the row coordinate of the neighbor core, and n y is the column coordinate of the neighbor core.Further details regarding the use of these equations can be found in Algorithm 1.
The proposed 3D-DNaPE uses task migration to transfer tasks to the coldest dark neighbor under two distinct conditions.The first condition is when the temperatures of both the coldest neighbor and its associated memory channel are lower than the predefined threshold temperature by a specific margin.The second condition is when the temperature of 131968 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the coldest neighbor and its associated memory channel are lower than the current core temperature by a specific margin.In scenarios that do not meet these conditions, 3D-DNaPE uses DVFS to reduce the frequency level of the current core.
3D-DNaPE uses heuristic algorithms to optimize performance while managing thermal conditions in a complex, dynamic 3D many-core system.The complexity and dynamism of the system justify the choice of a heuristic algorithm instead of finding an optimal solution.These algorithms aim to quickly find satisfactory solutions by leveraging problem-specific knowledge and simplifying assumptions.They are adept at providing practically good solutions, particularly in the dynamic environments typical of many-core systems.In these environments, cores experience heating and cooling phases, tasks exhibit varying arrival and execution patterns, and what may be an optimal configuration at one moment might lose its optimality shortly thereafter due to these dynamic changes.In essence, the deployment of heuristic algorithms in this context represents a prudent trade-off between optimality and computational efficiency, ensuring robustness amid uncertainty and dynamic change.The details of the proposed algorithms are explained in the following subsection.

A. PROPOSED ALGORITHMS
The proposed 3D-DNaPE first identifies the coldest destination core and then applies DTM techniques, specifically task migration and DVFS.Therefore, we have proposed two algorithms: Algorithm 1 (NaP) and Algorithm 2 (3D-DTM).All symbols used in these algorithms are defined in Table 1.Algorithm 1 is used to find the index of the neighbor that has the lowest temperature.This information is crucial for 3D-DNaPE to pattern the active and dark cores based on their neighbor temperatures.Algorithm 1 takes an index of current core C index , a width of NoC w, a vector of all core temperatures T c , and a vector of all core statuses S. It searches for the destination index D index that has the lowest temperature.Initially, the active and dark cores are distributed evenly so that dark cores encircle each active core, as shown in Fig. 1.
To find the current core's coordinates on the NoC, Eq. (2, 3) are used (lines 1-2).A minimum temperature T min is initialized to the temperature of the current core (line 3).All coordinates of the eight neighboring cores around the current core are stored in the N array (line 4).The algorithm then iterates over each neighbor in the N array.For each neighbor, it retrieves the row coordinate n x and column coordinate n y .If the neighbor coordinates are within the bounds of the NoC (i.e., 0 ≤ n x < w, and 0 ≤ n y < w), it calculates the neighbor index N index using Eq. ( 4).If the temperature value at the neighbor index T [N index ] is less than T min and it is a dark core, it updates the destination index D index with the neighbor index and updates T min with the temperature neighbor.Finally, the algorithm returns the index of the core that has the minimum temperature among the neighbors, which is used by Algorithm 2. Algorithm 2 utilizes task migration to move tasks from hot active cores to cooled dark cores, allowing the 3D many-core system to operate at high performance without exceeding thermal limits.If there is insufficient thermal headroom for task migration, the algorithm applies DVFS for more progressive thermal reduction.The algorithm takes several inputs, including all cores' transient temperatures, represented by T c ; the transient temperature of all memory channels, represented by T m ; an active cores' set, denoted as A c = {a 0 , . . ., a n−1 }; a dark cores' set, denoted as D c = {d 0 , . . ., d n−1 }; and a set of all active cores' frequency levels, denoted as F = {f 0 , . . ., f n−1 }.Additionally, the algorithm reads threshold temperature T th , threshold frequency f th , a safe-margin value α, and a frequency level step δ from a configuration file.
At a predefined control interval, the algorithm checks the temperature of each active core and its associated memory channel (lines 1-3).If any of them exceed the threshold temperature, Algorithm 1 is utilized to determine the destination index of a candidate dark core with the lowest temperature among its neighboring dark cores (line 4).
If the temperature of the candidate dark core and its associated memory channel is lower than the threshold temperature by a sufficient margin α, the candidate dark core is activated, and the current core is deactivated (lines 5-8).In case the previous condition is not met, the algorithm proceeds to check if the temperature of the candidate dark core and its associated memory channel is lower than the threshold temperature by α.If this condition holds, the algorithm activates the dark core, reduces its frequency using DVFS, and deactivates the current core (lines 9-14).
Otherwise, if there is no thermal headroom available for the task migration, the algorithm uses DVFS to reduce the frequency of the current core by δ (line 17).In cases where  2, 3) to find (x, y); there are no thermal violations and the frequency is less than f th , the algorithm increases the frequency of the current core by δ to improve system performance (lines [20][21].
In summary, the 3D-DNaPE algorithm periodically monitors the temperature of active cores and their associated memory channels, initiating appropriate actions based on the temperature readings of the cores and memory banks.Fig. 3 provides a visual representation of the proposed algorithms in action.For simplicity, while not losing generality, this illustration focuses solely on core temperatures.As depicted,

Algorithm 2 3D-DTM algorithm
Input: A c , D c ,T m , T c , f , T th , F th , α, and δ Output: Updated A c , D c , and f Use Algorithm 1 to find the destination index (D index );

B. COMPLEXITY ANALYSIS
The time complexity of the NaP algorithm is constant time O(1) irrespective of the input size.It performs a fixed sequence of operations to evaluate the thermal conditions of neighboring cores and identify a suitable candidate for task migration.The fixed-size array of neighboring cores and the iteration through this array, along with other operations within the algorithm, all contribute to the constant time complexity.On the other hand, the 3D-DTM algorithm primarily hinges on the inner for loop that iterates through all active cores in the set A c , resulting in a linear time complexity of O(n) where n is the number of active cores.Within this loop, various operations are performed, including temperature checking and invocation of Algorithm 1, both of which operate in constant time O(1).The task migration steps and frequency scaling operations within this loop are also constant-time operations, thereby maintaining the overall linear time complexity of O(n).
Regarding space complexity, the NaP algorithm exhibits a constant space complexity O(1).It utilizes a fixed-size array to hold the coordinates of neighboring cores and a few other variables to perform its operations that do not scale with the input size, thereby rendering a constant space complexity.Conversely, the 3D-DTM algorithm's space complexity is linear O(n), primarily due to the data structures used to represent the active and dark cores, transient temperatures, and frequency levels.These data structures are likely to scale with the number of cores and memory channels in the system.

V. EXPERIMENTAL EVALUATION
Numerous experiments were carried out to assess the validity and efficacy of our proposed work.This section presents the details of the experimental setups, the obtained comparison results, and a thorough discussion and analysis of these outcomes.

A. EXPERIMENTAL SETUP
The proposed 3D-DNaPE was validated on a 3D manycore system comprising three layers: one core layer and two memory layers.The core layer contains 64 cores, which are evenly split into 32 active cores and 32 dark cores, all interconnected via an 8 × 8 mesh-based NoC.It's worth noting that despite their shared instruction set architecture (ISA), these cores operate at heterogeneous frequencies.Their maximum clock frequency reaches 4 GHz.Each core occupies a space of 8.70 mm 2 , based on McPAT [54] modeling designed for the 22-nanometer technology node.Each core is equipped with a 32-kilobyte private L1 data cache, a 32-kilobyte private L1 instruction cache, and a 64-kilobyte private L2 cache.Additionally, an 8-megabyte S-NUCA cache is used as an LLC.The S-NUCA cache is shared among all cores, i.e., 128-kilobyte per core.As for the memory layers, they contain 128 memory banks, divided evenly into two layers with 64 memory banks each.In terms of memory channels, there are 64 in total, with each channel servicing two memory banks.Table 2 provides a summary of this system's settings.
Fig. 4 illustrates the experimental framework of this study.We utilized the state-of-the-art CoMeT simulator for 3D many-core systems [55].CoMeT is a toolchain that integrates Sniper [56], McPAT [54], CACTI [57], and HotSpot [58].Sniper is a high-performance parallel simulator designed for x86-64 architecture, capable of simulating multi or many cores efficiently.In the realm of integrated power, area, and timing modeling, McPAT has garnered significant adoption, as evidenced by recent utilization [59], [60].Its prominence stems from its ability to furnish exhaustive low-level configuration insights for processors operating in the multi/many-core domain.CACTI represents an innovative architecture-level integrated framework for modeling power, area, and timing aspects of cuttingedge memory technologies, including 3D-stacked memories, as well as traditional 2D DRAM and caches.This framework significantly simplifies the integration process with architectural-level core performance simulators, facilitating comprehensive evaluations of novel memory technologies.HotSpot simulator is one of the most commonly used tools for thermal simulations.This simulator is based on the wellknown stacked-layer packaging technology.
To enable the modeling of thermally constrained manycore system, some modifications were made to the Sniper simulator.Specifically, adjustments were made to the Sniper scheduler, allocating tasks exclusively to active cores through a core mask pattern.Additionally, modifications were carried out in McPAT, allowing it to quantify only the dark cores' static power to model the dark silicon state.Moreover, the wake-up latency, which is the time it takes to transition from a dark state to an active state for each task migration, was modeled by adding 200 µs, following Linux's intel_driver.These enhancements allowed us to effectively study the implications of thermally constrained 3D many-core system in our experiment.suites.The studied applications are Blackscholes, Bodytrack, Cholesky, Dedup, FFT, Fluidanimate, Ocean, Radix, and Raytrace.Furthermore, a combination of compute-and memory-intensive applications is utilized to represent a diverse spectrum of computing demands, memory access patterns, and workload sizes.Within compute-intensive 131972 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.applications, tasks of elevated temperature can rapidly increase the temperature of cores.On the contrary, memoryintensive applications, characterized by a high volume of memory accesses, tend to drive up the temperature of the memory banks.The mixed applications, grouped as MixApps, are Blackscholes, Bodytrack, Cholesky, and FFT.
These applications are a mix of compute-and memoryintensive applications.Although our experiments mainly focus on specific applications from these benchmark suites, the underlying mechanisms and techniques are designed to be generalizable to large-scale real-world applications.
In our experimental setup, we configured the threshold frequency f th and the frequency level δ to 200 MHz and 3800 MHz, respectively.The threshold temperature T th was set to 65 • C. The safe margin value ε was set to 5% of the threshold temperature.The control period interval was set to 1 ms.The values of f th , δ, and ε were empirically determined by conducting several experiments in which different values were tried.These values were chosen to ensure that the system does not frequently switch between active and dark cores.The value of T th was selected to show the efficiency of the proposed by considering the temperature characteristics of the studied applications as observed on the target platform.Ref. [17] provides additional details regarding the impact of the threshold temperature on the application of DTM techniques.The results presented in this paper are an average outcome of running the experiment ten times to mitigate the potential impact of random variations.

B. PERFORMANCE METRICS
The performance metrics used in our experiment validations are execution time E t , performance slowdown Perf slow , temperature, and power/energy.The execution time is the simulated execution time spent running a single simulation that starts at t s until it finishes at t f .E t = t f − t s (5) The performance slowdown Perf slow is the penalty for using DTM techniques.Perf slow is the difference between the execution time when a DTM technique is used E t(DTM ) and when no DTM is used E t(noDTM ) .
The MIPS/W is the ratio between the instruction execution rate and the power consumption rate [61]

C. COMPARATIVE RESULTS AND ANALYSIS
In this study, we evaluate the performance of the proposed 3D-DNaPE in comparison to the state-of-the-art CoreMemDTM [22], which takes into account both the cores and memory banks.Moreover, we compute the performance overhead for both our proposed 3D-DNaPE and CoreMemDTM by comparing their performance with the baseline, which does not incorporate any DTM technique.The performance results in terms of execution time were obtained by running the studied applications on the simulation setup outlined in the previous section and are shown in Fig. 5.This figure illustrates the normalized execution times of the studied applications when using our proposed 3D-DNaPE and CoreMemDTM.As can be seen, the proposed 3D-DNaPE outperforms CoreMemDTM's performance, demonstrating an improvement of up to 43% with an average enhancement of 20%.The extent of improvement is contingent upon the specific characteristics of each application [53].Notably, applications with compute-intensive demands that benefit from heightened instruction-level parallelism capitalize on the increased frequencies offered by 3D-DNaPE, leading to substantial performance enhancements.
Conversely, CoreMemDTM activates all cores, presenting an advantage for applications that require substantial threadlevel parallelism.Fig. 6 provides a depiction of the execution phases of the studied applications.Notably, applications like Bodytrack, Ocean, and Radix exhibit considerable parallel phases, benefiting from the large number of cores facilitated by CoreMemDTM.However, the disadvantage emerges when all cores are deactivated upon surpassing the maximum temperature threshold, resulting in overall system performance degradation.
To evaluate the performance slowdown caused by the DTM techniques, we use a baseline scenario.In this baseline scenario, the studied applications are executed without any thermal constraints, representing the maximum performance situation.Then, the execution time results of the proposed 3D-DNaPE and CoreMemDTM are compared against the baseline scenario according to Eq. ( 6).The percentage of performance slowdown for our proposed 3D-DNaPE and CoreMemDTM is shown in Fig. 7. Notably, our proposed 3D-DNaPE technique demonstrates a significantly lower performance slowdown, with reductions of up to 34% and an average of 13% when compared to CoreMemDTM.As highlighted in the previous discussion, the percentage of improvement varies depending on the unique characteristics of each application.Typically, applications with a high percentage of parallel phases tend to increase the temperature of the many-core system.Thus, this, in turn, prompts the more frequent activation of a DTM technique.
In terms of thermal management, both of our proposed 3D-DNaPE and CoreMemDTM manage to maintain the aver-131974 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.age system temperature lower than the predefined threshold.Fig. 8 shows the average transient temperature of running the studied applications under 65 • C thermal constraint.As shown in our results, the proposed 3D-DNaPE outperforms CoreMemDTM in most of the studied applications by up to 9% and an average of 2%.In some applications like FFT, Fluidanimate, and Ocean, CoreMemDTM manages to lower the system temperature by 3%, 3%, 5.7%, respectively.This temperature reduction is due to the deactivation of all cores.However, this leads to lower overall performance, as shown in Fig. 5.
The thermal statistical distribution of the proposed 3D-DNaPE and CoreMemDTM techniques can be observed from the box plots in Fig. 9.This figure shows the thermal behavior of cores and memory banks for the applications under study.Each subfigure represents the thermal distribution of the cores and memory banks for a specific application.For the majority of applications, the 3D-DNaPE technique exhibits up to 11 • C less thermal variability compared to the CoreMemDTM technique.This observation implies that the 3D-DNaPE technique has a more consistent temperature profile for these applications.However, there are exceptions like Bodytrack and Fluidanimate, where CoreMemDTM demonstrates a reduced thermal variability by 3 • C, 2 • C, respectively.Also, applications like FFT and Fluidanimate show a slightly lower median temperature for CoreMemDTM.On the other hand, the range, as indicated by the whiskers, for the 3D-DNaPE technique is generally tighter in compute-intensive applications like Blackscholes, Cholesky, and Radix.This indicates that extreme temperatures are less frequent with our proposed 3D-DNaPE for these applications.
The energy efficiency is measured by MIPS/W according to Eq. ( 7).Fig. 10 shows the normalized performance in terms of MIPS/W of the proposed 3D-DNaPE and CoreMemDTM techniques.The proposed 3D-DNaPE shows higher energy efficiency for almost all the studied applications.The improvement is up to 51% and an average of 24% compared to CoreMemDTM.This is because the proposed 3D-DNaPE activates only half of the cores, which leads to less power consumption.However, for memory-intensive applications, CoreMemDTM matches the performance of 3D-DNaPE.
Similar to thermal statistical distribution analysis, we have also made box plots that show how cores and memory banks behave in terms of power for the applications being studied to look at the statistical distribution of power for the proposed 3D-DNaPE and CoreMemDTM techniques.Fig. 11 shows the statistical distribution of power consumption for the studied applications, where each subfigure represents the power consumption distribution of the cores and memory banks for a specific application.For compute-intensive applications, the 3D-DNaPE technique demonstrates up to 90 W less variability in power consumption compared to the CoreMemDTM technique.This suggests that the 3D-DNaPE technique has a more consistent power consumption profile for these applications.However, there are exceptions, such as Fluidanimate, where the variability for CoreMemDTM seems comparable to 3D-DNaPE.The range (as indicated by the whiskers) for the 3D-DNaPE technique is generally tighter in applications like Blackscholes, Cholesky, and Radix.This indicates that extreme power values (both high and low) are less frequent with 3D-DNaPE for these applications.

VI. CONCLUSION
This paper presents 3D-DNaPE, an innovative method to enhance the performance of thermally constrained 3D manycore systems.It selectively migrates tasks from hot cores to cooler dark cores, leveraging S-NUCA to reduce cache misses from thread migrations.Additionally, it adjusts DVFS to lower system temperature when no nearby cold dark cores are available.In a comprehensive assessment against Core-MemDTM, 3D-DNaPE consistently outperforms, improving execution time by up to 43% and reducing performance slowdown by up to 34%.It also excels in temperature regulation, with up to a 9% reduction in system temperature and a 51% enhancement in energy efficiency, achieved by activating only half of the cores.For future work, we plan to integrate a broader range of use cases representing different workload sizes.We also plan to do an in-depth analysis to enhance 3D-DNaPE's core activation strategy, potentially utilizing machine learning algorithms for application-specific core activation.Furthermore, we plan to propose a hybrid DTM technique, combining the strengths of 3D-DNaPE and CoreMemDTM to achieve superior performance across a broader range of applications.

FIGURE 1 .
FIGURE1.An illustration of the proposed system model.

FIGURE 2 .
FIGURE 2.An illustration of a multi-threaded application.

FIGURE 3 .
FIGURE 3.An illustrative example demonstrating the functionality of the proposed algorithms.

FIGURE 4 .
FIGURE 4. Experimental framework of the proposed work.

FIGURE 6 .
FIGURE 6.The percentage of serial and parallel execution phases within the studied applications.

FIGURE 8 .
FIGURE 8.The average transient temperature of the studied applications.

FIGURE 9 .
FIGURE 9.The temperature distributions of the 3D-DNaPE and CoreMemDTM techniques for the studied applications.

FIGURE 10 .
FIGURE 10.Normalized performance in terms of MIPS/W.

FIGURE 11 .
FIGURE 11.The statistical distribution of power consumption for the studied applications.

VOLUME 11, 2023 131975
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 1 .
Definitions of symbols used in the proposed work.

then 11
S[D index ]= active; 12 Reduce f [D index ] by δ; 13 Move the task from A c [a i ] to D c [D index ]; 14 S[a i ]= dark;

15 end 16 else 17
Reduce f [a i ] by δ; c [a i ] < T th − α and f [a i ] <f th then 21 Increase f [a i ] by δ;

22 end 23 end 24 end the
temperature of Core 27 (C 27 ) exceeds the thresholdset at 65 • C for this example-prompting the NaP algorithm to find out the coldest core among the neighboring inactive ones, which include C 19 , C 26 , C 28 , and C 35 .In this scenario, Core 19 (C 19 ) is identified as the coldest.Subsequently, the 3D-DTM algorithm activates C 19 , facilitates the migration of task from C 27 to C 19 , and then deactivates C 27 to allow it to cool.

TABLE 2 .
Summary of system settings.
. The MIPS/W metric is used to measure the energy efficiency of the proposed work.The instruction count I count and execution time E t are provided by Sniper.The power P is provided by McPAT and CACTI.