DELICIOUS: Deadline-Aware Approximate Computing in Cache-Conscious Multicore

Enhancing result-accuracy in approximate computing (AC) based real-time systems, without violating power constraints of the underlying hardware, is a challenging problem. Execution of such AC real-time applications can be split into two parts: (i) the mandatory part, execution of which provides a result of acceptable quality, followed by (ii) the optional part, that can be executed partially or fully to refine the initially obtained result in order to increase the result-accuracy, without violating the time-constraint. This article introduces DELICIOUS, a novel hybrid offline-online scheduling strategy for AC real-time dependent tasks. By employing an efficient heuristic algorithm, DELICIOUS first generates a schedule for a task-set with an objective to maximize the results-accuracy, while respecting system-wide constraints. During execution, DELICIOUS then introduces a prudential cache resizing that reduces temperature of the adjacent cores, by generating thermal buffers at the turned off cache ways. DELICIOUS further trades off this thermal benefits by enhancing the processing speed of the cores for a stipulated duration, called V/F Spiking, without violating the power budget of the core, to shorten the execution length of the tasks. This reduced runtime is exploited either to enhance result-accuracy by dynamically adjusting the optional part, or to reduce temperature by enabling sleep mode at the cores. While surpassing the prior art, DELICIOUS offers 80% result-accuracy with its scheduling strategy, which is further enhanced by 8.3% in online, while reducing runtime peak temperature by 5.8°C on average, as shown by benchmark based evaluation on a 4-core based multicore.


INTRODUCTION
I N real-time systems, the correctness not only depends on the result-accuracy, but also on the time at which these results are produced. For such time-critical scenarios, approximated results obtained on-time are preferable over accurate results produced after the deadline. In plenty of application domains, such as multimedia computing, tracking of mobile targets, real-time heuristic search, information gathering and control systems, an approximate result, obtained before the deadline is usually acceptable [5]. For example, in case of video streaming, frames having lower quality are better than completely missing frames. In target tracking, an approximated estimation of the target's location generated within deadline is better than an accurate location, obtained too late. In these domains, a task is logically decomposed into a mandatory subtask and an optional subtask [9], [35], [37]. The entire mandatory subtask must be completed before the deadline to generate the minimally acceptable QoS, followed by a partial/complete execution of the optional part, subject to availability of the resources, to improve accuracy of the initially obtained result within the deadline. The QoS increases with the number of execution cycles spent on the optional part.
Energy efficient scheduling of the AC real-time task-set that intends to improve result-accuracy without violating the underlying system constraints have become an active research avenue in recent past. Stavrinides and Karatza were among the first to propose scheduling of an AC realtime task-set [44]. A recent theoretical analysis [37] shows how to improve system level result-accuracy through task to processor allocation and task adjustment constrained by an energy budget. However, limiting the energy usage does not ensure thermal safety of the chip, which can be tackled by incorporating power constraint, like thermal design power (TDP), together with a runtime power management while considering several architectural parameters. In an energy efficient approach, Prepare [10], to improve system level resultaccuracy, the authors considered the runtime architectural characteristics. However, the detailed runtime cache characteristics of the applications were not considered.
Researchers also employed integer linear programming (ILP) based scheduling strategies [10], [37] that might often become prohibitively expensive for large problem sizes, which can be overcome by designing a computationally feasible heuristic strategy. In DELICIOUS, we devise an efficient scheduling heuristic to schedule approximated real-time tasks on a chip multiprocessor (CMP) platform, where the scheduling is constrained by task-dependency and deadlines. The entire strategy of DELICIOUS is summarized in Fig. 1. Our AC real-time application contains n number of dependent tasks (T 1 to T n shown in the top of Fig. 1) and the entire application has a deadline. Each task is equipped with multiple versions with diverse set of result-accuracy based on the respective execution length of the optional part that is executed. In DELICIOUS-Offline (shown in the left of Fig. 1), the scheduling information, which versions of a task (Version ID) will be executed on which core (Processor ID) in a CMP and its starting time (Start-Time Instant) for all the tasks will be generated with an objective to maximize the overall system-level result-accuracy. All tasks are assigned a base voltage/frequency (V/F) level, which is the highest possible V/F (other than turbo mode [2]) for the underlying processor core. The generated schedule is next stored in a dispatch table (shown just below the DELICIOUS-Offline part in Fig. 1), from which task-executions are triggered.
With the objective to further enhancing the accuracy by exploiting runtime architectural characteristics (shown at the bottom of Fig. 1), DELICIOUS-Online judiciously selects and evicts dead blocks 1 from the shared last level cache (LLC) and turns off spare LLC ways to reduce the temperature of the cores in its proximity. By considering the live thermal status, DELI-CIOUS attempts to execute tasks at a higher frequency than that originally assigned for a stipulated duration (so called V/F Spiking, based on fine-grained DVFS [17]). V/F Spiking increases throughput and enables more of the optional part of a task to be executed, and thus improves the QoS without impacting the pre-determined schedule. To improve power and thermal efficiency further, DELI-CIOUS shuts down cores during unused slacks generated by reducing execution times of the tasks.
The contributions of DELICIOUS are as follows: 1) Our intended problem has been clearly formulated as an optimization problem, discussed in Section 4, subject to a set of constraints. 2) We have presented a real-time scheduling policy, DELICIOUS, for AC real-time precedence constrained task graphs (PTGs) on homogeneous CMPs. 3) Design of a heuristic strategy for an AC real-time PTG on a CMP, where each task can have multiple versions with distinct degrees of accuracy (see Section 5).
In addition to delivering satisfactory performance, the strategy exhibits reasonable time complexity with comparatively low, polynomial time scheduling overheads. 4) We apply a power/thermal restriction (i.e., TDP) aware V/F Spiking technique (see Section 6), induced by online LLC-resizing, to improve achieved QoS while keeping temperature in check, which we have empirically validated and reported in Figs. 11 and 12. 5) By shortening the execution time for each task, V/F Spiking incurs dynamic slacks, which are either exploited (i) to execute a higher task-version subject to availability, or (ii) to put the core in sleep mode to reduce core temperature (see Section 6). We further argue and empirically validate the efficacy of the task-scheduling heuristic of DELICIOUS in combination with the runtime mechanisms (see Section 7). For a set of tasks, the scheduling heuristic of DELICIOUS achieves 80% QoS, which is close to a recent ILP based optimal policy, Prepare [10] that achieves a QoS of 83%, while running time of ILP based optimal scheduling of Prepare is significantly higher than the scheduling heuristic of DELICIOUS (see Fig. 6). Our benchmark based evaluation with a 4-core based baseline CMP (equipped with 4 MB 16-way associative shared L2 cache) in our simulation setup (consisted of gem5 [8], McPAT [30], and Hotspot [48]) shows that the dynamic LLCresizing induced and TDP aware V/F Spiking of DELICIOUS further stimulates the achieved QoS by 8.3% and reduces core temperature up to 9.2 C, while meeting the deadlines. Our empirical analysis shows that, online mechanism of DELI-CIOUS outperforms Prepare [10] and GDP [33], in terms of online QoS enhancement, and peak temperature reduction. To the best of our knowledge, DELICIOUS is the first scheduling mechanism that introduces a dead block eviction based LLC-resizing induced TDP aware V/F Spiking technique for enhancing the QoS of dependent AC real-time task-set without violating the deadline and the thermal constraints.
Before formulating the problem in Section 4, we discuss the relevant prior work in Section 2, and brief our system model and assumptions in Section 3. The core offline and online mechanisms of DELICIOUS are detailed in Sections 5 and 6, respectively. The evaluation of offline and online mechanisms of DELICIOUS are presented next in Section 7 before concluding the paper in Section 8. The acronyms used in our paper are abbreviated in Table 1.

STATE-OF-THE-ART
Minimizing energy in recent CMP based real-time systems has become a topic of paramount importance [38], [39]. Scheduling time-critical dependent tasks on CMP platform while 1. Dead blocks indicate the data that will never be accessed before being evicted from the cache (detailed later in Section 6). maintaining the energy/power constraint is gradually becoming challenging with technology scaling [22]. Researchers recently attempted to devise energy-aware scheduling for the real-time task-sets with various system-wide constraints [6], [23], [27]. In 2018, the concept of AC to meet the energy budget of a large scale real-time system was introduced for the tasks without precedent constraints [9]. Other prior arts also explored AC task scheduling for the embedded real-time systems while minimizing energy [9], [32], [49], for the set of independent tasks. Yu et al. proposed the concept of an "Imprecise Computation (IC)" [47], for the first time, where individual tasks are decomposed into mandatory and optional parts, and their "dynamic-slack-reclamation" technique improves the system-wide QoS for more energy savings, but task-dependencies were not considered. To the best of our knowledge, in the very first attempt to schedule IC/AC dependent tasks [44], authors measured the performance of conventional real-time scheduling techniques like Highest Level First (HLF) and Least Space Time First (LSTF) for a couple of task-sets, where one set contains the AC tasks, but energy efficiency was not considered. The energy aware scheduling of dependent AC tasks were considered in some prior works [36], [37] that employed DVFS at the cores.
Most of the prior energy/thermal management mechanisms [14], [17], [28] control the dynamic power of the cores in CMPs either by employing DVFS [41], [42] or by migrating tasks [13], [19], [20]. Recently, Roeder et al. [42] showed the effectiveness of DVFS, planned offline, for a heterogeneous real-time system with multi-version based task-model, but energy efficiency can be enhanced dynamically based on the runtime tasks' as well as system's characteristics. Donald and Martonosi [14] have shown the efficacy of different DVFS techniques along with task migration policies to control temperature, where distributed DVFS applied with task migration are claimed to be the best. However, underlying migration overheads at the caches were not accounted. Hanumaiah et al. [26] proposed a thermal efficient thread migration, that was integrated with DVFS to reduce temperature of the homogeneous CMPs [25]. Recently, Esmaili et al. also integrated DPM, DVFS, and task migration in constrained scheduling, but the power budget of the system was not included [16]. Another study shows how combining DVFS and DPM can significantly boost up system throughput and thermal efficiency of the large sized CMPs [29]. However, a couple of recent attempts have tried to combine DVFS with the cache based policies [11], [12], but their efficacy in improving QoS of the AC real-time systems have not been studied. Moreover, these studies did not focus on the block recency before evicting them from the LLC, which we have studied in DELICIOUS.

DELICIOUS Over Prior Arts
In DELICIOUS, we investigated the potential of LLC wayshutdown in improving thermal efficiency of a multicore system, and how this benefits can be traded off to improve core performance. Basically, DELICIOUS first proposes a novel heuristic-based offline scheduling algorithm for a set of dependent AC real-time tasks, with an objective to improve the QoS (see Section 5). The QoS is further stimulated during execution by employing LLC resizing based mechanism that shuts down cache ways to reduce core temperature in proximity, which assists the cores to maintain a higher V/F for a stipulated time-span (see Section 6). Our results also illustrate, both offline and online mechanisms of DELICIOUS surpass the recent techniques. To the best of our knowledge, DELI-CIOUS is the first technique that employs dynamic LLC-resizing for a scheduled AC real-time tasks to reduce core temperature, that further offers room for V/F Spiking to enhance result-accuracy on-the-fly, while maintaining deadline and thermal safety.

SYSTEM MODEL AND ASSUMPTIONS
The considered CMP consists of m homogeneous cores, denoted as P ¼ fP 1 ; P 2 ; . . .; P m g. Each core supports L distinct V/Fs denoted as V ¼ fV 1 ; V 2 ; . . .; V L g and F ¼ fF 1 ; F 2 ; . . .; F L g, where V i < V iþ1 and F i < F iþ1 . The offline schedule is generated by considering a single base core V/F ( V L =F L ), at which core can execute tasks until completion without any potential thermal threats [1]. However, in online phase, during V/F Spiking, a core can execute tasks at higher V/F than the base level for a stipulated duration.
Our application is represented as a precedence task graph (PTG) (see Fig. 2), G ¼ ðT; EÞ, where T is a set of tasks (T ¼ fT i j 1 i ng) and E is a set of directed edges (E ¼ fhT i ; T j i j 1 i; j n; i 6 ¼ jg), representing the taskdependency or precedence relations between a distinct pair of tasks. An edge hTi; Tji implies a precedence, i.e., a task T j can start its execution only after T i is executed. Our single source and single sink tasks have no predecessors and no successors, respectively. Being a real-time application, G has to be  [46]. DELICIOUS selects a particular version among the k i versions of T i , the selection procedure is detailed in the following section. For each optional part of a task (O i ), there exists a separate executable module, that is executed after the execution of the mandatory portion (M i ) of the respective task, T i . The length of the j th version of task T i (len j i ) can be defined as: i . Note that, len j i includes the cycles required for accessing LLC, which we obtain by executing an individual task for a particular configuration. We define result-accuracy Acc j i of T j i as the executed optional part of the task, O j i (i.e., Acc j i ¼ O j i ). Thus, the overall system level result-accuracy (QoS) is now defined as the sum of the executed cycles of O j i for all the tasks [9], which can be represented as: Note that, in addition with execution of the M i for each task, we also need to execute at least one version of O i within deadline.

PROBLEM FORMULATION
In order to present a formal model of the problem and its objective, we have formulated it as a constraint optimization problem. Let us consider a binary decision variable Z ikth , where i ¼ 1; 2; . . .; n; k ¼ 1; 2; . . .; k i ; t ¼ 0; 1; . . .; D PTG , and h ¼ 1; 2; . . .m. Here, the indices i, k, t and h, denote task ID, corresponding version ID, timestamp, and processor ID, respectively. The variable Z ikth is 1, if the k th version of T i (T k i ) starts its execution at t th timestamp on processor h. This will eventually enforce that Z ikth for T i will be zero, for all other possible combinations, i.e., it cannot start on any other processors with other versions at any time stamp. We now present the objective function with constraints on the binary variables to model the scheduling problem.

Maximize QoS(1a)
QoS ¼ Subjectto : Equation (1b) presents the objective function in the above formulation, whereas Equation (1c) enforces the constraint that each task must start its execution on a particular processor at a unique timestamp with a unique version. In this scheduling problem, resource bounds for processors must be satisfied at each timestamp. Any processor can execute at most one task at a given time without any preemption (Equation (1d)). Equations (1e) and (1f) enforce execution dependency and deadline satisfaction constraints, respectively, whereas start time (st j ), execution length (el i ) and end time (et i ) are defined in Equations (1g), (1h), and (1i), respectively. Select T i from the priority list (with highest ALAP value); 8 Extract the task T j at the root of the min-heap; 9 z j = z j À 1; /* Decrease the current version of T j by one; */ 10 Compute the PF ðT j ; k j Þ and reheapify; 11 Calculate QoSðAÞ as: QoSðAÞ ¼ P jT j i¼1 Acc z i i ; 12 Return QoSðAÞ ;

Algorithm 1. DELICIOUS -Offline
Our scheduling problem stated above amicably lends itself towards its computation using a standard optimization tool, CPLEX. However, the presence of numerous decision variables and constraints makes this problem computationally highly complex. Therefore, solution techniques using standard optimizers, like CPLEX, are often computationally expensive in terms of time and space even for moderate problem sizes with respect to number of tasks, number of processors, nature of intertask dependencies, etc. We reiterate here that the main motivation towards encoding of our problem as above is the clarity it lends in detailed understanding and appreciating the structure of the scheduling problem at hand. Such realization is immensely useful towards designing and analyzing an efficient lower overhead heuristic strategy for the problem. We next present DELICIOUS-Offline, an efficient heuristic algorithm for the problem discussed above.

DELICIOUS-OFFLINE PHASE
Typically, list scheduling-based heuristic techniques are employed to compute feasible schedules for PTGs executing on multi-cores. They attempt to construct a static-schedule for the given PTG, to minimize the overall schedule length, while satisfying resource and precedence constraints. On the contrary, our heuristic strategy tackles the problem of scheduling a PTG consisting of task nodes with multiple versions, to maximize overall system accuracy, while satisfying the deadline constraint. For this purpose, we devise our heuristic algorithm, DELICIOUS-Offline, to generate a schedule by setting all task nodes to their highest version. Since DELICIOUS-Offline attempts to maximize the overall system level accuracy, the resulting schedule length may however violate the given deadline. This situation can then be refrained by degrading the versions of tasks, while reducing impact on overall system accuracy.

DELICIOUS-Offline Algorithm
Our heuristic algorithm for DELICIOUS-Offline is represented in Algorithm 1, that first attempts to generate a feasible schedule with considering the highest version of all the tasks by calling Algorithm 3, Sched-Gen (line 1 to 2). Sched-Gen yields TRUE, if a feasible schedule is possible by satisfying the resource and deadline constraints for each task, and returns FALSE and ALAP 2 times (generated by Algorithm 2), otherwise. By considering all tasks with their respective highest versions may not be feasible due to their high temporal requirements. If Sched-Gen yields FALSE, then DELI-CIOUS-Offline enters into a while loop until a feasible schedule for a chosen set of task versions is generated or all tasks have been reduced to their lowest versions (line 4 to 10). This while loop maintains the tasks in a priority queue organized as a min-heap with a parameter called Penalty FactorðPF Þ as key (Equation (2)). For a given task, T i , with its current version z i , PF ðT i ; z i ) is defined by the reduction in achieved accuracy as T i 's version is lowered from z i to z i À 1, and is calculated as If two tasks exhibit the same PF values, then the task with lower ALAP value will be selected from the ordered list (line 6 to 7). This is mainly due to the fact that in such a priority list based on task's ALAP times, the actual value of the ALAP time of a task provides an estimate of the remaining computational demand before completion of the sink task. For any given deadline bound, a relatively lower ALAP time for a task indicates a higher remaining processing requirement. Hence, DELICIOUS-Offline attempts to lower the version of a task having higher ALAP value, i.e., having lower processing requirement and less dependency.
In each iteration of the loop, DELICIOUS-Offline extracts the task (T j ) at the root of the min-heap (task with the minimum PF value), reduces its version by one, and check if the Sched-Gen returns TRUE or not (line 9 to 10). If Sched-Gen yields TRUE, then it indicates that a feasible schedule is obtained. DELICIOUS-Offline will then calculate and return the obtained system level QoS as output (line 11 to 12).

Algorithm 2. ALAP Time Calculation
Input: i.The task graph GðT; EÞ ii. z i : selected version of each task T i iii. len z i : Execution length for the z th version of T i iv. D PTG : The deadline of the task graph. Output: i. e la i : latest start time of each task T i 1 for Calculate the minimum of the latest start times min ðe la j Þ 8 T j 2 SuccðT i Þ; // Let task T sc has the minimum value of the latest start times among all successors of T i 6 e la i ¼ e la sc À min ðlen z i Þ;

Schedule Generation (Sched-Gen)
DELICIOUS -Offline calls Sched-Gen (Algorithm 3) to determine a valid schedule for a stipulated set of task versions chosen by Sched-Gen. Initialization and Task Prioritization (line 1 to 8). Algorithm 3 begins its execution by creating an array denoted as FP , which implies the number of free processors available. Sched-Gen uses a relative priority order amongst all tasks based on the tasks' ALAP start time, considering each task T i at its currently selected versions z i . This priority list based on task's ALAP times ensures that inter-task precedence relationships are always satisfied (ALAP time of a predecessor task is always less than the ALAP times of all its successors).
Task Mapping and Execution (line 9 to 29). Sched-Gen assigns the task with no predecessors to a separate processor. Then it continues to consider tasks only when all its predecessor task(s) finish(es) their executions. Such task to processor assignments eventually enable that the beginning of the task will be the latest finishing time of its predecessors. In case, if a task has a single predecessor, then DELI-CIOUS can start to consider the task right after the finishing time of its predecessor. When a task has multiple predecessors, DELICIOUS considers the predecessor which has the latest finishing time. The successor task may be assigned to the same processor assigned to its predecessor with the latest finishing time. All tasks executing at a given time, run in parallel in the available processors. A task (say, T j ) mapped to a processor (say, p i ) will continue its execution until the execution requirement of the task is finished. The variable PBP i denotes the "Processor Busy Period," which in turn provides the remaining execution requirement of T j in p i and thus, PBP i becomes zero when T j finishes its execution (line 20 to 21). After a task finishes its execution, it will be added to the set FT and will be removed from T (line 22 to 23). The set FT is finally stored in the dispatch table. The above processes of task mapping and execution continue iteratively either until all tasks in T complete their 2. It implies As Late As Possible. executions, or the deadline D PTG is encountered. In line 24 to 29, DELICIOUS will check whether the number of finished tasks (FT ) is equal to the number of tasks given in the input set T . Any mismatch will infer an incomplete schedule, otherwise, it will denote a successful one and DELI-CIOUS-Offline will return TRUE. Select processor P i with PL i == FALSE; 14 Set PL i = TRUE /* Set P i to busy; */ 15 Map T j in processor P i ; 16 st j = t /* Set current time t as the execution start time of T j */ 17 PBP i = l z j j ; /* start execution of T j ; PBP i : an integer variable denoting Procsseor Busy Period which holds the remaining time required to finish the current task in p i */ 18 FP = FP n P i ; /* Remove P i from set FP */ 19 else 20 Our heuristic algorithm is associated with a few carefully selected, restricted design choices, that assist in controlling the complexity. It can be observed, that distinct schedules can be generated with each task (T i ), assigned to any of the available processors (P ), with T i being actually scheduled in any of the designated processors. Hence, the number of schedules depends on the number of tasks and processors. The schedule with enhanced accuracy could be any one of the subsets of the schedules that satisfy precedence, resource and timing constraints. However, to limit the complexity of this compute-intensive problem, our heuristic uses a phase-based approach. At first, it generates an accuracy maximized schedule, by restricting all tasks to their respective highest versions. Given the order and task-toprocessor assignments as provided by the first phase, the task-versions are adjusted in the second phase, whilst meeting the deadline.
Example DELICIOUS-Offline. Let us consider a representative example with the task-set given in Table 2, which is pictorially represented in Fig. 2. These tasks have to be scheduled on two processors (m ¼ 2), with a deadline D PTG ¼ 70 time units. In Fig. 3 A, we have shown that, if the tasks are scheduled only with their respective highest versions, this will lead to deadline failure. Hence, by choosing different versions of the tasks, our algorithm generates the feasible schedule, which is depicted in Fig. 3 B. Here, T 2 and T 6 are executed with lower versions to satisfy the deadline. Our total obtained QoS value is 48.

DELICIOUS-ONLINE PHASE
To improve the accuracy or energy/thermal efficiency of the generated schedule, the selected V/F setting can be changed dynamically, but that might cause deadline failures, if not managed carefully. DELICIOUS-Online attempts to reduce core temperatures by employing a dynamic LLC resizing that generates on-chip thermal buffers by shutting down cache ways with close vicinity to the cores (see Fig. 3 C). Such gained thermal benefits are traded off by a TDP cognizant V/F scaling of the cores, named here as V/F Spiking, that reduces the execution length of the tasks. DELICIOUS-Online uses this performance increase either to improve task accuracy while the core temperature is kept in check, or to enhance energy and thermal efficiency by power gating the core (sleep mode) during generated slack. The possible task level changes by DELICIOUS-Online is illustrated in Fig. 5, however, we magnified the V/F Spiking induced version upgrade for a task (T 2 ) in Fig. 3 D. Our LLC resizing selectively evicts dead blocks by periodic runtime analysis and trims LLC to improve the energy/thermal efficiencies without any noticeable performance impact.

Detecting Dead Blocks and Thermal Management at LLC
It is a well known fact that much of the data stored in the LLC is dead, i.e., the data will never be accessed before being evicted. In fact, a substantial amount (more than 80%) of all cache blocks at any particular time are dead as well as dead on arrival (DOA) [18], [31]. Hence, proactive eviction of dead blocks can offer a significant amount of spare cache space to the current application, which can be either used for more live blocks to enhance performance, or turned off to save energy. However, as the LLC is the final defense before approaching off-chip accesses, dead block detection and eviction should be done prudentially to maintain performance.
Detecting dead blocks at the block level granularity requires individual counters for each LLC block, where the size of individual counters can incur implementation overheads. To simplify our implementation and by considering time-criticality, we decided to detect only DOA blocks and to eventually evict them. We employ a single bit, called the Dead_bit, to track if a block is DOA. When a block is brought into the cache, the bit is set and is cleared if it is further accessed. We periodically check the Dead_bit and evict the block if the bit is still set. The check is performed one block at a time, iterating through all blocks within the predetermined period. Note that, for checking of the dead-bits and eviction of the dead blocks, a small time-slice is reserved at the end of each period, called back up period (BackPer). For our baseline 16-way set-associative and 4 MB LLC, the storage overhead for implementing the Dead_bit is negligible at around 0.2%.
After detecting the dead blocks, DELICIOUS proactively evicts them from the LLC and turns off LLC ways to generate on-chip thermal buffers and to reduce core temperature in its vicinity [11], [12]. Basically, the temperature of any onchip component is guided by the basic superposition and reciprocity principle of heat transfer, which is driven by three factors: (1) the component's own power consumption, (2) heat abduction by ambient, and (3) conductive heat transfer with its peers [45]. Hence, prudential selection of these LLC-ways for shutting down on-the-fly can potentially reduce the chip temperature [12], by (a) curtailing its own power consumption and (b) incorporating heat transfer with the peers at the generated on-chip thermal buffers, while maintaining performance. As a significant number of LLC entries are DOA, which, if evicted, generates a large LLC portion as spare. But, such proactively generated empty locations might be scattered throughout the LLC, which has to be compacted to enable power gating of a complete cache way. This will generate continuous large thermal buffers, which will help in reducing temperature of the adjacent cores. Hence, we incorporate a simple but effective block swapping mechanism, discussed later, that prioritizes invalidation over write-back, and eventually empties an LLC way at the edge of the LLC bank before turning it off. By periodically monitoring the DOA blocks, and availability of the spare cache space after eviction, DELICIOUS-Online dynamically decides the number of LLC ways that can be power gated.

V/F Spiking: Effects and Amelioration
Increasing the V/F for a short duration, so called V/F Spiking, can enhance results accuracy if the core temperature can be kept in check by addressing the following issues: When should V/F Spiking be triggered? How long can the core maintain the increased V/F? To answer these questions, one should consider the dynamic and leakage power consumption of the cores at different V/F settings and temperatures, along with the TDP of the cores. During task execution, DELICIOUS evenly divides the entire execution span into multiple periods, where at the end of each period, a decision on V/F Spiking will be taken. At the end of a period, if the core temperature is detected to be sufficiently below the critical temperature, then the power consumption of the core is evaluated to determine if an increased V/F that can be maintained without violating the power constraint. The dynamic power consumption (Dyn pow ) at the target increased V/F is derived by employing the following equation: where a and C are circuit related constants, and V dd and f represent the supply voltage and core-frequency, respectively. By considering the current temperature (T ) and the target increased voltage, the leakage consumption (Leak pow ) of the core can be derived at the end of the period through the following equation: where, A 1 to A 6 are technology dependent constants. DELICIOUS inspects the available V/F levels and selects the maximum possible V/F setting for the upcoming period so that TDP is not violated during the next period. The span of a period can be determined empirically or from processor characteristics, during which the core temperature can be assumed to remain unchanged.
Maintaining a higher V/F setting for a period of time increases the core temperature, resulting in an increase in leakage power, which in turn generates heat in a self-reinforced cycle and can potentially affect the functional correctness of the chip. Employing an analytical formulation that estimated the generated heat from the power values can be a solution to determine the duration of the increased V/F residency [48]. But, the dynamic LLC resizing of DELI-CIOUS-Online, which significantly impacts the core thermal status, needs to be accounted for to correctly estimate the temperature, where the LLC resizing depends on the application's cache access behavior. In fact, our TDP based mechanism safeguards the core from thermal overshoot, but analytically determining the duration of the increased V/F residency might be unable to exploit the thermal benefits offered by LLC resizing. Hence, DELICIOUS-Online monitors the core temperature periodically using thermal sensors. Once the core temperature reaches the maximum threshold (Temp Max ), the V/F is reduced to the level at which the task is scheduled, and thus the duration of V/F Spiking is determined dynamically.

Proposed Online Technique
DELICIOUS -Online consists of two modules, the LLC Resizing module, which is implemented at the LLC controller for each LLC bank (discussed in Section 6.3.1), and the V/F Spiking module, which is implemented at the controller of the cores (discussed in Section 6.3.2). We illustrate the technique of DELICIOUS-Online in Algorithm 4. A complete schedule of the task-set is generated offline, called a Frame, details of which are kept in the dispatch table, where timing parameters of the tasks are converted into cycles prior to insertion.
As long as all tasks are not selected from the Dispatch Table, each task (T i ) within a Frame is fetched as per the schedule and the execution is initiated (line 1-4). For each LLC bank, Algorithm 5 is executed simultaneously with respect to each other at the LLC controller, to create on-chip thermal buffers by prudentially managing dead blocks (line 5-6). This gained thermal benefits are traded off to improve accuracy by employing V/F Spiking at each core during task execution (line 8-10). Note that, Algorithm 6 is executed at the respective core controllers, and is transparent to Algorithm 5.  Table do  3 Get schedule details of T i from the Dispatch Table;  4 Fetch T i and start execution; 5 for each LLC bank do 6 Call Algorithm 5; 7 # Execute simultaneously at each bank; 8 for each Core do 9 Call Algorithm 6; 10 # Execute simultaneously at each core;

LLC Resizing Technique
DELICIOUS -Online is primarily built on the LLC Resizing mechanism that stimulates thermal efficiency of the cores adjacent to the power gated LLC portions. Fig. 4 depicts the effects of power gated ways by illustrating the heat transfer from its adjacent cores. Before gating the ways, DELICIOUS-Online proactively evicts the dead blocks from the LLC by prudentially selecting them. After eviction of these dead blocks, a number of cache ways will be made empty by employing a swapping based compaction technique within each individual set. Once the selected way(s) is(are) empty, it is power gated. The entire LLC Resizing mechanism is illustrated in Algorithm 5. The whole task execution span is evenly divided into multiple time-intervals (Curr Interval), and a small time-span, BackPer (back up period), is taken from the end of each Curr Interval during which all the resizing related operations are performed. On completion of each Curr Interval À BackPer, the current performance of the bank (B) is determined by its miss ratio (ratio½B line 4). If the miss ratio is less than a preset threshold (POWER DOWN) and the number of turned off LLC ways (#Off ways½B) is within a preset limit (Limit), then a way (W ) adjacent to a core is selected as the victim (line 5 to 6). The location details of the LLC ways and their adjacency to the cores are determined from the Floorplan of the CMP, which is an input to our algorithm [11], [12]. For each set (S) the presence of dead blocks (blk) is determined by inspecting if their respective Dead bit½blk is set (line 8). If a dead block is clean, it is invalidated, else it is written back to the main memory (line 9 to 12).
On completion of the dead block eviction process, a set might not have an empty location at the victim way W (line 13). Hence, set S will then be checked if there is any empty location, and once an empty location is found, the block will be moved to there from W (line 15). However, if S does not have any empty location at the moment, then search for a clean NMRU (CN) block in S is performed, and will be invalidated on its presence. Otherwise, an NMRU block is selected from W , if available, or from any other random location of S and will be written back subsequently. Next, the block from W will be moved to this empty location (line 17 to 24). Once W is empty for all sets, it will be gated with updating #Off ways½B (line 25). If at the end of a Curr Interval, ratio½B is higher than a preset threshold (POWER UP ), and B has at least one way turned off, a way will then be turned on (line 27 to 28). No LLC reconfiguration is permitted within Curr Interval À BackPer and on completion of resizing process (line 29 and line 31).

Algorithm 5. LLC Resizing
Input: POWER DOWN, POWER UP , Limit, BackPer, Floorplan 1 while A task (T i ) is being executed do 2 if Curr Interval À BackPer is over then 3 for each LLC bank B do 4 ratio½B ¼ #missesðBÞ #accessesðBÞ ; 5 if ratio½B < POWER DOWN) and (#Off ways½B < Limit) then 6 #Select a way W as victim, which will be turned off and is in proximity to a core; 7 for each set S do 8 for The block swapping needs to be performed by accessing the peripheral circuitry of the bank, performance of which is hence limited by the number of ports available per bank. However, the power and performance overheads incurred by this swapping mechanism are negligible [12]. Additionally, our LLC resizing technique can serve the outstanding cache requests during BackPer, unlike prior art [11]. The only difference is that, on an eviction caused by a cache miss, the selected way to be evicted cannot be the victim way. However, the performance impact of LLC resizing is also included in our simulation.

Proposed V/F Spiking
LLC resizing technique can potentially reduce temperature (hence the leakage power) of the cores adjacent to the gated LLC ways. Reduced core temperature therefore offers enough room for maintaining the increased V/F through V/F Spiking for a certain amount of time while keeping the core temperature below the critical value. Our proposed Algorithm 6 shows how DELICIOUS-Online exploits the thermal benefits of Algorithm 5 to enhance core V/F without violating the thermal constraint.
Algorithm 6 takes Temp Max , TDP , and T Lim as inputs, where Temp Max is the maximum allowable temperature for a core. We set Temp Max to 2 C lower than the critical temperature of the core, to ensure that the core temperature will never reach at the critical value. During task execution, at the end of each Interval, each core temperature (Temperature½C) will be observed (line 2 to 5). If the Temperature½C is lower by T Lim than the Temp Max , leakage power of the core (Leak pow ½C) will be computed by considering Temperature½C and supply voltage (line 6). Next, the highest possible viable V/F level (V H =F H ) is determined, so that total (calculated) power consumption (Dyn H pow ½C + Leak pow ½C) is not violating the TDP (line 7). Our algorithm also considers the power of on-chip voltage regulator (VR Pow ). On availability of such V H =F H , the core's V/F is set at V H =F H , and the task execution will be resumed (line 8 to 9). Executing tasks at higher V/F leads to early completion, that results into change in the generated schedule. Basically, higher V/F can potentially execute more number of cycles for a certain time-span than the execution at the V sched =F sched . Hence, we employ a counter (Cyc Ctr) to keep track of the cycles executed at the higher V/F (line 10). Note that, the input T Lim safeguards the core from any potential chattering effects in V/F by allowing V/F Spiking only when the core temperature is sufficiently below the Temp Max .
During task execution, the core temperature will be monitored continuously, and once the Temperature½C reaches at Temp Max , the V/F will be lowered to V sched =F sched ½C (line 15). To keep track of the extra cycles completed at the higher frequency, Cyc Ctr is exploited at the end of each V/F spike. By computing the elapsed time along with considering V sched =F sched ½C, the amount of extra cycles is derived (line 14 to 17). This cycle surplus during executing M i is stored at D cyc , which will be used next for O i execution. We illustrate V/F Spiking process in Fig. 5 at the task level granularity, that depicts when Cyc Ctr is updated and how V/F Spiking helps in finishing the task early. for each core C in parallel do 5 if As per our example in Fig. 5, M i completes at t 0 with V/F Spiking, where its scheduled completion time was at t (t 0 < t). Hence, to execute O i , the time left is the summation of D cyc (which can be executed during interval (t; 0 t) at V sched =F sched ½C) and the cycles left before execution of the next task, which we termed as extended end time of T i (Cyc Ext End T i ) (line 18). Note that, for the sink task, Cyc Ext End T i will be set at the end of the current Frame. However, if the highest version of T i is not scheduled earlier, a checking is performed if O i can be upgraded (line 19 to 23). After selecting the best possible O i , the execution will be started with V sched =F sched ½C (line 26). Upgrading O i may generate slack before completion of Cyc Ext End T i (line 24 to 25), which can be utilized to power gate the core for improving energy/thermal efficiency. All the possible cases regarding upgrading O i are depicted in Fig. 5. By employing a counter and considering the processor's Break Even T ime (given as an input), the span of powergate is traced, and the core will be turned on (line 31) before the starting time of the next task/frame.

Hardware Mechanism
Both Algorithms 5 and 6 can be implemented separately at the respective controllers. The way-shutdown logic at the LLC controller adopts power gating [40] at the way-level granularity of the LLC. Power gating is a conventional circuit based technique integrated with caches as well as cores in modern CMPs [4], [34]. By exploiting conventional control bits (e.g., valid bit, dirty bits, etc.) and the existing performance monitoring counters at the LLC [21], the ratio and Dead bit can be periodically monitored for LLC resizing. Moreover, implementing Dead bit will not incur any noticeable overheads, as discussed earlier. To efficiently scale V/F at the cores, on-chip voltage regulators [17] can be attached, which are also common in contemporary CMPs. Note that, on-chip thermal sensors will be used to observe the core temperature on-the-fly.

EVALUATION
In this section, first we show the efficacy of DELICIOUS-Offline approach (Section 5) followed by the benchmark based evaluation of the DELICIOUS-Online (Section 6).

DELICIOUS-Offline
First, we define Normalized Achieved QoS (NAQ), which is the ratio between the actually achieved QoS for the PTG, and the maximum achievable QoS by executing the highest versions of all tasks. We formulate NAQ as: NAQ ¼ P n i¼1 Acc j i P n i¼1 Acc k i i , where k i represents the highest version of task T i . Next, we model a multicore along with the task-set: Processor System: A homogeneous multicore platform equipped with 4 Intel x86 cores (i.e., m ¼ 4) has been considered. The TDP of the each core is scaled and set as 10.5 W, by considering the Intel Xeon's datasheet [1] and the runtime core power is obtained through McPAT [30]. Task-set:The task characteristics have been taken from a prior technique, Prepare [10], that framed tasks by using PARSEC benchmark applications. The total execution requirement of a PTG (C PTG ) is the sum of the execution times of its subtasks, C PTG ¼ P n i¼1 ET i . Thus, utilization U i of a PTG can be presented as C PTG D PTG . The average utilization of a PTG is taken from a normal distribution, by considering a normalized frequency of 0.6. Given the PTG's utilization, we further obtain the total utilization of the system (Sys uti ) by summing up the utilization of all PTGs. Given the Sys uti , the total system workload (Sys WL ) / system pressure can be derived by: For a given Sys uti , all of our PTGs have been generated by following the method proposed in Prepare [10]. Given a Sys WL , a set of DAGs have been created. The number of DAGs (r) within a set can be calculated as: r ¼ mÂSys WL U i . In our generated PTGs, the minimum number of tasks is equal to 5 and the maximum number of tasks is set to 20. For each PTG in the set, the number of tasks have been generated randomly within a preset limit. Note that, as the individual U i of a DAG is lower than the given Sys WL , the number of DAGs (r) within the set will always be higher than m. Task Temporal Parameters: For each T i , based on which portion of the len i is considered as the mandatory portion (M i ), we consider the following cases [15]: (i) man low : M i $ Uð0:2; 0:4Þ Â len i (low portion of a task T i 's length (len i ) is for the mandatory portion). (ii) man med : M i $ Uð0:4; 0:6Þ Â len i (medium portion of a task T i 's length (len i ) is for the mandatory portion). (iii) man high : M i $ Uð0:6; 0:8Þ Â len i (high portion of a task T i 's length (len i ) is for the mandatory portion). Scalability Analysis of DELICIOUS-Offline. Fig. 6 depicts the mean solving time per number of tasks in each PTG while applying the scheduling heuristic of DELICIOUS, and the ILP based scheduling of Prepare [10]. This result shows that, our proposed heuristic has better scalability with the number of tasks than the ILP based algorithm. With significantly lower running time, this heuristic generates nearly optimal schedule like ILP. In fact, with 20 tasks, the ILP based scheduling has almost 4Â higher execution time than our scheduling heuristic.
Effects of System Workload. Fig. 7 depicts the NAQ achieved by DELICIOUS-Offline for different values of Sys WL . The NAQ is derived by running each of the DAGs that belongs to the set. Then, we have taken the average over the obtained individual NAQ values. We observed that, DELICIOUS is able to achieve 80% QoS, when the system workload is low. However, the QoS is reduced by 20% on average, when the workload is scaled up by 40%. Other two insightful observations can also be derived from this figure. First, as the system workload is increased in order to maintain the number of DAGs (r) in the system, the individual U i also increases and this eventually contributes to low NAQ values. This happens as increasing U i results in higher execution length of each task and thus the possibility of obtaining sufficient free slots in the scheduling period reduces within the deadline. Insufficient free slots in turn reduces the probability of obtaining feasible schedules by selecting higher tasks' versions.
Second, in case of man high , the reduction in achieved NAQ is reduced comparatively lower than the man med and man low , while increasing the value of Sys WL . This can be attributed to the fact that, when mandatory portions of the individual tasks are high, the length of the optional portions will be low. This results into the variance among the different versions of a task become less. Due to fewer variations among the optional parts of a task, there will be less impact on the achieved accuracy. On the other hand, in case of man low , we observe that, the reduction in NAQ is higher than the other two, and man med offers a performance between man high and man low . However, the NAQ sharply decreases while Sys WL goes up. We also compared our strategy with prior arts, Task Deploy [37] and Prepare [10] and the results are shown in Fig. 8. Towards a fair comparison with Task Deploy, we computed the overall energy constraint based on the considered TDP of the experimental framework of DELICIOUS. This power limit is also used in case of Prepare. Next, we consider our comparison by uniformly choosing M i of the tasks between 20% to 80% of len i . As execution demand of individual tasks goes up (due to increase in Sys WL ), DELICIOUS maintains improved QoS by achieving higher NAQ than Task Deploy. DELICIOUS is able to maintain 70% QoS at 70% workload where Task Deploy achieves 60% QoS. This is because the considered overall energy limit in Task Deploy would scale up with the higher Sys WL . Moreover, Task Deploy also allows unlimited tasks migration, that incurs additional overhead. However, for all workloads, Prepare shows better NAQ among all policies due to employment of ILP based optimal scheduling, but, heuristic-based strategy of DELICIOUS-Offline also offers a performance close to this optimal values, with a remarkably low computational time.

Simulation Setup
In this work, a homogeneous tiled CMP having 4 tiles is simulated in the gem5 full system simulator [8]. Each tile has an Intel x86 Xeon OoO core along with its private L1 data and instruction caches. The L2 cache is logically shared, yet physically distributed among the tiles, where each tile contains an L2-bank of the same size. After collecting the periodic performance traces from gem5, it is sent to McPAT [30] to generate the power traces. Basically, we derive dynamic power consumption for individual on-chip components by executing McPAT. As McPAT assumes uniform on-chip temperature for estimating leakage power, which is impractical, we compute the component-wise leakage power by considering the temperatures of individual on-chip components at the end of the last period [24], [25], [26]. Eventually, we derive the total power consumption from dynamic and leakage power estimations, the power values are sent to HotSpot 6.0 [48] towards generating temperature traces. Based on prior analyses [11], [12], the span of this periodic interval is set to 0.33 ms (i.e., 1.0 M cycles at 3.0 GHz frequency), during which we assume the temperature across the CMP is stable. We set BackPer as last 5% time-span of the interval. The HotFloor-Plan module of HotSpot 6.0 generates floorplan of the CMP once at the beginning by considering the component wise area estimation from McPAT. Our detailed system parameters used in the simulations by considering 22 nm technology nodes are listed in Table 5. Table 3 lists the V/F values for Intel x86 Xeon cores, for which power values are obtained from McPAT. The changes in leakage power for different temperatures are also obtained from McPAT and are shown in Table 4, where the leakage increases at higher rate at the higher temperatures. To simplify our online computation in Algorithm 4, we adopt piecewise linear approximation for each range of 10 C to compute leakage consumption at any temperature [11], [12]. In our simulation framework, each core runs at the Base V/F level with the effective frequency (f eff ) of 3.0 GHz. For our experiments, we also consider another V/ F magnitude (Med) between Turbo and Base. Note that, a core can execute tasks in all of these V/F, however, core can maintain Base V/F without any potential thermal threats, but the remaining two values are suggested to be maintained for particular time-spans, provided by the vendor. We set T Lim (of Algorithm 6) as 4 C.
To set Curr Interval, we evaluated nine PARSEC applications for DOA blocks on our baseline system with 0.5 M À 2.0 M in 0.5 M increments, by executing each application for 100 M cycles within RoI, and the results are shown in Fig. 9. The results show that, the cache access patterns for DOA blocks converge at 1.0 M for most of the applications, which is hence considered here as Curr Interval, which is also in line with prior research [12]. For a 1.0 M period-length, our evaluation shows that 89 À 93% of the LLC blocks are DOA, on average. Such salient presence of DOA blocks further justifies the sufficiency of using Dead_bit to detect the dead entries in Algorithm 5.

Task-Set
Our tasks are generated by using PARSEC benchmark suite [7], which can be fitted in an AC based paradigm through the loop perforation technique [3], [43]. Based on these prior studies, we framed our task-set by defining each task with a couple of PARSEC applications, where the former one is executed as M i and the latter one is representing O i . For creating multiple versions of O i , the latter application will have different executable files, with various execution lengths. We have constructed each M i and O i by using two copies of two different PARSEC applications, for example, for a task, T 1 , M 1 is framed by two copies of Black, whereas the O 1 is constructed by two copies of Body. The task-set is detailed in Table 6, where the execution lengths (Exec_Length) are given in million cycles in the region of interest (RoI) for the respective M i 's and O i 's. For example, while running T 2 with its first version of O i (having a length of 100 M cycles), 2 copies of Stream will be executed for 200 M cycles concurrently in our considered CMP to complete M i , and after that, to complete O i , 2 copies of Can will be executed concurrently on the same set of cores. Note that, the execution length of each task in Table 6 is set by scaling the task lengths given in Table 2. The versions of O i selected by DELICIOUS-Offline (Sel. Oi [EL]) are also given in Table 6. We have used a 4 core based CMP, where each task's M i and O i run on 2 cores. Two cores of this CMP implies a single processor-core, P i in Fig. 3.

LLC Resizing, Peak Temperature, and
Performance Improvements DELICIOUS -Offline schedules the task-set where T 2 and T 6 are scheduled with lower O i . Both of these tasks' M i 's  consist of memory intensive PARSEC applications (stream and x264). Presence of dead blocks at the LLC for stream and x264 enables Algorithm 5 to turn off a number of cache ways, that assists Algorithm 6 to maintain Turbo V/F for a longer time. We also experimented with a Med V/F level, higher than Base V/F but lower than Turbo, by running the core at this level during V/F Spiking. The cores can execute tasks at Med for longer time, as the rate of temperature change at this level is slower than Turbo. Our simulation results in Fig. 10 show the reduction in execution lengths of each task for Med and Turbo, where the offered thermal benefits at Med is however compensated by the performance benefits of the Turbo. Both Med and Turbo offer almost similar performance benefits by reducing execution length 8.5% and 8.2%, respectively, without violating the temperature threshold. However, the execution length for Turbo is slightly higher for T 4 , a memory intensive task, that is able to maintain Turbo residency for a longer time at some initial execution phases, which results into higher temperature, and thus it lacks some chances of V/F Spiking later. In DELI-CIOUS, we have chosen Turbo for executing tasks during V/ F Spiking, however, one can also choose Med as a promising alternative. Fig. 11 shows the average and minimum LLC sizes maintained for each task, and the respective reductions in core temperature are also depicted. Algorithm 5 is able to reduce peak temperature by 5.8 C on an average by leveraging the generated thermal buffers through gated LLC ways, that elongates the vendor defined span (of 10 ms) remarkably by 7% on an average (Fig. 10), at Turbo. Overall, DELICIOUS-Online improves QoS by executing all tasks at their highest version, and the reduction in execution span also generates slacks at the end of each task. The generated amount of online slacks are significant, which are in the range of 6:2 À 10:1% of their actual execution span (generated offline) across the tasks. The updated versions and the amount of generated slacks are listed in Table 7. However, by employing LLC resizing induced V/F Spiking, DELICIOUS-Online noticeably improves achieved QoS (by DELICIOUS-Offline) of the task-set by 8.3%.

Comparison With Prior Works
We compared DELICIOUS with two recent prior works, Prepare [10], that refines the schedule (generated offline) by employing an LLC miss induced DVFS technique, and, GDP [33], that employs a threshold temperature based technique to apply DVFS. Fig. 12 depicts how DELICIOUS outperforms the prior policies in terms of the maximum reduction in peak temperature of the cores during the slacks. The longer slack intervals in DELICIOUS offer a maximum reduction of up to 9.2 C, which is up to 7.8 and 6.7 C for Prepare and GDP, respectively. Table 8 shows, DELICIOUS surpasses the prior techniques in terms of online QoS improvement, as eviction of dead blocks also plays a significant role in boosting up the performance along with the V/F Spiking. Prepare offers an online QoS improvement by 5.3%, which is 8.3% in case of DELI-CIOUS-Online (not applicable for GDP). In fact, our LLC resizing is also able to reduce core peak temperature by 5.8 C, which is 5.1 C and 4.9 C for Prepare and GDP, respectively. The threshold temperature based DVFS in GDP scales down the core's V/F that does not allow thermal overshoot, whereas our V/F Spiking mechanism considers both TDP and critical temperature to prevent temperature overshoot with elongated time-span for Turbo frequency. Prepare, on the other hand, controls peak temperature by introducing energy-adaptive DVFS at the cores.    Improving result-accuracy in AC based real-time paradigms without violating power constraints of the underlying hardware has recently become an active research avenue. Execution of the AC real-time applications is split into two parts: (i) the mandatory part, execution of which provides a result of acceptable quality, followed by (ii) the optional part, which can be executed partially or fully to refine the initially obtained result towards improving the result-accuracy without deadline violation. In this paper, we introduce DELICIOUS, a novel hybrid offline-online scheduling strategy for AC real-time dependent tasks. By employing an efficient heuristic algorithm, DELI-CIOUS first generates a schedule for a dependent AC task-set at a base processing frequency with an objective to maximize the results-accuracy, while respecting the system-wide constraints. At runtime, DELICIOUS next employs a prudential way on-off based LLC resizing induced thermal management to enhance the processing speed at the cores for a stipulated time-span without violating power budget, called as V/F Spiking, to reduce the tasks' execution lengths. The generated slack by the reduced execution length can be exploited either to enhance QoS further by dynamically adjusting the optional part or to reduce temperature by enabling sleep at the cores. In addition with surpassing the prior art, DELICIOUS offers 80% result-accuracy with our scheduling strategy, which is enhanced by 8.3% in online, while reducing runtime peak temperature by 5.8 C on average within deadline, as shown by a benchmark based evaluation on a 4-core based CMP.  Sukarn Agarwal received the PhD degree in computer science and engineering from IIT Guwahati, India, in March 2020. He is a research associate with the School of Informatics, University of Edinburgh (U.K.). His research interests include emerging memory technologies, memory system design, network-on-chip design and thermal aware chip management. He published many of his research contributions in conferences like ASAP, VLSI-SoC, GLS-VLSI, ISVLSI, etc. and also published several of his research outcomes in journals like IEEE Transactions on Very Large Scale Integration (VLSI) Systems, ACM Transactions on Embedded Computing Systems, IEEE Transactions on Computers, and ACM Transactions on Design Automation of Electronic Systems.
Rahul Gangopadhyay is associated as a postdoctoral researcher with the Moscow Institute of Physics and Technology, Russia. Previously, he was a postdoc in St. Petersburg State university, Russia. His broad research domain is in Graph Theory, and specifically his research interests include hypergraph, rectilinear crossing, etc. He has published many of his research outcomes in journals like Computational Geometry, Graphs and Combinatorics, etc.
Magnus Sj€ alander received the PhD degree from the Chalmers University of Technology, in 2008. He is working as a professor with the Norwegian University of Science and Technology (NTNU). Before joining NTNU, in 2016 he has been a researcher with the Chalmers University of Technology, Florida State University, and Uppsala University. His research interests include hardware/software co-design (compiler, architecture, and hardware implementation) for high-efficiency computing.
Klaus McDonald-Maier is currently the head of the Embedded and Intelligent Systems Laboratory and director research, University of Essex, Colchester, U.K. He is also the founder of Ultra-SoC Technologies Ltd., the CEO of Metrarc Ltd., and a visiting professor with the University of Kent. His current research interests include embedded systems and system-on-chip design, security, development support and technology, parallel and energy-efficient architectures, computer vision, data analytics, and the application of soft computing and image processing techniques for real-world problems. He is a member of VDE and a fellow of the BCS and IET.