ACCURATE: Accuracy Maximization for Real-Time Multicore Systems With Energy-Efficient Way-Sharing Caches

Improving result accuracy in approximate computing (AC)-based real-time applications without violating deadlines has recently become an active research domain. Execution time of AC real-time tasks can individually be separated into: execution of the mandatory part to obtain a result of acceptable quality, followed by a partial/complete execution of the optional part to improve the result accuracy of the initial result within a given deadline. However, obtaining higher result accuracy at the cost of enhanced execution time may lead to deadline violation, along with higher energy usage. We present ACCURATE, a novel hybrid offline–online approximate real-time scheduling approach that first schedules AC-based tasks on multicore with an objective to maximize result accuracy and determines operational processing speeds for each task constrained by system-wide power limit, deadline, and task dependency. At runtime, by employing a way-sharing technique (WH_LLC) at the last level cache (LLC), ACCURATE improves performance, which is further leveraged, to enhance result accuracy by executing more from the optional part and to improve the energy efficiency of the cache by turning off a controlled number of cache ways. ACCURATE also exploits the slacks either to improve the result accuracy of the tasks or to enhance the energy efficiency of the underlying system, or both. ACCURATE achieves 85% QoS with 36% average reduction in cache leakage consumption with a 24% average gain in energy-delay product (EDP) for a 4-core-based chip multiprocessor (CMP) with 6.4% average improvement in performance.


I. INTRODUCTION
I N REAL-TIME computing, the correctness not only depends on the precision of the results but also on the time at which they are produced. For such critical systems, approximated results obtained on time are preferable over accurate results generated after the deadline has passed. For example, in a real-time video application, initially an inaccurate, but the acceptable quality image is generated from the received data. Then, based on the available resources, the obtained image may further be refined [1]. Thus, approximate computation (AC) approaches [2] can minimize the possibility of tasks missing their deadlines due to strict resource requirements. In AC approaches, a task is decomposed into a mandatory part, followed by an optional part [3]. The mandatory part must be executed entirely in order to produce an acceptable result, while the result accuracy increases with the execution cycles spent on the optional part. Specifically, to obtain a substantial amount of increase in result accuracy, a certain number of additional cycles need to be executed from the optional part. In order to maximize the result accuracy, while meeting the power and deadline constraints, proper scheduling approaches have to explore both the architectural characteristics of the system and the approximation tolerance of the applications.
Energy-efficient scheduling of the approximated real-time tasks that target to maximize result accuracy without violating the underlying system constraints has become a research topic of paramount importance in the recent past. Stavrinides and Karatza [4] were among the first to propose real-time scheduling of an AC-based task set. In recent theoretical analysis [3], the authors improved system-level result accuracy through the task to processor allocation and task adjustment constrained by a preset energy budget. But, restricting the energy usage does not guarantee the thermal safety of the chip, which can be addressed by incorporating power constraint together with a runtime power management technique by considering several architectural parameters. However, comprehensive studies that combine the theoretical aspects of energy-efficient processing of approximated applications in real-time paradigm along with due consideration to the runtime architectural characteristics (e.g., cache performance, instructions per cycle (IPC), etc.) have not been conducted so far.
A homogeneous chip-multiprocessor (CMP) platform along with a set of AC real-time tasks can be represented by precedence-constrained task graphs (PTGs), equipped with multiple distinct implementable versions having various result-accuracy levels based on the respective amount of the optional part that is executed. By exploiting start time and the versions of the individual task nodes, our work, ACCURATE presented here, first determines task-to-processor allocation with an appropriate version of the individual task, the operating voltage/frequency (V/F) level, as well as their order of execution, such that the system-level result accuracy (i.e., QoS) is maximized, while meeting both the deadline, precedence, and power constraints. After the offline phase, task executions are triggered as per the precomputed schedule and each task will be executed with its associated V/F level assigned. During the execution, the cache-based dynamic accuracy enhancement and energy minimization techniques of ACCURATE first attempt to improve the performance by adopting a way-sharing mechanism at the last level cache (LLC). This LLC-based runtime strategy ensures that improving performance through way-shared LLC (WH_LLC) can potentially finish the task early, which will be traded against either: 1) to enhance result accuracy by executing a higher version of the tasks selected on-the-fly or 2) to improve energy efficiency by dynamically resizing the LLC.
As contemporary applications [5]- [7], that include approximations that spend a significant amount of time accessing memory, employing way-shared LLC can reduce the total execution time of the tasks and can generate slacks. ACCURATE attempts to exploit such slacks to enhance the result accuracy by executing a higher optional version of the task (subject to availability), or by dynamically resizing the LLC to enhance energy efficiency while maintaining performance. Additionally, ACCURATE exploits slacks to enhance the energy efficiency of the system by enabling the sleep/powergated mode at the cores and LLC. However, our performancecognizant online approach enhances result accuracy for the tasks, and improves energy efficiency, without affecting the predetermined schedule. Fig. 1 depicts the working mechanism of ACCURATE.
The major contributions of the ACCURATE are thus summarized as follows.
1) We propose an integer linear programming (ILP)based scheduling scheme, ACCURATE:Offline, for the AC real-time PTGs on a power-constrained CMP with an objective to maximize the result accuracy, where the tasks are executed with a selected version (see Section IV-A). 2) We further propose a dynamic accuracy enhancement along with an online energy minimization technique, i.e., ACCURATE:Online (see Section IV-B), which improves the performance of the individual tasks, where improved performance is traded off either: 1) to enhance result accuracy by executing a higher task version selected onthe-fly or 2) to improve energy efficiency by dynamic LLC resizing. Additionally, in presence of any sufficiently large slacks, the system will be put into sleep/power-gated mode for more energy saving. We argue and empirically validate the significance of our task scheduling approach in combination with our online cache-based strategy (see Section V). The benchmark application-based evaluation with a 4-core-based baseline CMP (equipped with 2MB 8-way associative shared L2 cache) in our simulation setup (consisted of gem5 [8] and McPAT [9]) exhibits that through ILP-based task scheduling ACCURATE achieves 85% QoS, and the cache-based online strategy reduces LLC-leakage consumption by 36% on an average with 24% average gain in energy-delay product (EDP) combined with 6.4% average performance improvement. The scheduling strategy of ACCURATE outperforms a prior Task_Deploy [3] scheduling mechanism that offers a QoS of 55% for our considered task set with 70% system workload, while ACCURATE achieves a QoS of 70%. We further empirically justify the exploitation of way-shared LLC (having a performance improvement of 10%) over another prior technique, Zcache [10] (having an average performance improvement of less than 6%) in ACCURATE. To the best of our knowledge, ACCURATE is the first scheduling mechanism that trades off the performance gained by employing a way-sharing technique at LLC to improve both runtime energy efficiency and result accuracy of the AC real-time task set. After discussing the relevant related work in Section II, we show how ACCURATE is different from the state of the art.
Article Organization: After presenting the relevant related work in Section II, we will model the system in Section III where our processor and task models will be discussed along with the scheduling criteria. After modeling the system, the detailed mechanisms of ACCURATE will be illustrated in Section IV, in which Sections IV-A and IV-B discuss the ILP-based scheduling mechanism, and dynamic LLC-based performance improvement and energy-efficient techniques, respectively. The efficacy of the ACCURATE is demonstrated in Section V along with detailing the description of our simulation setup. The article is concluded in Section VI. The acronyms used in this article are abbreviated in Table I. II. RELATED WORK Nowadays, energy minimization in contemporary multiprocessor embedded systems has become a topic of paramount importance [11], [12]. Energy-efficient scheduling for the time-critical tasks, with precedence constraints on multiprocessor platform, imposes significant research challenges [13], [14]. Over the last few years, several research attempts [15]- [18] were undertaken to devise energy and fault-aware real-time scheduling for a set of time-critical task sets.
Recently, Cao et al. [19] introduced the concept of AC to meet the energy budget of a large-scale real-time system that executes tasks without precedence constraints. Other prior efforts also explored energy-efficient AC tasks scheduling [19]- [21], without considering the precedence relations among the tasks. Yu et al. [22] coined the concept of an "imprecise computation (IC)" tasks, where tasks also have a mandatory and an optional portions. The authors further proposed a "dynamic-slack-reclamation" technique to improve the system QoS to incorporate more energy efficiency, but task dependencies were not considered. To the best of our knowledge, the first attempt to schedule IC/AC-dependent tasks can be found in [4], where the authors compared the performance of conventional real-time scheduling approaches like highest level first (HLF) and least space-time first (LSTF) between two task sets, where one set contains the AC tasks. However, this work did not include the energy efficiency.
The energy-aware scheduling of dependent AC tasks is considered in [3] and [23] that employ DVFS at the cores to improve energy efficiency. However, as DVFS curtails the supply voltage and frequency to save power, the transient faults of the system can significantly raise up the reliability issues [24]. Hence, in ACCURATE, we first propose an offline task allocation technique that schedules AC realtime tasks with respective frequency levels by considering precedence-power-temporal constraints. In addition, during execution, a way-sharing LLC strategy is employed to enhance the performance which will be further traded off toward stimulating result accuracy as well as improving energy efficiency by dynamic cache resizing.
Zang and Gordon-Ross [25] and Mittal [26] surveyed a number of performance-cognizant low-power on-chip cache design techniques along with their pros and cons. By employing Gated-VDD [27] at the circuit level to power gate the cache lines, a prediction-based energy-efficient cache was proposed in [28] for the tiled CMP (TCMP) architecture static nonuniform cache access (SNUCA)-based architecture, that incurs a remapping technique for the gated cache lines. To reduce cache leakage power significantly, a bank shutdown policy based on run-time bank usages was proposed in [29]. Fitzgerald et al. [30] and Zhou et al. [31] kept selected cache lines into low-power drowsy/sleep mode, for minimizing cache leakage power where the sleep mode consumes less power but retains stored data. In addition with an effective reduction in the overall energy consumption of a CMP, dynamic cache resizing can also assist in reducing chip temperature significantly [32], [33].
Toward uniformly distributing the cache loads across the cache sets, dynamic associativity management (DAM) techniques have been developed where heavily used sets are benefited by utilizing the idle ways of the underused ones. Several DAM-based approaches [10], [34], [35] have already been proposed with variable implementation overheads. Out of these, FS-DAM [35] has been adopted in our work for its lesser implementation complexities along with the privilege of dynamic restructuring of the groups.
ACCURATE Over State of the Art: The majority of the prior scheduling approaches attempted to minimize the makespan time, however, in the case of AC-based precedenceconstrained tasks, the objective becomes to maximize the overall result accuracy, rather than makespan minimization. Moreover, most of these prior energy-efficient scheduling mechanism employed DVFS at the cores, but have not considered on-chip LLCs that significantly contributes to the total on-chip power consumption [25]. As the majority of LLC power comes from their leakage consumption and a large portion of these LLCs remain underutilized during execution, prudential LLC resizing can be a viable knob to achieve energy efficiency [32], [33]. To exercise such energy-efficient mechanisms in real-time systems, promising techniques like DAM can be employed at the LLCs to safeguard the performance. In ACCURATE, after generating the schedule of the tasks through an ILP-based strategy (see Section IV-A), we have studied the potential of a DAM-based way-sharing technique at the LLC in performance improvement for AC real-time task set. During execution, ACCURATE further trades off this gained performance (see Section IV-B), either: 1) to save runtime energy by the selective shutdown of LLC ways, where ways will be turned on if performance degrades, or; 2) to improve result accuracy by executing a higher version of the optional parts of the tasks, subject to availability. ACCURATE also exploits the sufficiently large slacks to save more energy by enabling power-gated/sleep mode at the cores and LLCs. Our results also show, ACCURATE surpasses state-of-the-art techniques. To the best of our knowledge, ACCURATE is the first technique that considers an LLC-based online mechanism to enhance both result accuracy and energy efficiency without violating the deadline constraint.

III. SYSTEM MODEL AND ASSUMPTIONS
We consider a CMP consisting of m homogeneous cores, denoted as P = {P 1 , P 2 , . . . , P m }. Each core supports L distinct V/F settings denoted as V = {V 1 , V 2 , . . . , V L } and , representing the precedence relations between a distinct pair of tasks. An edge T i , T j refers to the fact that a task T j can begin its execution only after the completion of T i . The source and sink tasks have no predecessors and no successors, respectively. Being a real-time application, A must be executed within the given deadline, D PTG , by executing all of its associated task nodes within the interval.
The worst case execution length, len i , for each task T i (1 ≤ i ≤ n) is logically decomposed into M i cycles for the mandatory part, and O i , the maximum cycles for the optional part. We further assume that a task T i may have k i different versions, that is, Note that, length of T j i (i.e., len j i ) includes the memory cycles needed to access LLC, which has been obtained by executing individual tasks for a particular configuration (see Fig. 4). The result accuracy Acc for all the tasks [19] and can be represented as follows: If a task T i executes at frequency F i then its execution time ET i can be denoted as len i /F i , which is a bound on taskexecution time. We used this execution time for the offline phase. If F a > F b , then len i /F a < len i /F b . To enhance the result accuracy of an individual task, while maintaining its deadline, a higher version of the task needs to be executed at a higher clock frequency of the core. However, increasing the clock frequency increases the power consumption (Pow), which might increase the core's temperature. Hence, we further assume an overall system-wide power limit (Pow_BGT), which includes both dynamic and static power, where the estimation for the static power in our theoretical model has been performed by considering a fixed temperature. 1 Note that, Pow_BGT includes power consumption of both cores and caches, where dynamic power consumption at the cores is higher than the static counterpart and caches are accounted for their static power consumption [25], [32]. However, toward maintaining accuracy in estimating the power consumption, both dynamic and static power have to be considered. Hence, our runtime power consumption is modeled by employing the McPAT [9] tool, that estimates power consumption values (both dynamic and static power) for both cores and caches for our specific system configuration detailed in Section V-B1.

IV. ACCURATE
In this section, the working mechanism of ACCURATE is illustrated. After elaborating the ILP-based scheduling in Section IV-A, we will discuss the runtime LLC-based power minimization and accuracy enhancement mechanism of ACCURATE in Section IV-B. First, ACCURATE generates the schedule and provides the following information: 1) task to core mapping; 2) start and end times of the individual tasks; 3) assigned frequency; and 4) respective tasks' versions. A dispatch table stores the generated scheduling information by arranging the tasks as per their execution order, which will be used to execute the tasks at runtime. During execution, ACCURATE traverses the dispatch table, selects, and fetches individual tasks to execute according to their start time stamps. Basically, while running the task set, ACCURATE:Online allows the measurements of release and completion times for each task. These measures of time correspond to the generated schedule which is presented afterward in Table IV and the respective pictorial timing diagram is shown in Fig. 3. Note that, the dispatch table is stored and maintained in a repository residing in memory.
To empirically validate ACCURATE, at first we employ the tool CPLEX [36] to verify the constrained scheduling, with an example task set represented as a DAG, where we created a task with PARSEC applications 2 [5] (see Section V). After that, by accessing the dispatch table, the generated information for this task set will be used in our online simulation framework consisting of gem5 [8] (a full system simulator for performance traces) and McPAT [9] (power simulator). Our evaluation framework for the online mechanism considers a 4 out-of-order (OoO) core-based TCMP [37] (discussed further in Section V with the detailed simulation setup). To enable way gating at the cores, ACCURATE incorporates power-gating mechanism [27] at the way-level granularity of each LLC bank, having negligible implementation overhead.
Let t start (T i ) and t finish (T i ) denote the start time and finish time of the task T i , respectively. Then, we have The required constraints on the decision variable to model our scheduling strategy are stated as follows.
1) Each task T i is assigned to exactly one processor with a particular version and executed at one frequency level 2) The application A meets its end-to-end absolute deadline D PTG . Hence, the sink node T n must be finished by D PTG . This constraint can be represented as follows: 3) The peak power consumption of the system should not exceed the given power budget. Let Pow peak represents the peak power consumption of the system as follows: Pow peak = max Pow sys (6) where Pow peak ≤ Pow_BGT.
Pow sys is the power (both dynamic and static) consumption of all the busy cores and can be obtained 5) To ensure, the tasks have no overlapping executions in the same processors, the following inequalities need to be satisfied: Equation (11) prevents timewise overlap of two tasks on the same processor, i.e., T j must start after completion of T i , if T i starts before T j . If tasks are executed in the opposite order, we use big-M nullification to deactivate the constraint. M has been considered as: M = max{ len k i /F l }∀i ∀l. Objective: The objective of the formulation is to choose the feasible solution, which maximizes QoS of the application. Hence, the objective can be written as follows: Here, in the context of this ILP formulation, QoS(A) can be found as follows: subject to the constraints presented in (4)- (11). Complexity Analysis: We present the complexity analysis for our ILP in Table II. The second column of this table lists the upper bound of the number of constraints for each equation. The unique resource constraint in (4) should be determined for all n tasks, hence, for a given PTG, overall n constraints will be required. Similarly, the number of variables for this constraint can be represented as O(K · L · m), where K denotes the maximum number of possible versions of a task. However, as the number of processors (m) and the number of frequency levels (L) are typically constants for a given system, thus the complexity may be considered as O(K). For deadline constraint in (5), this condition should be checked for a single sink node, and thus, only O(1) constraints will be required. In this way, the total complexity of ILP (in terms of the number of constraints) can be represented as O(n 2 ). It may be noted that the complexity of ILP is independent of the number of processing elements in a platform and the deadline of a PTG.  This PTG application needs to be scheduled on two processors (m = 2), with a deadline D PTG = 100 time units. Our assumed power budget for both processors is set as Pow_BGT = 50. As per the constrained scheduling strategy, CPLEX [36], the ILP solver generates the scheduling output shown in Fig. 3. The results are also represented in tabular form in Table IV. From Fig. 3, it can be found that tasks T 1 , T 3 , and T 5 were executed with their highest versions on processor P 1 . Out of these three tasks, T 5 executes in lower V/F level (i.e., 0.5) for satisfying the power constraint. On the other hand, task T 2 is able to execute with its highest version (of the available three versions) on the processor P 2 to maximize the overall QoS of the system. However, T 4 and T 6 are executed on P 2 with their respective lowest versions, in order to maintain the temporal constraint. It is evident that the entire PTG is able to finish by 100 time units and thus, D PTG = 100 has been fulfilled. The total obtained QoS value is 45.

B. ACCURATE:Online (Dynamic Accuracy Enhancement and Power Minimization)
Once the tasks are scheduled, the execution will be triggered and our runtime mechanism will first boost up the performance by incorporating a way-sharing-based technique (WH_LLC) [35] at the LLC (detailed in Section IV-B1). By logically increasing the cache associativity on-the-fly, WH_LLC reduces the number of cache misses, which limits the number of off-chip (memory) accesses. Thus, the running time of the task is reduced, and it generates a set of idle processor cycles (which will be called private slack for individual tasks in this article from here onward) at the end of the execution of each individual task in the predetermined schedule. Next, our online technique will utilize the private slack for each task in a couple of ways (see Section IV-B2). The tasks, that have been scheduled with their highest version, will exploit the private slack only for improving energy efficiency by turning off a set of LLC ways on-the-fly for reducing LLC-leakage power consumption. This dynamically trimmed LLC might affect the performance by increasing the number of cache misses. However, our online mechanism periodically monitors the performance and turns on cache ways, if needed, to maintain the predetermined schedule. On the other hand, the tasks scheduled with a result accuracy, having room for further improvement, might exploit the private slack by running the highest possible versions from their optional parts to enhance the result accuracy. Note that, in both cases, the predetermined schedule will not be violated. However, our online mechanism can be tuned further, to balance the power-performance tradeoff as per the system requirements.
Before applying WH_LLC, we first analyzed nine PARSEC applications [5] by running them in gem5 [8] for a stipulated number of clock cycles with our simulation setup (see Section V-B). Most of the prior analyses of the PARSEC regarding cache access patterns have shown the sufficiency of using 70−100M clock cycles, as by considering this analysis overall trend of cache access patterns can be realized for most of the PARSEC applications [5], [28], [32], [33]. In ACCURATE, we have used 80M clock cycles (in RoI) for all of our simulations related to background analyses.
Our simulation shows, a significant amount of their execution times, these PARSEC applications spend in accessing memory, which is shown in Fig. 4. In case of memoryintensive applications, like Can, Ded, Fluid, and Stream, more than 50% of the execution times are spent on accessing memory. The adopted LLC-based way-sharing technique, WH_LLC, and a prior way-sharing policy Zcache [10] that significantly curtail the memory accesses by reducing capacity and conflict misses through better utilization of the LLC space and thus, improve performance. We further implemented   and compared WH_LLC and Zcache with our simulation setup (mentioned above), and showed the performance improvements for the individual benchmarks in Fig. 5. As per this figure, WH_LLC outperforms Zcache for all of these nine applications with 10.5% improvement in IPC (on average), whereas Zcache achieves 5.6% average IPC improvement, which motivated us to adopt WH_LLC in the time-critical environment of ACCURATE.
1) Improving Performance at the LLC: Prior empirical analyses [35], [38] showed that, due to locality of reference, the LLC accesses of applications are distributed nonuniformly across different granularity levels (bank, set, way, etc.) of the LLC, that keeps a big chunk of the LLC portion underutilized. Several DAM-based techniques [10], [35], [38] have been evolved to logically handle such load distributions by providing heavily used cache sets the privilege of using the idle ways of the underutilized ones. Fig. 6 illustrates the entire WH_LLC mechanism for an 8-way set associative (A) cache having eight cache sets (S). First, a number of cache sets are grouped together to form a fellow group based on their usages, such that each group contains a mix of lightly and heavily used cache sets. Next, each of these cache sets is divided into two logical regions: 1) normal ways (NT) and 2) reserved ways (RT), where any cache set within a fellow group can use RT portions of all member cache sets. In Fig. 6, cache sets 0, 1, 3, and 5 are in the same fellow group and can share their RT ways, and similarly, cache sets 2, 4, 6, and 7 will also share their RT ways, respectively. Logically, the associativity of each cache set is now increased to 20 (from originally 8), which drastically reduces the capacity and conflict misses at the heavily used cache sets and improves the overall system performance. Note that, WH_LLC handles the existing diversities in cache set usages during different execution phases of the task, by dynamically restructuring these fellow groups. The functional correctness of the addressing mechanism in addition to the detailed discussion on this way-sharing mechanism is out of the scope of this article. Fig. 7 illustrates how WH_LLC will improve the performance in ACCURATE. The darker task blocks for individual tasks imply the modified execution spans of the respective ones with WH_LLC in action, while the corresponding brighter portions with dotted borderlines are representing the older schedule (see Fig. 3). We have also shown the generated private slack only for T 5 . Practically, the improved memory latency by employing WH_LLC will boost up the overall performance, which is reflected through the reduced execution times for the individual tasks. The change in execution time (Exec. Time) for T 3 after applying WH_LLC is explicitly shown in the figure. Note that, the performance improvements for the tasks in Fig. 7 are not to scale/measure. Our simulation results in Section V will show the changes in performance for the individual tasks consisted of PARSEC benchmarks [5] (see Table VI).
2) Enhancing Power Efficiency and Result Accuracy: Incorporating WH_LLC logically divides each LLC set into two parts, as discussed earlier. Hence, shutting down a physical cache way will have different impacts on the task's performance, depending upon if it is an NT or an RT way. Fig. 8 shows how way shutdown will change the associativity for an 8-way LLC, having a fellow-group size of 4 with four dedicated ways per set for RT. Shutting down two physical cache ways from the NT portion will reduce the logical associativity to 18. On the other hand, if two physical ways can be turned off from the RT part, logical associativity will be reduced by 2 × 4, i.e., 8, so finally it will be 12. And shutting down two ways individually from NT as well as from RT will entail the logical associativity to 10, which is still higher than the original one (8). So, by employing WH_LLC, even after shutting down 50% (physical) 3 ways from a cache bank, we can still maintain an associativity of 10. This can however partially curtail the gained benefits of WH_LLC, but will still be able to maintain the performance over the baseline while significantly reducing the power consumption. Note that, in this 3 By considering our system configuration (see Section V-B1), we restricted ourselves to ensure the available cache size at least 50% during execution based on prior cache requirement analyses of PARSEC [5]. Note that, the value of this limit is application dependent. work, we set the upper limit for way shutdown at 50% from each of the NT and RT ways. For all tasks, that have been scheduled with their highest version, the way shutdown will be applied for reducing LLC power consumption. To avoid any implementation conflicts, ACCURATE does not allow concurrent execution of dynamic LLC resizing and reconstruction of the fellow group in WH_LLC. Algorithms 1-6 present the complete procedure for performing way shutdown at the individual LLC banks along with the result-accuracy enhancement. Once the schedule is generated, the individual tasks' start time as well as end time are determined. ACCURATE:Online next converts all such timing parameters to cycles and stored in the dispatch table, whereas the duration (in cycles) of the deadline is named as FRAME.
Algorithm 1 takes the following parameters as the inputs: Interval_length, Sleep_Thr, Turn_ON_OH, and #available_higher_versions_of _O i . During execution, Algorithm 1 checks the LLC usages periodically at the end of each Interval_length number of cycles, which is set by considering prior analyses of LLC usages [28], [32], [33]. Sleep_Thr is a minimum threshold value for a slack span which is also known as the processor's break-even time [39], and whose value is architecture dependent. Turn_ON_OH represents the time taken for the core to be turned on from its sleep mode. The number of available higher versions of O i of task T i over its scheduled one is represented by #available_higher_versions_of _O i . cycle_cntr, a variable, keeps track of the number of cycles within a FRAME. #Off _ways_at_NT[B] and #Off _ways_at_RT[B] counters keep track of the number of turned off NT and RT ways, respectively, at a particular LLC bank B. We also use a flag No_LLC_resize_flag[T i ] to decide (initialized to 0 at line 3), if LLC resizing for T i will be enabled. The end timestamp for the individual tasks (within a FRAME on the assigned core) is modified and called as extended end time (Extended_End_Time_T i ), which is defined as follows.
1) Extended_End_Time_T i is the scheduled start time of the next task (say T i ) assigned on the same core, if the current task is not the last task on its assigned core within the same FRAME.  2) Extended_End_Time_T i is set to the length of the FRAME for the last task of a particular core within the FRAME. For example, Extended_End_Time_T 2 at core P 2 in Fig. 3, is 46, which is the start time of T 4 . The Extended_End_Time_T 5 will be 100, as T 5 is the last task of the FRAME at P 1 . For ease of understanding, all of these time values can be assumed as cycles, e.g., 100 time units can be considered as 100 cycles.
With the onset of the FRAME, the algorithm first checks if any initial slack exists at the current core by looking at the dispatch table. Such slack can only exist, if the tasks are waiting at the current core for the execution of the source task at some other core. For a sufficiently large init_slack having a length of at least Sleep_Thr + Turn_ON_OH, sleep mode will be enabled at the current core for the duration of the slack (lines 7-10). For enabling sleep mode at the core, Sleep-Manager() subroutine, i.e., Algorithm 2 is called, that maintains a counter (gated_cycles) during sleep and turns the core on if the counter is exhausted (line 1 to 6).
For each ready task (T i ), Algorithm 1 first checks if the task is scheduled with its highest version, and the execution will be started (line 11 to 14). If a task is not scheduled with its highest version, the system checks for the best possible schedulable higher version available for the task by executing enhanceaccuracy process given in Algorithm 3 (see lines [1][2][3][4][5]. Before inspecting the availability of the higher O i , the algorithm will start executing M i (line 18), and on completion the time left for executing O i , i.e., Cycles_Left_O i , will be determined (line 19). Based upon the available higher versions which can be fitted within the time left, O i will be updated with the best possible one by calling Algorithm 3 and will be executed accordingly (lines 20-22). In our example, we were able to dynamically schedule and execute the higher version for T 6 (see Fig. 9) by prudentially exploiting its private slack (included in Cycles_Left_O i ). Note that, our algorithm does not allow dynamic LLC resizing if a task's version can be updated online, which, if allowed, might lead to deadline violation. Hence, the flag No_LLC_resize_flag[T i ] is set to 1 for the tasks whose version can be updated dynamically (see line 16). Our algorithm also looks for the availability of the sufficiently large slack span after execution of each task, and on availability of such slacks, sleep mode will be enabled at the processor core by calling Algorithm 2 (lines [23][24][25][26]. To execute tasks, Algorithm 1 calls task-execution method given in Algorithm 4, that executes each task in the following manner. Once a task is fetched, the predetermined V/F level for this task will be set at the assigned processor core and the execution will be started (see line 2). During execution of a task, cycle_cntr is updated at each clock cycle, and this value is used to determine if an Interval is encountered and current task is eligible for LLC resizing (i.e., No_LLC_resize_flag[T i ] = 1) (see line 4). Once the cycle_cntr is at the Interval, and the task is eligible for LLC resizing, the algorithm will attempt to resize the LLC by calling Algorithm 5. ACCURATE is implemented with a multibanked LLC, in which we will enable our way-level dynamic LLC resizing strategy at each bank B. Hence, Algorithm 5 will be called for all of these LLC banks (lines 6-8). However, once resizing is done, the execution will proceed normally.
Existing diversities in cache access pattern across different execution phases of individual applications excogitate diverse cache requirements on-the-fly. As the time criticality is enforced, keeping track of the task's cache requirements during different execution phases is inevitable, which can be monitored by considering the miss rate at the bank level granularity. Therefore, at first, a ratio is calculated by #misses(B)/#accesses(B) for the individual banks (B) on completion of an interval (Interval) (see line 2). If this ratio is smaller than POWER_DOWN (line 3), the algorithm will first check if the number of turned off ways (#Off _ways_at_NT [B]) is less than the maximum allowed (#Limit) then an NT way is selected as a victim, and it will be shutdown eventually after invalidation or eviction of its blocks (lines 4-7). If the number of turned off ways (#Off _ways_at_NT[B]) reaches the maximum allowed (#Limit), then if the number of turned off ways in the RT portion (#Off _ways_at_RT [B]) is less than the maximum allowed (#Limit) (line 9), a way from RT will be turned off after invalidation or eviction of its blocks (lines [10][11][12]. Note that, during the eviction of the blocks from the victim way, the bank can still serve external memory accesses. The main difference is that an eviction caused by a cache miss will not evict the data from the victim way. On the other hand, if the ratio is larger than POWER_UP (line 14) and there exists at least one power-gated way at the RT portion, then one RT way is turned on (lines [15][16][17]. If RT has no gated ways at present, our algorithm will attempt to turn on a powered-off NT way (see lines [17][18][19][20]. Note that, the incorporation of two separate limits for ratio, where POWER_UP is larger than POWER_DOWN reduces the chance for oscillating resizing where one (physical) way is repeatedly turned on and off during stable execution phases. Depending on the system parameters and the average expected workload of the system, a suitable Interval_length and other thresholds (POWER_UP and POWER_DOWN) can be determined (see Section V-B1). Hence, these may either be set at design time or may be made configurable. The number of sets that can be evicted per cycle during way shutdown is to be limited by the number of memory ports (per bank). Note that, the block invalidation or eviction at the LLC ways is performed by the Evict-Way method in Algorithm 6 (lines 6 and 11). As long as the blocks are available at a particular way, this algorithm will either write the block back to the main memory, if dirty, or invalidate the block. Once this operation is done, the way will be turned off (lines 1-6). Proof: Algorithm 1 is the heart of ACCURATE:Online technique that executes at each core, which at first investigates the dispatch table to identify if there exists a slack at the beginning of the FRAME. Such slacks can be determined just by looking at the dispatch table, hence, it incurs a computational overhead of O (1). A stepwise analysis of computational overhead of Algorithm 1 due to the called functions/algorithms is as follows.

3) ACCURATE:Online Computational Overheads
1) On presence of a slack at the beginning of the FRAME the core will be gated, only if the slack span is sufficiently large, by calling Algorithm 2, that keeps tracking of the time during sleeping. As sleep duration typically takes a small value, Algorithm 2 will incur a computational overhead of O(1). For all practical purposes, computational overheads of these algorithms may be considered to be constant, however, implementation overheads for Algorithms 5 and 6 are limited [27]. 3) Hence, the worst case computational complexity of Algorithm 1 is O(n · k).
4) The number of processor cores is constant. Hence, at any FRAME, the total overhead for generating the schedules over all processor cores for the duration of a FRAME is O(n · k) in the worst case.

5) As the FRAME length is in O(D PTG ), the amortized complexity of ACCURATE:Online is ([O(n · k)]/[O(D PTG )]).
V. RESULTS AND ANALYSIS In this section, we will illustrate the efficacy of ACCURATE by evaluating ILP-based task allocation and scheduling (see Section V-A) and runtime energy efficiency and performance improvement (see Section V-B). Based upon the tasks' parameters (e.g., execution time spans, interdependencies among the tasks) and the number of available processor cores along with the V/F levels, the tasks are allocated by the ILP-based scheduling. Once the task allocation is over, with the onset of the execution, our online cache-based policy trims the execution spans of the individual tasks by activating WH_LLC. In case the current task is scheduled with its highest version, then LLC-leakage consumption will be reduced through selective power gating of the cache ways. On the other hand, while the task is scheduled with compromised accuracy, by trimming the execution span with WH_LLC, the highest possible version of the task is selected for execution. Toward standardizing our evaluations, we have considered task-execution parameters as per AC real-time task model of [3] in the case of our offline strategy, whereas our online architectural technique is evaluated by employing a mixture of compute and memory-bound PARSEC benchmark applications [5]. Moreover, a prior art claimed the eligibility of PARSEC in real-time environment [40].

A. Evaluating ACCURATE: ILP-Based Scheduling
Performance evaluation has been carried out through a comprehensive set of simulation-based experiments, considering a homogeneous multiprocessor system that executes a set of real-time precedence-constrained tasks. Normalized achieved QoS (NAQ) is the principal metric based on which the evaluation has been performed. NAQ can be defined as the ratio between the actually achieved QoS [see 2] for the entire PTG and the maximum possible achievable QoS by executing the highest version of each task node. Mathematically, NAQ can be formulated as follows: It can be inferred that NAQ contributes to derive a measure of the efficacy of the offline phase. Specifically, it determines how much optional portion of each task has been executed, depending upon the chosen version, by satisfying the constraints. Now, to show the efficacy of our offline technique, we model a multiprocessor system along with a task set as follows. 1) Processor System: For our experiment, we consider a multiprocessor platform equipped with 4 Alpha 21364 cores, where per core Pow_BGT is set at 2.7 W which is obtained through power profiling for individual tasks in McPAT [9].
2) Task Characteristic: Each PTG consists of a set of subtasks under dependency constraints with a deadline D PTG . Each subtask (T i ) is a multithreaded task (see Table VI), where all threads of a single task are executed on the same core (in a quasiparallel manner) which is characterized by the execution times, ET i . We also assumed that a subtask can consume between 4 × 10 7 and 6 × 10 8 clock cycles [3]. Note that these WCET values of tasks have been assumed to be calculated by employing the framework as stated in [41]. This framework enables to quantify the possible overestimation of WCET upper bounds obtained by the static analysis. The prime objective was to derive a lower bound on the WCET to complement the upper bound. As ACCURATE employs a hybrid offline-online approach, such static analysis will be beneficial for us to eliminate the overestimation, and we can expect much realistic WCET.
It is further assumed that each task node can have a maximum of five versions, i.e., k = 5. The assumptions regarding execution lengths also include memory cycles for our individual tasks, consisting of PARSEC benchmark applications [5], [35]. The total execution requirement of a PTG (C PTG ) is defined as the sum of the execution times of its subtasks, C PTG = n i=1 ET i . Hence, the utilization U i of a PTG can be denoted as (C PTG /D PTG ). The average utilization of a PTG is taken from normal distribution by considering normalized frequency 0.5. Given the PTG's utilization, we can obtain the total system utilization (Sys uti ) by summing up the utilization of all the PTGs. Given the system utilization, the total system workload (Sys WL ) or system pressure can be derived by: Sys WL = (Sys uti /m)×100%. For a given system utilization, we have generated the PTGs by following the method proposed by Qamhieh and Midonnet [42]. Given a Sys WL , a set of DAGs is created. The number of DAGs (ρ) within a set can be measured as follows: In our generated PTGs, the minimum number of tasks (nodes) is equal to 5 and the maximum number of nodes is set to 20. For each of our PTGs in the set, the number of nodes has been randomly generated within the specified limit. It can also be noted that as individual utilization (U i ) of a DAG is lower than the given system workload (Sys WL ), the number of DAGs (ρ) within the set will always be higher than m. All of our experiments are carried out by using the CPLEX optimizer version 12.10.0, with a timeout of 5 h.

4) Frequency Level:
We have chosen two distinct normalized frequency levels as: f norm = 0.5 and 1 for task execution. The respective actual V/F settings for our considered cores are given in Table V. Scalability Analysis of ILP: Fig. 10 depicts the average solving time per number of tasks (nodes) in each PTG. We observed that, when the number of tasks in each PTG is within 10, the average solving time remains comparable. This implies, if the number of tasks lies within 10, the increase in solving time does not significantly vary with the number of tasks. However, when the number of tasks increases further, i.e., more than 10, the average solving time also increases. This observation is also supported by the complexity analysis provided in Table II. Empirically, we further noticed, with n = 20, the ILP generates on average 5000 constraints for which the solving time reaches approximately 140 min. 1) Results: Fig. 11 shows the NAQ obtained by ACCURATE for various values of Sys WL . It can be observed that ACCURATE is able to achieve 85% QoS when the system workload is low. However, QoS is reduced by 20% on average when the workload increases by 40%. Other two insightful observations can be derived from this figure. First, as the system workload increases the average number of PTGs in the system also increases (as U i is fixed at 0.2) and this eventually contributes to low NAQ values. This happens due to higher number of tasks decreases the possibility of obtaining sufficient free slots in the scheduling period within the deadline. Insufficient free slots in turn reduce the probability of obtaining feasible schedules by selecting higher versions of the tasks.  Second, in the case of man_high, it imposes less adverse effect on the achieved NAQ with the increasing value of Sys WL . This is because when the mandatory portions of individual tasks are high, the length of the optional portions will be low. As a result, the variance among the different versions of a task becomes less. However, due to fewer variations among the optional portions of a task, there will be less impact on the achieved result accuracy. On the other hand, in the case of man_low, we can observe the alternative trend, and man_med offers a performance between man_high and man_low. However, the NAQ sharply decreases while the Sys WL increases. We have also compared our policy with a prior strategy (Task_Deploy) [3] and the results are shown in Fig. 12 in case of man_med. For a fair comparison with Task_Deploy, we first derived the overall energy limit based on our considered power budget (Pow BGT ) of ACCURATE's experimental framework. The same value is used as the energy limit for Task_Deploy as well. It can be observed, as the number of tasks increases (due to the increase in Sys WL ), ACCURATE maintains more QoS by achieving higher NAQ than Task_Deploy. ACCURATE is able to maintain 70% QoS with 70% workload where Task_Deploy achieves 55% QoS. This is because Task_Deploy did not consider any power limit, but assumed the energy budget would increase with a higher number of tasks. Moreover, Task_Deploy also allows unlimited task migration that incurs extra overheads.

B. Evaluating ACCURATE:Online LLC-Based Technique
The evaluation of the WH_LLC-based dynamic accuracyenhancement and power minimization is carried out by employing architectural simulators, where our entire online technique (discussed in Section IV-B) has been implemented. Before the demonstration of our results, we will first discuss the simulation setup.
1) Simulation Setup: We simulated two 4 core-based homogeneous TCMP with four replicated tiles (see Fig. 13) in gem5 full system simulator [8] as our baseline system, where each of these TCMP is representing a single processing element (i.e., P i in Fig. 13). However, each tile of these TCMP contains an In-Order (InO) Alpha 21364 core together with its private L1 (data and instruction) caches. The whole L2 cache (LLC in our case) is physically distributed/sliced uniformly among the tiles, called L2-bank, but logically the L2-banks share a single address space. The tiles are connected through a 2-D-mesh NoC, hence, each tile is also equipped with a router (depicted by the circles in Fig. 13). We implemented Algorithms 1-6 in the Ruby module of gem5 and associated performance overheads for implementing these algorithms are also considered in our simulation. For estimating power/energy consumption (based on 32nm technology nodes), performance traces are fed to another simulator, McPAT [9]. The incurred energy overheads for implementing the online mechanism of ACCURATE are also derived from McPAT.
By considering prior empirical analysis based on cache locality [28], [33], the length of an interval (Interval_length in Algorithm 1) is set to 2 million clock cycles. To set POWER_UP and POWER_DOWN in Algorithm 5, the range of the ratio for nine PARSEC applications was observed over 80 million clock cycles (within RoI), while applying FS-DAM at the LLC. Fig. 14 shows the ranges of ratio for individual PARSEC benchmarks. It can be noticed that the miss ratio is varying between less than 1% and more than 8% with an average of 2.75%. This small difference between the minimum and the average values indicates that for most intervals the miss ratio is small. For our evaluation, in this work, we set the values for POWER_UP and POWER_DOWN as 0.04 and 0.025, respectively, i.e., for a bank, the miss ratio of more than 0.04 will turn on a physical cache way while a value less than 0.025 will turn off a physical cache way in the LLC-bank. Table V contains the configuration parameters for the processor cores and memories used in the evaluations. We generated our tasks by using the PARSEC benchmark suite [5] which can be fitted in an AC-based paradigm [7], [43]. In their work, Sidiroglou-Douskos et al. [43] showed how PARSEC benchmark programs can be used in the approximation paradigm through the loop perforation technique. To simulate our application (mentioned in Table III), we use six tasks where each processor (i.e., each 4-core-based TCMP) executes the allocated tasks without any preemption. The tasks are framed by randomly combining executions of multiple PARSEC benchmark programs together, where each one might also appear multiple times (see Table VI). This implies, each of our tasks is multiprogrammed, hence, our application (A) is a collection of multiprogrammed tasks. Basically, in Table VI, we show how each T i in Fig. 2 (described in Section IV-A) is formed by PARSEC benchmark programs. Toward simulating the whole system with PARSEC, we further scale up the values of M i , O i and D PTG by 100 million. Note that, the individual task cycles include both processor and memory cycles for the specific cache configuration given in Table V. Toward empirically validating and verifying ACCURATE with the contemporary workloads, we employ multithreaded PARSEC benchmark programs, where each individual program is executed with four threads. However, the discussion related to the detailed allocation of the benchmarks and their threads inside each task to the cores of the TCMP, which is internally managed by our simulation setup, is out of scope of this article.
The Baseline values in all of our results that evaluate runtime techniques of ACCURATE are produced by executing the schedule generated by ILP-based scheduling (discussed in Section IV-A) without incorporating any changes during execution. Also note that, as we mentioned earlier, all timing parameters derived from the scheduling strategy are converted to clock cycles while filling up the dispatch table with the task details. The task details regarding their execution length (for mandatory and optional parts) in cycles for a particular configuration of the processing platforms need to be made available beforehand. Details of the processing platform include the number of cores per processor (e.g., it is 4 in ACCURATE), available operational processing frequencies, cache configurations, and memory sizes (see Table V). The processor and memory cycles for each task are also derived prior task scheduling through preexecutions of the tasks. The percentage of execution time spent for memory accesses is shown in Fig. 4 for individual PARSEC benchmark program.
2) Change in Performance at the Task Level: After implementing WH_LLC and dynamic way-shutdown techniques (Algorithms 1 and 5) in the Ruby module of gem5, we noticed the changes in IPC at the task levels during execution. Employing WH_LLC significantly boosts up LLC performance, by reducing capacity and conflict misses that further reduces off-chip accesses and resulting into improved IPC. But, incorporation of way shutdown (proposed in Algorithm 5) further aggravates performance gained through WH_LLC, however, this performance degradation is compensated by a remarkable reduction in leakage consumption (discussed next). We further compared WH_LLC with another DAM-based prior work, Zcache [10], that yields increased LLC associativity rather than the actual number of ways by increasing the number of replacement candidates. Fig. 15 shows the impacts on performance of WH_LLC, ACCURATE (WH_LLC + LLC resizing), and Zcache for the individual applications over the baseline. WH_LLC is able to improve performance by 10% on an average for all tasks, with a minimum improvement of 9.5% in case of T 1 . However, this result shows ACCURATE curtails performance gained by WH_LLC for individual applications, but is still able to maintain a better IPC over baseline, which ensures meeting of the real-time constraints.
Among all of our tasks (mentioned in Table VI), T 2 and T 5 are memory intensive, whereas the other tasks are comprised of mixed (memory plus computational) workloads. Hence, the  performance degradation is comparatively higher in the case of T 2 and T 5 in ACCURATE, than in the other tasks. However, our dynamic way turn on the mechanism (in Algorithm 1) safeguards the executions from violation of deadlines by providing more cache space to the tasks, on demand. Note that, even after shutting down cache ways on-the-fly, our technique still shows better performance than the baseline, as well as Zcache. Our technique ACCURATE still maintains a mean performance improvement of 6.4% over baseline, which is 10.4% with only WH_LLC (over the baseline), whereas Zcache boosts performance up by 5.7% over baseline. Moreover, this empirical result implies that any task for which a higher version is available, with an additional execution span (in clock cycles) of within 10% of the currently scheduled version, can enhance the result accuracy by executing its higher version. Additionally, energy efficiency can be enhanced by enabling the sleep mode, subject to availability of the private slack.
3) Reduction in LLC Leakage: We set the upper limit for way shutdown to 50% in Algorithm 5 [44] that reduces around 36% of leakage power on an average across the applications. Fig. 16 exhibits the reduction in LLC-leakage consumption for the individual applications, where the leakage reduction is more in case of the mixed workload-based tasks (T 1 , T 3 , T 4 , and T 6 ). The requirement of higher run-time cache space curtails the leakage reduction for the memory-intensive tasks (T 2 and T 5 ) for which Algorithm 1 was unable to maintain a lower cache size for a long time-span on-the-fly. Note that, we executed all of these tasks with their respective highest versions (i.e., the best possible ones) along with the assigned V/F level (at the core) (by scheduling mechanism in Section IV) toward illustration of the efficacy of our online mechanism. 4) EDP Gains: For the same set of applications executing with their respective highest version, our cache-based online technique shows lesser EDP gains in the cases of memoryintensive tasks (T 2 and T 5 ), due to their comparatively lesser reduction in LLC leakage. On the other hand, mixed workloads (T 1 , T 3 , T 4 , and T 6 ) are able to provide higher EDP gains due to higher reduction in the LLC-leakage consumption while applying ACCURATE. Fig. 17 shows significant gains in EDP across the tasks while applying ACCURATE. Our online LLC-based strategy is able to offer a significantly  higher average EDP gain of 24% and this gain lies between the range of 19%−28% for our task set. Note that, EDP for each application includes the power consumed by both the processor cores and the two levels of caches.

C. Gains From ACCURATE in Nutshell
The offline mechanism first generates the schedule and is able to achieve around 85% NAQ (see Section V-A), while maintaining the system constraints. Our online cache-based strategy shows a significant performance improvement of 6.4% on average (see Section V-B2) while reducing 36% LLCleakage power consumption on an average (see Section V-B3) by shutting down a number of LLC ways. The overall performance improvement of the online policy ensures to meet the timing constraints determined by the offline scheduling. However, while maintaining the deadline constraint, our cachebased online technique is able to reduce a significant amount of energy by generating private slacks, which are employed for sleep that enables to a noticeable overall energy reduction of 44% (see Fig. 18).
By employing Algorithms 1 and 5, we have modified the schedule online, which is reported in Table VII. For tasks T 1 , T 2 , T 3 , and T 5 , Algorithm 1 applies WH_LLC along with the way shutdown, whereas for T 4 and T 6 , Algorithm 1 attempted to improve the result accuracy. The Scheduled Timespan column in Table VII shows the output of our offline technique, and the next two columns present the actual run time with WH_LLC and ACCURATE (that includes WH_LLC and dynamic LLC resizing), respectively. In our schedule, for T 4 and T 6 we have scopes to improve the result accuracy, as they are not scheduled with their respective highest versions. Our algorithm is able to improve the result-accuracy online for T 6 , which is highlighted in the green background whereas red background in case of T 4 implies it cannot be executed with its higher version due to violation of the schedule. For T 6 , the actual running time with WH_LLC and ACCURATE are lower than its predetermined execution spans, and note that for T 6 , way shutdown was not performed. The private slacks generated at the end of the execution of any tasks will be employed for sleep. Note that, during the execution of source (T 1 ) as well as sink (T 6 ) tasks, only one core where the source/sink task is assigned will be active, and the rest will be kept in the sleep mode. By executing a higher version in the case of T 6 , our technique is able to achieve a result accuracy of 47, which was 45 at the end of our offline scheduling. Finally, our overall energy savings for individual task level are shown in Fig. 18. This figure shows, by incorporating way shutdown and sleep, we achieve 44% savings in overall energy consumption for our task set. So, the amalgamation of these techniques in ACCURATE (offline plus online) can offer an energy-efficient AC real-time task-allocation strategy with higher achievable QoS.
VI. CONCLUSION QoS improvement in AC real-time systems without violating the precedence-power-temporal constraints has become an active research topic in recent times. Accuracy of such AC tasks can be stimulated by executing more from their optional parts along with executing their respective mandatory parts. In this article, ACCURATE proposed: 1) an efficient scheduling strategy toward maximizing result accuracy for a set of AC real-time applications modeled as PTGs on multicores, along with 2) an online cache-based mechanism toward further refinement of the result accuracy together with reducing run-time energy of the underlying circuitry.
Once the tasks are allocated to the processor cores by employing an ILP-based scheduling technique, our online strategy orchestrates a DAM-based way-sharing mechanism at the shared LLC to significantly reduce the running time of the applications. This improved performance is traded off toward enhancing result accuracy by executing more workload from the optional part of the applications and by turning off a controlled number of LLC ways to enhance the energy efficiency, dynamically, while respecting the system-wide constraints. Our evaluation reveals that the offline strategy of ACCURATE achieves 85% QoS while maintaining the system constraints and the cache-based online mechanism reduces LLC leakage by 36% on an average with 24% average gain in EDP and 6.4% improvement in performance for our 4-core-based CMP baseline system.