Architectural-Space Exploration of Heterogeneous Reliability and Checkpointing Modes for Out-of-Order Superscalar Processors

State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of $\sim 6\times $ using a unique combination of different state compression techniques.


I. INTRODUCTION
Reduced transistor sizes, due to technological advancements, have led to significant improvements in computational performance, especially in the latest generation of multi-core processors.As a drawback, this downscaling has led to an increased susceptibility towards several reliability problems, such as soft errors, at the hardware layer [1].Soft errors are transient faults in the hardware that cause bit-flips in the data path or memory that may propagate to the application output or may terminate application execution [2].The rate of occurrences of these soft errors is expected to increase with each new generation of processors being released into the market, due to their fabrication using continuously smaller technology nodes [3].This is a major threat for industries that rely on dependability and reliability of electronics for application areas like aerospace or automotive.
Plenty of research work focusing on techniques like redundancy and checkpointing has been proposed towards prevention and/or detection and mitigation of soft errors across the computing stack, i.e., the hardware and software layers [4] [5].Reliability at the hardware layer is ensured through redundancy of execution paths and/or hardening of pipeline components, i.e., Double/Triple Modular Redundancy.Software-layer techniques realize spatial/temporal redundancy by ensuring the execution of multiple redundant threads of an application, thereby ensuring a reliable output [6].However, these techniques incur significant performance (temporal redundancy), and/or power overheads (spatial redundancy).Therefore, we propose to evaluate the individual properties and requirements of an application at the design-time to develop a soft error mitigation strategy that decreases the power/performance overhead while satisfying the reliability requirement of the application.
Towards this, we make the following novel contributions:  Vulnerability analysis of an out-of-order superscalar core, to extract application-specific requirements and AVF of pipeline components (Section III).
 A novel methodology for analyzing the applicationspecific properties and vulnerabilities of out-of-order superscalar cores.It enables the design of a wide range of heterogeneous reliability modes, which can be selected at run-time, to increase the system reliability and decrease the power/area overhead (Section IV).
 To further enhance the processor reliability and increase the design space, we analyze and investigate efficient compression techniques.They can be used to decrease the storage size of checkpointing data in case of power emergencies, and enable an efficient rollback of the processor state to the last known safe-state (Section IV.C).
We present a motivational case study that illustrates the requirement for heterogeneous reliability modes in Section III.Before that, we present our experimental setup for better understanding of the results.

II. EXPERIMENTAL SETUP
We extend the cycle-accurate simulator, gem5 [7], to provide the following functionality: (1) estimate the vulnerable time of all pipeline components to determine their AVFs [8] heterogeneous reliability modes by hardening key pipeline components using TMR [9], but not the full-scale processor triplication all the time.Component level redundancy enables more fine-grained reliability management at run time, and (3) checkpoint processor states using mechanisms like DMTCP [10][11] and HBICT [12][13].We use the ALPHA 21264 fourissue super-scalar processor core [14] as our target platform.To account for a wide range of applications, we evaluate the proposed heterogeneous reliability modes using the MiBench application benchmark suite.An overview of our experimental setup is provided in Fig. 1.
In the next section, we analyze the vulnerability of an out-oforder superscalar ALPHA core processor to evaluate the application-specific Architectural Vulnerability Factor (AVF) for key pipeline components.

III. VULNERABILITY ANALYSIS OF OUR-OF-ORDER SUPERSCALAR ALPHA CORES
A component's Architectural Vulnerability Factor (AVF) is defined as the probability of a fault to propagate to the final output resulting in an execution error.We evaluate the vulnerability of the Sha and Bit-counts applications from the MiBench application benchmark suite [15] by executing them on a single-core ALPHA 21264 superscalar processor [14].We consider the following eleven key pipeline components for the vulnerability analysis: Re-order Buffer (ROB), Instruction (IQ), Load (LQ), and Store Queues (SQ), Integer (Int.RF) and Floating Point Register Files (FP RF), Rename Map (RM), Integer (Int.ALU) and Floating Point ALUs (FP ALU), and Integer (Int.MD) and Floating Point Multiply/Divide (FP MD).The results of this experiment are presented in Fig. 2. From the experiment, make the following key observations:  The AVFs of the individual pipeline components are different for different applications. We have identified four key pipeline components (Integer ALU, Floating-point Register File, Store Queue, and Reorder Buffer) that are more vulnerable during the execution of Sha, when compared to Bit-counts.
These components have different AVFs because of the type of instructions being executed and their application-specific properties (compute or memory-intensive, instruction-level parallelism, cache hit/miss rate, etc.).For example, components like the re-order buffer and the store queue are more vulnerable in Sha because of higher levels of instruction-level parallelism and more store instructions.
Based on this analysis, we can infer that hardening certain components of the pipeline increases the reliability of the processor more than hardening the other components.Therefore, we design a wide range of reliability-heterogeneous ALPHA cores that can be selected to increase the reliability of application executions while decreasing the area/power overhead.
In the next section, we present the methodology we used to evaluate the vulnerability of different pipeline components and devise heterogeneous reliability modes that can be selected at run-time to increase the reliability of the system.

IV. HETEROGENEOUS RELIABILITY MODES OF OUT-OF-ORDER SUPERSCALAR CORES
Fig. 3 presents an overview of our methodology for designing heterogeneous reliability modes for out-of-order superscalar processors.Our methodology targets two approaches for designing heterogeneous reliability modes: (1) Redundancy and (2) Checkpointing.To ensure reliable execution at the hardware layer, we propose heterogeneous reliability modes by hardening the processor's highly vulnerable pipeline components depending on the target application.These pipeline components are selected based on initial fault-injection experiments or on AVFs that are estimated based on the number of vulnerable bits and vulnerable time of each component.Second, we ensure redundancy by investigating efficient checkpoint compression techniques to effectively reduce the size of the checkpointing data.The initial analysis for evaluating the AVF of different components is illustrated and discussed in Section III.

A. Full-Processor Vulnerability Factor (FPVF)
We extend the architectural vulnerability factor to evaluate the vulnerability of the complete processor.We define Full-Processor Vulnerability Factor (FPVF) as the ratio of the total number of vulnerable bits (VulnerableBits) in the processor pipeline for the duration they are vulnerable (VulnerableTime) to the total number of bits in the processor (TotalBits) pipeline  for the total duration of application execution (TotalTime).We estimate the vulnerability of our proposed heterogeneous reliability modes using the equation:

B. Heterogeneous Reliability Modes for ALPHA Cores
Based on the differences in vulnerability of various pipeline components, we propose to harden a combination of the key pipeline components, instead of all pipeline components, to increase processor reliability while reducing the area and power overheads of triple modular redundancy.Table 1 presents our list of nine proposed heterogeneous reliability modes (RM) and the components that are hardened in these modes using TMR.
Hardened components have three instances with the same inputs, and a voter circuit at the output to determine the majority.RF, IQ, LQ, SQ, RM We evaluate the vulnerability of our heterogeneous reliability modes by executing applications from the MiBench application benchmark to estimate the FPVF for each scenario.We also evaluate the area and power overheads incurred by each reliability mode.These results are illustrated in Fig. 4. From these results, we make the following key observations:  Different hardening modes are successful in reducing the processor vulnerability to different extents depending upon the application properties and requirements.For example, reliability modes like RM2, RM6, and RM9 are successful in reducing the processor vulnerability of Sha by more than 50%, but not of Dijkstra, even though they have similar vulnerabilities in all other reliability modes. Hardening specific components in the pipeline can significantly reduce the overall processor vulnerability.For example, key like Rename Map (RM) and Reorder Buffer (ROB) effectively reduce the FPVF for all applications, as shown by the heterogeneous reliability modes RM4, RM7 and RM8.However, utilizing these hardening modes incur significant area and power overheads.U, RM4, RM7, RM8 Using the data gathered from the simulation of our designs, we perform a design space exploration that trades-off FPVF, area, and power overheads to extract the pareto-optimal designs that suit the target application best.The corresponding results are illustrated in Fig. 5.The x-axis denotes the Full-Processor Vulnerability Factor (FPVF), whereas the y-and z-axes denote  the power and area overheads, respectively.The design labeled Ⓤ in all applications is the unprotected core that is highly vulnerable to soft errors.As it does not deploy any redundancy measures, it has zero area and power overhead, and hence lies on the pareto-front.The pareto-optimal reliability modes for the applications are presented in Table 2. RM4 is pareto-optimal for all applications except Sha.The register file is highly vulnerable to soft errors during the execution of Sha and needs to be hardened to reduce the vulnerability.The reliability mode RM7 is pareto-optimal for all four applications and reduces the FPVF on average by 87% with average area and power overheads of 10% and 43%, respectively.

C. State Compression Techniques
Checkpointing and Rollback is an effective way of guaranteeing reliability at the software layer by means of providing both spatial and temporal redundancy.A checkpoint is a snapshot of the processor state at any instant in time.Checkpoints allow the system to rollback to previous safe states in case a failure is detected and re-execute instructions.The checkpointing mechanism deployed by gem5 comes with certain caveats.The cache and pipeline states are not preserved/saved in a checkpoint because of which frequent restoration from such checkpoints results in performance loss.Therefore, we explore techniques like DMTCP (Distributed Multi-Threaded Checkpointing) that checkpoints the Linux process.The back-end checkpointing mechanism of DMTCP is accessible to the programmer via numerous APIs.These APIs can be used in conjunction with the front-end gem5 pseudoinstructions for checkpoint creation/recovery.Since these software-based checkpoints are often large, the checkpoint is compressed using gzip and HBICT (Hash Based Incremental Checkpointing Tool) to save memory.HBICT provides DMTCP (Distributed Multi-Threaded Checkpointing Tool) support for delta-compression (relative to the previous compression) which is further compressed using gzip (combination of lossless data compression algorithms like LZ77 and Huffman coding).We investigate the effectiveness of these techniques in all possible combinations.We evaluate the effectiveness of these state compression techniques on applications from the MiBench application benchmark suite by simulating them on the ALPHA core using gem5.The results of this experiment are presented in Fig. 6.It can be observed that the combination of DMTCP and gzip is highly successful in reducing the checkpoint size by ~6×.On the other hand, a combination of DMTCP, HBICT, and gzip techniques reduces the checkpoint size by ~5.7×.

V. CONCLUSION
We presented a methodology for evaluating the vulnerability of a processor's pipeline components for a given set of applications to identify the highly vulnerable components.Using these results, we proposed heterogeneous reliability modes for an outof-order superscalar processor that decreases the processor vulnerability by hardening specific components in the pipeline to reduce power and area overhead.The pareto-optimal reliability mode RM7 is successful in reducing the processor vulnerability by 87% on average, with area and power overheads of 10% and 43%, respectively.We also investigate effective state-compression techniques to reduce the size of the checkpoint by ~6×.

1 Fig. 3 :
Fig. 3: Overview of Our Methodology for Hardening Out-of-Order Superscalar Processors

Fig. 4 :
Fig. 4: Full-Processor Vulnerability Factor (FPVF) and Power/Area Trade-off of Our Heterogeneous Reliability Modes for Different MiBench Applications

Fig. 2: Differences in AVF of ALPHA Core Components during Application Execution (Sha and
Bit-counts).