By Topic

Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on

Date 20-22 April 2008

Filter Results

Displaying Results 1 - 25 of 28
  • [Covers]

    Page(s): i - ii
    Save to Project icon | Request Permissions | PDF file iconPDF (241 KB)  
    Freely Available from IEEE
  • [Commentary]

    Page(s): iii - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (192 KB)  
    Freely Available from IEEE
  • Reviewer and referee listings

    Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (113 KB)  
    Freely Available from IEEE
  • Table of contents

    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • [Commentary]

    Page(s): xiii - xiv
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (125 KB)  

    First Page of the Article
    View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quick Performance Models Quickly: Closely-Coupled Partitioned Simulation on FPGAs

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (345 KB) |  | HTML iconHTML  

    In this paper we explore microprocessor performance models implemented on FPGAs. While FPGAs can help with simulation speed, the increased implementation complexity can degrade model development time. We assess whether a simulator split into closely-coupled timing and functional partitions can address this by easing the development of timing models while retaining fine-grained parallelism. We give the semantics of our simulator partitioning, and discuss the architecture of its implementation on an FPGA. We describe how three timing models of vastly different target processors can use the same functional partition, and assess their performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Program Phase Detection based on Critical Basic Block Transitions

    Page(s): 11 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (653 KB) |  | HTML iconHTML  

    Many programs go through phases as they execute. Knowing where these phases begin and end can be beneficial. For example, adaptive architectures can exploit such information to lower their power consumption without much loss in performance. Architectural simulations can benefit from phase information by simulating only a small interval of each program phase, which significantly reduces the simulation time while still yielding results that are representative of complete simulations. This paper presents a lightweight profile-based phase detection technique that marks each phase change boundary in the program's binary at the basic block level with a critical basic block transition (CBBT). It is independent of execution windows and does not explicitly employ the notion of threshold to make a phase change decision. We evaluate the effectiveness of CBBTs for reconfiguring the LI data cache size and for guiding architectural simulations. Our CBBT method is as effective at dynamically reducing the L1 data cache size as idealized cache reconfiguration schemes are. Using CBBTs to statically determine simulation intervals yields as low a CPI error as the well-known SimPoint method does. In addition, experimental results indicate the CBBTs' effectiveness in both the self-trained and cross-trained inputs, demonstrating the CBBTs' stability across different program inputs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters

    Page(s): 22 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4095 KB) |  | HTML iconHTML  

    Computer clusters are a very cost-effective approach for high performance computing, but simulating a complete cluster is still an open research problem. The obvious approach - to parallelize individual node simulators - is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic "ground truth" simulation, at less than a 1% accuracy error. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conservative vs. Optimistic Parallelization of Stateful Network Intrusion Detection

    Page(s): 32 - 43
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1022 KB) |  | HTML iconHTML  

    This paper presents and experimentally analyzes the performance of three parallelization strategies for the popular open-source Snort network intrusion detection system (NIDS). The parallelizations include 2 conservative variants and 1 optimistic scheme. The conservative strategy parallelizes inspection at the level of TCP/IP flows, as any potential inter-packet dependences are confined to a single flow. The flows are partitioned among threads, and each flow is processed in-order at one thread. A second variation reassigns flows between threads to improve load balance but still requires that only one thread process a given flow at a time. The flow-concurrent scheme provides good performance for 3 of the 5 network packet traces studied, reaching as high as 4.1 speedup and 3.1 Gbps inspection rate on a commodity 8-core server. Dynamic reassignment does not improve performance scalability because it introduces locking overheads that offset any potential benefits of load balancing. Neither conservative version can achieve good performance, however, without enough concurrent networkflows. For this case, this paper presents an optimistic parallelization that exploits the observation that not all packets from a flow are actually connected by dependences. This system allows a single flow to be simultaneously processed by multiple threads, stalling if an actual dependence is found. The optimistic version has additional overheads that reduce speedup by 25% for traces with flow concurrency, but its benefits allow one additional trace to see substantial speedup (2.4 on five cores). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computer Aided Engineering of Cluster Computers

    Page(s): 44 - 53
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (745 KB) |  | HTML iconHTML  

    There are many scientific and engineering applications that require the resources of a dedicated supercomputer: drug design, weather prediction, simulating vehicle crashes, fluid dynamics simulations of aircraft or even consumer products. Cluster supercomputers can leverage commodity parts with standard interfaces that allow them to be used interchangeably to build supercomputers customized for these and other applications. However, the best design for one application is not necessarily the best design for other applications. Supercomputer design is challenging, but this problem is harder due to the huge range of possible configurations, volatile component availability and pricing, and constraints on available power, cooling, and floor space. Cluster design rules (CDR) is a computer-aided engineering tool that uses resource constraints and application performance models to identify the few best designs among the trillions of designs that could be constructed using parts from a given database. It uses a branch-and-bound strategy based on cluster design principles that can eliminate many inferior designs from the search without evaluating them. For the millions of designs that remain, CDR measures fitness by one of several user-specified application performance models. New application performance models can be added by means of a programming interface. This paper details the concepts and mechanisms inside CDR and shows how it facilitates model-based engineering of custom clusters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Analysis of I/O And Syscalls In Critical Sections And Their Implications For Transactional Memory

    Page(s): 54 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (575 KB) |  | HTML iconHTML  

    Transactional memory (TM) is a scalable and concurrent way to build atomic sections. One aspect of TM that remains unclear is how side-effecting operations - that is, those which cannot be transparently undone by a TM system - should be handled. This uncertainty poses a significant barrier to the general applicability and acceptance of TM. Further, the absence of transactional workloads makes it difficult to study this aspect In this paper, we characterize the usage of I/O, and in particular system calls, within critical sections in two large applications, exploring both the actions performed and the characteristics of the critical sections in which they are performed. Shared memory programs employing critical sections are the closest approximation available to transactional workloads, so using this characterization, we attempt to reason about how the behavior we observed relates to the previous proposals for handling side-effecting operations within transactions. We find that the large majority of syscalls performed within critical sections can be handled with a range of existing techniques in a way transparent to the application developer. We also find that while side-effecting critical sections are rare, they tend to be quite long-lasting, and that many of these critical sections perform their first syscall (and thus become side-effecting) relatively early in their execution. Finally, we show that while these long-lived, side-effecting critical sections tend to execute concurrently with many critical sections on other threads, we observe little concurrency between side-effecting critical sections. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full-System Critical Path Analysis

    Page(s): 63 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (578 KB) |  | HTML iconHTML  

    Many interesting workloads today are limited not by CPU processing power but by the interactions between the CPU, memory system, I/O devices, and the complex software that ties all the components together. Optimizing these workloads requires identifying performance bottlenecks across concurrent hardware components and across multiple layers of software. Common software profiling techniques cannot account for hardware bottlenecks or situations where software overheads are hidden due to overlap with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor microarchitectures or message-passing applications). In this paper we present a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines. We avoid tedious up-front modeling by using control-flow tracing to expose implicit software state machines automatically, and iterative refinement to add necessary manual annotations with minimal effort. By applying our technique within a full-system simulator, we achieve an integrated trace of hardware and software events with minimal perturbation. As a result, we can perform this analysis across the user/kernel and hardware/software boundaries and even across multiple systems. We apply this technique to analyzing network performance, and show that we are able to find performance bottlenecks in both hardware and software, including some surprising bottlenecks in the Linux 2.6.13 kernel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Explaining the Impact of Network Transport Protocols on SIP Proxy Performance

    Page(s): 75 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (842 KB) |  | HTML iconHTML  

    This paper characterizes the impact that the use of UDP versus TCP has on the performance and scalability of the OpenSER SIP proxy server. The session initiation protocol (SIP) is an application-layer signaling protocol that is widely used for establishing voice-over-IP (VoIP) phone calls. SIP can utilize a variety of transport protocols, including UDP and TCP. Despite the advantages of TCP, such as reliable delivery and congestion control, the common practice is to use UDP. This is a result of the belief that UDP's lower processor and network overhead results in improved performance and scalability of SIP services. This paper argues against this conventional wisdom. This paper shows that the principal reasons for OpenSER's poor performance using TCP are caused by the server's design, and not the low-level performance of UDP versus TCP. Specifically, OpenSER's architecture for handling concurrent calls is responsible for most of the difference. Moreover, once these issues are addressed, OpenSER's performance using TCP is much more competitive with its performance using UDP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis of ARQ Protocols using a Theorem Prover

    Page(s): 85 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB) |  | HTML iconHTML  

    Automatic-repeat-request (ARQ) protocols are widely used in modern data communications to guarantee reliable transmission over imperfect physical links. The behavior of an ARQ protocol largely depends on a number of network parameters and traditionally simulation is used for their performance analysis. However, simulation provides less accurate results and usually requires enormous amount of CPU time in order to attain reasonable estimates. To overcome these limitations, we propose to conduct the performance analysis of ARQ protocols in the environment of a higher-order-logic theorem prover (HOL). We present an approach to formally model the delay characteristics of ARQ protocols as a function of geometric random variable in higher-order-logic. In particular, we develop higher-order-logic models that describe the delay behavior of three basic types of ARQ protocols, i.e., Stop-and-Wait, Go-Back-N and Selective-Repeat. The paper also includes the verification of the average message delay relations for these three protocols in HOL. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors

    Page(s): 95 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2235 KB) |  | HTML iconHTML  

    The floating point portion of the SPEC CPU suite and the HPC Challenge suite are widely recognized and utilized as benchmarks that represent scientific application behavior. In this work we show that while these benchmark suites may be representative of the cache behavior of production scientific applications, they do not accurately represent the TLB behavior of these applications. Furthermore, we demonstrate that the difference can have a significant impact on performance. In the first part of the paper we present results from implementation-independent trace-based simulations which demonstrate that benchmarks exhibit significantly different TLB behavior for a range of page sizes than a representative set of production applications. In the second part we validate these results on the AMD Opteron implementation of the x86 architecture, showing that false conclusions about choice of page size, drawn from benchmark performance, can result in performance degradations of up to nearly 50% for the production applications we investigated.. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scientific Computing Applications on a Stream Processor

    Page(s): 105 - 114
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (738 KB) |  | HTML iconHTML  

    Stream processors, developed for the stream programming model, perform well on media applications. In this paper we examine the applicability of a stream processor to scientific computing applications. Eight scientific applications, each having different performance characteristics, are mapped to a stream processor. Due to the novelty of the stream programming model, we show how to map programs in a traditional language, such as FORTRAN. In a stream processor system, the management of system resources is the programmers' responsibility. We present several optimizations, which enable mapped programs to exploit various aspects of the stream processor architecture. Finally, we analyze the performance of the stream processor and the presented optimizations on a set of scientific computing applications. The stream programs are from 1.67 to 32.5 times faster than the corresponding FORTRAN programs on an Itanium 2 processor, with the optimizations playing an important role in realizing the performance improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pinpointing and Exploiting Opportunities for Enhancing Data Reuse

    Page(s): 115 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1464 KB) |  | HTML iconHTML  

    The potential for improving the performance of data-intensive scientific programs by enhancing data reuse in cache is substantial because CPUs are significantly faster than memory. Traditional performance tools typically collect or simulate cache miss counts or rates and attribute them at the function level. While such information identifies program scopes that exhibit a large cache miss rate, it is often insufficient to diagnose the causes for poor data locality and to identify what program transformations would improve memory hierarchy utilization. This paper describes an approach that uses memory reuse distance to identify an application's most significant memory access patterns causing cache misses and provide insight into ways of improving data reuse. Unlike previous approaches, our tool combines (1) analysis and instrumentation of fully optimized binaries, (2) online analysisof reuse patterns, (3) fine-grain attribution of measurements and models to statements, loops and variables, and (4) static analysis of access patterns to quantify spatial reuse. We demonstrate the effectiveness of our approach for understanding reuse patterns in two scientific codes: one for simulating neutron transport and a second for simulating turbulent transport in burning plasmas. Our tools pinpointed opportunities for enhancing data reuse. Using this feedback as a guide, we transformed the codes, reducing their misses at various levels of the memory hierarchy by integer factors and reducing their execution time by as much as 60% and 33%, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Processor Performance Modeling using Symbolic Simulation

    Page(s): 127 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5195 KB) |  | HTML iconHTML  

    We propose a method of analytically characterizing processor performance as a function of circuit latencies. In our approach, we modify traditional simulation to use variables instead of fixed latencies for the internal functional units. The simulation engine then algebraically computes execution times, and the result is a mathematical equation which characterizes the performance space across numerous processor configurations. We discuss the computational complexity issues of this approach and show that instruction chunking and simple equation redundancy checking can make this approach feasible-we can model a large multi-dimensional design space with thousands to millions of design parameter combinations for about 10times the simulation time of a single conventional simulation run. We demonstrate our approach by exploring two different machines: a traditional MlPS-style in-order pipeline and the Intel Graphics Media Accelerator X3000. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Next-Generation Performance Counters: Towards Monitoring Over Thousand Concurrent Events

    Page(s): 139 - 146
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (677 KB) |  | HTML iconHTML  

    We present a novel performance monitor architecture, implemented in the Blue Gene/PTM supercomputer. This performance monitor supports the tracking of a large number of concurrent events by using a hybrid counter architecture. The counters have their low order data implemented in registers which are concurrently updated, while the high order counter data is maintained in a dense SRAM array that is updated from the registers on a regular basis. The per formance monitoring architecture includes support for per- event thresholding and fast event notification, using a two- phase interrupt-arming and triggering protocol. A first implementation provides 256 concurrent 64b counters which offers an up to 64x increase in counter number compared to performance monitors typically found in microprocessors today, and thereby dramatically expands the capabilities of counter-based performance tuning. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Configurational Workload Characterization

    Page(s): 147 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2332 KB) |  | HTML iconHTML  

    Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurational- characteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites

    Page(s): 157 - 168
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (17702 KB) |  | HTML iconHTML  

    Characterizing and understanding emerging workload behavior is of vital importance to ensure next generation microprocessors perform well on their anticipated future workloads. This paper compares a number of benchmark suites from emerging application domains, such as bio-informatics (BioPerf), biometrics (BioMetricsWorkload) and multimedia (MediaBench II), against general-purpose workloads represented by SPEC CPU2000 and CPU2006. Although these benchmark suites have been characterized before, prior work did not capture the benchmark suites' inherent (microarchitecture-independent) behavior, nor did they provide a phase-level characterization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Independent Component Analysis and Evolutionary Algorithms for Building Representative Benchmark Subsets

    Page(s): 169 - 178
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (566 KB) |  | HTML iconHTML  

    This work addresses the problem of building representative subsets of benchmarks from an original large set of benchmarks, using statistical analysis techniques. The subsets should be developed in this way to include only the necessary information for evaluating the performance of a computer system or application. The development of representative workloads is not a trivial procedure, since incorrectly selecting benchmarks the representative subset can produce erroneous results. A number of statistical analysis techniques have been developed for identifying representative workloads. The goal of these approaches is to reduce the dimensionality of the original set of benchmarks prior to identifying similar benchmarks. In this work we propose a combination of independent component analysis (ICA) and evolutionary algorithm (EA) as a more efficient way for reducing the computational complexity of the problem and the redundant information of the original set of benchmarks. Experimental results validate that the proposed technique generates more representative workloads than prior techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterization of SPEC CPU2006 and SPEC OMP2001: Regression Models and their Transferability

    Page(s): 179 - 190
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (561 KB) |  | HTML iconHTML  

    Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Thermal Management through Task Scheduling

    Page(s): 191 - 201
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5362 KB) |  | HTML iconHTML  

    The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processor's reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Metrics for Architecture-Level Lifetime Reliability Analysis

    Page(s): 202 - 212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1268 KB) |  | HTML iconHTML  

    This work concerns metrics for evaluating microarchitectural enhancements to improve processor lifetime reliability. A commonly reported reliability metric is mean time to failure (MTTF). Although the MTTF metric is simpler to evaluate, it does not provide information on the reliability characteristics during the relatively short operational life of commodity processors. An alternate metric is nTTF, which represents the time to failure of n% of the processor population. nTTF is a more informative metric for the (short) portion of the lifetime that is relevant to the end- user, but determining it requires knowledge of the distribution of processor failure times which is generally hard to obtain. The goals of this paper are (1) to determine if the choice of metric has a quantitative impact on architecture-level reliability analysis and modern superscalar processor designs and (2) to build a fundamental understanding of why and when MTTF- and nTTF- driven analysis result in different designs. We show through an in- depth analysis that, in general, the nTTF metric differs significantly from the MTTF metric, and using MTTF as a proxy for nTTF leads to sub-optimal designs. Additionally, our analysis introduces the concept of relative vulnerability factor (RVF) for different processor components to guide reliability-aware design. We show that the difference between nTTF- and MTTF-driven design largely occurs because the relative vulnerabilities of the processor components change over the processor lifetime, making the optimal design choice dependent on the amount of time the processor is expected to be used. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.