By Topic

Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on

Date 24-27 June 2008

Filter Results

Displaying Results 1 - 25 of 78
  • Message from the conference general chair and coordinator

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (284 KB)  
    Freely Available from IEEE
  • Message from the DCCS program chair

    Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (287 KB)  
    Freely Available from IEEE
  • Message from the PDS program chair

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (270 KB)  
    Freely Available from IEEE
  • William C. Carter Award

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (291 KB)  
    Freely Available from IEEE
  • DSN 2008 organizers

    Page(s): v
    Save to Project icon | Request Permissions | PDF file iconPDF (172 KB)  
    Freely Available from IEEE
  • DSN 2008 Steering Committee

    Page(s): vi
    Save to Project icon | Request Permissions | PDF file iconPDF (167 KB)  
    Freely Available from IEEE
  • DCCS 2008 Program Committee

    Page(s): vii - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (243 KB)  
    Freely Available from IEEE
  • DCCS reviewers

    Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (262 KB)  
    Freely Available from IEEE
  • PDS 2008 Program Committee

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (206 KB)  
    Freely Available from IEEE
  • PDS reviewers

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (224 KB)  
    Freely Available from IEEE
  • DSN 2008 subCommittees

    Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (150 KB)  
    Freely Available from IEEE
  • Citation information

    Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (194 KB)  
    Freely Available from IEEE
  • Dependability on the last frontier

    Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (179 KB)  
    Freely Available from IEEE
  • Automotive Enhanced Vision Systems - Issues arising from placing users in the loop

    Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (268 KB)  
    Freely Available from IEEE
  • Coverage of a microarchitecture-level fault check regimen in a superscalar processor

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2668 KB) |  | HTML iconHTML  

    Conventional processor fault tolerance based on time/space redundancy is robust but prohibitively expensive for commodity processors. This paper explores an unconventional approach to designing a cost-effective fault-tolerant superscalar processor. The idea is to engage a regimen of microarchitecture-level fault checks. A few simple microarchitecture-level fault checks can detect many arbitrary faults in large units, by observing microarchitecture-level behavior and anomalies in this behavior. Previously, we separately proposed checks for the fetch and decode stages, rename stage, and issue stage of a contemporary superscalar processor. While each piece hinted at the possibility of a complete regimen - for an overall fault-tolerant superscalar processor - this totality was not explored. This paper provides the culmination by building a full regimen into a superscalar processor. We show for the first time that the regimen-based approach provides substantial coverage of an entire superscalar processor. Analysis reveals vulnerable areas which should be the focus for regimen additions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and solutions to issue queue process variation

    Page(s): 11 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3017 KB) |  | HTML iconHTML  

    The last few years have witnessed an unprecedented explosion in transistor densities. Diminutive feature sizes have enabled microprocessor designers to break the billion-transistors per chip mark. However various new reliability challenges such as process variation (PV) have emerged that can no longer be ignored by chip designers. In this paper, we provide a comprehensive analysis of the effects of PV on the microprocessorpsilas Issue Queue. Variations can slow down issue queue entries and result in as much as 20.5% performance degradation. To counter this, we look at different solutions that include instruction steering, operand- and port- switching mechanisms. Given that PV is non-deterministic at design-time, our mechanisms allow the fast and slow issue-queue entries to co-exist in turn enabling instruction dispatch, issue and forwarding to proceed with minimal stalls. Evaluation on a detailed simulation environment indicates that the proposed mechanisms can reduce performance degradation due to PV to a low 1.3%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trace-based microarchitecture-level diagnosis of permanent hardware faults

    Page(s): 22 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2333 KB) |  | HTML iconHTML  

    As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead (ldquoalways-onrdquo) detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty corepsilas instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98% of the faults studied and is a highly robust and flexible way for diagnosing permanent faults. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study of cognitive resilience in a JPEG compressor

    Page(s): 32 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (943 KB) |  | HTML iconHTML  

    Many classes of applications are inherently tolerant to errors. One such class are applications designed for a human end user, where the capabilities of the human cognitive system (cognitive resilience) may compensate some of the errors produced by the application. We present a methodology to automatically distinguish between tolerable errors in imaging applications which can be handled by the human cognitive system and severe errors which are perceptible to a human end user. We also introduce an approach to identify non-critical spots in a hardware circuit which should not be hardened against soft errors because errors that occur on these spots are tolerable. We demonstrate that over 50% of flip-flops in a JPEG compressor chip are non-critical and require no hardening. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling algorithms for unpredictably heterogeneous CMP architectures

    Page(s): 42 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (586 KB) |  | HTML iconHTML  

    In future large-scale multi-core microprocessors, hard errors and process variations will create dynamic heterogeneity, causing performance and power characteristics to differ among the cores in an unanticipated manner. Under this scenario, naive assignments of applications to cores degraded by various faults and variations may result in large performance losses and power inefficiencies. We propose scheduling algorithms based on the Hungarian Algorithm and artificial intelligence (AI) search techniques that account for this future uncertainty in core characteristics. These thread assignment policies effectively match the capabilities of each degraded core with the requirements of the applications, achieving an ED2 only 3.2% and 3.7% higher, respectively, than a baseline eight core chip multiprocessor with no degradation, compared to over 22% for a round robin policy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated duplicate detection for bug tracking systems

    Page(s): 52 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (438 KB) |  | HTML iconHTML  

    Bug tracking systems are important tools that guide the maintenance activities of software developers. The utility of these systems is hampered by an excessive number of duplicate bug reports-in some projects as many as a quarter of all reports are duplicates. Developers must manually identify duplicate bug reports, but this identification process is time-consuming and exacerbates the already high cost of software maintenance. We propose a system that automatically classifies duplicate bug reports as they arrive to save developer time. This system uses surface features, textual semantics, and graph clustering to predict duplicate status. Using a dataset of 29,000 bug reports from the Mozilla project, we perform experiments that include a simulation of a real-time bug reporting environment. Our system is able to reduce development cost by filtering out 8% of duplicate bug reports while allowing at least one report for each real defect to reach developers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Validation of hardware error recovery mechanisms for the SPARC64 V microprocessor

    Page(s): 62 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2035 KB) |  | HTML iconHTML  

    The SPARC64 V microprocessor is designed for use in high-reliability, large-scale unix servers. In addition to implementing ECC for large SRAM arrays, the SPARC64 V microprocessor incorporates error detection and recovery mechanisms for processor logic circuits and smaller SRAM arrays. The effectiveness of these error recovery mechanisms was validated via accelerated neutron testing of Fujitsupsilas commercial unix server, the PRIMEPOWER 650. Soft errors generated in SRAM arrays were completely recovered by the implemented hardware mechanisms, and only 6.4% of the estimated neutron-induced logic circuit faults manifested as errors, 76% of which were recovered by hardware. From these tests, the soft error failure rate of the SPARC64 V microprocessor due to atmospheric neutron hits was confirmed to be well below 10 FIT. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using likely program invariants to detect hardware errors

    Page(s): 70 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2479 KB) |  | HTML iconHTML  

    In the near future, hardware is expected to become increasingly vulnerable to faults due to continuously decreasing feature size. Software-level symptoms have previously been used to detect permanent hardware faults. However, they can not detect a small fraction of faults, which may lead to silent data corruptions(SDCs). In this paper, we present a system that uses invariants to improve the coverage and latency of existing detection techniques for permanent faults. The basic idea is to use training inputs to create likely invariants based on value ranges of selected program variables and then use them to identify faults at runtime. Likely invariants, however, can have false positives which makes them challenging to use for permanent faults. We use our on-line diagnosis framework for detecting false positives at runtime and limit the number of false positives to keep the associated overhead minimal. Experimental results using microarchitecture level fault injections in full-system simulation show 28.6% reduction in the number of undetected faults and 74.2% reduction in the number of SDCs over existing techniques, with reasonable overhead for checking code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detouring: Translating software to circumvent hard faults in simple cores

    Page(s): 80 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (487 KB) |  | HTML iconHTML  

    CMOS technology trends are leading to an increasing incidence of hard (permanent) faults in processors. These faults may be introduced at fabrication or occur in the field. Whereas high-performance processor cores have enough redundancy to tolerate many of these faults, the simple, low-power cores that are attractive for multicore chips do not. We propose Detouring, a software-based scheme for tolerating hard faults in simple cores. The key idea is to automatically modify software such that its functionality is unchanged but it does not use any of the faulty hardware. Our initial implementation of Detouring tolerates hard faults in several hardware components, including the instruction cache, registers, functional units, and the operand bypass network. Detouring has no hardware cost and no performance overhead for fault-free cores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards assessing the security of DBMS configurations

    Page(s): 90 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (333 KB) |  | HTML iconHTML  

    Database management systems (DBMS) have a long tradition in high security. Several mechanisms needed to protect data have been proposed/consolidated in the database arena. However, the effectiveness of those mechanisms is very dependent on the actual configuration chosen by the database administrator. Tuning a large database is quite complex and achieving high security is a very difficult task that requires a lot of expertise and continuous and proactive work. In this paper we analyze the security best practices behind the many configuration options available in several well-known DBMS. These security best practices are then generalized in order to be applicable to practically any DBMS available today. Finally, we use these best practices to define a set of configuration tests, which have been successfully used to evaluate four real database installations based in four well-known and widely used DBMS. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiences with formal specification of fault-tolerant file systems

    Page(s): 96 - 101
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (952 KB) |  | HTML iconHTML  

    Fault-tolerant, replicated file systems are a crucial component of todaypsilas data centers. Despite their huge complexity, these systems are typically specified only in brief prose, which makes them difficult to reason about or verify. This paper describes the authorspsila experience using formal methods to improve our understanding of and confidence in the behavior of replicated file systems. We wrote formal specifications for three real-world fault-tolerant file systems and used them to: (1) expose design similarities and differences; (2) clarify and mechanically verify consistency properties; and (3) evaluate design alternatives. Our experience showed that formal specifications for these systems were easy to produce, useful for a deep understanding of system functions, and valuable for system comparison. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.