Scheduled System Maintenance
On Wednesday, December 20, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET.
During this time, there may be intermittent impact on performance. We apologize for any inconvenience.

Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352)

15-18 June 1999

Filter Results

Displaying Results 1 - 25 of 47
  • Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352)

    Publication Year: 1999
    Request permission for commercial reuse | PDF file iconPDF (1081 KB)
    Freely Available from IEEE
  • Algorithm based fault tolerance versus result-checking for matrix computations

    Publication Year: 1999, Page(s):4 - 11
    Cited by:  Papers (15)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (110 KB)

    Algorithm Based Fault Tolerance (ABFT) is the collective name of a set of techniques used to determine the correctness of some mathematical calculations. A less well known alternative is called Result Checking (RC) where, contrary to ABFT, results are checked without knowledge of the particular algorithm used to calculate them. In this paper a comparison is made between the two using some practica... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An algorithm based error detection scheme for the multigrid algorithm

    Publication Year: 1999, Page(s):12 - 19
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (50 KB)

    In this paper an Algorithm Based Error Detection (ABED) scheme is applied to the multigrid algorithm which provides an iterative solution to a system of linear algebraic equations resulting from a finite difference discretization of a Poisson equation. Invariants are created to implement checking in the relaxation, restriction and interpolation operators. Modifications to invariants due to roundof... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MetaKernels and fault containment wrappers

    Publication Year: 1999, Page(s):22 - 29
    Cited by:  Papers (19)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (420 KB)

    This paper addresses the problem of using COTS microkernels in dependable systems. Because they are not developed with this aim, their behavior in the presence of faults is a main concern to system designers. We propose a novel approach to contain the effect of both external and internal faults that may affect their behavior. As microkernels can be decomposed into simple components, modeling of th... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparing the robustness of POSIX operating systems

    Publication Year: 1999, Page(s):30 - 37
    Cited by:  Papers (66)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (484 KB)

    Critical system designers are turning to off-the-shelf operating system (OS) software to reduce costs and time-to-marker. Unfortunately general-purpose OSes do not always respond to exceptional conditions robustly, either accepting exceptional values without complaint, or suffering abnormal task termination. Even though direct measurement is impractical, this paper uses a multiversion comparison t... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiprocessor architecture using an audit trail for fault tolerance

    Publication Year: 1999, Page(s):40 - 47
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (274 KB)

    In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be fault tolerant. Researchers have designed various checkpointing algorithms to implement fault tolerance in a TCMP. To date, these algorithms fall into 2 principal classes, where processors can be checkpoint dependent on each other. We introduce a new apparatus and algorithm that represents a 3rd cl... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Egida: an extensible toolkit for low-overhead fault-tolerance

    Publication Year: 1999, Page(s):48 - 55
    Cited by:  Papers (23)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (186 KB)

    We discuss the design and implementation of Egida, an object-oriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by gluing together the appropriate objects from an availa... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incremental messages: micro-kernel services for flexible and efficient management of replicated data

    Publication Year: 1999, Page(s):56 - 63
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (543 KB)

    Incremental messages have been designed to efficiently and flexibly manage replicated copies of critical data. We describe this new type of messages and show how applications benefit from an extended message passing interface which provides support for creating updating and recovering data copies, combining the advantages of both kernel and user-level approaches. The paper describes the interface ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rigorous development of a safety-critical system based on coordinated atomic actions

    Publication Year: 1999, Page(s):68 - 75
    Cited by:  Papers (10)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (85 KB)

    This paper describes our experience using coordinated atomic (CA) actions as a system structuring tool to design and validate a sophisticated control system for a complex industrial application that has high reliability and safety requirements. Our study is based on the "Fault-Tolerant Production Cell", which represents a manufacturing process involving redundant mechanical devices (provided in or... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The systematic improvement of fault tolerance in the Rio file cache

    Publication Year: 1999, Page(s):76 - 83
    Cited by:  Papers (11)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (47 KB)

    Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However fault injection is rarely used for all these purposes to guide the design and implementation of a fault tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant s... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • AR-SMT: a microarchitectural approach to fault tolerance in microprocessors

    Publication Year: 1999, Page(s):84 - 91
    Cited by:  Papers (208)  |  Patents (12)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (71 KB)

    This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques-system-level, gate-level, or component-specific approaches-are either too costly for general purpose computi... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Routing and wavelength assignment for establishing dependable connections in WDM networks

    Publication Year: 1999, Page(s):94 - 101
    Cited by:  Papers (9)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (117 KB)

    This paper considers the problem of establishing dependable connections (D-connections) for fast failure recovery in wavelength-routed wavelength-division multiplexed (WDM) networks with dynamic traffic demand. A D-connection is realized by a primary lightpath and one or more backup lightpaths. Real time applications usually require timeliness and fault-tolerance. It is not a difficult task to gua... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interference robust TCP

    Publication Year: 1999, Page(s):102 - 109
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (194 KB)

    With the exception of the fast retransmit and recovery algorithms, transmission control protocol (TCP) assumes congestion to be the only source of packet loss. When wireless networks experience packet loss due to interference or any other error congestion control algorithms in TCP are triggered. Unnecessary and incorrect usage of congestion control algorithms results in a high performance penalty.... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effect of failures on optimal location management algorithms

    Publication Year: 1999, Page(s):110 - 117
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (164 KB)

    The current IS-41 standard PCS architecture uses a centralized database, the Home Location Register (HLR), to store service and location information of each mobile registered in the PCS network. However this is a single point of failure, therefore distributed database architectures have been proposed as possible architectures for future mobile networks. The performance of such architectures improv... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cots hardware and software in high-availability systems

    Publication Year: 1999, Page(s):120 - 121
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (121 KB)

    First Page of the Article
    View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient network-flow based techniques for dynamic fault reconfiguration in FPGAs

    Publication Year: 1999, Page(s):122 - 129
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (232 KB)

    In this paper we consider a "dynamic" node covering frameworks for incorporating fault tolerance in SRAM-based segmented array FPGAs with spare row(s) and/or column(s) of cells. Two types of designs are considered: one that can support only node-disjoint (and hence nonintersecting) rectilinear reconfiguration paths, and the other that can support edge-disjoint (and hence possibly intersecting) rec... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-step algorithms for maximal diagnosis of wiring interconnects

    Publication Year: 1999, Page(s):130 - 137
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (127 KB)

    We give two algorithms for maximal diagnosis of wiring networks without repair under a general fault model. Maximal diagnosis consists of identifying all diagnosable faults under the assumptions that each net can have multiple drivers and receivers and can be affected by any number of short and open faults. This process is equivalent to verifying all connections between inputs and outputs. Matrice... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault identification algorithmic: a new formal approach

    Publication Year: 1999, Page(s):138 - 145
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (325 KB)

    Much research has been devoted to system-level diagnosis. Two issues have been addressed. The first of these is diagnosability. The second is the design of fault identification algorithms. This paper focuses on the second of these concerns. This paper investigates the process of fault identification itself, introduces a new formal approach, and proposes a fault identification algorithm which runs ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fault tolerance framework for CORBA

    Publication Year: 1999, Page(s):150 - 157
    Cited by:  Papers (19)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (532 KB)

    We describe a fault tolerance framework for CORBA that provides fault tolerance management and core services, implemented above the ORB for ease of use and customization, and fault tolerance mechanisms, implemented beneath the ORB for transparency and efficiency. Strong replica consistency is facilitated by a multicast engine that provides reliable totally ordered delivery of multicast messages to... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The performance of database replication with group multicast

    Publication Year: 1999, Page(s):158 - 165
    Cited by:  Papers (16)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1411 KB)

    Replication with update-anywhere capability while maintaining global synchronization and isolation has long been thought impractical. Protocols have been proposed for distributed replicated databases that take advantage of atomic broadcast systems to simplify message passing and conflict resolution in hopes of making replication efficient. This paper presents performance measurements on a simulati... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design, implementation and performance evaluation of a CORBA group communication service

    Publication Year: 1999, Page(s):166 - 173
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (203 KB)

    This paper describes the design, implementation, and performance evaluation of a CORBA group communication service. It also evaluates the effect of CORBA in implementing a group communication service. The main conclusion is that CORBA can be used to implement group communication services and thereby achieve interoperability in a heterogeneous computing environment. However, there is a substantial ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An execution service for a partitionable low bandwidth network

    Publication Year: 1999, Page(s):174 - 181
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (108 KB)

    As the amount of scientific data grows to the point where the Internet bandwidth no longer supports its transfer it becomes necessary to make powerful computational services available near data repositories. Such services allow remote researchers to start long-running parallel computations on the data. Current execution services do not provide remote users with adequate management facilities for t... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic design of optimal concurrent fault detector for linear analog systems

    Publication Year: 1999, Page(s):184 - 191
    Cited by:  Papers (6)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (164 KB)

    This paper presents a generalized strategy for optimal design of on-line integrated fault detector for linear analog systems. The method consists in processing the available node voltage signals to provide a residual signal that carries information about the faults. Contrary to previously proposed techniques dealing only with the particular case of state variable systems, the use of extra circuitr... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Some transmission time analysis for the parallel asynchronous communication scheme

    Publication Year: 1999, Page(s):192 - 199
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (209 KB)

    In asynchronous buses, the sender encodes a data word with a code word from an unordered code and transmits it on the parallel bus lines. In this paper some transmission time analysis for the above parallel asynchronous communication scheme are presented. It is proved that the average transmission time for a code word is a strictly increasing function of the weight, w, of the cone word and it appr... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault tolerance, channel coding and arithmetic source coding combined

    Publication Year: 1999, Page(s):200 - 207
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (150 KB)

    A complete source and channel coding system is protected from both channel errors and errors emanating from internal hardware failures by introducing redundancy in the source encoding and decoding procedures as well as in frequently inserting parity symbols generated by a burst-detecting convolutional code. The combined protected system can detect errors in any significant subsystems whether from ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.