Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems

15-16 Dec. 1997

Filter Results

Displaying Results 1 - 25 of 38
  • Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems

    Publication Year: 1997
    Request permission for commercial reuse | PDF file iconPDF (199 KB)
    Freely Available from IEEE
  • Index of authors

    Publication Year: 1997, Page(s): 243
    Request permission for commercial reuse | PDF file iconPDF (62 KB)
    Freely Available from IEEE
  • Design of a fault-tolerant microprocessor: a simulation approach

    Publication Year: 1997, Page(s):161 - 166
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (536 KB)

    This paper presents an approach for assessing the merits and the cost of incorporating processor-level error detection and recovery mechanisms. The approach is exemplified by implementing several fault-tolerant mechanisms into a 32-bit, MIPS R3000-compatible, RISC microprocessor and conducting simulation-based fault injection experiments. The mechanisms are triple modular redundancy (TMR), retry o... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault handling mechanisms in the RETHER protocol

    Publication Year: 1997, Page(s):153 - 159
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (772 KB)

    RETHER is a software-driven token-passing protocol designed to provide bandwidth guarantee for real-time multimedia applications over off-the-shelf Ethernet hardware. To our knowledge, it is the first all-software and fully-implemented real-time protocol on top of commodity Ethernet hardware. Because token passing is used to regulate network accesses, node crashes and/or packer corruption may lead... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault coverage estimation model for partially testable multichip modules

    Publication Year: 1997, Page(s):72 - 77
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (412 KB)

    This paper proposes a simple and efficient model for designers to estimate fault coverage for partially testable MCMs. This model relates fault coverage, test methodology, and the ratio and distribution of DFT dies (dies with design for testability features) in an MCM. Experimental results show that our model can efficiently predict the fault coverage of a partially testable MCM with less than 5% ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Checkpointing Message-Passing Interface (MPI) parallel programs

    Publication Year: 1997, Page(s):147 - 152
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (460 KB)

    Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing par... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-lag duplexing-a fault tolerance technique for online transaction processing systems

    Publication Year: 1997, Page(s):202 - 207
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (536 KB)

    In this paper the concept of time-lag duplexing is proposed to achieve fault tolerance. Time-lag duplexing incorporates time and component redundancy to provide for transient errors both easy error recovery and tolerance against errors in common irredundant components. As a result, minimum performance and cost penalties are incurred. In this paper the fault detection and recovery algorithm using t... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A system solution to reducing frequency of memory repairs

    Publication Year: 1997, Page(s):53 - 58
    Cited by:  Papers (1)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (448 KB)

    Single symbol error correcting and double symbol error detecting (SSC-DSD) codes have been used in m-bit-per-chip computer memories for fault-tolerance and for savings in repair costs. In this paper, we present a solution to the reduction of memory repair actions for memories designed with SSC-DSD codes. We present a scheme that extends the basic SSC-DSD scheme to the data recovery of double symbo... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Formal verification of a TDMA protocol start-up mechanism

    Publication Year: 1997, Page(s):235 - 242
    Cited by:  Papers (19)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (668 KB)

    This paper presents a formal verification of the start-up algorithm of the DACAPO protocol. The protocol uses TDMA (Time Division Multiple Access) bus arbitration. It was verified that an ensemble of four communicating stations becomes synchronized and operational within a bounded time from an arbitrary initial state. The system model included a clock drift corresponding to ±10-3... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive system-level diagnosis and its application

    Publication Year: 1997, Page(s):66 - 71
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (508 KB)

    System-level diagnosis is known for decades, and many solid results have been provided for this problem. This paper presents a new adaptive diagnosis strategy which requires far fewer tests than previous methods. Since tests can be applied concurrently in this method, the total time span for fault location can be further reduced. Although the practical implications of system-level diagnosis is few... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant object on network-wide distributed object-oriented systems for future telecommunications applications

    Publication Year: 1997, Page(s):139 - 146
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (740 KB)

    This paper describes a fault-tolerant object using replication of objects in network-wide distributed object-oriented communications systems, and a mechanism for managing multiple objects that execute the target functions in the systems. This mechanism is located in the distributed processing platform that controls the execution of objects. The replication management mechanism combines fault detec... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis of two time-based coordinated checkpointing protocols

    Publication Year: 1997, Page(s):194 - 201
    Cited by:  Papers (15)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (732 KB)

    Time-based checkpointing protocols are a recently proposed way to improve a system's dependability. They claim to have the advantages of coordinated protocols without the normal costs of coordination. This paper investigates that claim, by analyzing and comparing two time-based checkpointing protocols. The analysis is performed by determining the forward progress of a system using each protocol, a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phaeton: a log-based architecture for high performance file server design

    Publication Year: 1997, Page(s):28 - 33
    Cited by:  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (544 KB)

    An analysis of the construction of high performance network file servers is presented. Phaeton, a log-disk based architecture is proposed to achieve sequential disk access speeds for both modification and non-modification requests, while maintaining strict semantics and recoverability in the presence of failure. The Phaeton architecture is capable of providing dramatic improvement of the single-se... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimized authenticated self-synchronizing Byzantine agreement protocols

    Publication Year: 1997, Page(s):122 - 129
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (696 KB)

    In order to make a dependable distributed computer system resilient to arbitrary failures of its processors, deterministic Byzantine agreement protocols (BAPs) can be applied. Many BAPs found in literature require that communication takes place in synchronized rounds of information exchange and require that all correct processors know the start of the BAP and start the protocol simultaneously It i... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis of a reliable real-time token-ring protocol

    Publication Year: 1997, Page(s):180 - 185
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (432 KB)

    This paper presents an analytical model using stochastic Petri nets for the performance of a priority based real-time protocol that uses data-link layer message logging for fast recovery in the event of station crashes. The advantage of using message logging at the data-link layer over traditional higher layer recovery mechanisms is demonstrated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Double and triple error detecting capability of Internet checksum and estimation of probability of undetectable error

    Publication Year: 1997, Page(s):47 - 52
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (436 KB)

    The Internet checksum is calculated by 16-bit one's complement arithmetic. The occurrence of two or more errors, however, may not be detected. In this manuscript, we formulate the checksum procedure as a nonlinear code. Part of distance distribution of the nonlinear code is calculated. By using the results, we derived lower and upper bounds on the probability of an undetectable error when the nonl... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An implementation of the FTAG model in concurrent ML

    Publication Year: 1997, Page(s):229 - 234
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (520 KB)

    Non-imperative programming models can simplify the development of fault-tolerant software, in part because of their potential for automatically generating concurrent implementations. This paper describes the design of a concurrent implementation of FTAG, a previously-described functional model for writing fault-tolerant software based on attribute grammars. The implementation involves translating ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A cache error propagation model

    Publication Year: 1997, Page(s):15 - 21
    Cited by:  Papers (10)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (640 KB)

    Cache memory is a small, fast, memory system that holds frequently used data. With increasing processor speed, aggressive design practices increase the probability of fault occurrence and the presence of latent errors as the processor allows a short duration for read and write. The fault may corrupt the cache memory system or lead to an erroneous internal CPU state. The authors investigate error p... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An extended binary tree quorum strategy for K-mutual exclusion in distributed systems

    Publication Year: 1997, Page(s):110 - 115
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (544 KB)

    In this paper, we propose two strategies called generalized binary tree quorum and extended binary tree quorum for k-mutual exclusion, which impose a logical structure on the network. Both of the proposed strategies are based on a logical binary tree structure. The quorum size constructed from both strategies is [1g2 n/k] in the best case and is [n+k/2k] in the worst case, where n is th... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliability simulation of fault-tolerant software and systems

    Publication Year: 1997, Page(s):167 - 173
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (552 KB)

    Fault tolerance is a survival attribute of complex computer systems and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability and performance evaluation of hardware/software fault tolerant architectures can be quite cumbersome. Also, in practice, isolating the effect of various parameters on a system, while... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliability modeling of structured systems: exploring symmetry in state-space generation

    Publication Year: 1997, Page(s):78 - 84
    Cited by:  Papers (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (632 KB)

    A large number of systems are implemented using regular interconnected topologies. Markov analysis of such systems results in large state spaces. We explore symmetry, in particular rotational and permutational, of such systems to achieve a significant reduction in the size of the state space required to analyze them. The resulting much smaller state spaces allow analyses of very large systems. We ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing software reliability through rollback and on-line fault repair

    Publication Year: 1997, Page(s):208 - 213
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (536 KB)

    We propose a new paradigm for increasing the reliability of a software system by combing reactive and proactive approaches. The proposed approach employs rollback and restart for masking transient failure, and employs on-line software version charge to remove faults from the software. A model for reliability analysis of a system employing the proposed approach is presented. The analysis shows that... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrent error detection in priority queue managers for ATM networks

    Publication Year: 1997, Page(s):59 - 64
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (524 KB)

    In this paper, a concurrent error detection technique for priority queue managers for ATM switches is presented. A functional model for queue managers dealing with a mixture of real-time and non-real-time traffic is developed. Based on the functional model, functional errors due to physical failures in specific modules are explicitly stated. Errors in the queue managers are detected during normal ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fault-tolerant decoding procedure for real cyclic codes

    Publication Year: 1997, Page(s):35 - 40
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (548 KB)

    Real number cyclic codes which are constructed by imposing constraints in the discrete Fourier (DFT) domain are decoded by new estimation procedures. The codewords are assumed corrupted by small levels of roundoff noise and possibly by a few large excursions of random disturbances that represent the effects of failures. Error correction is separated into two parts, large activity detection followe... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The adaptable distributed recovery block scheme and a modular implementation model

    Publication Year: 1997, Page(s):131 - 138
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (772 KB)

    The purpose of adaptive fault-tolerance (AFT) is to meet the dynamically and widely changing fault-tolerance requirement by efficiently and adaptively utilizing a limited and dynamically changing amount of available redundant processing resources. In this paper we present one concrete AFT scheme, named the adaptable distributed recovery block (ADRB) scheme, which is an extension of the Distributed... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.