By Topic

FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing

22-24 June 1993

Filter Results

Displaying Results 1 - 25 of 74
  • Digest of Papers - FTCS-23 The Twenty-Third Intenrational Symposium on Fault-Tolerant Computing [front matter]

    Publication Year: 1993, Page(s):i - xxii
    Request permission for commercial reuse | PDF file iconPDF (1043 KB)
    Freely Available from IEEE
  • FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing

    Publication Year: 1993
    Request permission for commercial reuse | PDF file iconPDF (812 KB)
    Freely Available from IEEE
  • Application transparent fault management in fault tolerant Mach

    Publication Year: 1993, Page(s):10 - 19
    Cited by:  Papers (12)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1017 KB)

    A general purpose operating system fault management mechanism, the sentry, has been defined and implemented for the Mach 3.0 microkernel running a UNIX 4.3 BSD server. The value of a mechanism in the operating system domain is usually judged by two criteria: the suitability of the mechanism to support a wide range of policies and the performance cost of the mechanism. Similarly, in fault detection... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system

    Publication Year: 1993, Page(s):20 - 29
    Cited by:  Papers (44)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1051 KB)

    The authors present a measurement-based study of software failures and recovery in the Tandem GUARDIAN90 operating system using a collection of memory dump analyses of field software failures. They identify the effects of software faults on the processor state and trace the propagation of the effects to other areas of the system. They also evaluate the role of the defensive programming techniques ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing system availability through on-line software version change

    Publication Year: 1993, Page(s):30 - 35
    Cited by:  Papers (2)  |  Patents (11)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (759 KB)

    The usual way of installing a new version of a software is to shut down the running program and then installing the new version. This necessitates a delay during which service is denied to the users of the software, and has the effect of reducing the overall availability of the system. An online software replacement system replaces parts of the software while it is in execution, thereby eliminatin... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast simulation of steady-state availability in non-Markovian highly dependable systems

    Publication Year: 1993, Page(s):38 - 47
    Cited by:  Papers (15)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1201 KB)

    Efficient simulation techniques for estimating steady-state quantities in models of highly dependable computing systems with general component failure and repair time distributions are considered. Earlier approaches in this application setting for steady-state estimation rely on the regenerative method of simulation, which an be used when the failure time distributions are exponentially distribute... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interval availability distribution computation

    Publication Year: 1993, Page(s):48 - 55
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (527 KB)

    Interval availability is a dependability measure defined by the fraction of time during which a system is in operation over a finite observation period. The computation of its distribution allows the user to ensure that the probability that a system will achieve a given availability level is high enough. As usual, the system is assumed to be modeled by a finite Markov process. The authors propose ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing a lumping heuristic for a Markov network reliability model

    Publication Year: 1993, Page(s):56 - 65
    Cited by:  Papers (3)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1047 KB)

    Network reliability models are plagued by large state spaces. When combinatorial models are inapplicable, Markov models are generally used to evaluate network reliability. Standard numerical methods of Markov chain solution are not applicable due to the size of the state space. Alternate solution methods through state space reduction by lumping or a solution by simulation are required. The authors... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparison of hybrid modular redundant multiprocessor systems with respect to performabilities

    Publication Year: 1993, Page(s):66 - 75
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (840 KB)

    A hybrid modular redundant (HMR) processor consists of VSD (Voter/Switch/Detector) module and computing modules. The authors discuss performability of three typical multiprocessor systems defined using HMR processors: (1) NSMS in which HMR processors share no computing module, (2) CSMS in which HMR processors share all computing modules, and (3) PSMS in which adjacent HMR processors share two comp... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The design concepts and operational results of fault-tolerant computer systems for the Shinkansen train control

    Publication Year: 1993, Page(s):78 - 87
    Cited by:  Papers (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (931 KB)

    The Shinkansen is a safe, reliable, mass, and rapid public transportation system in Japan. The authors discuss COMTRAC in the Shinkansen. First, they look into the concept and technology of fail-safe in railways. Second, they explain COMTRAC and its fault-tolerant computer systems. Finally, they give several kinds of field data accumulated during the past several years about COMTRAC, which show it... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Development of a benchmark to measure system robustness

    Publication Year: 1993, Page(s):88 - 97
    Cited by:  Papers (18)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1182 KB)

    An initial attempt at the development of a set of benchmarks to gauge a system's robustness as measured by its ability to tolerate errors is presented. Due to the large domain of system components whose intolerance to errors can lead to system failure, several primitive benchmarks that can be combined into a robustness benchmark suite are presented. Each primitive benchmark targets a system functi... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The variation of software survival time for different operational input profiles (or why you can wait a long time for a big bug to fail)

    Publication Year: 1993, Page(s):98 - 107
    Cited by:  Papers (27)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (902 KB)

    Experimental and theoretical evidence for the existence of contiguous failure regions in the program input space (blob defects) is provided. For real-time systems where successive input values tend to be similar, blob defects can have a major impact on the software survival time because the failure probability is not constant. For example, with a random walk input sequence, the probability of fail... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A class of error locating codes for byte-organized memory systems

    Publication Year: 1993, Page(s):110 - 119
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (833 KB)

    Error locating codes (EL codes), first proposed by J.K. Wolf and B. Elspas (1963), have the potential to be used to identify the faulty module for fault isolation and reconfiguration in fault-tolerant computer systems. A new class of EL codes suitable for memory systems organized with b-bit (b/spl ges/2) byte-organized semiconductor memory chips which are mounted on memory cards each having B-bit ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unidirectional error control codes

    Publication Year: 1993, Page(s):120 - 129
    Cited by:  Papers (1)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (781 KB)

    A new class of codes for unidirectional error control is presented. These codes are targeted towards unidirectional errors which affect at most t bits confined at most u bytes of a code word. Such errors are termed t/u-unidirectional errors. t/u-unidirectional error control, in general, requires lesser redundancy as compared to u unidirectional byte error control codes. Codes that are capable of d... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A proposal for error-tolerating codes

    Publication Year: 1993, Page(s):130 - 136
    Cited by:  Papers (6)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (583 KB)

    An extended concept of error-tolerating codes is presented and some examples of error-tolerating codes are introduced. An erroneous codeword of the proposed error-tolerating code may occur in the codespace; however, in this case, the erroneous codeword is required to be in a defined neighborhood of the original codeword. When no error is detected in a word, the word may differ from the original co... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Progressive retry for software error recovery in distributed systems

    Publication Year: 1993, Page(s):138 - 144
    Cited by:  Papers (20)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (747 KB)

    A method of execution retry for bypassing software faults based on checkpointing, rollback, message reordering, and replaying is described. The authors demonstrate how rollback techniques, previously developed for transient hardware failure recovery, can also be used to recover from software errors by exploiting message reordering to bypass software faults. The approach intentionally increases the... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonblocking and orphan-free message logging protocols

    Publication Year: 1993, Page(s):145 - 154
    Cited by:  Papers (28)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1141 KB)

    Currently existing message logging protocols demonstrate a classic pessimistic vs. optimistic tradeoff. It is shown that the optimistic-pessimistic tradeoff is not inherent to the problem of message logging. The authors construct a message-logging protocol that has the positive features of both optimistic and pessimistic protocols: the protocol prevents orphans and allows simple failure recovery; ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relaxing consistency in recoverable distributed shared memory

    Publication Year: 1993, Page(s):155 - 163
    Cited by:  Papers (13)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (920 KB)

    Relaxed memory consistency models tolerate increased memory access latency in both hardware and software distributed shared memory systems. In recoverable systems, relaxing consistency has the added benefit of reducing the number of checkpoints needed to avoid rollback propagation. The authors introduce new checkpointing algorithms that take advantage of relaxed consistency to reduce the performan... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EXOP (Extended Operation): A new logical fault model for digital circuits

    Publication Year: 1993, Page(s):166 - 175
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1085 KB)

    A gate-level fault model for digital circuits is proposed that generalizes previous models and is expected to model defects not included in other fault models. Theoretical discussions and experimental results are presented to show the effectiveness of a test set for the new fault model in achieving very high coverage of commonly used single and multiple faults. It is shown that test generation for... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient testing of tree circuits

    Publication Year: 1993, Page(s):176 - 185
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1055 KB)

    The authors investigate the testing properties of a class of regular circuits known as trees which include parity circuits, multiplexers, and decoders. Viewing a tree as overlapping one-dimensional arrays makes it possible to extend the testing properties of these arrays to trees. The authors give conditions for individually testing the arrays within a tree with a constant number of tests. Under t... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A BDD-based algorithm for computation of exact fault detection probabilities

    Publication Year: 1993, Page(s):186 - 195
    Cited by:  Papers (20)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1010 KB)

    Signal and fault detection probabilities are widely used in the area of testing. Due to the computational complexity, in most cases only approximated values are computed. A system called PLATO which allows the computation of the exact values for many combinational circuits is described. The implemented algorithms use the recently developed BDD packages as data structure. Besides the description of... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detection and discrimination of injected network faults

    Publication Year: 1993, Page(s):198 - 207
    Cited by:  Papers (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1166 KB)

    Six hundred faults were induced by injection into five live campus networks at Carnegie Mellon University in order to determine whether or not particular network faults have unique signatures as determined by out-of-band monitoring instrumentation. If unique signatures span networks, then the monitoring instrumentation can be used to diagnose network faults, or distinguish among fault classes, wit... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software fault injection and its application in distributed systems

    Publication Year: 1993, Page(s):208 - 217
    Cited by:  Papers (14)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1050 KB)

    The authors describe a software fault injector (SFI) developed to facilitate the validation of dependability mechanisms on an experimental distributed real-time system called HARTS. SFI introduces a number of extensions to previous work done on fault injection tools. In particular, it allows combinations of fault types to be injected in the nodes of a distributed system. It also allows control of ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulation of software behavior under hardware faults

    Publication Year: 1993, Page(s):218 - 227
    Cited by:  Papers (25)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (969 KB)

    A simulation-based software-model that permits application specific dependability analysis in the early design stages is introduced. The model represents an application program by decomposing it into a graph model consisting of a set of nodes, a set of edges that probabilistically determine the flow from node to node, and a mapping of the nodes to memory. The software model simulates the execution... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimators for fault tolerance coverage evaluation

    Publication Year: 1993, Page(s):228 - 237
    Cited by:  Papers (21)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1015 KB)

    The problem of estimating the coverage of a fault tolerance mechanism through statistical processing of observations collected in fault injection experiments is addressed. A formal definition of coverage is given in terms of the fault and activation sets that characterize the input space. Two categories of sampling techniques are considered for coverage estimation: sampling in the whole space and ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.