By Topic

Proceedings of IEEE 13th Symposium on Reliable Distributed Systems

25-27 Oct. 1994

Filter Results

Displaying Results 1 - 22 of 22
  • Proceedings of IEEE 13th Symposium on Reliable Distributed Systems

    Publication Year: 1994
    Request permission for commercial reuse | PDF file iconPDF (59 KB)
    Freely Available from IEEE
  • Reliability analysis of a hardware and software fault tolerant parallel processor

    Publication Year: 1994, Page(s):74 - 83
    Cited by:  Papers (2)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (644 KB)

    Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors aris... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Processing of read-only queries at a remote backup

    Publication Year: 1994, Page(s):192 - 201
    Cited by:  Patents (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (744 KB)

    Remote backup systems are often used to provide high data availability. Updates are typically propagated to the backup via a log, which decouples the backup from the primary. We show that this decoupling can lead to efficient installation of updates in batches and efficient processing of read-only queries, by eliminating or reducing access conflicts between updates and queries. We present several ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic validation using worst event driven and importance sampling simulation

    Publication Year: 1994, Page(s):84 - 93
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (740 KB)

    Probabilistic validation is an approach for the validation of highly dependable and complex systems. It relies on a partial analysis on a system model and tries to prove that the failed event occurrences has a sufficiently low probability. We define a probabilistic validation method using worst event driven and an importance sampling simulation. The system which must be validated is modeled by a s... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transparent load sharing in distributed systems: decentralized design alternatives based on the Condor package

    Publication Year: 1994, Page(s):202 - 211
    Cited by:  Papers (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (804 KB)

    In recent years a number of load sharing (LS) mechanisms have been proposed or implemented to fully utilize system resources. We design and implement a decentralized LS mechanism based on the Condor package, and give in this paper a description of our design and implementation approaches. Two important features of the design are the use of region-change broadcasts in the information policy to prov... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PLinda 2.0: a transactional/checkpointing approach to fault tolerant Linda

    Publication Year: 1994, Page(s):96 - 105
    Cited by:  Papers (12)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (756 KB)

    Robust parallel computation in Linda requires both tuple space and processes to be resilient to failure. In this paper, we present PLinda 2.0, set of extensions to Linda to support robust parallel computation on loosely coupled processors communicating over a network. The principal extensions of PLinda 2.0 to Linda are transaction mechanisms for reliable tuple space and process-private logging mec... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of checkpointing schemes for multiprocessor systems

    Publication Year: 1994, Page(s):52 - 61
    Cited by:  Papers (9)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (748 KB)

    Parallel computing systems provide hardware redundancy that helps to achieve low cost fault-tolerance, by duplicating the task into more than a single processor, and comparing the states of the processors at checkpoints. This paper suggests a novel technique, based on a Markov reward model (MRM), for analyzing the performance of checkpointing schemes with task duplication. We show how this techniq... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A replication-transparent remote invocation protocol

    Publication Year: 1994, Page(s):160 - 169
    Cited by:  Papers (6)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (848 KB)

    Although many algorithms and implementations of replicated services have been developed, most have embedded aspects of the replication management in the invocation protocol. This makes it extremely difficult to modify the replication protocol without changing the protocol used by the clients, and causes an undesirable violation of both transparency and modularity. The GRIP protocol supports the fa... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed reconfiguration of multiprocessor systems

    Publication Year: 1994, Page(s):212 - 221
    Cited by:  Papers (1)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (720 KB)

    We propose distributed algorithms for assigning tasks to processors in a distributed system, as well as algorithms for reassigning tasks from a failed processor to the remaining fault free processors in the system. The assignment and reassignment of the tasks should 1) make an efficient use of the capacity of the processors in the system, 2) distribute the load in a balanced way among the processo... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A methodology for constructing a stabilizing crash-tolerant application

    Publication Year: 1994, Page(s):12 - 21
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (712 KB)

    This paper is an exercise to construct a stabilizing mutual-exclusion protocol that withstands a single crash-failure. We begin with a collection of distributed processes arranged in a ring. The resulting protocol is stabilized by construction. Stabilizing protocols converge to a correct behavior regardless of their initial state. A faulty process is automatically removed from the system and, afte... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nested dynamic actions: how to solve the fault containment problem in a cooperative action model

    Publication Year: 1994, Page(s):106 - 115
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (828 KB)

    Elements of transaction processing become more and more accepted as a base for fault-tolerant distributed computing. We have developed an action concept with an extended functionality suited to better support the needs of Concurrent Engineering applications. This allows actions to access data produced by other, still active actions. In case of an action fault, however, fault containment may become... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An environment for importance sampling based on stochastic activity networks

    Publication Year: 1994, Page(s):64 - 73
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (976 KB)

    Model-based evaluation of reliable distributed and parallel systems is difficult due to the complexity of these systems and the nature of the dependability measures of interest. The complexity creates problems for analytical model solution techniques, and the fact that reliability and availability measures are based on rare events makes traditional simulation methods inefficient. Importance sampli... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting program semantics for efficient instrumentation of distributed event recognitions

    Publication Year: 1994, Page(s):181 - 190
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (984 KB)

    Event based monitors and debuggers for distributed systems automatically detect occurrences of user specified events that characterize the state of one or more processes in a distributed computation. The system monitors the occurrences of local events. Data pertaining to these occurrences is utilized to detect occurrences of distributed events which check the ordering or concurrency of local event... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers

    Publication Year: 1994, Page(s):42 - 51
    Cited by:  Papers (25)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (832 KB)

    Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A protocol description language for customizing failure semantics

    Publication Year: 1994, Page(s):148 - 157
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (768 KB)

    To optimize performance in a fault-tolerant distributed system, it is often necessary to enforce different failure semantics for different components. By choosing a custom set of failure semantics for each component and then by enforcing the semantics with a minimal set of protocols for a particular architecture, performance may be maximized while ensuring the desired system behavior. We have deve... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault tolerance in a multisensor environment

    Publication Year: 1994, Page(s):2 - 11
    Cited by:  Papers (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (688 KB)

    Replicating sensors is desirable not only to tolerate sensor failures but also to increase the expected accuracy of the ensemble of replicated sensors beyond that obtainable with a single one. Such replication is used in a multisensor environment or in a distributed sensor network. We model a continuous valued sensor as an interval of real numbers containing the physical value of interest. Given n... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient treatment of failures in RPC systems

    Publication Year: 1994, Page(s):170 - 180
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (936 KB)

    This paper addresses extensions to be made to a basic remote procedure call system for the integration of primitive fault tolerance measures. Our main design goal is to not introduce performance penalty for remote procedure calls executing in the absence of failures, and to not impose significant overhead by the treatment of failures. Basically, extensions include a simple algorithm that finds and... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing interprocessor dependence in recoverable distributed shared memory

    Publication Year: 1994, Page(s):34 - 41
    Cited by:  Papers (8)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (644 KB)

    Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer o... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulating fail-stop in asynchronous distributed systems

    Publication Year: 1994, Page(s):138 - 147
    Cited by:  Papers (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (792 KB)

    The fail-stop failure model appears frequently in the distributed systems literature. However, in an asynchronous distributed system, the fail-stop model cannot be implemented. In particular, it is impossible to reliably detect crash failures in an asynchronous system. In this paper, we show that it is possible to specify and implement a failure model that is indistinguishable from the fail-stop m... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic internal clock synchronization

    Publication Year: 1994, Page(s):22 - 31
    Cited by:  Papers (22)  |  Patents (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (764 KB)

    We propose an improved probabilistic method for reading remote clocks in systems subject to unbounded communication delays and use this method to design a fault-tolerant probabilistic internal clock synchronization protocol. This protocol masks clock reading failures and arbitrary failures of processes. Because of probabilistic reading, our protocol achieves better synchronization precisions than ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interaction of formal design systems in the development of a fault-tolerant clock synchronization circuit

    Publication Year: 1994, Page(s):128 - 137
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (644 KB)

    We propose a design strategy that exploits the strengths of different formal approaches to establish a reliable path from a mechanically verified high-level description to a concrete gate-level realization. We demonstrate the use of this approach in the realization of a fault-tolerant clock synchronization circuit. We used the Digital Design Derivation system (DDD) to derive a major portion of the... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient recovery scheme for locking-based distributed database systems

    Publication Year: 1994, Page(s):116 - 125
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (820 KB)

    This paper presents a checkpointing recovery scheme which effectively copes with media failures in distributed database systems, under the two-phase locking policy. Our scheme utilizes both the current database area and the checkpoint area, so that the transactions whose effects are to be included in the checkpoint and the ones whose effects are to be excluded from the checkpoint can separately re... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.