By Topic

Reliable Distributed Systems, 1996. Proceedings., 15th Symposium on

Date 23-25 Oct. 1996

Filter Results

Displaying Results 1 - 25 of 25
  • Proceedings 15th Symposium on Reliable Distributed Systems [front matter]

    Publication Year: 1996
    Save to Project icon | Request Permissions | PDF file iconPDF (180 KB)  
    Freely Available from IEEE
  • Author index

    Publication Year: 1996 , Page(s): 229
    Save to Project icon | Request Permissions | PDF file iconPDF (47 KB)  
    Freely Available from IEEE
  • Analysis of a multistage interconnection network using binary decision diagrams (BDD)

    Publication Year: 1996 , Page(s): 34 - 43
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    The authors use the BDD to help derive a closed-form solution for the reliability of a multistage interconnection network with n stages. The BDD reveals repeated structures, the reliability of which can be encoded in a recursive formula. An exact solution of a network with an arbitrary number of stages can be computed in time proportional to the number of stages. They also provide results which include the concept of imperfect coverage, in which two mutually-exclusive failure modes (with different effects) are possible for certain switching elements View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Specialized N-modular redundant processors in large-scale distributed systems

    Publication Year: 1996 , Page(s): 12 - 21
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1012 KB)  

    Computers are being used to achieve increasingly sophisticated control for large and complex systems. Many of these systems require a large shared state-space or database. Thus, handling real-time concurrent accesses to a shared database is an essential feature for modern fault-tolerant systems. Many fault-tolerant systems have been implemented for uniformly tolerating various types of failures, such as MAFT (Multicomputer Architecture for Fault Tolerance), FTP (Fault-Tolerant Processor), FTPP (Fault-Tolerant Parallel Processors) and Delta-4. However, most of these either lack the notion of a shared state-space or do not efficiently support parallel tasks that concurrently access a shared state-space. We use a processor-specialization approach to increase the effectiveness of replication and, consequently, achieve cost-effective fault tolerance in such systems. The SNMR (specialized N-modular redundancy) protocol has been developed based on these concepts. Compared to many existing Byzantine-resilient systems, the SNMR approach incurs less overhead and can be easily parameterized to fit various fault models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Primary copy method and its modifications for database replication in distributed mobile computing environment

    Publication Year: 1996 , Page(s): 178 - 187
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (944 KB)  

    Rapidly expanding cellular communication technology, wireless LANs and satellite services have made it possible for mobile users to access information anywhere and at any time. In a mobile computing environment replication might be considered as an essential technique providing reliability, throughput increase and data availability. This paper addresses the replica control protocols with an emphasis on workstation mobility issues. The modifications that have to be made to the primary copy method for replicated database management strategies in order to address the effect of mobility on the existing replica control protocols are analysed and proposed. A variation of the primary copy algorithm, called virtual primary copy method is proposed and it is shown that this method is well suited for the distributed mobile computing environment. The performance of virtual primary copy method comparative to traditional primary copy method using computer simulation is analysed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A transparent light-weight group service

    Publication Year: 1996 , Page(s): 130 - 139
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (852 KB)  

    The virtual synchrony model for group communication has proven to be a powerful paradigm for building distributed applications. Implementations of virtual synchrony usually require the use of failure detectors and failure recovery protocols. In applications that require the use of a large number of groups, significant performance gains can be attained if these groups share the resources required to provide virtual synchrony. A service that maps user groups onto instances of a virtually synchronous implementation is called a light-weight group service. This paper proposes a new design for the light-weight group protocols that enables the usage of this service in a transparent manner as a test case, the new design was implemented in the Horus system, although the underlying principles can be applied to other architectures as well. The paper also presents performance results from this implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analyzing dynamic voting using Petri nets

    Publication Year: 1996 , Page(s): 44 - 53
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (912 KB)  

    Dynamic voting is considered a promising technique for achieving high availability in distributed systems with data replication. To date, stochastic analysis of dynamic voting algorithms is restricted to either site or link Markov models, but not both, possibly because of the difficulty in specifying the state-space which grows exponentially as the number of sites increases. Furthermore, to reduce the state-space, the assumption of “frequent updates” was normally made, which results in an overestimation of the availability. In this paper, we develop a Petri net model that considers both site and link failures and also relaxes the modeling assumption of frequent updates. We test our Petri net model on ring and star network topologies to analyze if availability under dynamic voting can be seriously degraded if updates are not frequent under various site and link failure/repair situations. Finally, We use the Petri net developed in the paper to determine the maximum achievable improvement in availability when null updates are introduced to augment regular updates to keep the status of availability up-to-date View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The design of a CORBA group communication service

    Publication Year: 1996 , Page(s): 150 - 159
    Cited by:  Papers (34)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (960 KB)  

    The common object request broker architecture (CORBA) is becoming a standard for distributed application middleware, and there are increasing needs for enriching the basic functionalities of CORBA. While mechanisms for persistence, transactions, event channels, etc. have been designed and specified for CORBA, no standard support is provided to handle object replication. In this paper we discuss the issue of augmenting CORBA with group communication, which is considered an adequate paradigm to handle replication. We distinguish two main approaches: the integration approach and the service approach. We argue that the service approach is more appropriate to CORBA as it preserves the modularity of the architecture. We describe a proposal for a group communication service and discuss some implementation issues View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fault-tolerant CORBA name server

    Publication Year: 1996 , Page(s): 188 - 197
    Cited by:  Papers (4)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1012 KB)  

    OMG CORBA applications require a distributed naming service in order to install and to retrieve object references. High availability of the naming service is important since most CORBA applications need to access it at least once during their lifetime. Unfortunately, the OMG standards do not deal with availability issues; the naming services of many of the commercially available CORBA object request brokers introduce single points of failure. In this paper we describe the design and implementation of a replicated, highly-available CORBA name server that adheres to the OMG Common Object Services Specification. Our naming service can be replicated at run-time, while many applications are installing and retrieving object references. We compare our approach with the approaches taken by the ILU, NEO, Orbix, and DOME object request brokers. The performance of our name server is measured for various degrees of replication View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic fault tolerance in DCMA-a dynamically configurable multicomputer architecture

    Publication Year: 1996 , Page(s): 22 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1732 KB)  

    This paper introduces a new architecture for a fault-tolerant computer system which connects high-end PCs or workstations by a high-speed network. To achieve platform independence, coupling is based on the widely used PCI-bus. In contrast to commercially available fault-tolerant systems we strongly emphasize mechanisms for tolerating transient and intermittent faults. To keep hardware costs low the system is built with off-the-shelf computers and their extensions are kept as small as possible. To reduce the operational costs the system can be dynamically adapted to different demands on fault tolerance on a program-by-program basis. Adaptation is done transparently to the application software by the operating system. We use a commercially available real-time operating system with a POSIX-compliant UNIX-interface. The bandwidth of fault tolerance reaches from a non-redundant system of stand-alone computers, a master/checker configuration to a TMR-system. The high-performance network allows the system to operate as a parallel multicomputer, too View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical adaptive distributed system-level diagnosis applied for SNMP-based network fault management

    Publication Year: 1996 , Page(s): 98 - 107
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB)  

    Fault management is a key functional area of network management systems, but currently deployed applications often implement rudimentary diagnosis mechanisms. This paper presents a new hierarchical adaptive distributed system-level diagnosis (Hi-ADSD) algorithm and its implementation based on SNMP (simple network management protocol). Hi-ADSD is a fully distributed algorithm that has diagnosis latency of at most (log2N)2 testing rounds for a network of N nodes. Nodes are mapped into progressively larger logical clusters, so that each node executes tests in a hierarchical fashion. The algorithm assumes no link faults, a fully-connected network and imposes no bounds on the number of faults. Both the worst-case diagnosis latency and correctness of the algorithm are formally proved. Experimental results are given through simulation of the algorithm for large networks. The algorithm was implemented on a small network using SNMP. We present details of the implementation, including device fault management, the role of the network management station, and the diagnosis management information base View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Strong and weak virtual synchrony in Horus

    Publication Year: 1996 , Page(s): 140 - 149
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (876 KB)  

    This paper presents two variants of virtual synchrony, which are supported by Horus. The first variant, called strong virtual synchrony, includes the property that every message is delivered within the view in which it is sent. This property is very useful in developing applications, since it helps in minimizing the amount of context information that needs to be sent on messages, and the amount of computation which is required in order to process a message. However, it is shown that in order to support this property, the application program has to block messages during view changes. An alternative definition, called weak virtual synchrony, which can be implemented without blocking messages, is then presented. This definition still guarantees that messages will be delivered within the view in which they were sent, only that it uses a slightly weaker notion of what the view in which a message was sent is. An implementation of weak virtual synchrony that does not block messages during view changes as also developed in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A proposal for ensuring high availability of distributed multimedia applications

    Publication Year: 1996 , Page(s): 220 - 227
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB)  

    Recent advances in computing, like high-speed networks and data-compression, make extensible distributed multimedia applications a challenging application-domain of distributed systems. Such applications like VoD (Video on Demand) or real-time conferencing are characterized by QoS (quality of service) requirements which depend on the quality of video and sound transmitted to the client and on the respect of time constraints associated to video and audio data. Much work has been done in order to provide system support aimed at meeting these requirements. However, existing proposals do not integrate the consequence of failure occurrence on the guaranteed QoS. To deal with this issue, we propose a resource reservation model that integrates availability requirements of multimedia services in addition to the QoS constraints introduced above. Our paper details the resulting model together with its integration in a distributed system. In particular we show how the model implementation can be customized in the case of a VoD server View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation and performance of a stable-storage service in Unix

    Publication Year: 1996 , Page(s): 86 - 95
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (984 KB)  

    This paper describes the design, implementation, and performance of a stable-storage service that has been implemented on top of the Unix operating system. This service allows servers to create, access, and delete persistent memory that survives server crashes. We describe its functionality and exported operations, discuss the experiences and performance of its implementation, and offer concrete examples of its use in implementing some real fault-tolerant distributed protocols View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Locating more corruptions in a replicated file

    Publication Year: 1996 , Page(s): 168 - 177
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (804 KB)  

    When a data file is replicated at more than one site, we are interested in detecting corruption by comparing the multiple copies. In order to reduce the amount of messaging for large files, techniques based on page signatures and combined signatures have been explored. However, for 3 or more sites, the known methods assume that the number of corrupted page copies to be at most [M/2]-1, where M is the number of sites. We point out that this assumption is unrealistic and the corresponding methods are unnecessarily pessimistic. In this paper, we replace this assumption by another assumption which we show to be reasonable. Based on this assumption, we derived a distributed algorithm which in general achieves better performance than previously known results. Our system model is also more refined than previous work View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ongoing fault diagnosis

    Publication Year: 1996 , Page(s): 108 - 117
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (780 KB)  

    We consider a dynamic fault diagnosis problem: there are n processors, to be tested in a series of rounds. In every testing round we use a directed matching to have some processors report on the status (good or faulty) of other processors. Also, in each round up to t processors may break down, and we may direct that up to t processors are repaired. We show that it is possible to limit the number of faulty processors to O(t log2 t), even if the system is run indefinitely. We present an adversary which shows that this bound is optimal View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A causal message ordering scheme for distributed embedded real-time systems

    Publication Year: 1996 , Page(s): 210 - 219
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (844 KB)  

    In any distributed system, messages must be ordered according to their cause-and-effect relation to ensure correct behavior of the system. Causal ordering is also essential for services like atomic multicast and replication. In distributed real-time systems, not only must proper causal ordering be ensured, but message deadlines must be met as well. Previous algorithms which ensure such behavior include the Δ-protocol family and the MARS approach. However, both these algorithms give large response times by delaying all messages for a fixed period of time. In this paper we show that for small- to medium-sized real-time systems (consisting of a few tens of nodes) as are commonly used for embedded applications, it becomes feasible to extend the h-protocol so that instead of delaying all messages for a fixed period, each message is delayed according to its deadline. Our algorithm requires certain message deadlines to be adjusted by the application designer and we show that for small-scale applications such as those used in embedded systems, this adjustment is feasible and can be automated by the use of proper CAD tools View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting data-flow for fault-tolerance in a wide-area parallel system

    Publication Year: 1996 , Page(s): 2 - 11
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (972 KB)  

    Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will be a common occurrence. Unfortunately, most parallel processing systems have not been designed with fault-tolerance in mind. Mentat is a high-performance object-oriented parallel processing system that is based on an extension of the data-flow model. The functional nature of data-flow enables both parallelism and fault-tolerance. In this paper, we exploit the data-flow underpinning of Mentat to provide easy-to-use and transparent fault-tolerance. We present results on both a small-scale network and a wide-area heterogeneous environment that consists of three sites: the National Center for Supercomputing Applications, the University of Virginia and the NASA Langley Research Center View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

    Publication Year: 1996 , Page(s): 76 - 85
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (932 KB)  

    Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) fault-tolerance. The results of this paper will serve as a guide so that future implementations of coordinated checkpointing can allow their users to achieve the combination of performance and functionality that is right for their applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Developing reliable applications on cluster systems

    Publication Year: 1996 , Page(s): 165 - 166
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (116 KB)  

    A cluster is a group of computers which are loosely connected together to provide fast and reliable services. There have been many applications built on cluster systems such as distributed/parallel database applications, telecommunication systems and, recently, internet/intranet servers. Cluster systems can deliver similar or better performance and reliability than traditional mainframes, supercomputers and fault-tolerant systems with a much lower hardware cost View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Diagnosing crosstalk-faulty switches in photonic networks

    Publication Year: 1996 , Page(s): 118 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1008 KB)  

    A procedure for diagnosing crosstalk and crosstalk-faulty switches in photonic dilated Benes networks (DBNs) is presented. It obtains the crosstalk ratios of each and every switch in an N×N DBN in 4N tests, along with O(N·log2N) calculations. One of its applications is to identify single or multiple switches in the DBN which are generating excessive crosstalk, or crosstalk-faulty. A recursive algorithm is used to configure the DBN for each test such that the necessary power measurements of the signals can be taken accurately. An important feature of the proposed diagnostic procedure is its suitability for automated test generation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fail-aware failure detectors

    Publication Year: 1996 , Page(s): 200 - 209
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1028 KB)  

    In existing asynchronous distributed systems it is impossible to implement failure detectors which are perfect, i.e. they only suspect crashed processes and eventually suspect all crashed processes. Some recent research has however proposed that any “reasonable” failure detector for solving the election problem must be perfect. We address this problem by introducing two new classes of fail-aware failure detectors that are (1) implementable in existing asynchronous distributed systems, (2) not necessarily perfect, and (3) can be used to solve the election problem. In particular we show that there exists a fail-aware failure detector that allows to solve the election problem and which is strictly weaker than a perfect failure detector View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Observations from 16 years at a fault-tolerant computer company

    Publication Year: 1996 , Page(s): 162 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB)  

    Observations acquired from 16 years of experience working for a vendor of fault-tolerant computer systems are presented, along with two “war stories” that illustrate some of the principles View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback

    Publication Year: 1996 , Page(s): 66 - 75
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (952 KB)  

    Basing rollback recovery on optimistic message logging and replay avoids the need for synchronization between processes during failure-free execution. Some previous research has also attempted to reduce the need for synchronization during recovery, but these protocols have suffered from three problems: not eliminating all synchronization during recovery, not minimizing rollback, or providing these properties but requiring large timestamps. This paper makes two contributions: we present a new rollback recovery protocol, based on our previous work, that provides these properties (asynchronous recovery, minimal rollback) while reducing the timestamp size; and we prove that no protocol can provide these properties and have asymptotically smaller timestamps View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On-line testing for application software of widely distributed system

    Publication Year: 1996 , Page(s): 54 - 63
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (792 KB)  

    Widely distributed systems are constructed step-by-step over a long lime. These systems must permit on-line testing. On-line testing verifies newly added application software by receiving the real data in the real environment without disrupting the operating subsystems. To enable testing during system operation, an on-line test technique based on autonomous decentralized system structure was proposed. In this paper, the functions to verify various types of application software are proposed: (1) for non-real-time application software, checking whether it communicates with every operating application software. (2) for real-time application software, checking whether timing of output data is within the timing-deadline or not. (3) for new version software, verifying whether the version of all functions is newer than that off functions in the present software. An example of application software applied to the on-line test and the effectiveness of the technique is shown in a real system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.