By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 6 • Date June 1999

Filter Results

Displaying Results 1 - 10 of 10
  • Editorial: special section on dependable real-time systems

    Page(s): 529 - 531
    Save to Project icon | Request Permissions | PDF file iconPDF (183 KB)  
    Freely Available from IEEE
  • Real-time dependable channels: customizing QoS attributes for distributed systems

    Page(s): 600 - 612
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (324 KB)  

    Communication services that provide enhanced Quality of Service (QoS) guarantees related to dependability and real time are important for many applications in distributed systems. This paper presents real-time dependable (RTD) channels, a communication-oriented abstraction that can be configured to meet the QoS requirements of a variety of distributed applications. This customization ability is based on using CactusRT, a system that supports the construction of middleware services out of software modules called micro-protocols. Each micro-protocol implements a different semantic property or property variant and interacts with other micro-protocols using an event-driven model supported by the CactusRT runtime system. In addition to RTD channels CactusRT and its implementation are described. This prototype executes on a cluster of Pentium PCs running the OpenGroup/RI MK 7.3 Mach real-time operating system and CORDS, a system for building network protocols based on the x-kernel View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GUARDS: a generic upgradable architecture for real-time dependable systems

    Page(s): 580 - 599
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1292 KB)  

    The development and validation of fault-tolerant computers for critical real-time applications are currently both costly and time consuming. Often, the underlying technology is out-of-date by the time the computers are ready for deployment. Obsolescence can become a chronic problem when the systems in which they are embedded have lifetimes of several decades. This paper gives an overview of the work carried out in a project that is tackling the issues of cost and rapid obsolescence by defining a generic fault-tolerant computer architecture based essentially on commercial off-the-shelf (COTS) components (both processor hardware boards and real-time operating systems). The architecture uses a limited number of specific, but generic, hardware and software components to implement an architecture that can be configured along three dimensions: redundant channels, redundant lanes, and integrity levels. The two dimensions of physical redundancy allow the definition of a wide variety of instances with different fault tolerance strategies. The integrity level dimension allows application components of different levels of criticality to coexist in the same instance. The paper describes the main concepts of the architecture, the supporting environments for development and validation, and the prototypes currently being implemented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chameleon: a software infrastructure for adaptive fault tolerance

    Page(s): 560 - 579
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1372 KB)  

    This paper presents Chameleon, an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special ARMORs-Adaptive. Reconfigurable, and Mobile Objects for Reliability-that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: 1) Managers oversee other ARMORs and recover from failures in their subordinates. 2) Daemons provide communication gateways to the ARMORs at the host node. They also make available a host's resources to the Chameleon environment. 3) Common ARMORs implement specific techniques for providing application-required dependability. Employing ARMORs, Chameleon makes available different fault-tolerant configurations and maintains run-time adaptation to changes in the availability requirements of an application. Flexible ARMOR architecture allows their composition to be reconfigured at run-time, i.e., the ARMORs may dynamically adapt to changing application requirements. In this paper, we describe ARMOR architecture, including ARMOR class hierarchy, basic building blocks, ARMOR composition, and use of ARMOR factories. We present how ARMORs can be reconfigured and reengineered and demonstrate how the architecture serves our objective of providing an adaptive software infrastructure. To our knowledge, Chameleon is one of the few real implementations which enables multiple fault tolerance strategies to exist in the same environment and supports fault-tolerant execution of substantially off-the-shelf applications via a software infrastructure only. Chameleon provides fault tolerance from the application's point of view as well as from the software infrastructure's point of view. To demonstrate the Chameleon capabilities, we have implemented a prototype infrastructure which provides set of ARMORs to initialize the environment and to support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery times from failures in the user application, the Chameleon ARMORs, the hardware, and the operating system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Overload management in real-time control applications using (m, k)-firm guarantee

    Page(s): 549 - 559
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB)  

    Tasks in a real-time control application are usually periodic and they have deadline constraints by which each instance of a task is expected to complete its computation, even in the adverse circumstances caused by component failures. Techniques to recover from processor failures often involve a reconfiguration in which all tasks are assigned to fault-free processors. This reconfiguration may result in processor overload where it is no longer possible to meet the deadlines of all tasks. In this paper, we discuss an overload management technique which discards selected task instances in such a way that the performance of the control loops in the system remain satisfactory even after a failure. The technique is based on the rationale that real-time control applications can tolerate occasional misses of the control law updates, especially if the control law is modified to account for these missed updates. The paper devises a scheduling policy which deterministically guarantees when and where the misses will occur. The paper also proposes a methodology for modifying the control law to minimize the deterioration in the control system behavior as a result of these missed control law updates View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and evaluation of system-level checks for on-line control flow error detection

    Page(s): 627 - 641
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    This paper evaluates the concurrent error detection capabilities of system-level checks, using fault and error injection. The checks comprise application and system level mechanisms to detect control flow errors. We propose Enhanced Control-Flow Checking Using Assertions (ECCA). In ECCA, branch-free intervals (BFI) in a given high or intermediate level program are identified and the entry and exit points of the intervals are determined. BFls are then grouped into blocks, the size of which is determined through a performance/overhead analysis. The blocks are then fortified with preinserted assertions. For the high level ECCA, we describe an implementation of ECCA through a preprocessor that will automatically insert the necessary assertions into the program. Then, we describe the intermediate implementation possible through modifications made on gee to make it ECCA capable. The fault detection capabilities of the checks are evaluated both analytically and experimentally. Fault injection experiments are conducted using FERRARI to determine the fault coverage of the proposed techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The timed asynchronous distributed system model

    Page(s): 642 - 657
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB)  

    We propose a formal definition for the timed asynchronous distributed system model. We present extensive measurements of actual message and process scheduling delays and hardware clock drifts. These measurements confirm that this model adequately describes current distributed systems such as a network of workstations. We also give an explanation of why practically needed services, such as consensus or leader election, which are not implementable in the time-free model, are implementable in the timed asynchronous system model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A real-time primary-backup replication service

    Page(s): 533 - 548
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    This paper presents a real-time primary-backup replication scheme to support fault-tolerant data access in a real-time environment. The main features of the system are fast response to client requests, bounded inconsistency between primary and backup, temporal consistency guarantee for replicated data, and quick recovery from failures. The paper defines external and interobject temporal consistency, the notion of phase variance, and builds a computation model that ensures such consistencies for replicated data deterministically where the underlying communication mechanism provides deterministic message delivery semantics and probabilistically where no such support is available. It also presents an optimization of the system and an analysis of the failover process which includes failover consistency and failure recovery time. An implementation of the proposed scheme is built within the x-kernel architecture on the MK 7.2 microkernel from the Open Group. The results of a detailed performance evaluation of this implementation are also discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experimental evaluation of behavior-based failure-detection schemes in real-time communication networks

    Page(s): 613 - 626
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    Effective detection of failures is essential for reliable communication services. Traditionally, non-real-time computer networks have relied on behavior-based techniques for detecting communication failures. That is, each node uses heartbeats to detect the failure of its neighbors and the end-to-end transport protocol (e.g., TCP) achieves reliable communication by acknowledgment/retransmission. Recently, there has been a growing demand for reliable “real-time” communication, but little research has been done on the failure detection problem. In this paper, we present two behavior-based failure-detection schemes-neighbor detection and end-to-end detection-for reliable real-time communication services and experimentally evaluate their effectiveness. Specifically, we measure and analyze the coverage and latency of these detection schemes through fault-injection experiments. The experimental results have shown that nearly all failures can be detected very quickly by the neighbor detection scheme, while the end-to-end detection scheme uncovers the remaining failures with larger detection latencies View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Periodically regular chordal rings

    Page(s): 658 - 672
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    Chordal rings have been proposed in the past as networks that combine the simple routing framework of rings with the lower diameter, wider bisection, and higher resilience of other architectures. Virtually all proposed chordal ring networks are node-symmetric, i.e., all nodes have the same in/out degree and interconnection pattern. Unfortunately, such regular chordal rings are not scalable. In this paper, periodically regular chordal (PRC) ring networks are proposed as a compromise for combining low node degree with small diameter. By varying the PRC ring parameters, one can obtain architectures with significantly different characteristics (e.g., from linear to logarithmic diameter), while maintaining an elegant framework for computation and communication. In particular, a very simple and efficient routing algorithm works for the entire spectrum of PRC rings thus obtained. This flexibility has important implications for key system attributes such as architectural satiability, software portability, and fault tolerance. Our discussion is centered on unidirectional PRC rings with in/out-degree of 2. We explore the basic structure, topological properties, optimization of parameters, VLSI layout, and scalability of such networks, develop packet and wormhole routing algorithms for them, and briefly compare them to competing fixed-degree architectures such as symmetric chordal rings, meshes, tori, and cube-connected cycles View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology