By Topic

Computers, IEEE Transactions on

Issue 2 • Date Feb. 2003

Filter Results

Displaying Results 1 - 13 of 13
  • Guest editorial: special issue on reliable distributed systems

    Page(s): 97 - 98
    Save to Project icon | Request Permissions | PDF file iconPDF (222 KB)  
    Freely Available from IEEE
  • Comments on "A fast and efficient processor allocation scheme for mesh-connected multicomputers"

    Page(s): 255 - 256
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1062 KB)  

    In a recent paper by B.S. Yoo and C.R. Das (2002), the so-called stack-based allocation (SBA) algorithm is claimed to be, at worst, O(B/sup 2/) expensive. In this paper, we present an exception for which the time complexity of SBA is at least O(B/sup 3/). Furthermore, we point out the discrepancy in the complexity analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perfect failure detection in timed asynchronous systems

    Page(s): 99 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (889 KB) |  | HTML iconHTML  

    Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed asynchronous systems with hardware watchdogs. The two main system model assumptions are: 1) each computer can measure time intervals with a known maximum error and 2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware. To implement a perfect failure detector for process crash failures, we show that, in some systems, a hardware watchdog is actually not necessary. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Building survivable services using redundancy and adaptation

    Page(s): 181 - 194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (400 KB) |  | HTML iconHTML  

    Survivable systems-that is, systems that can continue to provide service despite failures, intrusions, and other threats-are increasingly needed in a wide variety of civilian and military application areas. As a step toward realizing such systems, this paper advocates the use of redundancy and adaptation to build survivable services that can provide core functionality for implementing survivability in networked environments. An approach to building such services using these techniques is described and a concrete example involving a survivable communication service is given. This service is based on Cactus, a system for building highly configurable network protocols that offers the flexibility needed to easily add redundant and adaptive components. Initial performance results for a prototype implementation of the communication service built using Cactus/C2.1 running on Linux are also given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliable real-time communication in cooperative mobile applications

    Page(s): 166 - 180
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (971 KB) |  | HTML iconHTML  

    Embedded systems are expected to provide increasingly complex and safety-critical services that will, sooner or later, require the cooperation of several such systems for their fulfillment. In particular, coordinating the access to shared physical and information technological resources will become a general problem. Examples are mobile robots in industrial automation or car-to-car coordination for future traffic control applications. In such applications, cooperation is subject to strong real-time and reliability requirements. In this paper, we present an architecture that allows autonomous mobile systems to schedule shared resources in real-time using their own wireless distributed infrastructure. In this architecture, there is a clear separation between the application-specific scheduling part and the application independent communication part that constitutes the real-time and reliability hardcore of the system. The latter provides clock synchronization, real-time atomic multicast, and real-time group membership based on an IEEE 802.11 standard wireless LAN. An application prototype shows how the architecture can be used in future mobile cooperative applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sustaining availability of Web services under distributed denial of service attacks

    Page(s): 195 - 208
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3000 KB) |  | HTML iconHTML  

    The recent tide of Distributed Denial of Service (DDoS) attacks against high-profile web sites demonstrate how devastating DDoS attacks are and how defenseless the Internet is under such attacks. We design a practical DDoS defense system that can protect the availability of web services during severe DDoS attacks. The basic idea behind our system is to isolate and protect legitimate traffic from a huge volume of DDoS traffic when an attack occurs. Traffic that needs to be protected can be recognized and protected using efficient cryptographic techniques. Therefore, by provisioning adequate resource (e.g., bandwidth) to legitimate traffic separated by this process, we are able to provide adequate service to a large percentage of clients during DDoS attacks. The worst-case performance (effectiveness) of the system is evaluated based on a novel game theoretical framework, which characterizes the natural adversarial relationship between a DDoS adversary and the proposed system. We also conduct a simulation study to verify a key assumption used in the game-theoretical analysis and to demonstrate the system dynamics during an attack. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Peer-to-peer membership management for gossip-based protocols

    Page(s): 139 - 149
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1373 KB) |  | HTML iconHTML  

    Gossip-based protocols for group communication have attractive scalability and reliability properties. The probabilistic gossip schemes studied so far typically assume that each group member has full knowledge of the global membership and chooses gossip targets uniformly at random. The requirement of global knowledge impairs their applicability to very large-scale groups. In this paper, we present SCAMP (Scalable Membership protocol), a novel peer-to-peer membership protocol which operates in a fully decentralized manner and provides each member with a partial view of the group membership. Our protocol is self-organizing in the sense that the size of partial views naturally converges to the value required to support a gossip algorithm reliably. This value is a function of the group size, but is achieved without any node knowing the group size. We propose additional mechanisms to achieve balanced view sizes even with highly unbalanced subscription patterns. We present the design, theoretical analysis, and a detailed evaluation of the basic protocol and its refinements. Simulation results show that the reliability guarantees provided by SCAMP are comparable to previous schemes based on global knowledge. The scale of the experiments attests to the scalability of the protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performability evaluation of multipurpose multiprocessor systems: the "separation of concerns" approach

    Page(s): 223 - 236
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    The aim of our work is to provide a modeling framework for evaluating performability measures of Multipurpose, Multiprocessor Systems (MMSs). The originality of our approach is in the explicit separation between the architectural and environmental concerns of a system. The overall dependability model, based on stochastic reward nets, is composed of 1) an architectural model describing the behavior of system hardware and software components, 2) a service-level model, and 3) a maintenance policy model. The two latter models are related to the system utilization environment. The results can be used for supporting the manufacturer design choices as well as the potential end-user configuration selection. We illustrate the approach on a particular family of MMSs under investigation by a system manufacturer for Internet and e-commerce applications. As the systems are scalable, we consider two architectures: a reference one composed of 16 processors and an extended one with 20 processors. Then, we use the obtained results to evaluate the performability of a clustered system composed of four reference systems. We evaluate comprehensive measures defined with respect to the end-user service requirements and specific measures in relation to the distributed shared memory paradigm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant mobile agent execution

    Page(s): 209 - 222
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1690 KB) |  | HTML iconHTML  

    Mobile agents have attracted considerable interest in recent years. In the context of mobile agents, fault tolerance is crucial to enable the integration of mobile agent technology into today's business applications. This article identifies two important properties for fault-tolerant mobile agent execution: nonblocking and exactly-once. Nonblocking ensures that the agent execution can proceed despite a single failure of the agent or the machine, for instance. Replication is the generally adopted mechanism to prevent blocking, but may lead to multiple executions of the agent (i.e., a violation of the exactly-once property), which is undesirable with operations that have side effects. Hence, we propose that fault-tolerant mobile agent execution be modeled as a sequence of agreement problems. Our approach is nonblocking and ensures exactly-once execution. FATOMAS, our prototype fault-tolerant mobile agent system, implements our approach. Its performance evaluation illustrates the overhead of the replication mechanisms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transparent recovery from intermittent faults in time-triggered distributed systems

    Page(s): 113 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (983 KB)  

    The time-triggered model, with tasks scheduled in static (off line) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Process recovery in heterogeneous systems

    Page(s): 126 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3843 KB) |  | HTML iconHTML  

    Heterogeneous computing environments, where computers may have different instruction set architectures, data representations, and operating systems, complicate checkpointing and recovery of processes. This paper describes an approach to recovery and an implementation, PREACHES, that provides portable checkpointing of single-process applications in heterogeneous systems using checkpoint propagation. The checkpoint propagation mechanism creates machine-dependent checkpoints for different architectures in the heterogeneous environment. A process is restored on a specific machine with the checkpoint that is appropriate for the architecture. An implementation of PREACHES has been evaluated on a heterogeneous network of workstations, including Sun, HP, and Pentium machines. The experimental results show that PREACHES achieves efficient checkpointing and rapid recovery. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantically reliable multicast: definition, implementation, and performance evaluation

    Page(s): 150 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (525 KB) |  | HTML iconHTML  

    Semantic reliability is a novel correctness criterion for multicast protocols based on the concept of message obsolescence: A message becomes obsolete when its content or purpose is superseded by a subsequent message. By exploiting obsolescence, a reliable multicast protocol may drop irrelevant messages to find additional buffer space for new messages. This makes the multicast protocol more resilient to transient performance perturbations of group members, thus improving throughput stability. This paper describes our experience in developing a suite of semantically reliable protocols. It summarizes the motivation, definition, and algorithmic issues and presents performance figures obtained with a running implementation. The data obtained experimentally is compared with analytic and simulation models. This comparison allows us to confirm the validity of these models and the usefulness of the approach. Finally, the paper reports the application of our prototype to distributed multiplayer games. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reflective fault-tolerant systems: from experience to challenges

    Page(s): 237 - 254
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1235 KB) |  | HTML iconHTML  

    This paper presents research work performed on the development and the verification of dependable reflective systems based on MetaObject Protocols (MOPS). We describe our experience, we draw the lessons learned from both a design and a validation viewpoint, and we discuss some possible future trends on this topic. The main originality of this work relies on the combination of both design and validation issues for the development of reflective systems, which has led to the definition of a reflective framework for the next generation of fault-tolerant systems. This framework includes: 1) the specification of a MetaObject Protocol suited to the implementation of fault-tolerant systems and 2) the definition of a general test strategy to guide its verification. The proposed approach is generic and solves many issues related to the use and evolution of system platforms with dependability requirements. Two different instances of the specified MOP have been implemented in order to study the impact of the MOP architecture in the development of a reflective fault-tolerant system. As far as the test strategy is concerned, a different testing level is associated with each reflective mechanism defined in the MOP. For each testing level, we characterize the test objectives and the required test environments. According to this experience, several new research challenges are finally identified. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Albert Y. Zomaya
School of Information Technologies
Building J12
The University of Sydney
Sydney, NSW 2006, Australia
http://www.cs.usyd.edu.au/~zomaya
albert.zomaya@sydney.edu.au