By Topic

Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers., Twenty-Seventh Annual International Symposium on

Date 24-27 June 1997

Filter Results

Displaying Results 1 - 25 of 46
  • Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing [Front Matter and Table of Contents]

    Save to Project icon | Request Permissions | PDF file iconPDF (289 KB)  
    Freely Available from IEEE
  • Managing dependencies-a key problem in fault-tolerant distributed algorithms

    Page(s): 2 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (980 KB)  

    As computing environments become increasingly distributed and modular, fault-tolerance algorithms have to catch up with the consequences. In particular, they often have to deal with the management of dynamically evolving dependencies. In this paper, we identify this as a basic problem for the design of fault-tolerance algorithms. Our approach to cope with this problem is the specification and realization of a distributed dependency-management system. Realized as a generic software tool, it can be used as a customizable component that eases the design and implementation of existing and future algorithms. Our approach comprises all relevant issues for the subject, which encompasses modeling, consistency considerations, a generic interface, algorithms, and implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An approach to fault-tolerant parallel processing on intermittently idle, heterogeneous workstations

    Page(s): 11 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (908 KB)  

    We propose a novel approach to harness the idle cycles of workstations connected by LAN/WANs for long running scientific computations and present performance results for our prototype system called Persistent Linda (PLinda). PLinda offers low runtime overhead and, migration among heterogeneous architectures while retreating quickly when owners return to their workstations. PLinda achieves this by implementing a lightweight transaction model that lacks serializability and durability, but preserves a basic guarantee: if a PLinda execution terminates, it has the same result as some failure-free Linda execution. Further, by storing the state of a PLinda process as a set of core variables as of each transaction commit, a PLinda process can migrate among different architectures. Within the space of lightweight transaction models, we offer three mechanisms that make different tradeoffs between failure-free performance and recovery time. All three mechanisms may be used at the same time in a single application; each process using the mechanism which is best suited for its characteristics. Our experiments illustrate the tradeoffs of the three mechanisms as well as the overall performance of the system on applications from physics and finance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Renegotiable quality of service-a new scheme for fault tolerance in wireless networks

    Page(s): 21 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (986 KB)  

    In this paper we propose the concept that faults in telecommunications networks often manifest themselves as reductions in service quality, which can be addressed by using the notion of Quality of Service (QoS). In wireless ATM networks, the ability to provide QoS guarantees for high priority traffic in the presence of noise or faults is of utmost importance. Moreover there is a need for renegotiating existing QoS on an established connection, since the characteristics of a wireless link may well change during the lifetime of a connection due to mobile hosts' movements or external interference. In this paper we describe a general QoS strategy as a fault tolerance mechanism, and address the problems associated with providing QoS over a wireless link. We present a QoS scheme with renegotiation capability, define an API (application programming interface) for the access to this scheme and describe our implementation for this QoS API on the SWAN system, a wireless ATM network, and summarize its performance using measurements obtained from a series of experiments based on different fault scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VERIFY: evaluation of reliability using VHDL-models with embedded fault descriptions

    Page(s): 32 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (478 KB)  

    A new technique for reliability evaluation of digital systems will be presented by demonstrating the functionality and usage of the simulation based fault injector VERIFY (VHDL-based Evaluation of Reliability by Injecting Faults efficientlY). This software tool introduces a new way for describing the behavior of hardware components in case of faults by extending the VHDL language with fault injection signals together with their rate of occurrence. The accuracy of the results is obtained by using the same VHDL-models which have been developed during conventional phases of hardware design. For demonstrating the capabilities of VERIFY, a VHDL-model of a simple 32-bit processor (DP32) will be used as an example to illustrate the several steps of reliability evaluation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward accessibility enhancement of dependability modeling techniques and tools

    Page(s): 37 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (930 KB)  

    Although various dependability evaluation techniques and tools have been developed in the last two decades, no adequate attention has been paid to allow system designers not well versed in analytic modeling to easily employ these techniques and tools. In this paper, we report our experiences on accessibility enhancement for off-the-shelf modeling techniques and tools. In particular, we discuss our approaches to the development of a user-friendly dependability-evaluation workbench which is intended to lead the user to exploit the features and capabilities of the modeling tool SHARPE. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of a 32-bit microprocessor with built-in concurrent error-detection

    Page(s): 42 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (473 KB)  

    This paper describes the test results from heavy ion testing of ERC32, a 32-bit processing core with on-chip concurrent error-detection. The parity based error-detection mechanisms succeeded in detecting more than 97.5% of all injected errors, significantly reducing the MTBF for undetected SEU errors. Most errors occurred in registers, but some errors in combinational logic could also be observed. The cross-section for errors in combinational logic is however too small to have an influence on the overall error rate. The conclusion is therefore that parity based error-detection is well suited to detect SEU errors in VLSI devices for space applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic checkpointing

    Page(s): 48 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (919 KB)  

    Many optimization schemes have been proposed to reduce the overhead of checkpointing. Incremental checkpointing based on memory page protection has been one of the successful schemes used to reduce the overhead and to improve the performance of checkpointing. In this paper, we propose two checkpointing schemes, called "block encoding" and "combined block encoding", which further reduce the checkpointing overhead. The smallest unit of checkpoint data in our scheme is a block, which is smaller than a page-this reduces the amount of checkpoint data required when compared with page-based incremental checkpointing. One drawback of the proposed schemes is the possibility of aliasing in encoded words. In this paper, however, we show that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed schemes is analyzed and measured using experiments. First, we construct an analytic model that predicts the checkpointing overhead. By using this model, we can estimate the block size that produces the best performance for a given target program. Next, the proposed schemes are implemented on libckpt, a general-purpose checkpointing library for Unit based system which was developed at the University of Tennessee. According to our experimental results, the proposed schemes reduce the overhead by 11.7% in the best case and increase the overhead by 0.5% in the worst case in comparison with page-based incremental checkpointing. In most cases, the combined block encoding scheme shows an improvement over both block encoding and page-based incremental checkpointing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Portable checkpointing for heterogeneous architectures

    Page(s): 58 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1026 KB)  

    Current approaches for checkpointing assume system homogeneity, where checkpointing and recovery are both performed on the same processor architecture and operating system configuration. Sometimes it is desirable or necessary to recover a failed computation on a different processor architecture. For such situations checkpointing and recovery must be portable. In this paper, we argue that source-to-source compilation is an appropriate concept for this purpose. We describe the compilation techniques that we developed for the design of the c2ftc prototype. The c2fte compiler enables machine-independent checkpoints by automatic generation of checkpointing and recovery code. Sequential C programs are compiled into fault tolerant C programs, whose checkpoints can be migrated across heterogeneous networks, and restarted on binary incompatible architectures. Experimental results on several systems provide evidence that the performance penalty of portable checkpointing is negligible for realistic checkpointing frequencies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A communication-induced checkpointing protocol that ensures rollback-dependency trackability

    Page(s): 68 - 77
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (977 KB)  

    Considering an application in which processes take local checkpoints independently (called basic checkpoints), this paper develops a protocol that forces them to take some additional local checkpoints (called forced checkpoints) in order that the resulting checkpoint and communication pattern satisfies the Rollback Dependency Trackability (RDT) property. This property states that all dependencies between local checkpoints are on-line trackable by using a transitive dependency vector. Compared to other protocols ensuring the RDT property, the proposed protocol is less conservative in the sense that it takes less additional local checkpoints. It attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies. As indicated by simulation study, the proposed protocol compares favorably with other protocols; moreover it additionally associates on-the-fly with each local checkpoint C the minimum global checkpoint to which C belongs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A method to automate user interface testing using variable finite state machines

    Page(s): 80 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (919 KB)  

    A method has been developed that allows automating a portion of interface testing. A core component of this method is a simple, yet flexible way to specify a formal model of a user interface, named variable finite state machines (VFSM). The model can be converted into an equivalent finite state machine, from which tests can be generated. The method has been applied to the user interface of Navigator II, a wearable computer system employed by aircraft repair personnel. A VFSM model of the interface was constructed and used to generate 5,968 tests, each composed of an input sequence and its corresponding expected output sequence. The tests were then applied to an instrumented form of the interface, and the results were compared to the expected output. From the data collected, three error sources were detected in the original interface that had escaped months of previous debugging efforts and field usage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MetriStation: a tool for user-interface fault detection

    Page(s): 89 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1101 KB)  

    Using extant methods of evaluation, user-interface fault detection is difficult, time consuming, error prone, often qualitative, and subject to evaluator bias. One of the more effective approaches, empirical evaluation, is data intensive, laboratory oriented, and requires up to 25 hours of skilled analyst time for every hour of testing time; hence it is seldom used. Preparing the evaluation is difficult; results can be inconsistent and hard to replicate. This paper presents MetriStation-a new, portable user-interface evaluation system that runs on an IBM-compatible desktop or notebook computer. MetriStation monitors user sessions, automatically capturing derailed data about user-system behavior including keystrokes, mouse clicks, mouse movements, video display context, user speech (think-aloud protocols), observer speech and critical incidents. Hesitation analysis is used for automatic location of potential user-interface defects. Data can be analyzed and presented in as much as three orders of magnitude less time than is conventionally possible; the uniformity of the data and analysis provide increased repeatability and accuracy over traditional evaluation methods. MetriStation's capabilities are demonstrated in a study of a simple interface. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a statistical approach to testing object-oriented programs

    Page(s): 99 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1131 KB)  

    Statistical testing is based on a probabilistic generation of test data: structural or functional criteria serve as guides for defining an input profile and a test size. Previous work has confirmed the high fault revealing power of this approach for procedural programs; it is now investigated for object-oriented programs. A method for incremental statistical testing is defined at the cluster level, based on the class inheritance hierarchy. Starting from the root class of the program, descendant class(es) are gradually added and test data are designed for (i) structural testing of newly defined features and, (ii) regression testing of inherited features. The feasibility of the method is exemplified by a small case study (a Travel Agency) implemented in Eiffel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Work in Progress and Outrageous Opinions

    Page(s): 110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (36 KB)  

    Summary form only given, as follows. This session provides the opportunity for conference participants to make either of two types of short five-minute presentations: Work in progress. These report on formative new ideas that have not yet found their way into the literature or work that is otherwise not to the point of publication. Outrageous opinions. Deliberately provocative statements of opinion concerning some aspect of fault-tolerant computing; must be topical, technically oriented and in good taste, but otherwise, anything is fair game. Titles and short abstracts of the presentations were submitted to the organizer by electronic mail or hardcopy, with a deadline close to the session time to encourage genuinely new ideas. Submissions have been selected based on perceived audience appeal, ability to stimulate discussion, and to maintain balance among the topics associated with FTCS. Some preference has been given to young researchers as a way to encourage new people to participateactively in the conference. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How hard is hard real-time communication on field-buses?

    Page(s): 112 - 121
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (923 KB)  

    Continuity of service and bounded and known message delivery latency, are reliability requirements of a number of real-time applications, such as those served by field-buses. The analysis and design of such networks, w.r.t. timing properties, has been based on no-fault scenarios, rather than under a performability perspective. In this paper, we do an analytical study of the inaccessibility of CAN and PROFIBUS. The study reveals that previous studies on the access delays of these field-buses were too optimistic, from a performability viewpoint. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experimental evaluation of failure-detection schemes in real-time communication networks

    Page(s): 122 - 131
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1064 KB)  

    An effective failure-detection scheme is essential for reliable communication services. Most computer network rely on behavior-based detection schemes: each node uses heartbeats to detect the failure of its neighbor nodes, and the transport protocol (like TCP) achieves reliable communication by acknowledgment/retransmission. In this paper, we experimentally evaluate the effectiveness of such behavior-based detection schemes in real-time communication. Specifically, we measure and analyze the coverage and latency of two failure-detection schemes-neighbor detection and end-to-end detection-through fault-injection experiments. The experimental results have shown that a significant portion of failures can be detected very quickly by the neighbor detection scheme, while the end-to-end detection scheme uncovers the remaining failures with larger detection latencies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Graceful degradation in real-time control applications using (m, k)-firm guarantee

    Page(s): 132 - 141
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (915 KB)  

    Tasks in a real-time control application are usually periodic and they have deadline constraints by which each instance of a task is expected to complete its computation even in the adverse circumstances caused by component failures. Techniques to recover from processor failures often involve a reconfiguration in which all tasks are assigned to fault-free processors. This reconfiguration may result in processor overload where it is no longer possible to meet the deadlines of all tasks. In this paper, we discuss an overload management technique which discards selected task instances in such a way sheet the performance of the control loops in the system remain satisfactory even after a failure. The technique is based on the rationale that real-time control applications can tolerate occasional misses of the control law updates, especially if the control law is modified to account for these missed updates. The paper devises a scheduling policy which deterministic guarantees when and where the misses will occur and proposes a methodology for modifying the control law to minimize the deterioration in the control system behavior as a result of these missed control law updates. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ACTIV-LOCSTEP: a test generation procedure based on logic simulation and fault activation

    Page(s): 144 - 151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (746 KB)  

    We present a test generation procedure for synchronous sequential circuits referred to as ACTIV-LOCSTEP. Like its predecessor LOCSTEP, ACTN-LOCSTEP generates test sequences at low computational costs by using randomized search and avoiding fault oriented test generation. However, ACTIV-LOCSTEP is fundamentally different from LOCSTEP, being based on the following observation. Consider an input sequence C that consists of a transient C/sub 1/, followed by a periodic part C/sub 2/ that takes the fault free circuit through a cycle of states. Suppose that a fault f is activated during the cycle. If the same input sequence does not create a cycle in the faulty circuit, or creates a cycle of a different length than the one traversed by the fault free circuit, then f is likely to be detected after several repetitions of C/sub 2/ In the resulting procedure, the test sequence length is controlled by restricting the length of the input sequence C and the number of repetitions of the periodic part. Our experiments indicate that relatively short sequences and small numbers of repetitions of the periodic part of the sequence allow large numbers of faults to be detected. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust search algorithms for test pattern generation

    Page(s): 152 - 161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB)  

    In recent years several highly effective algorithms have been proposed for Automatic Test Pattern Generation (ATPG). Nevertheless, most of these algorithms too often rely on different types of heuristics to achieve good empirical performance. Moreover there has not been significant research work on developing algorithms that are robust, in the sense that they can handle most faults with little heuristic guidance. In this paper we describe an algorithm for ATPG that is robust and still very efficient. In contrast with existing algorithms for ATPG, the proposed algorithm reduces heuristic knowledge to a minimum and relies on an optimized search algorithm for effectively pruning the search space. Even though the experimental results are obtained using an ATPG tool built on top of a Propositional Satisfiability (SAT) algorithm, the same concepts can be integrated on application-specific algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal structural diagnosis of wiring networks

    Page(s): 162 - 171
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (956 KB)  

    We consider the problem of full diagnosis of all the short faults among a set of nets using the maximum number of parallel tests. The study adaptive and nonadaptive algorithms for structural diagnosis, using the information about the adjacency of the nets in the wiring network. The problem is formulated as a graph search problem and graph theory is used to derive optimal and near-optimal adaptive and non-adaptive algorithms. Our results indicate that adaptive diagnosis often uses exponentially fewer tests than traditional non-adaptive diagnosis, and structural diagnosis may use dramatically fewer tests than behavioral diagnosis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new approach to realizing fault-tolerant multiprocessor scheduling by exploiting implicit redundancy

    Page(s): 174 - 183
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (862 KB)  

    In this paper we propose a new approach to fault-tolerant multiprocessor scheduling by exploiting implicit redundancy, which is originally introduced by task duplication. In the new scheduling algorithm, we adopt two strategies: (1) Some processing elements (PEs) are reserved only for realizing fault-tolerance, and thus are not used for original task scheduling (reserved-scheduling). (2) A set of tasks is partitioned into several disjoint small subsets, and to each subset the algorithm is applied incrementally (phased-scheduling). By this unique device, toe can ensure that the finish times of schedules are small even in tile case of a single PE failure. Then we apply the new scheduling algorithm to practical task graphs (LU-decomposition and Laplace equation solver). The experimental results show that the obtained schedules can tolerate a single PE failure at the cost of small degree of time redundancy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault recovery mechanism for multiprocessor servers

    Page(s): 184 - 193
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (870 KB)  

    Achieving higher reliability in open server computer systems with low cost has been an increasing interest recently. To satisfy this general demand, we propose a new fault recovery mechanism. We extended the recovery cache scheme to adapt to state-of-the-art multiprocessor server computer systems, and built a system level fault recovery mechanism. It enables the system to recover from most intermittent hardware errors without rebooting the system. Furthermore, faulty processors can be isolated dynamically, and not only hardware errors but also many of operating system panics caused by unanticipated software errors can be recovered. The fault recovery mechanism is implemented with the "add-on" hardware module and controlling software module and fully transparent to application programs. Thus no modification is required to the basic hardware and binary compatibility is maintained which is mandatory for open systems. System performance was evaluated using TPC-C benchmark. We also built an experimental system with prototype hardware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An object-oriented testbed for the evaluation of checkpointing and recovery systems

    Page(s): 194 - 203
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1010 KB)  

    The paper presents the design and development of an object-oriented testbed for simulation and analysis of checkpointing and recovery schemes in distributed systems. An important contribution, of the testbed is a unified environment that provides a set of specialized components for easy and detailed simulation of checkpointing and recovery schemes. The testbed allows a designer to mix and match different components either to study the effectiveness of a particular scheme or to freely experiment with hybrid designs before the actual implementation. The testbed also facilitates the evaluation of interdependencies among the various parameters such as communication and application dynamics and their effect on the performance of checkpointing and recovery schemes. The implementation of the testbed as an extension of DEPEND which is an integrated design and fault-injection environment, provides for unique system-level dependability analysis under realistic fault conditions unlike existing simulation tools. The authors illustrate the versatility of the testbed by using four diverse applications, ranging from the comparison of performances of two checkpointing and recovery schemes to the study of the effect of checkpoint size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using non-volatile storage to improve the reliability of RAID5 disk arrays

    Page(s): 206 - 215
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1023 KB)  

    RAID5 disk arrays are becoming popular solutions for providing fast access to data for transaction processing applications. They provide good performance at low cost without sacrificing much data reliability. Their main drawback is poor performance for small write requests. Techniques for using non-volatile RAM (NVRAM) have been proposed for improving the performance of these write requests. The paper shows that NVRAM and other nonvolatile devices can also be used to improve the reliability of an array by significantly reducing the time required to repair a failed disk: in the array in the event of a single disk failure. More importantly, proper use of these non-volatile devices allows the array to support heavier workloads than previously reported yet still repair a failed disk in a reasonable amount of time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using virtual links for reliable information retrieval across point-to-point networks

    Page(s): 216 - 225
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1210 KB)  

    The problem of establishing reliable communication across point-to-point networks is addressed. Several protocols are given, using node-disjoint paths to implement a set of independent virtual links (u-links) between each node pair. The u-links can be used to establish communication between the node pair provided the number and type of failures in the network are limited. Failed devices are classified by whether they are benign (delay and omission faults) or are malicious (not benign). The use of u-links is demonstrated with a protocol for remote information retrieval. With b benign and m malicious faults, retrieval can always be achieved if the graph connectivity exceeds b+2 m. This bound is tight. The authors give a flexible retrieval protocol; it analyzes the responses received thus far, and computes a minimum and maximum number of u-links to send (additional) retrieval requests. Maximal requests guarantee completion of the protocol upon their timeout. With minimal requests, protocol completion is barely possible with no further requests. u-links are extended so that they bounce requests when there is no response, which strengthens the retrieval protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.