By Topic

Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on

Date 15-17 June 1994

Filter Results

Displaying Results 1 - 25 of 48
  • Durable memory RS/6000 system design

    Publication Year: 1994 , Page(s): 414 - 423
    Cited by:  Papers (10)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1040 KB)  

    The DM/6000 prototype is a fault-tolerant/durable-memory RS/6000. The main storage of this system is battery backed so as to maintain memory content across prolonged power interruptions. In addition, there are no single points of failure, and all likely multiple failure scenarios are covered. The prototype is intended to match the data integrity and availability characteristics of RAID5 disks. Redundancy is managed in hardware and in transparent to the software; application programs and the operating system (AIX) can run unmodified. The prototype is based on the IBM PowerPC 601 microprocessor operating at 80 MHz and is equivalent in performance and software appearance to a conventional 4-way shared bus, cache coherent, symmetric multiprocessor (SMP), with 4 gigabytes of non-volatile main storage.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Test pattern generation for path delay faults in synchronous sequential circuits using multiple fast clocks and multiple observation times

    Publication Year: 1994 , Page(s): 456 - 465
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (969 KB)  

    The problem of test generation for path delay faults in synchronous sequential circuits is addressed. In existing testing methods, a single fast clock cycle is used to activate path delay faults and a fault is said to be detected only if the fault free response is different from the faulty response at a single output and at a specified time unit in the test sequence. We refer to these methods as single fast clock cycle and single observation time testing methods. We show that testable faults may exist, which are untestable using a single fast clock cycle and a single observation time. Such faults are testable when multiple fast clock cycles and/or multiple observation times are used. A test generation procedure is given that uses multiple fast clock cycles and multiple observation times. Experimental results are presented on MCNC synthesis benchmarks to demonstrate the effectiveness of the proposed strategy in increasing the fault coverage and reducing the test length.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic state and objective learning for sequential circuit automatic test generation using recomposition equivalence

    Publication Year: 1994 , Page(s): 446 - 455
    Cited by:  Papers (13)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (910 KB)  

    Automatic test pattern generation (ATPG) for sequential circuits involves making decisions in the state and combinational search spaces defined by a sequential circuit. The search spaces are exponential in the memory elements and primary inputs, respectively, making exhaustive search impractical. Since the circuit topology does not change, ATPG search for different faults may share identical decision spaces. However, existing sequential circuit ATPG algorithms are not capable of recognizing identical search decision spaces. Consequently, they reenter previously-explored decision spaces. We propose a dynamic learning algorithm that identifies previously-explored decision spaces during reverse-time sequential circuit test generation based on decomposition equivalences. This algorithm runs two and 3.3 times faster than GENTEST and HITEC, respectively, on the 1989 ISCAS benchmarks, compresses 24% of the learned information and identifies 85% of all previously-explored decision spaces by state covering. We provide theorems with proofs, examples and results.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Checkpoint/rollback in a distributed system using coarse-grained dataflow

    Publication Year: 1994 , Page(s): 424 - 433
    Cited by:  Papers (5)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (892 KB)  

    The Common Spaceborne Multicomputer Operating System (COSMOS) is a spacecraft operating system for distributed memory multiprocessors, designed to meet the on-board computing requirements of long-life interplanetary missions. One of the main features of COSMOS is software-implemented fault-tolerance, including 2-way voting, 3-way voting, and check point/rollback. This paper describes the COSMOS distributed checkpoint/rollback approach, which exploits the fact that a COSMOS application program is based on a coarse-grained dataflow programming paradigm and therefore most of the state of a distributed application program is contained in the data tokens. Furthermore, all computers maintain a consistent view of this dynamic state, which facilitates the implementation of a coordinated checkpoint.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly available cluster: a case study

    Publication Year: 1994 , Page(s): 404 - 413
    Cited by:  Papers (5)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1013 KB)  

    The methodology and design of a system that provides highly available data in a cluster is presented. A highly available cluster consists of multiple machines interconnected by a common bus. Data is replicated at a primary and one or more backup machines. Data is accessed at the primary, using a location independent mechanism that ensures data integrity. If the primary copy of the data fails, access is recovered by switching to a backup copy. Switchover is transparent to the application, hence called seamless switchover. The fault model is fail-stop. The entire cluster is resilient to at least single failures. Designating data as highly available is selective in scope, and the overhead of replication and recovery is incurred only by applications that access highly available data. An experimental prototype was implemented using IBM AS/400 machines and a high-speed bus with fiber-optic links.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal fault-tolerant leader election in chordal rings

    Publication Year: 1994 , Page(s): 392 - 401
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (822 KB)  

    Chordal rings (or circulant graphs) are a popular class of fault-tolerant network topologies which include rings and complete graphs. For this class, the fundamental problem of leader election has been extensively studied assuming either a fault-free system or an upper-bound on the number of link failures. We consider chordal rings where an arbitrary number of lines has failed and a processor can only detect the status of its incident links. We shows that a leader election protocol in a faulty chordal ring requires only O(n log n) messages in the worst-case, where n is the number of processors. Moreover, we show that this is optimal. If the network is not partitioned, the algorithm will detect it and will elect a leader. In case the failures have partitioned the network a distinctive element will be determined in each active component and will detect that a partition has occurred; depending on the application, these distinctive elements can thus take the appropriate actions.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant routing strategy in hypercube systems

    Publication Year: 1994 , Page(s): 382 - 391
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (884 KB)  

    We investigate fault-tolerant routing which aims at finding feasible minimal paths in a faulty hypercube. The concept of an unsafe node is adopted to identify nodes that may cause routing difficulties. We propose a set of stringent criteria to reduce the number of fault free nodes that are labeled unsafe. Routing efficiency can thus be improved. An algorithm is presented to facilitate the routing of messages. It only requires each node to maintain information about its local state, whereby each node is in one of four states. The issue of deadlock freeness is also addressed. Furthermore, an alternative deadlock-free routing algorithm, which requires a constant of five virtual networks in wormhole routing, is presented.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Connectivity and fault tolerance of multiple-bus systems

    Publication Year: 1994 , Page(s): 372 - 381
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (924 KB)  

    We study multiple-bus computer systems that are fault tolerant in the sense that processors remain connected in the presence of component faults such as faulty processors and buses, and faulty links between processors and buses which may represent partial bus failures. We propose several graph-theoretic models for this purpose. A processor-bus-link (PBL) graph is introduced to represent a multiple-bus system component adjacency graphs derived from the PBL graph exhibit the connectivity of the system's components, We then transform the problem of analyzing fault tolerance of multiple-bus systems into the simpler problem of finding the node connectivity of component adjacency graphs. Minimum critical fault sets, each of which is a minimum set of faulty components whose removal disconnects processors, are also characterized.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Training techniques to obtain fault-tolerant neural networks

    Publication Year: 1994 , Page(s): 360 - 369
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (701 KB)  

    This paper addresses methods of improving the fault tolerance of feedforward neural nets. The first method is to coerce weights to have low magnitudes during the backpropagation training process, since fault tolerance is degraded by the use of high magnitude weights; at the same time, additional hidden nodes are added dynamically to the network to ensure that desired performance can be obtained. The second method is to add artificial faults to various components (nodes and links) of a network: during training. The third method is to repeatedly remove nodes that do not significantly affect the network: output, and then add new nodes that share the load of the more critical nodes in the network. Experimental results have shown that these methods can obtain better robustness than backpropagation training, and compare favorably with other approaches.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experimental evaluation of the fail-silent behavior in computers without error masking

    Publication Year: 1994 , Page(s): 350 - 359
    Cited by:  Papers (32)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1128 KB)  

    Traditionally, fail-silent computers are implemented by using massive redundancy (hardware or software). In this research we investigate if it is possible to obtain a high degree of fail-silent behavior from a computer without hardware or software replication by using only simple behavior based error detection techniques. It is assumed that if the errors caused by a fault are detected in time it will be possible to stop the erroneous computer behavior, thus preventing the violation of the fail-silent model. The evaluation technique used in this research is physical fault injection at the pin level. Results obtained by the injection of about 20000 different faults in two different target systems have shown that: in a system without error detection up to 46% of the faults caused the violation of the fail-silent model; in a computer with behavior based error detection the percentage of faults that caused the violation of the fail-silent mode was reduced to values from 2.3% to 0.4%; the results are very dependent on the target system, on the program under execution during the fault injection and on the type of faults.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On latching probability of particle induced transients in combinational networks

    Publication Year: 1994 , Page(s): 340 - 349
    Cited by:  Papers (56)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (816 KB)  

    The question to what extent particle induced transients in combinational parts of a circuit propagate into memory elements is addressed in this paper An experimental method is presented in which the proportion of bit flips originating from heavy-ion hits in combinational logic is determined. It is proposed that a voltage pulse may only propagate through a limited number of transistor stages and still be latched. The proportion of all transients in combinational logic that were latched into registers was experimentally, estimated to be between 0.7/spl middot/10/sup -3/ and 2/spl middot/10/sup -3/ for a custom designed CMOS circuit. Very few multiple bit flips were observed during the experiments which indicates that the single bit flip model used in many high-level simulations is reasonable accurate.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The impact of fault expansion on the interval estimate for fault detection coverage

    Publication Year: 1994 , Page(s): 330 - 337
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (543 KB)  

    A high fault detection coverage is very critical for systems with ultra-safe requirements and fault injection is an effective technique for estimating the coverage. One difficulty of fault injection lies in the huge number of injections that need to be carried out in order to obtain statistically significant results. Fault expansion has been proposed as a means of reducing the number of injections by dividing faults into equivalence classes. We show that this intuitively appealing approach is only effective when each fault equivalence class is a significant portion of the fault population.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identifying software problems using symptoms

    Publication Year: 1994 , Page(s): 320 - 329
    Cited by:  Papers (9)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1048 KB)  

    This paper presents an approach to automatically identify recurrent software failures using symptoms, in environments where many users run the same software. The approach is based on observations that the majority of field software failures in such environments are recurrences and that failures due to a single fault often share common symptoms. The paper proposes the comparison of failure symptoms, such as stack traces and symptom strings, as a strategy for identifying recurrences. This diagnosis strategy is applied using the actual field software failure data. The results obtained are compared with the diagnosis and repair logs by analysts. Results of such comparisons using the failure, diagnosis, and repair logs in two Tandem system software products show that between 75% and 95% of recurrences can be identified successfully by matching stack traces and symptom strings. Less than 10% of faults are misdiagnosed. These results indicate that automatic identification of recurrences based on their symptoms is possible.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study of throughput degradation following single node failure in a data sharing system

    Publication Year: 1994 , Page(s): 310 - 319
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (861 KB)  

    The data sharing approach to building distributed database systems is becoming more common because of its potentially higher processing power and flexibility compared to data partitioning. However, due to the large amounts of hardware and complex software involved, the likelihood of a single node failure in the system increases. Following a single node failure, some processing has to be done to determine the set of locks held by transactions which were executing at the failed node. These locks cannot be released until database recovery has completed on the failed node. This phenomenon can cause throughput degradation even if the processing power on the surviving nodes is adequate to handle all incoming transactions. This paper studies the throughput dropoff behavior following a single node failure in a data sharing system through simulations and analytical modeling. The analytical model reveals several important factors affecting post-failure behavior and is shown to match simulations quite accurately. The effect of hot locks (locks which are frequently accessed) on post-failure behavior is observed. Simulations are performed to observe system behavior after the set of locks held by transactions on the failed node has been determined and show that if the delay in obtaining this information is too large, the system is prone to thrashing.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the use and implementation of message logging

    Publication Year: 1994 , Page(s): 298 - 307
    Cited by:  Papers (37)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (868 KB)  

    We present a number of experiments showing that for compute-intensive applications executing in parallel on clusters of workstations, message logging has higher failure-free overhead than coordinated checkpointing. Message logging protocols, however, result in much shorter output latency than coordinated checkpointing. Therefore, message logging should be used for applications involving substantial interactions with the outside world, while coordinated checkpointing should be used otherwise. We also present an unorthodox message logging design that uses coordinated checkpointing with message logging, departing from the conventional approaches that use independent checkpointing. This combination of message logging and coordinated checkpointing offers several advantages, including improved failure-free performance, bounded recovery time, simplified garbage collection, and reduced complexity. Meanwhile, the new protocols retain the advantages of the conventional message logging protocols with respect to output commit. Finally, we discuss three "lessons learned" from an implementation of various message logging protocols.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Faster checkpointing with N+1 parity

    Publication Year: 1994 , Page(s): 288 - 297
    Cited by:  Papers (28)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (801 KB)  

    This paper presents a way to perform fast incremental checkpointing of multicomputers and distributed systems by using N+1 parity. A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure. The algorithm's speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk. This eliminates the most time-consuming portion of checkpointing. The algorithm requires each application processor to allocate a fixed amount of extra memory for checkpointing. This amount may be set statically by the programmer, and need not be equal to the site of the processor's writable address space. This alleviates a major restriction of previous checkpointing algorithms using N+1 parity. Finally, we outline how to extend our algorithm to tolerate any m processor failures with the addition of 2m extra checkpointing processors.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A cache protocol for error detection and recovery in fault-tolerant computing systems

    Publication Year: 1994 , Page(s): 278 - 287
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (952 KB)  

    We propose an error detection and recovery protocol for redundant processor systems employing caches. The protocol allows cache-based systems to vote more often and thereby reduce the chance of losing synchronization. The scheme is based on cache data broadcasting of a dirty line after modification. The scheme effectively exploits the redundancy of a fault-tolerant system using hardware voting. It recovers from erroneous data written by a processor and this remedies the insufficiency of error-correcting codes. The protocol can also be used to speed-up resynchronization process for a temporarily failed processor in a redundant system. More than 60% of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The performance overhead is to broadcast only 2-3% of the total memory references.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient algorithmic circuit verification using indexed BDDs

    Publication Year: 1994 , Page(s): 266 - 275
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (852 KB)  

    The Indexed Binary Decision Diagram (IBDD), a Boolean function representation scheme, provides a compact representation for functions whose OBDD representation is intractably large. In this paper, we present more efficient algorithms for satisfiability testing and equivalence checking of IBDDs. Efficient verification of Booth multipliers, as well as practical strategies for polynomial time verification of some classes of unsigned array multipliers, are demonstrated experimentally. Our results show that the verification of many instances of functions whose analysis is intractable using OBDDs, such as multipliers and the hidden-weighted-bit function, can be done efficiently using IBDDs.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Behavioral synthesis of testable designs

    Publication Year: 1994 , Page(s): 436 - 445
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1101 KB)  

    High-level synthesis tools automatically produce RTL designs from algorithmic specifications. These designs, however, are not necessarily easy to test. In this paper we present TBINET, an algorithm for module and register binding, which generates RTL designs having low testability overheads. It obtains a heuristic solution to the binding problem by mapping it onto a sequence of minimum cost network flow problems which can be solved very quickly. A cost function that considers the testability of the design is defined in the paper. The results of experiments on various benchmarks show that the designs produced by our binding algorithm are indeed easier to test as compared to circuits designed without testability considerations.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architectural timing verification and test for super scalar processors

    Publication Year: 1994 , Page(s): 256 - 265
    Cited by:  Papers (4)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (945 KB)  

    We address the problem of verification and testing of super scalar processors, from the point of view of correctness of [Bprogram execution time. Trace-driven architectural si[Bmulation methods are commonly used in current industrial practice to estimate cycles-per-instruction performance of a candidate processor organization, prior to actual implementation. We present a novel set of strategies for testing the timing correctness of processors as represented in an architectural timing model ("timer"). We focus on two main aspects of the theory: (a) deriving architectural test sequences to cover possible failure modes, defined in the context of a pipeline flow state transition fault model; and (b) deriving loop test kernels to verify steady-state (periodic) behavior of pipeline flow, against analytically predicted signatures. We develop the theory in the context of an example super scalar processor and its timer model.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing fault tolerance with an attribute and functional based model

    Publication Year: 1994 , Page(s): 244 - 253
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (773 KB)  

    Programs constructed using techniques that allow software or operational faults to be tolerated are typically written using an imperative computational model. Here, an alternative is described in which such programs are written using an attribute and functional based model called FTAG. The basic model is introduced, followed by a description of mechanisms that allow a variety of standard fault tolerance techniques to be realized in a straightforward way. Techniques that can be accommodated include replication and checkpointing to tolerate operational faults, and recovery blocks and N-version programming to tolerate software faults. Several examples are given to illustrate these techniques, including a replicated name server and a fault-tolerant sort that uses recovery blocks. These examples also serve to highlight the advantages of FTAG when compared to traditional imperative approaches, such as a declarative style, separation of semantic and syntactic definitions and the simplicity of a functional foundation. Finally an outline of how this model can be implemented in a computer system containing multiple processors is given.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The performance of two-phase commit protocols in the presence of site failures

    Publication Year: 1994 , Page(s): 234 - 243
    Cited by:  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (839 KB)  

    Much of the existing literature on the two phase commit protocol is restricted to discussing and analyzing the protocol (and its variants) in the absence of failure. Very little, especially in quantitative terms, has been written about its performance in the presence of site failures. We use a simulation test bed of a distributed database system to quantify the differences in the performances of four 2PC protocols (the generic 2PC, presumed commit, presumed abort, and early prepare). Our study covers both the no-failure case and the case of site failures. We present a number of interesting experimental results. One is that the performance of these protocols is highly dependent on the message-processing latency at the transaction coordinator site. Another is that the presumed abort protocol does not necessarily yield better performance in the presence of site failures.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the effectiveness of residue code checking for parallel two's complement multipliers

    Publication Year: 1994 , Page(s): 219 - 228
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (935 KB)  

    The effectiveness of residue code checking for on-line error detection in parallel two's complement multipliers has up to now only been evaluated experimentally for few architectures. In this paper a formal analysis is given for most of the current multiplication schemes. Based on this analysis it is shown which check bases are appropriate, and how the original scheme has to be extended for complete error detection at the input registers and Booth recording circuitry. In addition, we argue that the hardware overhead for checking can be reduced by approximately one half if a small latency in error detection is acceptable. Schemes for structuring the checking logic in order to guarantee it to be self-testing, and thus achieve the totally self-checking goal for the overall circuit, are also derived.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Balanced codes for noise reduction in VLSI systems

    Publication Year: 1994 , Page(s): 212 - 218
    Cited by:  Papers (4)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (443 KB)  

    In a balanced code each codeword contains equal number of 1's and 0's. In this paper, methods for the construction of balanced codes, which use parallel encoding/decoding are described. The codes use fewer check bits and less hardware complexity than the existing codes.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A SEC-BED-DED code with byte plus bit error detection

    Publication Year: 1994 , Page(s): 208 - 211
    Cited by:  Papers (1)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB)  

    A code is presented which is designed to protect a 64 bit word, consisting of 8 bytes of 8 bits each, against both byte errors and single bit errors occurring in computer memories. Two check bytes are appended to give an overall length of 80 bits. The code then provides the often studied SEC-BED-DED level of error protection; that is, single error correction, byte error detection and double error detection are available simultaneously. In addition, the code detects a byte error together with a single random error occurring concurrently in another byte. Some implementation issues such as systematic versus separable encoding are considered.<> View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.