By Topic

Fault-Tolerant Systems, 1997. Proceedings., Pacific Rim International Symposium on

Date 15-16 Dec. 1997

Filter Results

Displaying Results 1 - 25 of 38
  • Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems

    Save to Project icon | Request Permissions | PDF file iconPDF (199 KB)  
    Freely Available from IEEE
  • Index of authors

    Page(s): 243
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Fault tolerant constructive algorithm for feedforward neural networks

    Page(s): 215 - 220
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (512 KB)  

    In this paper, a constructive algorithm for fault tolerant feedforward neural network, called FTCA, is proposed. The algorithm starts with a network with a single hidden neuron, and a new hidden unit is added to the network whenever it fails to converge. Before inserting the new hidden neuron into the network, only the weights connecting the new hidden neuron to the other neurons are trained (i.e. updated) until there is no significant reduction of the output error. To generate a fault tolerant network, the relevance of synaptic weights is estimated in each cycle. And only the weights which have a relevance less than a specified threshold are updated in that cycle. The loss of connections between neurons (which are equivalent to stuck-at-0 faults) are assumed. The simulation results indicate that the network constructed by FTCA has a significant fault tolerance ability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The use of neurons with higher functionality to enhance the fault tolerance of neural networks

    Page(s): 221 - 228
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    So far we have proposed fault-tolerant design techniques of neural networks based on the property of conventional neurons (component elements of neural networks). If a higher functionality of each neuron is available rather than simple weighted sum of the inputs, we can design neural networks which tolerate the mixture of stuck-at-1 and stuck-at-0 faults without exploiting the triplication scheme View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Double and triple error detecting capability of Internet checksum and estimation of probability of undetectable error

    Page(s): 47 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    The Internet checksum is calculated by 16-bit one's complement arithmetic. The occurrence of two or more errors, however, may not be detected. In this manuscript, we formulate the checksum procedure as a nonlinear code. Part of distance distribution of the nonlinear code is calculated. By using the results, we derived lower and upper bounds on the probability of an undetectable error when the nonlinear code is used for error detection on a binary symmetric channel. For the IP header with no optional fields, the lower bound is almost equal to the upper bound when the bit error rate is less than about 1.0×10-3 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An embedded fail-safe interlocking system

    Page(s): 22 - 27
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    The paper presents a fail-safe railway interlocking system embedded in an Area Control Center (ACC) system. The host of the system is a TANDEM NONSTOP HIMALAYA K200 computer. The fault tolerant computer aims at high safety, reliability and availability. In addition, the dispatcher management system, device supervision system, and train control system are integrated in the host computer to ensure high performance. Tens of stations can be controlled by the system concurrently. The paper also presents some measures in the software used to ensure safety of the interlocking system. These measures have been verified by practical applications in old versions of the interlocking system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An extended binary tree quorum strategy for K-mutual exclusion in distributed systems

    Page(s): 110 - 115
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    In this paper, we propose two strategies called generalized binary tree quorum and extended binary tree quorum for k-mutual exclusion, which impose a logical structure on the network. Both of the proposed strategies are based on a logical binary tree structure. The quorum size constructed from both strategies is [1g2 n/k] in the best case and is [n+k/2k] in the worst case, where n is the number of nodes in the system. Moreover, both strategies can be fault-tolerant up to (n-k[1g2 n/k]) nodes failures in the best case and k([1g2 n/k]) node failures in the worst case. our performance analysis, we show that the extended binary tree quorum strategy can provide a higher availability than k-majority, cohorts, and DIV strategies almost all the time. Moreover, the quorum size of the extended binary tree quorum is always the smallest one among these four strategies, when n>12. While in the worst case, the quorum size of the extended binary tree quorum is always smaller than that of 4-majority and DIV strategies View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Behavior of a computer based interlocking system under transient hardware faults

    Page(s): 174 - 179
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB)  

    The paper addresses the safety analysis and evaluation of a hard real-time, interlocking, railway control system. The major objective is to demonstrate an efficient methodology capable of capturing crucial system dependability characteristics while allowing meaningful results to be obtained within a reasonable time. The evaluation is done by simulating the execution of the control software under transient hardware faults. The study demonstrates that simulation is a valuable approach to quick and accurate assessment of critical, railway control systems. It allows to study the trade-offs between different alternatives and to quantify the impact of crucial system parameters (e.g., scheduling policy in data exchange among redundant units in the system) on the overall system dependability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimized authenticated self-synchronizing Byzantine agreement protocols

    Page(s): 122 - 129
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (696 KB)  

    In order to make a dependable distributed computer system resilient to arbitrary failures of its processors, deterministic Byzantine agreement protocols (BAPs) can be applied. Many BAPs found in literature require that communication takes place in synchronized rounds of information exchange and require that all correct processors know the start of the BAP and start the protocol simultaneously It is hard to satisfy either or both requirements in a distributed system. As a consequence, it is hard to implement the above BAPs in a distributed system. Authenticated self-synchronizing BAPs evade this problem by guaranteeing Byzantine Agreement while allowing arbitrary clock skew between the clocks of the processors and not requiring correct processors to know the start of the BAP. However, authenticated self-synchronizing BAPs require much communication overhead. Therefore, in this paper, we introduce so-called optimized authenticated self-synchronizing BAPs, that require fewer messages than the existing authenticated self-synchronizing BAPs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliability modeling of structured systems: exploring symmetry in state-space generation

    Page(s): 78 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    A large number of systems are implemented using regular interconnected topologies. Markov analysis of such systems results in large state spaces. We explore symmetry, in particular rotational and permutational, of such systems to achieve a significant reduction in the size of the state space required to analyze them. The resulting much smaller state spaces allow analyses of very large systems. We define equivalent classes of states and develop an algorithm to generate small state spaces and the corresponding Markov chain for systems with permutation symmetries. The state space generation process is also simplified. We demonstrate our technique using several examples. Our technique is very useful in the exact analysis of large systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault handling mechanisms in the RETHER protocol

    Page(s): 153 - 159
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (772 KB)  

    RETHER is a software-driven token-passing protocol designed to provide bandwidth guarantee for real-time multimedia applications over off-the-shelf Ethernet hardware. To our knowledge, it is the first all-software and fully-implemented real-time protocol on top of commodity Ethernet hardware. Because token passing is used to regulate network accesses, node crashes and/or packer corruption may lead to token loss, thus potentially shutting down the network completely. This paper describes the fault handling mechanisms built into RETHER to address this problem in both a single-segment and multi-segment Ethernet environment. The emphasis of the paper is on the uniqueness of the target application context and the rationale of chosen solutions. We present the performance tradeoffs of the fault detection/recovery schemes, the implementation experiences of the prototype View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis of a reliable real-time token-ring protocol

    Page(s): 180 - 185
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (432 KB)  

    This paper presents an analytical model using stochastic Petri nets for the performance of a priority based real-time protocol that uses data-link layer message logging for fast recovery in the event of station crashes. The advantage of using message logging at the data-link layer over traditional higher layer recovery mechanisms is demonstrated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The adaptable distributed recovery block scheme and a modular implementation model

    Page(s): 131 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (772 KB)  

    The purpose of adaptive fault-tolerance (AFT) is to meet the dynamically and widely changing fault-tolerance requirement by efficiently and adaptively utilizing a limited and dynamically changing amount of available redundant processing resources. In this paper we present one concrete AFT scheme, named the adaptable distributed recovery block (ADRB) scheme, which is an extension of the Distributed Recovery Block (DRB) scheme for reliable execution of real-time applications with the tolerance of both hardware and software faults in distributed/parallel computer systems. An ADRB station dynamically switches its operating mode in response to significant changes in the resource and application modes. Different operating modes have different resource requirements and yield different fault tolerance capabilities. A modular implementation model for the ADRB scheme is also presented. An efficient execution support mechanism for the ADRB scheme has been implemented as a part of a timeliness-guaranteed kernel developed at the University of California, Irvine View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-lag duplexing-a fault tolerance technique for online transaction processing systems

    Page(s): 202 - 207
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB)  

    In this paper the concept of time-lag duplexing is proposed to achieve fault tolerance. Time-lag duplexing incorporates time and component redundancy to provide for transient errors both easy error recovery and tolerance against errors in common irredundant components. As a result, minimum performance and cost penalties are incurred. In this paper the fault detection and recovery algorithm using time-lag duplexing is shown for both transient and permanent faults. We show the applicability of time-lag duplexing by developing a prototype Online Transaction Processing (OLTP) System in a Network Computing Environment. Faults are injected into this prototype to show the viability of this technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Formal verification of a TDMA protocol start-up mechanism

    Page(s): 235 - 242
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (668 KB)  

    This paper presents a formal verification of the start-up algorithm of the DACAPO protocol. The protocol uses TDMA (Time Division Multiple Access) bus arbitration. It was verified that an ensemble of four communicating stations becomes synchronized and operational within a bounded time from an arbitrary initial state. The system model included a clock drift corresponding to ±10-3. The protocol was modeled using a network of timed automata, and verification was performed using the symbolic model checker UPPAAL View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High performance fault tolerant computer and its fault recovery

    Page(s): 2 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (368 KB)  

    The authors proposed a new architecture for an FTC called QPR (Quad Processor Redundancy) in which duplicated CPUs operate under a hardware lock step, and duplicated I/Os are managed by software. A dual system bus combines two duplicated areas. After recovery from a fault, it is necessary to resynchronize the system, so the contents of the main memory must be copied from the normal CPU to the other CPU. The overhead for copying must be small, so that the normal CPU can still continue the application. They describe a fault recovery method especially for a memory copying method. When a memory access has occurred, the memory interface unit snoops the data and sends them to another CPU using the dual system bus. They measured copy time using the real machine and simulated the copy overhead under a heavy DMA load. They obtained a small overhead and small load dependency View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phaeton: a log-based architecture for high performance file server design

    Page(s): 28 - 33
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    An analysis of the construction of high performance network file servers is presented. Phaeton, a log-disk based architecture is proposed to achieve sequential disk access speeds for both modification and non-modification requests, while maintaining strict semantics and recoverability in the presence of failure. The Phaeton architecture is capable of providing dramatic improvement of the single-server file service and also the multi-server replicated service. A new replication model, called the Symmetric Primary-backup Model, that combines the network file service with the Phaeton storage architecture is also included. In their previous research the authors had implemented TOFF, a fault-tolerant network file service based on the Sun NFS protocol. TOFF-2 is a revolution version using both the Phaeton architecture and the Symmetric Primary-backup Model to unleash the potential bandwidth in the replicated service and provide much better performance than TOFF or other single-server file services View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Engineering oriented dependability evaluation: MEADEP and its applications

    Page(s): 85 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB)  

    MEADEP is a user-friendly dependability evaluation tool for measurement-based analysis of completing systems. Features of MEADEP include: a data processor for converting data in various formats to the MEADEP format, a statistical analysis module for graphical data presentation and parameter estimation, a graphical modeling interface for building reliability block diagrams and Markov chains, a library of dependability models for constructing customer models, and a model solution module for availability/reliability calculations with graphical parametric analysis. Use of the tool on failure data from measurements can provide objective evaluations of dependability for critical systems, while greatly reducing requirements for specialized skills in data processing, analysis, and modeling from the user. MEADEP has been applied to evaluate availability for two air traffic control systems based on operational failure data and results produced by MEADEP have provided valuable feedback to the project management of these critical systems. MEADEP has also been used to analyze a nuclear power plant safety model, based on the Eagle 21 architecture and its early field failure data, and results of sensitivity analysis on the model are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a fault-tolerant microprocessor: a simulation approach

    Page(s): 161 - 166
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB)  

    This paper presents an approach for assessing the merits and the cost of incorporating processor-level error detection and recovery mechanisms. The approach is exemplified by implementing several fault-tolerant mechanisms into a 32-bit, MIPS R3000-compatible, RISC microprocessor and conducting simulation-based fault injection experiments. The mechanisms are triple modular redundancy (TMR), retry on duplication-comparison, and retry on parity-checking codes. Reliability gains and performance/area overheads are quantitatively evaluated for each error-detection/recovery scheme. The fault injection analysis results indicate that the highest fault coverage is achieved with the code-based retry technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing software reliability through rollback and on-line fault repair

    Page(s): 208 - 213
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB)  

    We propose a new paradigm for increasing the reliability of a software system by combing reactive and proactive approaches. The proposed approach employs rollback and restart for masking transient failure, and employs on-line software version charge to remove faults from the software. A model for reliability analysis of a system employing the proposed approach is presented. The analysis shows that substantial benefit in reliability can be obtained by employing the proposed approach. A prototype system which incorporates the proposed approach is also described View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On transaction liveness in replicated databases

    Page(s): 104 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    This paper makes a first attempt to give a precise characterisation of liveness in replicated database systems. We introduce the notion of liveness degrees, which express the expectation a database user might have about the termination of its transactions, despite concurrency and failures. Our liveness degrees are complementary to the traditional transactional safety degrees (e.g., serializability) and lead to a better characterisation of the reliability of database replication protocols. We present a generic framework that abstracts several well-known replication protocols and we point out an interesting trade-off between live-ness and safety properties in these protocols View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A cache error propagation model

    Page(s): 15 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    Cache memory is a small, fast, memory system that holds frequently used data. With increasing processor speed, aggressive design practices increase the probability of fault occurrence and the presence of latent errors as the processor allows a short duration for read and write. The fault may corrupt the cache memory system or lead to an erroneous internal CPU state. The authors investigate error propagation in the cache memory system due to transient faults either in the cache memory itself or in the processor's registers or both. The information gained from such an investigation should lead to the development of more effective error recovery mechanisms against failures due to transient faults arising in the machine's cache memory and register set. They establish that even though the computer system is capable of recovering about 50% of the time from the effect of a single erroneous cache location/processor register, the other 50% of the time error recovery is affected only through specific recovery mechanisms. Their results are obtained using both a discrete-time Markov model and by means of error injection on a real system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Checkpointing Message-Passing Interface (MPI) parallel programs

    Page(s): 147 - 152
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    Many scientific problems can be distributed on a large number of processes to take advantage of low cost workstations. In a parallel systems, a failure on any processor can halt the computation and requires restarting all applications. Checkpointing is a simple technique to recover the failed execution. Message Passing Interface (MPI) is a standard proposed for writing portable message-passing parallel programs. In this paper, we present a checkpointing implementation for MPI programs, which is transparent, and requires no changes to the application programs. Our implementation combines coordinated, uncoordinated and message logging techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new class of t-error correcting/d-error detecting (d>t) and all unidirectional error detecting codes

    Page(s): 41 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    In this paper, a new class of t-error correcting/d-error detecting and all unidirectional error detecting (t-EC/d-ED/AUED) codes has been proposed. Compared to the published results, this scheme, in general, needs less or equal number of check bits. Further, both the encoding/decoding algorithms for this class of codes can be implemented with faster and simpler hardware. In case of ROM implementation, the proposed scheme results in significant saving in word length View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive system-level diagnosis and its application

    Page(s): 66 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (508 KB)  

    System-level diagnosis is known for decades, and many solid results have been provided for this problem. This paper presents a new adaptive diagnosis strategy which requires far fewer tests than previous methods. Since tests can be applied concurrently in this method, the total time span for fault location can be further reduced. Although the practical implications of system-level diagnosis is few, we find the adaptive diagnosis useful in locating faults in field-programmable gate array (FPGA). We show this application in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.