By Topic

Reliability, IEEE Transactions on

Issue 4 • Date Oct 1990

Filter Results

Displaying Results 1 - 10 of 10
  • Empirically based analysis of failures in software systems

    Page(s): 444 - 454
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (788 KB)  

    An empirical analysis of failures in software systems is used to evaluate several specific issues and questions in software testing, reliability analysis, and reuse. The issues examined include the following: diminishing marginal returns of testing; effectiveness of multiple fault-detection and testing phases; measuring system reliability versus function or component reliability; developer bias regarding the amount of testing that functions or components will receive; fault-proneness of reused versus newly developed software; and the relationship between degree of reuse and development effort and fault-proneness. Failure data from a large software manufacturer and a NASA production environment were collected and analyzed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • System availability monitoring

    Page(s): 480 - 485
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB)  

    A process set up by Digital to monitor and quantify the availability of its systems is described. The reliability data are collected in an automated manner and stored in a database. The breadth of data gathered provides a unique opportunity to correlate hardware andsoftware failures. In addition, several hypotheses have been tested, e.g. the relationship between crash rate and system load, the interdependence of crashes, the cause of crashes, and the effect of new releases in the operating system. It is concluded that the process (in operation since 1988) has yielded worthwhile information on the products monitored. The usual availability metrics are calculated regularly for the machines monitored. Trends in system fault occurrence have been identified, leading to suggestions for both software and hardware improvements. The monitoring process and analysis methodology are revised on an ongoing basis to improve the quality of information obtained and to extend the analysis to Digital's new systems. The recently announced VAX9000 mainframe and fault-tolerant VAXft 3000 are two such systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experimental evaluation of the fault tolerance of an atomic multicast system

    Page(s): 455 - 467
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1268 KB)  

    The authors present a study of the validation of a dependable local area network providing multipoint communication services based on an atomic multicast protocol. This protocol is implemented in specialized communication servers, that exhibit the fail-silent property, i.e. a kind of halt-on-failure behavior enforced by self-checking hardware. The tests that have been carried out utilize physical fault injection and have two objectives: (1) to estimate the coverage of the self-checking mechanisms of the communication servers, and (2) to test the properties that characterize the service provided by the atomic multicast protocol in the presence of faults. The testbed that has been developed to carry out the fault-injection experiments is described, and the major results are presented and analyzed. It is concluded that the fault-injection test sequence has evidenced the limited performance of the self-checking mechanisms implemented on the tested NAC (network attachment controller) and justified (especially for the main board) the need for the improved self-checking mechanisms implemented in an enhanced NAC architecture using duplicated circuitry View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulated fault injection: a methodology to evaluate fault tolerant microprocessor architectures

    Page(s): 486 - 491
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    A simulation-based fault-injection methodology for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault-impact. To exemplify the methodology, a fault-tolerant architecture which models the digital aspects of a dual-channel, real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100% coverage of single transients. Approximately 12% of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A case study of Ethernet anomalies in a distributed computing environment

    Page(s): 433 - 443
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB)  

    Fault detection and diagnosis depend critically on good fault definitions, but the dynamic, noisy, and nonstationary character of networks makes it hard to define what a fault is in a network environment. The authors take the position that a fault or failure is a violation of expectations. In accordance with empirically based expectations, operating behaviors of networks (and other devices) can be classified as being either normal or anomalous. Because network failures most frequently manifest themselves as performance degradations or deviations from expected behavior, periods of anomalous performance can be attributed to causes assignable as network faults. The half-year case study presented used a system in which observations of distributed-computing network behavior were automatically and systematically classified as normal or anomalous. Anomalous behaviors were traced to faulty conditions. In a preliminary effort to understand and catalog how networks behave under various conditions, two cases of anomalous behavior are analyzed in detail. Examples are taken from the distributed file-system network at Carnegie Mellon University View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Validating complex computer system availability models

    Page(s): 468 - 479
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1032 KB)  

    The authors report on experiences in validating complex computer-system availability models. A validation process, the availability models, and the data-collection process are described. An iteration of the model validation process emphasizing discrepancies between observed system behavior from data and from the model assumptions is presented. Analysis of data from five sites revealed that interruption dependencies on a device and across devices within a system exist. Furthermore, the frequency of, dependencies between, and downtime of device interruptions constitute differentiating characteristics of the causes of device interruptions. Data on partially available machine states, nonexponentially distributed machine downtimes, and nonstationarity of the machine interruption process are included. Validating the model assumptions is essential to model interpretation. Model accuracy is assessed. Exponentially distributed times between interruptions on a device and exponentially distributed device downtimes are not significant with respect to average measures of availability. The benefits of and issues for availability-model validation are summarized View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting and eliminating built-in test false alarms

    Page(s): 500 - 505
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (452 KB)  

    Failures detected by built-in test equipment (BITE) occur because a BITE measurement noise or bias as well as actual hardware failures. A quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise. Guidelines for setting BIT measurement limits are presented for a range of system MTBF (mean time between failures) times and BIT run times. The technique was applied to a BIT for an analog VLSI test system with excellent results, showing it to be a powerful tool for predicting tests with the potential for false alarms. It was discovered that, for this test case, false alarms are avoidable View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation and design of an ultra-reliable distributed architecture for fault tolerance

    Page(s): 492 - 499
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (768 KB)  

    The issues related to the experimental evaluation of an early conceptual prototype of the MAFT (multicomputer architecture for fault tolerance) architecture are discussed. A completely automated testing approach was designed to allow fault-injection experiments to be performed, including stuck-at and memory faults. Over 2000 injection tests were run and the system successfully tolerated all faults. Concurrent with the experimental evaluation, an analytic evaluation was carried out to determine if higher levels of reliability could be achieved. The lessons learned in the evaluation phase culminated in a new design of the MAFT architecture for applications needing ultrareliability. The design uses the concept of redundantly self-checking functions to address the rigid requirements proposed for a future generation of mission-critical avionics. The testing of three subsystems critical to the operation of the new MAFT is presented with close to 50-k test cycles performed over 51 different IC devices to verify the designs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Error log analysis: statistical modeling and heuristic trend analysis

    Page(s): 419 - 432
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1172 KB)  

    Most error-log analysis studies perform a statistical fit to the data assuming a single underlying error process. The authors present the results of an analysis that demonstrates that the log is composed of at least two error processes: transient and intermittent. The mixing of data from multiple processes requires many more events to verify a hypotheses using traditional statistical analysis. Based on the shape of the interarrival time function of the intermittent errors observed from actual error logs, a failure-prediction heuristic, the dispersion frame technique (DFT), is developed. The DFT was implemented in a distributed system for the campus-wide Andrew file system at Carnegie Mellon University. Data collected from 13 file servers over a 22-month period were analyzed using both the DFT and conventional statistical methods. It is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error-log entry points required by statistical methods for failure prediction. The DFT achieved a 93.7% success rate in predicting failures in both electromechanical and electronic devices View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A census of Tandem system availability between 1985 and 1990

    Page(s): 409 - 418
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (980 KB)  

    A census of customer outages reported to Tandem showing a clear improvement in the reliability of hardware and maintenance has been taken. It indicates that software is now the major source of reported outages (62%), followed by system operations (15%). This is a dramatic shift from the statistics for 1985. Even after discounting systematic underreporting of operations and environmental outages, the conclusion is clear: hardware faults and hardware maintenance are no longer a major source of outages. As the other components of the system become increasingly reliable, software necessarily becomes the dominant cause of outages. Achieving higher availability requires improvement in software quality and software fault tolerance, simpler operations, and tolerance of operational faults View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Reliability is concerned with the problems involved in attaining reliability, maintaining it through the life of the system or device, and measuring it.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Way Kuo
City University of Hong Kong