By Topic

Dependable Systems and Networks, 2001. DSN 2001. International Conference on

Date 1-4 July 2001

Filter Results

Displaying Results 1 - 25 of 55
  • Proceedings International Conference on Dependable Systems and Networks

    Save to Project icon | Request Permissions | PDF file iconPDF (421 KB)  
    Freely Available from IEEE
  • The reliability of diverse systems: a contribution using modelling of the fault creation process

    Page(s): 5 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1015 KB) |  | HTML iconHTML  

    Design diversity is a defence against design faults causing common-mode failure in redundant systems, but we badly lack knowledge about how much reliability it will buy in practice, and thus about its cost-effectiveness, the situations in which it is an appropriate solution and how it should be taken into account by assessors and safety regulators. Both current practice and the scientific debate about design diversity depend largely on intuition. More formal probabilistic reasoning would facilitate critical discussion and empirical validation of any predictions: to this aim, we propose a model of the generation of faults and failures in two separately-developed program versions. We show results on: (i) what degree of reliability improvement an assessor can reliably expect from diversity; and (ii) how this reliability improvement may change with higher-quality development processes. We discuss the practical relevance of these results and the degree to which they can be trusted. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Difficulties measuring software risk in an industrial environment

    Page(s): 15 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1004 KB) |  | HTML iconHTML  

    Software risk is intended to reflect loss due to software failure. This has traditionally been computed by taking the product of two things: a probability of occurrence and the cost associated with failures. Applying these definitions in practice, however, may be much harder than it at first appears. There are two types of problems that affect the applicability and usefulness of such a computation: that the user has to know detailed information that is not normally available, and that most risk definitions do not use relevant information that is available, including information derived from testing. A definition of risk is introduced that will be usable in industrial settings. We also explore ways of incorporating information about how the software has been tested, the degree to which the software has been tested, and the observed results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Techniques for estimation of design diversity for combinational logic circuits

    Page(s): 25 - 34
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (881 KB) |  | HTML iconHTML  

    Design diversity has long been used to protect redundant systems against common-mode failures. The conventional notion of diversity relies on "independent" generation of "different" implementations of the same logic function. This concept is qualitative and does not provide a basis to compare the reliabilities of two diverse systems. In a recent paper, we presented a metric to quantify diversity among several designs. The problem of calculating the diversity metric is NP-complete and can be of exponential complexity. In this paper we present techniques to estimate the value of the design diversity metric. For datapath designs, we have formulated very fast techniques to calculate the value of the metric by exploiting the regularity in the datapath structures. For general combinational logic circuits, we present an adaptive Monte-Carlo simulation technique for estimating bounds on the value of the metric. The adaptive Monte-Carlo simulation technique provides accurate estimates of the design diversity metric; the number of simulations used to reach this estimate is polynomial (instead of exponential) in the number of circuit inputs. Moreover, the number of simulations can be tuned depending on the desired accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrent error detection in fast unitary transform algorithms

    Page(s): 37 - 46
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (594 KB) |  | HTML iconHTML  

    Discrete fast unitary transform algorithms, of which the fast Fourier transform (FFT) and fast discrete Cosine transform (DCT) are practical examples, are highly susceptible to temporary calculation failures because of their interconnected computational flows. Many error detection techniques for FFT algorithms have been reported, but fault tolerance issues for other important transforms have not been addressed as vigorously. A general design and analysis approach for all fast unitary transforms is presented. It relies on fundamental linear algebra methods coupled with associated dual space representations that are natural descriptions of real parity values. Basic output error patterns from single computational errors are used to define an equal-sized group of dual space basis vectors on which practical parity weighting functions may be evaluated. An iterative design approach leads to complete single error detection capabilities. FFT and fast DCT examples are given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant high-performance matrix multiplication: theory and practice

    Page(s): 47 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (772 KB) |  | HTML iconHTML  

    We extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C=AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry, is corrupted. Third we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How to model link failures: a perception-based fault model

    Page(s): 57 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB) |  | HTML iconHTML  

    We propose a new hybrid fault model for clock synchronization and single-round (approximate) agreement in synchronous distributed systems, which accurately captures both node and link faults. Unlike conventional "global" fault models, which rest upon the total number of faulty nodes in the system, it solely relies upon the number of faults in any two non-faulty nodes' "perceptions"-conveyed by the messages from all other nodes-of the system. This way, arbitrary node and communication faults, including receiver-caused omission and time/value faults, can be modeled properly. As an example, we show that the consistent broadcast primitive (and hence the clock synchronization algorithms) of Srikanth & Toueg (1987) can be analyzed under this model. As far as link faults are concerned, our analysis reveals that as few as 4f𝒞a+2f𝒞s+2f𝒞o+1 nodes are sufficient for tolerating at most f𝒞a, f𝒞s, and f𝒞o asymmetric, symmetric, and omission link faults at any receiving node. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A checkpointing tool for Palm operating system

    Page(s): 71 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (613 KB) |  | HTML iconHTML  

    It is foreseeable that handheld devices will be involved in the arena of distributed computing in the near future. To provide a dependable computing environment, check-pointing and rollback recovery is a useful and important technique for fault-tolerant distributed computing systems. For the most popular platform among handhelds, Palm OS, its built-in HotSync tool can take a partial snapshot of a system state, but it synchronizes only the static data in the handheld with a PC. All dynamic data of applications are lost if a failure occurs and the Palm OS is reset. In order to accommodate mobile computing devices with checkpointing and rollback recovery capability, dynamic data such as global variables should be checkpointed to tolerate system reset/crash failure. Therefore, we developed a checkpointing tool, which provides a set of APIs to checkpoint Palm applications. Using the checkpointing tool, dynamic data in a Palm device can be saved and recovered from a system reset. We describe the tool and demonstrate its usefulness in four popular Palm applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model-based synthesis of fault trees from Matlab-Simulink models

    Page(s): 77 - 82
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (567 KB) |  | HTML iconHTML  

    We outline a new approach to safety analysis in which concepts of computer HAZOP are fused with the idea of software fault tree analysis to enable a continuous assessment of an evolving programmable design developed in Matlab-Simulink. We also discuss the architecture of a tool that we have developed to support the new method and enable its application in complex environments. We show that the method and the tool enable the integrated hardware and software analysis of a programmable system and that in the course of that analysis they automate and simplify the development of fault trees for the system. Finally, we propose a demonstration of the method and the tool and we outline the experimental platform and aims of that demonstration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GOOFI: generic object-oriented fault injection tool

    Page(s): 83 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (534 KB) |  | HTML iconHTML  

    We present a new fault injection tool called GOOFI (Generic Object-Oriented Fault Injection). GOOFI is designed to be adaptable to various target systems and different fault injection techniques. The tool is highly portable between different host platforms since it relies on the Java programming language and an SQL compatible database. The current version of the tool supports pre-runtime software implemented fault injection and scan-chain implemented fault injection. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantifying the reliability of embedded systems by automated analysis

    Page(s): 89 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB) |  | HTML iconHTML  

    In embedded systems development safety and reliability are important quality characteristics. It is thus required to determine the reliability and safety of a complete system including hardware and software. It is reasonable to analyze particular failures that may cause, for example, safety critical situations. The basic idea of our approach is the automated generation of so-called fault trees based on the source code of the software. These may be combined with fault trees based on the electronic circuit design of the hardware. It is important to do the generation in such a way that the software and the hardware fault trees can be composed into a valid fault tree of the system. Fault tree generation is a largely automated approach. We have implemented a prototype of a fault tree generation tool that is capable to generate fault trees based on C++ code. The fault tree generation tool for electronic circuits has already been used to analyze failure situations in industrial automation applications. If a structural approach is not applicable, stochastic techniques may be used. We developed the reliability assessment tool RAT that supports reliability analysis of software systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Avoiding priority inversion on the processing of requests by active replicated servers

    Page(s): 97 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1041 KB) |  | HTML iconHTML  

    We consider the priority inversion problem in an actively replicated system. Priority inversion was originally defined in the context of non-replicated systems. Therefore we first introduce the concept of group priority inversion, which extends the concept of (local) priority inversion to the context of a group of processors that perform an actively replicated processing. We then present the properties of a request scheduling protocol to enforce a total ordering for the processing of requests while avoiding group priority inversions. These properties have been implemented in a protocol that relies on a timed asynchronous system model equipped with a failure detector of the class □S. The proposed solution allows one to replicate a critical server while ensuring that the processing of all the incoming requests is consistent (mechanisms for solving the atomic broadcast problem) and predictable (mechanisms for solving the group priority inversion problem). Thus, the described request scheduling protocol is a key component which can be used to develop fault tolerant real time applications in a timed asynchronous system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A dynamic replica selection algorithm for tolerating timing faults

    Page(s): 107 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1008 KB) |  | HTML iconHTML  

    Server replication is commonly used to improve the fault tolerance and response time of distributed services. An important problem when executing time-critical applications in a replicated environment is that of preventing timing failures by dynamically selecting the replicas that can satisfy a client's timing requirement, even when the quality of service is degraded due to replica failures and excess load on the server. We describe the approach we have used to solve this problem in AQuA, a CORBA-based middleware that transparently replicates objects across a local area network. The approach we use estimates a replica's response time distribution based on performance measurements regularly broadcast by the replica. An online model uses these measurements to predict the probability with which a replica can prevent a timing failure for a client. A selection algorithm then uses this prediction to choose a subset of replicas that can together meet the client's timing constraints with at least the probability requested by the client. We conclude with experimental results based on our implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Online reconfiguration in replicated databases based on group communication

    Page(s): 117 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1094 KB) |  | HTML iconHTML  

    Over the last years, many replica control protocols have been developed that take advantage of the ordering and reliability semantics of group communication primitives to simplify database system design and to improve performance. Although current solutions are able to mask site failures effectively, many of them are unable to cope with recovery of failed sites, merging of partitions, or joining of new sites. This paper addresses this important issue. It proposes efficient solutions for online system reconfiguration providing new sites with a current state of the database without interrupting transaction processing in the rest of the system. Furthermore, the paper analyzes the impact of cascading reconfigurations, and argues that they call be handled in an elegant way by extended forms of group communication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving driver robustness: an evaluation of the Devil approach

    Page(s): 131 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (879 KB) |  | HTML iconHTML  

    To keep up with the frantic pace at which devices come out, drivers need to be quickly developed, debugged and tested. We have recently introduced a new approach to improve driver robustness based on an Interface Definition Language, named Devil. Devil allows a high-level definition of the communication of a device. A computer automatically checks the consistency of a Devil specification and generates stubs that include run-time checks. We use mutation analysis to evaluate the improvement in driver robustness offered by Devil. To do so, we have injected programming errors using mutation analyses into Devil based Linux drivers and the original C drivers. We assess how early errors can be caught in the development process, by measuring whether errors are detected either at compile time or at run time. The results of our experiments on the IDE Linux disk driver show that nearly 3 times more errors are detected in the Devil driver than in the original C driver. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robustness testing and hardening of CORBA ORB implementations

    Page(s): 141 - 150
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1013 KB) |  | HTML iconHTML  

    Before using CORBA (Common Object Request Broker Architecture) applications in mission-critical scenarios, it is important to understand the robustness of the Object Request Broker (ORB) being used, which forms the platform for CORBA applications. We have extended the Ballista software testing technique to test the exception-handling robustness of C++ ORB client-side application interfaces, and have tested two major versions of three ORB implementations on two operating systems, yielding robustness failure rates ranging from 26% to 42%. To improve ORB robustness, we also propose a probing method to harden object and pseudo-object related data types against exceptional inputs. Using these probes on omniORB 2.8 has proven to be effective in eliminating some cases of robustness failures found during testing. These results suggest that CORBA implementations currently have significant robustness vulnerabilities, but that some important classes of problems can be overcome with better exception-handling approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Constructing self-testable software components

    Page(s): 151 - 160
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (979 KB) |  | HTML iconHTML  

    Component-based software engineering techniques are gaining substantial because of their potential to improve productivity and lower development costs of new software applications, yet satisfying high reliability requirements. A first step to address such high reliability requirements consists in reusing reliable components. To merit the attribute "reliable", a component should be extensively validated. As far as testing is the technique most commonly used for validation, this means that reusable components should well tested. For tests to be applied efficiently and on time, a component should be testable. This paper presents an approach to improve component testability by integrating testing resources into it, and hence obtaining a self-testable component. A prototyping tool, Concat, was developed to support the proposed approach. The tool is intended for OO components implemented in C++. Some preliminary results of an empirical evaluation of the fault detection effectiveness of the proposed testing approach are also discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An approach for analysing the propagation of data errors in software

    Page(s): 161 - 170
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (982 KB) |  | HTML iconHTML  

    We present a novel approach for analysing the propagation of data errors in software. The concept of error permeability is introduced as a basic measure upon which we define a set of related measures. These measures guide us in the process of analysing the vulnerability of software to find the modules that are most likely exposed to propagating errors. Based on the analysis performed with error permeability and its related measures, we describe how to select suitable locations for error detection mechanisms (EDMs) and error recovery mechanisms (ERMs). A method for experimental estimation of error permeability, based on fault injection, is described and the software of a real embedded control system analysed to show the type of results obtainable by the analysis framework. The results show that the developed framework is very useful for analysing error propagation and software vulnerability and for deciding where to place EDMs and ERMs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing survivability of security services using redundancy

    Page(s): 173 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (989 KB) |  | HTML iconHTML  

    Traditional distributed system services that provide guarantees related to confidentiality, integrity, and authenticity enhance security, but are not survivable since each attribute is implemented by a single method. This paper advocates the use of redundancy to increase survivability by using multiple methods to implement each security attribute and doing so in ways that can vary unpredictably. As a concrete example, the design and implementation of a highly configurable secure communication service called SecComm are presented. The service has been implemented using Cactus, a system for building highly configurable protocols and services for distributed systems. Initial performance results for a prototype implementation on Linux are also given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributing trust on the Internet

    Page(s): 183 - 192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1046 KB) |  | HTML iconHTML  

    This paper describes an architecture for secure and fault-tolerant service replication in an asynchronous network such as the Internet, where a malicious adversary may corrupt some servers and control the network. It relies on recent protocols for randomized Byzantine agreement and for atomic broadcast, which exploit concepts from threshold cryptography. The model and its assumptions are discussed in detail and compared to related work from the last decade in the first part of this work, and an overview of the broadcast protocols in the architecture is provided. The standard approach in fault-tolerant distributed systems is to assume that at most a certain fraction of servers fails. In the second part, novel general failure patterns and corresponding protocols are introduced. The allow for realistic modeling of real-world trust assumptions, beyond (weighted) threshold models. Finally, the application of our architecture to trusted services is discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Protection of software-based survivability mechanisms

    Page(s): 193 - 202
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1005 KB) |  | HTML iconHTML  

    Many existing survivability mechanisms rely on software-based system monitoring and control. Some of the software resides on application hosts that are not necessarily trustworthy. The integrity of these software components is therefore essential to the reliability and trustworthiness of the survivability scheme. We address the problem of protecting trusted software on untrustworthy hosts by software transformations. Our techniques include a systematic introduction of aliases in combination with a "break-down" of the program control-flow; transforming high-level control transfers to indirect addressing through aliased pointers. In so doing, we transform programs to a form that yields data flow information very slowly and/or with little precision. We present a theoretical result which shows that a precise analysis of the transformed program, in the general case, is NP-hard and demonstrate the applicability of our techniques with empirical results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Intrusion-tolerant group management in Enclaves

    Page(s): 203 - 212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (889 KB) |  | HTML iconHTML  

    Groupware applications require secure communication and group-management services. Participants in such applications may have divergent interests and may not fully trust each other. The services provided must then be designed to tolerate possibly misbehaving participants. Enclaves is a software framework for building such group applications. We discuss how the protocols used by Enclaves can be modified to guarantee proper service in the presence of nontrustworthy group members. We show how the improved protocol was formally specified and proven correct. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FATOMAS-a fault-tolerant mobile agent system based on the agent-dependent approach

    Page(s): 215 - 224
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1050 KB) |  | HTML iconHTML  

    Fault tolerance is fundamental to the further development of mobile agent applications. In the context of mobile agents, fault-tolerance prevents a partial or complete loss of the agent, i.e., it ensures that the agent arrives at its destination. We present FATOMAS, a Java-based fault-tolerant mobile agent system based on an algorithm presented in an earlier paper (2000). Contrary to the standard "place-dependent" architectural approach, FATOMAS uses the novel "agent-dependent" approach. In this approach, the protocol that provides fault tolerance travels with the agent. This has the important advantage to allow fault-tolerant mobile agent execution without the need to modify the underlying mobile agent platform (in our case ObjectSpace's Voyager). In our performance evaluation, we show the costs of our approach relative to the single, non-replicated agent execution. Pipelined mode and optimized agent forwarding are two optimizations that reduce the overhead of a fault-tolerant mobile agent execution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for database audit and control flow checking for a wireless telephone network controller

    Page(s): 225 - 234
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1066 KB) |  | HTML iconHTML  

    The paper presents the design and implementation of a dependability framework for a call-processing environment in a digital mobile telephone network controller. The framework contains a data audit subsystem to maintain the structural and semantic integrity of the database and a preemptive control flow checking technique, PECOS, to protect call-processing clients. Evaluation of the dependability-enhanced system is performed (using NFTAPE, a software-implemented error injection environment). The evaluation shows that for control flow errors in the client, the combination of PECOS and data audit eliminates fail-silence violations, reduces the incidence of client crashes, and eliminates client hangs. For database injections, data audit detects 85% of the errors and reduces the incidence of escaped errors. Evaluation of combined use of data and control checking (with error injection targeting the database and the client) shows coverage increase from 35% to 80% and indicates data flow errors as a key reason for error escapes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of the effects of outages on the quality of service of GPRS network systems

    Page(s): 235 - 244
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (852 KB) |  | HTML iconHTML  

    The General Packet Radio Service (GPRS) extends the Global System Mobile Communications (GSM) by addressing packet-oriented traffic. Availability is the most important dependability requirement for such communication systems as GPRS. Focusing on the contention phase, where users compete for channel reservation, this paper analyses the GPRS with the objective to understand its behaviour under critical conditions, as determined by periods of outages, which significantly impact on the resulting dependability. In fact, during outages (service unavailability), users trying to access the service accumulate, leading to an overload of the system. When the system resumes its operations, the accumulated users determine a higher probability of collisions on resources assignment (and therefore a degradation of the QoS perceived by the users). Our analysis, performed using a simulation approach, allowed us to gain insights on the impact of outages on the QoS and of the overload that GPRS systems have to face after outages. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.