By Topic

Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

Cover Image Copyright Year: 2014
Author(s): Taylor, Z.; Ranganathan, S.
Publisher: Wiley-IEEE Press
Content Type : Books & eBooks
Topics: Computing & Processing ;  General Topics for Engineers ;  Power, Energy, & Industry Applications
  • Print
  •   Click to expandTable of Contents

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Front Matter

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.fmatter
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      The prelims comprise:
      Half-Title Page
      IEEE Press
      Title Page
      Copyright Page
      Dedication Page
      Contents
      Preface
      List of Abbreviations View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Introduction

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch1
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      No abstract. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Initial Considerations for Reliability Design

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch2
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      Prior to designing a high availability system, we must have a set of availability require¬ments or goals for our system. We then set out to design a system that meets these requirements. By decomposing the system into appropriate components, we can create a system reliability model and mechanisms for ensuring high availability. For each component in this model, we allocate specific MTBF, MTTR, and other reli¬ability information. We describe a few general methods can be used to estimate and improve this reliability information. The more knowledge we have regarding the reli¬ability of these components, the maintenance plan for the system, and the number of systems we expect to deploy, the more accurate our model will be in predicting actual system reliability and availability once deployed to the field. Now that we have introduced the mechanics of obtaining initial reliability informa¬tion, the next several chapters will dwell on basic mathematical concepts that will set the foundation for more advanced techniques and applications to build, predict, and optimize high availability systems. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      A Game of Dice: An Introduction to Probability

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch3
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter introduces some of the fundamentals of probability theory illustrated by our analysis of a game of dice and coin flip experiments. By analyzing these famil¬iar examples in greater depth, using familiar illustrations, we have hopefully been able to shed some light on some fundamental concepts of probability theory. These concepts are the building blocks for more practical and varied applications of reli¬ability analysis. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Discrete Random Variables

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch4
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter introduces the concept of random variables and discrete random vari¬ables in particular. We then discuss several important discrete probability distribu¬tions directly applicable to reliability theory, including a set of examples to illustrate their use. These distributions are important building blocks for reliability theory, and we will explore the application of these concepts to more practical problems of pre¬diction and design in subsequent chapters. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Continuous Random Variables

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch5
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter provides a brief introduction to the continuous random variables in particular and contrasting this concept with discrete random variables. Several important continuous random variables are applicable to reliability theory. We will see in later chapters how these concepts are applied to more practical reliability problems. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Random Processes

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch6
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      A random process is a set of discrete or continuous random variables that evolve over time. Reliability theory makes use of several important random processes. One of the most important of these processes is the Poisson process. The Poisson process is a counting process in which the rate of occurrence of any random event is the same over any interval of time and as this time interval becomes very short, the expected rate of occurrence becomes a constant. This process is a building block for many reliability concepts. We derive the Poisson process and provide an example of its use. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Modeling and Reliability Basics

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch7
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      The important relationships between reliability distributions are captured in Table 7.2. We define reliability engineering and design and showed the relationship with other related terminology. Reliability modeling is a tool used for reliability prediction, and a variety of these models are available. The modeling approaches for reliability assessment are described. Failure probability and failure density are introduced, followed by a description of the most important reliability terminology along with fundamental reliability equations and several examples. We also introduce Bayes's theorem in the context of reliability theory followed by a discussion of Reliability Block Diagrams (RBDs). View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Discrete-Time Markov Analysis

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch8
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      In this chapter, we introduce the concept of a Markov process. We consider two types of Markov processes: Discrete Time Markov Chain (DTMC) and Continuous Time Markov Chain (CTMC). These two processes are differentiated by the time scales employed. A DTMC is based on discrete equally spaced time intervals. These time intervals can be any arbitrary or fixed values of time increments, whereas a CTMC is based on continuous time. The continuous time Markov processes are analyzed in much more detail in the following two chapters. The state transition matrix and probability vector are introduced and the rele¬vant equations derived. We then consider several examples of the application of DTMCs, including several practical reliability problems, in which these equations were applied. DTMCs are a useful tool that can be applied to many practical reliability problems. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Continuous-Time Markov Systems

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch9
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      In this chapter, we expand on the concept of a continuous-time Markov processes previously introduced in Chapter 8. Given the Markov properties and comparing with the Discrete Time Markov Chain (DTMC), the state transition rate matrix is derived. With this canonical form of the continuous-time Markov process, we are able to derive the dynamic properties of each defined state and explore how they evolve over time. A two-state Markov model for a single component repairable reliability model is derived. We walk through the steps to create a Markov model and considered the characteristics of this model, including asymptotic behavior. We also provide a brief introduction to Markov reward models. The continuous-time Markov process is one of the most powerful tools we have for modeling system behavior. With the appropriately defined model, we are able to extract many useful characteristics of the system under analysis. We will apply this powerful technique to a broad variety of problems in the latter half of this text. One of the drawbacks of this technique is state-space explosion. We will investigate this problem in the following two chapters and discuss ways in which we can mitigate this to make the problems more tractable. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Markov Analysis: Nonrepairable Systems

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch10
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter introduces applications of Markov analysis to nonrepairable systems and illustrates both dynamic and steady state responses for nonrepairable and par¬tially repairable systems with different configurations. Table 10.1 summarizes the availability models across different configurations. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Markov Analysis: Repairable Systems

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch11
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter introduces applications of Markov Analysis to repairable systems and illustrated both dynamic and steady-state responses for repairable systems with dif¬ferent configurations. Table 11.1 and Table 11.2 summarize the reliability character¬istics for the different configurations we considered. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Analyzing Confidence Levels

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch12
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      To increase the predictive usefulness of reliability models, we need good estimates of reliability parameters, such as Mean Time between Failures (MTBF) and (Mean Time to Repair) MTTR. A key source of information that can be applied to verify or improve our reliability models is test data. These test data may be obtained from projects with similar characteristics as our model or be based directly on test data or field data from the system we are modeling. GoF tests provide us with a measure of how well our assumed probability distributions for our model match the actual test data. If a large disparity is identified, then our model may need to be amended to improve its accuracy of modeling the actual system behavior. Confidence levels, on the other hand, tell us the range of values for reliability estimates, such as MTBF, that fall within a certain confidence level, for example, minimum and maximum values that fall within 95% confidence levels. The more data we have, the more narrow the range between the maximum and minimal values. And the tighter the confidence level, the wider the range will be. The confidence level provides us with feedback on how likely our model will be in predicting future reliability based on data we have - including the case where no failure data exist but the system(s) has been operational for a given period of time. In Chapter 18, we will apply these techniques to a case study. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Estimating Reliability Parameters

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch13
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      For our model to be as accurate as possible, we require good estimates of reli¬ability parameters. Without any data related to the system we are designing, expert opinions of reliability engineers and analysts are used. These experts will usually base their estimates on reliability data similar projects, systems, or components. Vendors may provide some reliability data as well. If we have test data from the system being built, or better yet, data from deployed systems, we use that information to update our estimates. Monte Carlo simulations are also helpful for our estimation. A powerful technique is the use of Bayes's theorem, which provides a method for combining previous estimates with new data to create a new estimate that com¬bines both. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Six Sigma Tools for Predictive Engineering

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch14
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      In this chapter, we introduce a variety of Design for Six Sigma (DFSS) tools and methods and talked about their practical applications and advantages. Here are some of the key takeaways: Integrating and optimizing DFSS into an organization's DNA can enable a cre¬ative, efficient, and excellent high performance team to deliver breakthrough prod¬ucts and services. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Design Failure Modes and Effects Analysis

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch15
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      Learning from each failure after it occurs in the lab or out in the field is both costly and time consuming. If the failure is reported by a customer, customer dissatisfaction leading to loss of current customers and future business is a very plausible scenario. The root cause of the problem must be identified before any fix can be attempted. Some problems, as engineers will attest, can be notoriously hard to debug. Once the problem has been identified, the first solution may be too risky, and a less optimal fix proposed initially, with a more complete solution provided in a later software release. Failures need to be prioritized according to how serious their consequences are, how frequently they occur and how easily they can be detected. A Design Failure Modes and Effects Analysis (DFMEA) does this and also documents current knowledge and actions about the risks of failures for use in continuous improvement. DFMEA provides a systematic method of identifying a large number of potential failures using thought experiments. DFMEAs provide risk mitigation, in both product and process development phases. Each potential failure is considered for its effect on the product or process or customer and, based on the risk level, actions are determined to reduce, mitigate, or eliminate the risk prior to software implementation and deployment. The DFMEA drives a set of actions to prevent or reduce the severity or likelihood of failures, starting with the highest priority ones. Each failure scenario removed during the product design phase provides large dividends and cost-benefits for the company when a more robust and reliable product is deployed. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Fault Tree Analysis

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch16
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      Fault Tree Analysis (FTA) is a top-down reliability technique for deriving failure nodes that in combination with other failures can result in the occurrence of the top event (e.g., system outage). FTA is an intuitive graphical approach that shows a clear relationship between faults and system failure. We applied FTA to the two-component nonrepairable model previously analyzed using Markov techniques. FTA has several significant limitations that we need to be aware of when applying this tool. We will explore FTA in several examples and case studies in Chapters 18, 21, and 22. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Monte Carlo Simulation Models

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch17
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      Monte Carlo simulation is a powerful technique used to characterize a range of system behavior based on assumed probability distributions of parameters associated with a model. Monte Carlo simulation has been applied to a wide range of problems, including financial predication. In terms of reliability modeling, we combine the probability distributions for many reliability parameters of interest in our model to predict the range of possible system availability and the likelihood of each outcome. These data is plotted for visual analysis. Based on these data, we can determine the likelihood of the minimum availability requirement of our system being met. Sensitivity analysis shows us which parameters have the most impact on system availability. With this information, we can focus our design activities on optimizing/ improving system behavior by improving the reliability of this parameter or minimizing their impacts. Time-based Monte Carlo simulation shows us when outages are most likely to occur on average given a large number of systems being simulated. A comparison of tools and simulation techniques helps us determine the benefit and limitations of each technique. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Updating Reliability Estimates: Case Study

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch18
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter describes some practical techniques that can be used to analyze field data and revise the availability model and predictions. This analysis can be continued during the life of the system as and when more field data are obtained. The availability analysis can be periodically updated based on potential new outages and/or additional field exposure. Even if we have zero outages over a long period of time, the system unavailability will not remain the same. This will actually increase the availability. On the contrary, if we have more outages, the overall availability of the system will decrease. The system availability is dynamic and must be updated as more information becomes available. It is a good practice to monitor and track system availability regularly to ensure that a system is healthy and operational at satisfactory levels. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Fault Management Architectures

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch19
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      To improve the reliability and availability or our system, fault management techniques should be considered. When a fault occurs, can our system automatically detect this problem and then mitigate its impact on the system? Fault coverage, time to detect, redundancy schemes, recovery time, and robustness of the system in the event of failures are important considerations. This chapter focuses on these aspects of fault management. Redundancy is one of the most powerful design techniques for minimizing or eliminating the impact of failures on the overall system behavior. We discuss a variety of models used in the fields of aerospace and telecommunications engineering. We compare the system effects of redundancy with repair and redundancy without repair. In certain mission-critical systems, such as fly-by-wire aircraft flight control systems, the failed components cannot be repaired or replaced until after the mission has been completed. Comparisons of the various techniques, including several examples are provided. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Application of DFMEA to Real-Life Example

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch20
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter provides a case study of a Design Failure Modes and Effects Analysis (DFMEA) for an actual telecommunications project. As a result of conducting this DFMEA, design faults were identified that otherwise may not have been discovered. Design changes were introduced to prevent these faults from being discovered after product introduction. Using the initial and revised RPNs, the design improvements can be quantified. The lessons learned from this project were identified and used to improve the effectiveness and efficiency of the subsequent DFMEAs for this project. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Application of FTA to Real-Life Example

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch21
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter provides a case study of Fault Tree Analysis (FTA) for a telecommunications project. We note a variety of factors could cause a system outage - including events outside the system itself, such as environmental or security issues. We examine one important aspect of system failure - call processing failure and identified issues or faults that could result in this failure. A good approach is to initially capture the fault tree in a modeling tool, such as Relex, and then transfer these data and relationships to a spreadsheet for easier manipulation. We identify the high risk nodes and used (Design of Experiments) DOE and Monte Carlo Analysis to determine the impact of these nodes on the system. This simulation results show how likely it is the system would achieve its five 9s availability requirement. The sensitivity analysis showed which failures have the largest impact on the system. Armed with this information, the design team can focus on the components that have the biggest impact on availability. By addressing these items, we greatly improve the probability of meeting our availability requirements. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Complex High Availability System Analysis

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.ch22
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      This chapter applies many techniques we have explored in this chapter to a more complicated but practical system architecture frequently employed in the telecommunications industry. We consider a variety of techniques - including Markov analysis, Monte Carlo simulation, fault tree analysis, sensitivity analysis, and Matlab simulation. We see firsthand how state space explosion in Markov analysis can occur. Although with a variety of tools available, extracting information from the model is not difficult, understanding all of the possible interactions becomes increasingly difficult with state space explosion. We see that many of these interactions are either insignificant or nonexistence. By partitioning and/or simplifying the model, we lose little but gain valuable information on the system behavior and how we might improve it. We compare both the more complex models and the simplified models to show this. With the approach to solving complex analysis problems provided in this chapter, as well as the many other techniques for solving reliability problems and improving reliability explored in this text, the reliability analyst will be able to successfully tackle a wide variety of reliability analysis and design problems. These techniques applied optimally will help contribute to the improvement of the reliability of the systems and products the analyst encounters. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      References

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.refs
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      No abstract. View full abstract»

    • Full text access may be available. Click article title to sign in or learn about subscription options.

      Index

      Taylor, Z. ; Ranganathan, S.
      Designing High Availability Systems:DFSS and Classical Reliability Techniques with Practical Real Life Examples

      DOI: 10.1002/9781118739853.index
      Copyright Year: 2014

      Wiley-IEEE Press eBook Chapters

      No abstract. View full abstract»