By Topic

Failure analysis of a fault-tolerant 2-node server system

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Jacob, D. ; Relex Software Corp., Greensburg, PA ; Simon, E.J. ; Zhang, W. ; Rose, D.

In this paper, we present an integrated model of hardware and software failures of a fault-tolerant 2-node server system used in a real-life application of an archive system. Each node runs a distinct component of the server application software and identical copies of a fault monitoring service. The fault monitoring service on each node monitors the status of its local application software as well as the availability of the hardware and software on the other node. Upon a node failure, the fault monitoring service on the good node transfers the application software on the failed node to the good node. Upon the failure of an application software component or fault monitoring service, an automatic restoration is performed by the available fault monitoring service. The failed nodes are restored on a first-come, first-serve basis by a single repair facility. The failure and restoration processes of the hardware and software are highly dependent on the status of other components as well as the sequence of failure events. Therefore, we employ a decomposition method that uses both combinatorial analysis as well as Markov-based state space analysis to solve the problem. The proposed method allows us to extend the analysis easily for the cases of multiple nodes, software components, and different repair policies

Published in:

Reliability and Maintainability Symposium, 2006. RAMS '06. Annual

Date of Conference:

23-26 Jan. 2006