By Topic

Latent fault detection in large scale services

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Gabel, M. ; Dept. of Comput. Sci., Technion - Israel Inst. of Technol., Haifa, Israel ; Schuster, A. ; Bachrach, R.-G. ; Bjorner, N.

Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications. We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults. We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests.

Published in:

Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on

Date of Conference:

25-28 June 2012