By Topic

The resiliency challenge presented by soft failure incidents

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $31
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Caffrey, J.M. ; IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, NY 12601-5400, USA

A common problem observed on mainframe installations, and one which presents a significant challenge for resiliency and high availability, involves soft failure incidents. n contrast to catastrophic failures, soft failures involve some degree of system shutdown without an obvious cause. This has been described with the phrase: “Systems don't break; they just stop running, and we don't know why.” Extending a medical paradigm, this paper proposes a new method for solutions deployed on IBM z/OS™ systems to respond when either the system or the application stops running. The current approach is to treat the “disease,” by determining the cause of he problem and taking action to prevent its recurrence. The new approach is to determine whether the system or application is behaving abnormally, identify the cause of this abnormal behavior, and take action to treat the “symptom.” This new approach uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer. Based on an analysis of critical problems and preliminary modeling work, the types of abnormal behavior identified are assigned to broad categories. In this paper, we describe the progress being made to address the challenge of soft failures by implementing this new paradigm.

Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.  

Published in:

IBM Systems Journal  (Volume:47 ,  Issue: 4 )