Skip to Main Content
A common problem observed on mainframe installations, and one which presents a significant challenge for resiliency and high availability, involves soft failure incidents. n contrast to catastrophic failures, soft failures involve some degree of system shutdown without an obvious cause. This has been described with the phrase: “Systems don't break; they just stop running, and we don't know why.” Extending a medical paradigm, this paper proposes a new method for solutions deployed on IBM z/OS™ systems to respond when either the system or the application stops running. The current approach is to treat the “disease,” by determining the cause of he problem and taking action to prevent its recurrence. The new approach is to determine whether the system or application is behaving abnormally, identify the cause of this abnormal behavior, and take action to treat the “symptom.” This new approach uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer. Based on an analysis of critical problems and preliminary modeling work, the types of abnormal behavior identified are assigned to broad categories. In this paper, we describe the progress being made to address the challenge of soft failures by implementing this new paradigm.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.