Although part of the IBM System z™ strategy is to improve design and development processes to prevent errors from escaping to the field, improving recovery is another element in the strategy to keep a machine up and running should an error occur. The z9™ continues on an evolutionary path of enhancing I/O subsystem (IOSS) recovery to further advance the reliability, availability, and serviceability (RAS) of System z platforms. This paper presents an overview of recovery and how it interacts with other RAS functions—such as error-detection mechanisms in hardware, including automatic identification and recovery of failing elements—up to the point in time prior to the advent of the z9. It then presents the innovations to IOSS recovery and error detection in the z9 that further improve machine availability. The recovery infrastructure, which significantly reduces recovery time and makes recovery much less dependent on machine scaling for this and future generations of System z servers, is described. Also described are such innovative uses of this new infrastructure as improvements in error detection related to elusive firmware problems seen in prior machines, the ability to detect and recover from firmware hangs or lockups related to inadvertently leaving control blocks locked, and the capability to perform recovery in parallel by multiple system-assist processors.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.