Skip to Main Content
The paper presents a solution for improving the availability of applications running on a Unix computer cluster with two or more nodes. Tandem's NCAPS (NonStop Clusters Application Protection System) consists of specialized system software that is capable of recovering applications after hardware, software or operating system failures. The main component of NCAPS, the PPM (Process Pairs Manager), uses a primary and warm backup approach to achieve recovery times in the range of 10 seconds (for nodes having access to all needed resources) regardless of the application initialization time. This is a clear improvement over recovery times provided by existing high availability (HA) solutions, which are typically in the order of 1 minute plus the application reinitialization time. The PPM manages an application through a configurable user-specified state model in which state changes are triggered by detected failures or system administrator commands. Upon a state transition the PPM sends a state change command message to registered application processes. Communication between the application processes and the PPM is achieved through a set of API (application programming interface) calls provided by the OftLib (Open Fault Tolerance Library), also called FT-API. NCAPS is now available on Unix clusters composed of Tandem S4000 machines. A version to run on Tandem SSI (Single System Image) product NSC (NonStop Clusters) for a cluster of Compaq Proliant machines is under development.