Skip to Main Content
As massively parallel processing (MPP) machines and their associated applications become larger, more work on resiliency is needed if those applications are to have a chance of running for significant lengths of time in the face of the expected component failure rates. This paper describes an approach for protecting large read-mostly in-memory data structures from various forms of failures by applying the concept of software erasure-correcting codes. A prototype library for this scheme was implemented on the Cray XMT and applied to a sample application. It is also portable to other global shared memory architectures that meet certain requirements, including the Cray XE.