Abstract:
As exascale platforms are in sight, high-performance computing needs to take failures into account and provide fault-tolerant applications and environments. Checkpoint-re...Show MoreMetadata
Abstract:
As exascale platforms are in sight, high-performance computing needs to take failures into account and provide fault-tolerant applications and environments. Checkpoint-restart approaches do not require modifying the application, but are expensive at large scale. Application-based fault tolerance is more specific to the application and is expected to achieve better performance. In this paper, we address fault-tolerant matrix factorization with algorithms that present good performance, both during failure-free executions and when failures happen. A challenge when designing fault-tolerant algorithms is to make sure they are resilient to any failure scenario. Therefore, we design a model for these algorithms and prove they can tolerate failures at any moment, as long as enough processes are still alive.
Published in: 2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS)
Date of Conference: 26-30 March 2022
Date Added to IEEE Xplore: 03 May 2022
ISBN Information: