Skip to Main Content
The paper describes our on-going project, termed Unibus, in the context of facilitating fault-tolerant executions of MPI applications on computing chunks in the cloud. In general, Unibus focuses on resource access virtualization and automatic, user-transparent resource provisioning that simplify use of heterogeneous resources available to users. In this work, we present the key Unibus concepts (the Capability Model, composite operations, mediators, soft and successive conditionings, metaapplications), and demonstrate how to employ Unibus to orchestrate resources provided by a commercial cloud provider into a fault-tolerant platform, capable of executing message passing applications. In order to support fault tolerance we use DMTCP (Distributed MultiThreaded CheckPointing) that enables checkpointing at the user's level. To demonstrate that the Unibus-created, FT-enabled platform allows to execute MPI applications we ran NAS Parallel Benchmarks and measured the overhead introduced by FT.