Skip to Main Content
Self-Caring IT systems are those that can proactively avoid system failures rather than reactively handle failures after they have occurred. In this paper, we are interested in failures in which a MapReduce job is unable to execute within an SLA-based completion time. The existing fault tolerance capability provided by Map Reduce frameworks is simple and the penalty associated with handling failures could potentially lead to excessive job execution times. Our goal in this paper is to bring out the severity of this penalty for different job characteristics and configurable framework parameters. We first quantitatively evaluate the penalty in execution time associated with node failures in the open-source MapReduce framework, Hadoop using the MRPerf simulator. This increase in execution time is particularly expensive in pay-as-you-go cloud infrastructures where users are charged by resource usage duration. Our solution minimizes job-completion-time SLA violations by augmenting the existing fault-tolerance capability of the MapReduce framework using a dynamic resource scaling approach. This resource scaling approach leverages the elastic properties of a cloud, in order to mitigate execution time penalties and hence proactively avoids a potential job failure. Using our proposed approach for various job and framework parameters, we show that performance penalties can be decreased by up to 78% in the case of singlenode failures and by up to 100% in the case of 4-node failures at minimal additional cost.