Skip to Main Content
In this work we are concerned with the cost associated with replicating intermediate data for dataflows in Cloud environments. This cost is attributed to the extra resources required to create and maintain the additional replicas for a given data set. Existing data-analytic platforms such as Hadoop provide for fault-tolerance guarantee by relying on aggressive replication of intermediate data. We argue that the decision to replicate along with the number of replicas should be a function of the resource usage and utility of the data in order to minimize the cost of reliability. Furthermore, the utility of the data is determined by the structure of the dataflow and the reliability of the system. We propose a replication technique, which takes into account resource usage, system reliability and the characteristic of the dataflow to decide what data to replicate and when to replicate. The replication decision is obtained by solving a constrained integer programming problem given information about the dataflow up to a decision point. In addition, we built a working prototype, CARDIO of our technique which shows through experimental evaluation using a real testbed that finds an optimal solution.