Skip to Main Content
Replication is the prevalent solution to tolerate faults in large data structures hosted on distributed servers. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. The theoretical and experimental evaluation confirms that the fused backups are space-efficient as compared to replication, while they cause very little overhead for normal operation. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon's highly available key-value store, Dynamo. While the current replication-based solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.