Replication is a key technique for improving fault tolerance but can introduce considerable performance overhead under some circumstances. To explore the tradeoff between performance and failure resilience, we develop a calculus that takes into consideration the I/O characteristics of applications and failure behavior of distributed storage nodes. With the developed evaluation model, we then prescribe a file system replication strategy that maximizes the utilization of computational resources for long-running and compute-intensive grid applications.
Published in:
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
Date of Conference: 19-22 May 2008