Loading [MathJax]/extensions/MathEvents.js
SelectiveEC: Towards Balanced Recovery Load on Erasure-Coded Storage Systems | IEEE Journals & Magazine | IEEE Xplore

SelectiveEC: Towards Balanced Recovery Load on Erasure-Coded Storage Systems


Abstract:

Erasure coding (EC) has been commonly used to offer high data reliability with low storage cost. Upon failures, the lost blocks are recovered in batches. Due to the limit...Show More

Abstract:

Erasure coding (EC) has been commonly used to offer high data reliability with low storage cost. Upon failures, the lost blocks are recovered in batches. Due to the limited number of stripes, the data layout within a batch is non-uniform. Together with the random selection of source and replacement nodes for recovery tasks, the recovery workload among live nodes is skewed within a batch, which severely slows down failure recovery. To solve this problem, We present SelectiveEC, a new recovery task scheduling module that provides provable network traffic and recovery load balancing for large-scale EC-based storage systems. It relies on bipartite graphs to model the recovery traffic among live nodes. Then, it intelligently selects tasks to form batches and carefully determines where to read source blocks or to store recovered ones, using theories such as a perfect or maximum matching and k-regular spanning subgraph. SelectiveEC supports single-node failure and multi-node failure recovery, and can be deployed in both homogeneous and heterogeneous network environments. We implement SelectiveEC in HDFS, and evaluate its recovery performance in a local cluster of 18 nodes and AWS EC2 of 50 virtual machine instances. SelectiveEC increases the recovery throughput by up to 30.68\% compared with state-of-the-art baselines in homogeneous network environments. It further achieves 1.32\times recovery throughput and 1.23\times benchmark throughput of HDFS on average in heterogeneous network environments, due to the straggler avoidance by the balanced scheduling.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 33, Issue: 10, 01 October 2022)
Page(s): 2386 - 2400
Date of Publication: 23 November 2021

ISSN Information:

Funding Agency:


References

References is not available for this document.