Abstract:
In this paper, we propose an erasure-coded data archival system called aHDFS for Hadoop clusters, where RS(k + r; k) codes are employed to archive data replicas in the Ha...Show MoreMetadata
Abstract:
In this paper, we propose an erasure-coded data archival system called aHDFS for Hadoop clusters, where RS(k + r; k) codes are employed to archive data replicas in the Hadoop distributed file system or HDFS. We develop two archival strategies (i.e., aHDFS-Grouping and aHDFS-Pipeline) in aHDFSto speed up the data archival process. aHDFS-Groupinga MapReduce-based data archiving scheme - keeps each mapper's intermediate output Key-Value pairs in a local key-value store. With the local store in place, aHDFS-Grouping merges all the intermediate key-value pairs with the same key into one single key-value pair, followed by shuffling the single Key-Value pair to reducers to generate final parity blocks. aHDFS-Pipeline forms a data archival pipeline using multiple data node in a Hadoop cluster. aHDFS-Pipeline delivers the merged single key-value pair to a subsequent node's local key-value store. Last node in the pipeline is responsible for outputting parity blocks. We implement aHDFS in a real-world Hadoop cluster. The experimental results show that aHDFS-Grouping and aHDFS-Pipeline speed up Baseline's shuffle and reduce phases by a factor of 10 and 5, respectively. When block size is larger than 32 MB, aHDFS improves the performance of HDFS-RAID and HDFS-EC by approximately 31.8 and 15.7 percent, respectively.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 28, Issue: 11, 01 November 2017)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Data Repository ,
- Archiving System ,
- Hadoop Cluster ,
- Block Size ,
- Local Store ,
- File System ,
- Parity-check ,
- Intermediate Output ,
- Key-value Pairs ,
- Subsequent Nodes ,
- Hadoop Distributed File System ,
- Storage Systems ,
- Side Of Equation ,
- Fault-tolerant ,
- Intermediate Results ,
- Urban Network ,
- Storage Cost ,
- File Size ,
- Key Values ,
- Pipelining ,
- Reed-Solomon Codes ,
- Total Execution Time ,
- Data Block ,
- Reduction In Execution Time ,
- Map Tasks ,
- Phase Map ,
- Optimal Guidance ,
- Intermediate Data ,
- Multiple Mapping
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Data Repository ,
- Archiving System ,
- Hadoop Cluster ,
- Block Size ,
- Local Store ,
- File System ,
- Parity-check ,
- Intermediate Output ,
- Key-value Pairs ,
- Subsequent Nodes ,
- Hadoop Distributed File System ,
- Storage Systems ,
- Side Of Equation ,
- Fault-tolerant ,
- Intermediate Results ,
- Urban Network ,
- Storage Cost ,
- File Size ,
- Key Values ,
- Pipelining ,
- Reed-Solomon Codes ,
- Total Execution Time ,
- Data Block ,
- Reduction In Execution Time ,
- Map Tasks ,
- Phase Map ,
- Optimal Guidance ,
- Intermediate Data ,
- Multiple Mapping
- Author Keywords