Skip to Main Content
The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.