Skip to Main Content
Cloud storage has become increasingly popular due to its convenience, cost-effectiveness and scalability. It provides the basis for a slate of file hosting services, which offer users the ability to synchronize their files between the servers and their devices. Naive file synchronization, however, requires the whole file to be transmitted to all other locations (servers, devices) whenever the file is updated in one location. This leads to massive waste of bandwidth and significant delays in propagating the update. We propose a method called HadoopRsync, which is capable of performing incremental update of files instead of transmitting them in entirety. This method is based on the rsync utility originally proposed for file synchronization between computers, but the scenario under consideration is significantly different from that for rsync in that in the cloud storage context, files are distributedly stored at multiple nodes in the cloud. We therefore propose a pair of algorithms called HadoopRsync Upload and HadoopRsync Download, which are responsible for the synchronization from the user's devices to the cloud and the synchronization in the opposite direction respectively. These algorithms only transmit the differences between the new version of the file and the old version, rather than the whole file. Our solution is based on Hadoop, the open-source framework for distributed processing of very large data across clusters of computers. The algorithms utilize the MapReduce facility provided by Hadoop to fully taking advantage of its massive-parallelization capability. In addition, we propose some optimization measures to reduce the I/Os required for file update. Extensive experiments are conducted to evaluate the proposed solution, which show that HadoopRsync significantly outperforms the baseline methods.