Cluster-based delta compression of a collection of files | IEEE Conference Publication | IEEE Xplore

Cluster-based delta compression of a collection of files


Abstract:

Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression ...Show More

Abstract:

Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.
Date of Conference: 14-14 December 2002
Date Added to IEEE Xplore: 25 February 2003
Print ISBN:0-7695-1766-8
Conference Location: Singapore

1 Introduction

Delta compressors are software tools for compactly encoding the differences between two files or strings in order to reduce communication or storage costs. Examples of such tools are the diff and bdiff utilities for computing edit sequences between two files, and the more recent xdelta [16], vdelta [12], vcdiff [15], and zdelta [26] tools that compute highly compressed representations of file differences. These tools have a number of applications in various networking and storage scenarios; see [21] for a more detailed discussion. In a communication sce-nario, they typically exploit the fact that the sender and receiver both possess a reference file that is similar to the transmitted file; thus transmitting only the difference (or delta) between the two files requires a significantly smaller number of bits. In storage applications such as version control systems, deltas are often orders of magnitude smaller than the compressed target file.

Contact IEEE to Subscribe

References

References is not available for this document.