A Plain-Text Incremental Compression (PIC) Technique with Fast Lookup Ability | IEEE Conference Publication | IEEE Xplore

A Plain-Text Incremental Compression (PIC) Technique with Fast Lookup Ability


Abstract:

Data compression is a key aspect of computing applications such as online search engines, cloud computing and big data computing. In recent times, with the increasing pop...Show More

Abstract:

Data compression is a key aspect of computing applications such as online search engines, cloud computing and big data computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and search (lookup) speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed, thus increasing the lookup time. In this paper, we propose an incremental compression technique for plain-text data objects (PIC), that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the words, and the words are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. PIC also facilitates string operations (such as concatenation, insertion, deletion and lookup-and-replacement) on compressed text without the need of decompression. In this manner, these string operations are incrementally compressed, resulting in greatly improved efficiency. We implement our technique in C++, and compare PIC with industry standard tools like lz4, gzip and bzip2 in terms of compression ratio, lookup speed, and lookup-and-replace time. PIC is about 8.84×, 100.17× and 174.25× faster as compared to lz4, gzip and bzip2, respectively, when the data is looked-up, and restored into a compressed format. A lookup-and-replace operation on a PIC-compressed file is shown to be 2.83× faster than a plain-text file. We formally prove that our method does not produce any false positive o...
Date of Conference: 07-10 October 2018
Date Added to IEEE Xplore: 17 January 2019
ISBN Information:

ISSN Information:

Conference Location: Orlando, FL, USA

Contact IEEE to Subscribe

References

References is not available for this document.