With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (GZIP) for data archiving. In this work, we investigate how GZIP (and BZIP2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.
Published in:
Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on
Date of Conference: 18-18 Dec. 2010