Skip to Main Content
We present an improved minimum description length (MDL) learning algorithm - MDLCompress - for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion of the phrase in the model maximizes compression. The deep recursion of MDLCompress combined with its two-part coding nature makes it uniquely able to identify biologically meaningful sequence without limiting assumptions. The ability to quantify cost in bits for phrases in the MDL model promotes prediction of fragile regions where single nucleotide polymorphisms (SNPs) may have the most impact on biological activity. MDLCompress improves our previous algorithm in runtime performance through use of an innovative data structure and in specificity of motif detection (compression) through use of improved heuristics. We also discuss recent results from MDLCompress analysis of 144 known overexpressed genes from a breast cancer cell line, BT474. Novel motifs, including potential microRNA (miRNA) binding sites, have been identified within certain genes and are being considered for in vitro validation studies.