Skip to Main Content
One of the main goals of computational genomics is fast and accurate biological interpretation of newly sequenced genomic DNA. The complexity of the task varies among genomes but is never simple. Currently, for a new genome a custom built annotation pipeline is constructed by integration of ab initio and comparative genomic methods. Still, a consistent solution of the jigsaw puzzle of genome annotation frequently requires additional experimental efforts (such as EST/cDNA sequencing, etc.) Current ab initio gene finding algorithms use statistical analysis and optimization to solve the gene identification problem restated as search for the optimal parse of the genomic sequence into fragments with distinct statistical characteristics. This problem setting leads to a classic task for dynamic programming: search for an optimal path through a network with weights/scores assigned to nodes and vertices. Obviously, assignment of weights/scores plays a critical role and may present a significant challenge. This task is equivalent to estimation of parameters of statistical models (hidden Markov models) representing a mosaic of functional sequences and sites in a given genome. The task is rather easy when large sets of validated training sequences are available. However, it is not the case for hundreds of currently unfolding genome sequencing and annotation projects. In the lecture we will consider the general schemes of ab initio gene prediction. We will discuss estimation of model parameters without a training set. We will show that this unsupervised approach is possible and is becoming very important for two rapidly developing branches of genomics: i/ for prokaryotic metagenomes that are becoming a rich source of information about non-cultivated microbial species and ii/ for "compact" eukaryotic genomes, such as fungi, which relatively short genome size (less than 50 MB) allows to obtain complete genome sequence in a relatively short time.