By Topic

The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes in Prokaryotic Genomes

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)

In this paper, we present a new algorithm called 4M (mixed memory Markov model) for finding genes from the genomes of prokaryotes. This is achieved by modeling the known coding regions of the genome as a set of sample paths of a multistep Markov chain (call it ) and the known non-coding regions as a set of sample paths of another multistep Markov chain (call it ). The new feature of the 4M algorithm is that different states are allowed to have different memory lengths, in contrast to a fixed multistep Markov model used in GeneMark in its various versions. At the same time, compared with an algorithm like Glimmer3 that uses an interpolation of Markov models of different memory lengths, the statistical significance of the conclusions drawn from the 4M algorithm is quite easy to quantify. Thus, when a whole genome annotation is carried out and several new genes are predicted, it is extremely easy to rank these predictions in terms of the confidence one has in the predictions. The basis of the 4M algorithm is a simple rank condition satisfied by the matrix of frequencies associated with a Markov chain. The 4M algorithm is validated by applying it to 75 organisms belonging to practically all known families of bacteria and archae. The performance of the 4M algorithm is compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found that, in a vast majority of cases, the 4M algorithm finds many more genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome annotation of 13 organisms by using 50% of the known genes as the training input for the coding model and 20% of the known non-genes as the training input for the non-coding model. After this, all of the open reading frames are classified. It is found that the 4M algorithm is highly specific in that it picks out virtually all of the known genes, while predicting that only a small number of the open reading frames whose status is unknown- are genes.

Published in:

IEEE Transactions on Automatic Control  (Volume:53 ,  Issue: Special Issue )