Skip to Main Content
Metagenomic sequencing is becoming a powerful method to explore various environmental organisms without isolation and cultivation. Genomic sequences data generated by this technology is growing explosively while numerous computational methods for analysis are still urgently in need. One of the first and most important processes is exhaustive gene prediction. As short and anonymous DNA fragments, assembly of metagenomic sequences usually has not a fixed end point to obtain complete genomes and moreover is often not available. This situation makes the annotation more complicated than in complete genomes. Here, we present a newly developed SVM-based algorithm which comprises a supervised universal model and a data-specific novel model. It utilizes entropy density profiles of codon usage, translation initiation signal scoring and open read frame length for model training. Tests on fixed-length artificial shotgun sequences of 700 bp showed a sensitivity of 94.7% and a specificity of 94.9% on average, which indicate that our method has the totally higher performance than the best of current gene prediction methods. Thousands of additional genes are predicted when applied to two metagenomic samples from human gut community. Furthermore, compared to other gene predictors, our algorithm predicts the most potential novel genes.