Skip to Main Content
With the accelerating advancement of biomedical research, it has been widely accepted that genetic variation plays a critical role in the pathogenesis of human inherited diseases. As an important type of genetic variation, nonsynonymous single nucleotide polymorphisms (nsSNPs) that occur in protein coding regions lead to amino acid substitutions in proteins, affecting structures and functions of proteins, and potentially causing human diseases. Hence, identifying disease-associated nsSNPs against neutral ones by machine learning approaches plays an important role in the understanding of genetic bases of human diseases and further promoting the prevention, diagnosis, and treatment of these diseases. In this paper, we formulate the task of identifying disease-associated nsSNPs as a binary classification problem. Based on a set of 26 numeric features derived from protein sequence information, we compare the performance of five popular ensemble learning approaches (AdaBoost, LogitBoost, Random forests, L2 boosting and stochastic gradient regression) with two traditional classification methods (decision trees and support vector machines) in this classification problem. Systematic validation demonstrates that ensemble learning approaches are in general more effective in identifying the disease-associated nsSNPs, while LogitBoost can achieve the highest performance among all the methods compared.