Skip to Main Content
A new machine learning approach has been developed in this study for sequence-based prediction of DNA-binding residues in proteins. The approach used both the labeled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices and several new descriptors. The sequence-derived features were used to train random forests, which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset. The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies.