We describe a support vector regression (SVR) approach to predict the accessible surface area (ASA) of a protein from its sequence. Our approach encodes each protein residue as a vector of amino acid propensities derived from a multiple alignment of the subject protein with homologous proteins. The vector consists of the log-likelihood ratios of each of the twenty amino acids in the residue's multiple alignment column. Using a reference set of proteins of known structure and, hence, known ASA, we trained an SVR model. Each training sample consists of the fifteen log-likelihood vectors in a window of width fifteen surrounding a residue, along with the "true" ASA value, computed from the known structure. To apply the model to proteins of unknown structure, only the subject protein sequence is required. Our method uses PSI-BLAST to simultaneously determine a set of (putative) homologs and compute the log-likelihood vectors needed to encode the subject protein. We show that this method provides substantially improved accuracy in predicting ASA when compared with an earlier method.
Published in:
Engineering in Medicine and Biology Society, 2004. IEMBS '04. 26th Annual International Conference of the IEEE
(Volume:2
)
Date of Conference: 1-5 Sept. 2004