Skip to Main Content
The accurate analysis of the proteome using mass spectrometry plays an important role in the understanding of many of the physiological processes that occur in an organism and has become a standard tool used in the identification of proteins. This identification of proteins is a challenging one and relies upon bioinformatics tools to characterize proteins via their proteolytic peptides which are identified via characteristic mass spectra generated after their ions undergo fragmentation in the gas phase within the mass spectrometer. An important problem associated with the accurate identification of peptides from mass spectrometry is whether or not a particular peptide is likely to be detected in a standard proteomics experiment, this can be dependant on a number of factors including the physiochemical properties of the peptide itself as well as the mass spectrometer used in the experiment. A machine learning approach was applied to find peptide fragmentation patterns based on different properties of the peptide sequence and we are able to predict which peptide(s) are likely to be detected in a standard proteomics experiment. The task of protein identification is made even more challenging by the occurrence of partial enzymatic protein cleavage, resulting in peptides with internal missed cleavage sites, as proteases frequently fail to digest proteins to their limit peptides. Typically, up to 1 of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using rules derived from information theory, we were able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines.