Skip to Main Content
Conditional random fields (CRFs) are undirected probabilistic graphical models that were introduced for solving sequence labeling and segmenting problems. CRFs have several advantages compared to other well understood and widely used techniques such as hidden Markov models (HMMs) or maximum entropy Markov models (MEMMs). Being a conditional model, it does not explicitly model the input data sequences but uses feature functions (features) to incorporate the arbitrary interactions and inter-dependencies that exist in the observation sequences. The number of all possible features is extremely large, up to millions, and is usually specified and designed in advance or according to a feature-generating scheme based on domain knowledge. This paper introduces a feature subset selection method for CRFs based on genetic algorithms, in which a population of candidate feature function subsets is evolved to achieve a maximal CRF performance. The method was experimentally validated on the well known bioinformatics problem of protein phosphorylation site prediction, phosphorylation being one of the most important protein modification mechanisms.