Skip to Main Content
In this paper, an algorithm using evolved regular expressions to characterize and predict human gene splice sites without any prior knowledge is described. In contrast to previous pattern-based approaches to the splice site detection problem, the patterns to be matched are unknown in advance and discovered using a supervised learning approach. We have used a genetic programming based system, PerlGP, to evolve regular expressions and proper length windows for a long sequence in which the evolved regular expressions can effectively characterize and predict the splice junctions. Since the gene splicing process is too complex to be fully understood currently, and the widely accepted consensus sequences only reflect some partial statistical information around splice sites, not to mention defining a splice site. However, our evolved regular expressions may shed new light on the underlying rules that define splice sites. Experimental results demonstrate that using the evolved regular expressions, splice junctions could be accurately characterized, furthermore, these evolved regular expressions could also be employed as a predictor to detect whether a CT/AG containing sequence is a splice site or not. Our experimental results also exhibit that the performance of this approach for predicting human gene splice junctions is competitive compared with some other traditional methods.