Skip to Main Content
We present the application of a regularized least-squares based algorithm, known as greedy RLS, to perform a wrapper-based feature selection on an entire genome-wide association dataset. Wrapper methods were previously thought to be computationally infeasible on these types of studies. The running time of the method grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. Moreover, we show how it can be further accelerated using parallel computation on multi-core processors. We tested the method on the Wellcome Trust Case Control Consortium's (WTCCC) Type 2 Diabetes - UK National Blood Service dataset consisting of 3,382 subjects and 404,569 single nucleotide polymorphisms (SNPs). Our method is capable of high-speed feature selection, selecting the top 100 predictive SNPs in under five minutes on a high end desktop and outperforms typical filter approaches in terms of predictive performance.