The <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq1-3029952.gif"/></alternatives></inline-formula>-OMP Algorithm for Feature Selection With Application to Gene Expression Data

Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq2-3029952.gif"/></alternatives></inline-formula>-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq3-3029952.gif"/></alternatives></inline-formula>-OMP can handle (a)various types of outcomes, such as continuous, binary, nominal, time-to-event, (b)discrete (categorical)features, (c)different statistical-based stopping criteria, (d)several predictive models (e.g., linear or logistic regression), (e)various types of residuals, and (f)different types of association. We compare <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq4-3029952.gif"/></alternatives></inline-formula>-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq5-3029952.gif"/></alternatives></inline-formula>-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). <inline-formula><tex-math notation="LaTeX">$\gamma$</tex-math><alternatives><mml:math><mml:mi>γ</mml:mi></mml:math><inline-graphic xlink:href="tsagris-ieq6-3029952.gif"/></alternatives></inline-formula>-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.


INTRODUCTION
T HE problem of feature selection (hereafter FS), in the predictive analytics context, is the problem of identifying a minimal-size, optimally predictive subset S out of available features F for an outcome Y (calligraphic letters denote random variables).To solve this problem, we assume we are given a dataset X with n rows that correspond to samples and the columns correspond to features, such that each row is a vector x of values of X , and Y is the vector 1 of the corresponding values of Y for each sample.
FS algorithms for use in bioinformatics are required to scale up to tens or even hundreds of thousands of features (e.g., multi-omics data) and maintain high-quality (in terms of features selected and predictive performance), even when sample sizes are relatively small (e.g., a few hundreds).Further, they have to be general enough to handle binary outcomes (e.g., classifying between disease cases and controls), multiclass outcomes (e.g., classifying between disease sub-types), continuous outcomes (e.g., quantitative traits such as height), and censored time-to-event outcomes (e.g,.time to death, complications, relapse, metastasis).In addition, while features may often be only continuous (e.g., gene expressions), an FS algorithm should allow for discrete features too, in order to be able to also consider clinical or other genetic data (e.g.single nucleotide polymorphisms).
The FS literature is vast (some recent reviews in [1], [5], [6], [10], [43]); in this paper however, we focus only on the most prominent algorithms that can abide to the requirements above, namely, scalability to the number of features and ability to generalize to several types of outcomes and features.Most such high-dimensional algorithms are greedy in their selection (see [4], [16] for exceptions).Specifically, we focus on the Orthogonal Matching Pursuit (OMP) algorithm, a well known algorithm in the signal processing literature [11], [12], [38], [41], and generalize it towards 6 directions.This generalization is named γ-OMP 2 .γ-OMP can treat numerous types of outcomes while allowing for numerous regression models to be used.It is able to accept statistical-based stopping criteria and works with various types of residuals produced by the regression models.In addition, Pearson correlation is expanded to broader types of associations and finally γ-OMP allows for the inclusion of discrete features.
We compare γ-OMP with LASSO [47], perhaps the most widely used FS algorithm in bioinformatics, machine learning and statistics.LASSO is computationally scalable to the typical feature sizes in bioinformatics, exhibits high predictive quality, and has been generalized to different types of outcomes (multivariate, binary, multinomial, ordinal, time-toevent, clustered) and also allows for discrete features [57].In addition, LASSO is also quite similar in spirit to γ-OMP; while LASSO solves a global optimization problem, this problem is solved in a greedy way similar to γ-OMP [13].The results of the comparative evaluation show that γ-OMP is on par, or outperforms LASSO in several aspects.Specifically, our results are summarised as follows: 1) In simulated data where the FS solution is known, γ-OMP consistently returns fewer false positive features.It also outperforms LASSO in terms of predictive performance, for the same number of selected features.2) In real gene expression data, γ-OMP produces predictive models of similar performance to LASSO but more parsimonious.When contrasted to LASSO on an equal footing (roughly equal number of selected features), γ-OMP again outperforms LASSO in terms of predictive performance.3) γ-OMP is computationally more efficient than LASSO; it scales up to higher number of dimensions and/or larger sample sizes than LASSO.

A UNIFYING VIEW OF GREEDY FEATURE SELECTION ALGORITHMS
For high-dimensional data, most FS algorithms rely on a greedy strategy to include the next feature in S, or to remove from S. A family of algorithms fit a model with all selected features, and identify the best next feature by building new models for each feature under consideration.Examples include the Forward Search, Forward-Backward Search, and the recent Forward-Backward with Early Dropping (FBED) algorithm [7].A second family of algorithms select the next feature as the one that maximizes the pairwise association with the residuals of the current model.

OMP AND LASSO ALGORITHMS
We now present the two algorithms mostly related to the current work, namely OMP which forms the basis for the proposed γ-OMP, and LASSO against which we compare γ-OMP.

OMP algorithm
OMP is a forward search algorithm that was first proposed for continuous outcomes in the context of signal reconstruction [11], [12], [41].It conceptually tries to solve an objective function (minimize a loss function) where the penalty is the L 0 -norm of the regression coefficients (βs), i.e., the number of non-zero βs, and the optimization is performed in a greedy way.OMP assumes continuous outcome (y) and a data matrix X of continuous features (its pseudo-code is in Algorithm 1).The ith column of X is denoted as X i , while X S denote the matrix with all selected features.The algorithm initiates the current selection S ← ∅, and residuals r ← y.At each iteration, the feature with the largest (in absolute value) Pearson linear correlation with r is selected for inclusion.If X is standardised, then it suffices to compute r, X i , i.e., the inner product of the two vectors.After selection, a new least squares regression model is fitted, resulting in new β coefficients and is used to update the residuals.The procedure stops when the L 2 -norm (alternatively the L 1 -norm can be used) of the residuals is below some threshold .

Algorithm 1
The OMP algorithm 1: Input: Outcome values y, dataset X 2: Output: A subset S ⊆ X of selected features.
β ← least squares regression of y on X S 10: Update residuals r ← y − X S • β 11: end while 12: return S

LASSO algorithm
Least Absolute Shrinkage and Selection Operation (LASSO) [47] is perhaps the most common FS algorithm, owing much of its success to its computational efficiency, and high learning quality with both low and medium sample sizes.LASSO is formulated around the penalized maximization (regularization) of an objective function (usually log-likelihood), that depends upon the type of the outcome variable.The level of penalization affects the magnitude of the regression coefficients forcing them to shrink towards zero, hence feature selection regularization of the coefficients are performed simultaneously.
For binary outcomes ({0, 1}) modelled via logistic regression, the objective function to be minimised over β is: whereas with continuous outcomes and linear regression the objective function is: With time-to-event (strictly positive continuous) data, using the partial log-likelihood of the Cox regression the objective function becomes [15]: where R i is the set of indices j with y j ≥ t i (those at risk at time t i ) and β β β = β 1 , . . ., β p denote the vector of regression coefficients.In all cases, the λ is a positive valued hyper-parameter that must be tuned.LASSO was first solved effectively for continuous outcomes in 2004, by Efron et al. [13], using a modified version of Least Angle Regression (LARS) borrowing ideas from OMP.In 2010 though, Friedman et al. [15] proposed a coordinate descent method that is applicable to a wider range of outcomes.

THE γ-OMP ALGORITHM
γ-OMP is a greedy feature selection algorithm that is based on simple ideas and a simple heuristic for selecting the next best feature.Yet, in this evaluation we show that it is not only scalable, but also competitive in terms of predictive performance.γ-OMP is easily implemented and customized.In contrast, LASSO's optimization problem is not convex for some outcomes, or predictive models and requires task-specific algorithms to generalize.For example, for time-course data, LASSO is non-convex and non-scalable [48].
The γ-OMP algorithm, presented in Algorithm 2, generalises OMP towards the following directions: a) OMP algorithm has been defined for univariate/multivariate continuous and binary outcomes, using linear and logistic regression models.Quantitative traits and case-control data fall within these categories, but outcomes can also Cox-Snell, response and martingale residuals [46].For the case of ordinal regression we compute the probability residual [27], while another option is the surrogate residual [31].This clearly points out that computing residuals is still an open problem for some regression models indicating that the choice of residuals is an important aspect for a residual-based algorithm such as γ-OMP.
It is worth noting that all types of residuals are continuous.For multi-class problems, there is a residual for each sample but also for each class, and hence, residuals r form a matrix instead of a vector.Similarly for Euclidean multivariate outcomes or manifold valued data (directional and compositional).e) OMP selects the candidate feature that maximizes the absolute linear correlation with the current residual vector.When both the residuals and the features are continuous, the Pearson or Spearman correlation can be employed.In the case of multivariate residuals we compute the sum of the inner products between each column of residuals and the feature.This association was also used by [32], but in a reverse order.γ-OMP can substitute the correlation with any type of pairwise association denoted by the function Assoc in line 9 of Algorithm 2. Other correlation options include the distance correlation [44] that also captures non-linear relationships, but is computationally expensive.f) Associations between different types of features (mixed, continuous and discrete) will not be in the same scale.For instance, to examine correlation of the residuals with a discrete feature, we propose the use of Analysis of Variance (ANOVA) [39], while for multinomial or multivariate outcomes that produce multivariate residuals, we suggest the use of Multivariate ANOVA (MANOVA) [35].Hence, instead of comparing association values, we propose to convert all associations to p-values that are on the same scale, by testing the hypothesis that the association is zero, thus allowing for continuous and discrete features.Then, the residual-based heuristic selects the feature with the smallest such p-value.Converting to pvalues automatically adjusts for different degrees of freedom, for example in the case of discrete features with different number of values.When the sample size is relatively high, several p-values might be smaller than the machine epsilon and will numerically be computed as zero.To avoid the algorithm from choosing blindly among the relevant features, it is crucial to compute the logarithm of those p-values.

SIMULATION STUDIES
Before moving to the empirical evaluation of γ-OMP we will show some interesting results of a small scale simulation study using binary and continuous outcomes.For both types of outcome, we generated 50, 000 features from a normal distribution and ranged the sample sizes from 100 to 1000 by a step of 100.The evaluation criteria are: a) the true positive rate (TPR), b) the percentage of relevant features selected, c) the false discovery rate (FDR), d) the percentage of falsely selected features and e) the performance of the final predictive model produced by the features selected by each algorithm.
We randomly chose 10 features to linearly produce the outcome and then added Gaussian noise.Specifically for the continuous outcome we used a high signal to noise ratio, equal to 32.5.This is practically a noiseless case and with a relatively large sample size there should be no falsely selected features.
All experiments were performed on an Intel Core i5-4690K CPU @3.50GHz, 32GB RAM desktop computer.

Evaluation protocol
To assess the performance of each FS method, we perform 10-fold Cross-Validation (CV).We tune all hyper-parameters Within the 10-fold CV procedure, to avoid overfitting.With the binary outcome, we used the log-likelihood ratio test 3 as the stopping criterion and chose 10 threshold values for γ-OMP, spanning from χ 2 0.95,1 = 3.84 (the 95% upper quantile of the χ 2 3. Twice the difference in the log-likelihood of two successive regression models. distribution with 1 degree of freedom), up to χ 2 0.99,1 4 .
With the continuous outcome, we used the adjusted R 2 (this is related to the F-test) as the stopping criterion and chose 10 equidistant stopping values, ranging from 0.05% up to 0.5%.We used the default 100 λ values for LASSO for both outcomes.
We further assessed the predictive performance of the features selected by γ-OMP and LASSO on a more fair basis.γ-OMP performs FS only, whereas LASSO simultaneously performs FS and regularization.Following [7] we also performed LASSO regularization on the features selected by γ-OMP in order to compare the predictive performance of γ-OMP with LASSO on a more equal basis.We chose 10 λ values resulting in 100 (10 stopping values of γ-OMP × 10 λ values of LASSO) predictive models, the same number of predictive models produced by LASSO.

Predictive performance metrics
We used the Area Under the Curve (AUC) as the performance metric in the binary outcome scenario.AUC represents the probability of correctly classifying a sample to the class it belongs to, thus takes values between 0 and 1, where 0.5 denotes random assignment.Unlike the accuracy metric (proportion of correctly classified samples), AUC is not affected by the distribution of the two classes (cases and controls).We used the Mean Squared Error (MSE) as the predictive performance metric with continuous outcomes, where a zero value corresponds to perfect prediction, and random guessing occurs when the average (of the observed outcome) is used.
When the sample sizes are limited to at most a few hundreds, the final estimated predictive performance of the best model is optimistically biased (the performance of the chosen model is overestimated).To overcome this, we applied the bootstrap-based bias correction method [51], described below.
Upon completion of the cross-validation, the predicted values produced by all predictive models across all folds are collected in an n × M matrix P, where n is the total number of samples and M is the total number of trained models (corresponding to the different combinations of hyper-parameter values).We sample rows (predictions) of P with replacement, termed in-sample values, whereas the non re-sampled rows are termed out-of-sample values.The performance of each trained model in the in-sample values is computed and the model with the optimal performance is selected.Its performance is computed in the out-of-sample values.The process of resampling and performance evaluation is repeated B times and the average performance 4.This represents a range of different critical values corresponding to different significance levels, e.g.(χ 2 0.95,1 , . . ., χ 2 0.99,1 ).The 1 degree of freedom is justified from the use of continuous features only.
(in the out-of-sample values) is returned.This estimated performance usually underestimates the true performance, but this negative bias is smaller than the optimistic uncorrected performance, the performance estimated from the cross-validation [51].The low computational overhead makes the bootstrap-based bias correction method highly attractive compared to the computationally expensive nested cross-validation [53], while bias correction is equally effective [51].

Simulation studies results
Figure 1 presents the results of FDR and TPR.Overall, the results show the superiority of γ-OMP, in terms of FDR (lower is better) and TPR (higher is better), compared to LASSO.Especially for TPR, γ-OMP outperforms LASSO in greater sample sizes, while for lower sample sizes their difference in not statistically significant.LASSO tends to select a higher number of irrelevant (not related to the outcome) features than γ-OMP.It has already been reported ( [4], [45]) that LASSO tends to choose considerably more features than necessary, leading to an abundance of falsely selected features (false positives).On the contrary, γ-OMP achieves a 0% FDR, for the continuous outcome scenario, whereas LASSO has an FDR equal to 50%.As for the TPR, with large sample sizes, both γ-OMP and LASSO have a TPR equal to 100%.Figure 2 shows the estimated performance of γ-OMP and LASSO for a range of sample sizes.For small to moderate sample sizes γ-OMP outperforms LASSO, while for large sample sizes, they produce comparable results.

EMPIRICAL EVALUATION USING REAL GENE
EXPRESSION DATA Hastie et al. (2017) [22] performed a simulation study comparing LASSO with other competing algorithms by using simulated datasets only.They concluded that LASSO was on par or outperformed two state-of-theart FS algorithms, FSR [56] and best-subset-selection [4].The drawback of simulated data is that they do not necessarily portray the complex structure observed with real data.Borboudakis and Tsamardinos [7] compared FBED with LASSO using real high dimensional data from various fields; biology, text mining and medicine.Their results showed that LASSO produced predictive models with higher performance at the cost of being more complex.They concluded that there is no clear winner between LASSO and FBED and the choice depends solely on the goal of the analysis.
Following Borboudakis and Tsamardinos [7] we also conducted extensive experiments with real, publicly available 5 , pre-processed gene expression data 5.The dataset names can be found in the Appendix along with the estimated predictive performance of each algorithm and for each case scenario.[25].The gene expression data can be downloaded from BioDataome [25].For the case-control (and the continuous) outcome we searched for datasets with at least 100 samples, while for the time-to-event outcome we included datasets with at least 70 samples, of which no more than 50% were censored.We have implemented γ-OMP in the R package MXM [24], [50], while for LASSO with continuous features we em-ployed the R package glmnet [15].Finally, for the case of both continuous and discrete features, we used the R package gglasso [57] where Group LASSO (GLASSO) has been implemented.All experiments were performed on an Intel Core i5-4690K CPU @3.50GHz, 32GB RAM desktop computer.

The three outcome scenarios
We considered three different types of outcome variables: binary (case-control), continuous and (right censored 6 ) survival times (time-to-event).
• Case-control outcome 7 .The goal is to identify the minimal subset of genes that best discriminates among two classes.For γ-OMP we employed the logistic regression and similarly to Lozano et al. ( 2011) [32], we computed the raw residuals e i = y i − ŷi .In this scenario we investigated the performance of γ-OMP and LASSO using 53 gene expression datasets, with feature (probesets) sizes at the order of tens of thousands features (17, 000, 22, 000, 33, 000, 45, 000 and 54, 000).• Continuous outcome.The next case scenario is when the outcome of interest takes values in the continuum, the BMI index for instance.In this case, γ-OMP employed a linear regression model, computing again the raw residuals.We used the same 53 datasets as with the previous scenario.
The outcome here is the feature (probeset) mostly correlated with the binary outcome.• Time-to-event outcome.The event of interest can be death (as in our case), a disease relapse, or in general any time-related event.The aim is to identify the subset of features, e.g.genes, mostly correlated with the survival time.Both γ-OMP and LASSO employed the Cox proportional hazards model.This is the only outcome for which we did not compute the raw residuals but the more appropriate martingale residuals [46], due to the nature of the problem.We highlight that this is the first time that a variant of OMP treats survival outcomes.For this outcome scenario we used 10 gene expression datasets 8 , whose dimensionality spans from a few thousands of features to tens of thousands of features.

Types of features
γ-OMP, as mentioned in Algorithm 2, is able to treat not only continuous, but also discrete features via computation of p-values from the appropriate hypothesis test depending on the nature of the feature.Hence, we performed two types of experimental evaluation studies.The first one used the original datasets that contain continuous features only, whereas for the second type we introduced discrete features.Half of the 6.Censoring occurs when we have limited information about individual's survival time.The individual might have died of another cause, while the study was in progress or dropped out the study.
7. We focused on unmatched case-control gene expression data.
8. This very small number of datasets and the peculiarity of this type of outcome are the two main reasons we did not perform simulation studies for this type of outcome.features were randomly selected and were discretised according to their 33%th and 66%th quantile values.
Selecting among discrete features using OMP has been treated by Group OMP [32], [33], [34], which was devised for selecting groups of features, for instance when discrete features are binarised (to form dummy variables).However, this ignores the number of levels (values) of the discrete features, or the dimensionality of the group of dummy variables.On the contrary, our proposed approach (ANOVA) accounts for the different number of levels of each discrete feature.Moreover, computation of the relevant p-values ensured that the selection among continuous and discrete features will be on a more equal basis.

Evaluation protocol
We conducted a 10-fold cross-validation for the casecontrol and the linear outcomes, but an 8-fold crossvalidation for the time-to-event outcome due to the relatively small sample sizes.We also estimated the performance of LASSO when it selects, roughly, as many features as γ-OMP in order to compare them on a more equal basis.This was denoted by LASSO * .

Predictive performance metrics and number of selected features
We used again the AUC as the performance metric in the case-control outcome scenario and the MSE for the continuous outcome scenario.For the time-to-event outcomes we used the concordance index (C-index) [21] as the performance metric for model assessment.The C-index expresses the probability that, for a pair of randomly chosen samples, the sample with the highest risk prediction will be the first one to experience the event (e.g death).It measures the percentage of pairs of subjects correctly ordered by the model in terms of their expected survival time.A model ordering pairs at random (without use of any feature) is expected to have a C-index of 0.5, while perfect ranking would lead to a C-index of 1.When there are no censored values, the C-index is equivalent to AUC.
It is often the case that the primary goal of FS, especially in biological domains, is to identify the relevant features that assist on interpretability and knowledge discovery.Hence, we also examined the number of features each algorithm selected and related it to their predictive performance, aiming at both parsimonious and highly accurate predictive models.

Statistical evaluation of the predictive performance and the number of selected features
For each evaluation criterion separately, we calculated the differences between γ-OMP & LASSO and computed the p-value using 999 permutations.We repeated this process 1000 times and reported the average p-value, as a means to draw safer conclusions.

Computational efficiency
We measured the computational efficiency of each algorithm during the 8 or 10-fold cross-validation.We computed the speed-up factor of γ-OMP to LASSO (computational cost of LASSO divided by the computational cost of γ-OMP) for a range of randomly selected subsets of features.

Predictive performance and number of selected continuous features
Figures 3(a)-(c) present the difference in the predictive performance for each outcome scenario as a function of the number of selected features.Figure 3(d) presents box-plots of the difference in the predictive performance of γ-OMP and LASSO * .
• Case-control outcome scenario.Figure 3(a) presents the results for the binary outcome scenario.LASSO achieved close or higher predictive performance (AUC) than γ-OMP in most cases.The p-value for the mean difference in the AUC was equal to 0.001, indicating that the average difference in the predictive performance of γ-OMP and LASSO, based on all 53 datasets, was statistically significant.LASSO selected consistently more features than γ-OMP though (p-value = 0.001) and the average predictive performance of LASSO * was not statistically significantly different than that of γ-OMP though (p-value = 0.200).Figure 3(d) shows their differences in the AUC.• Continuous outcome scenario.Figure 3(b) presents the results for the continuous outcome scenario.LASSO achieved close or higher predictive performance (MSE) than γ-OMP in most cases.The p-value for the mean difference in the MSE was equal to 0.001, indicating that the average differences in the predictive performance of γ-OMP and LASSO, based on all 53 datasets, were statistically significant.Once again, LASSO selected statistically significantly more features than γ-OMP (p-value = 0.001).The predictive performance of LASSO * was significantly different than that of γ-OMP (p-value = 0.001) though.This means, that on a more fair basis, when these two algorithms selected roughly equal number of features, γ-OMP performed significantly better than LASSO (see Figure 3(d)).
• Time-to-event outcome scenario.Figure 3(c) shows the relationship between the predictive performance (C-index) and the number of selected features.The p-value for the mean difference in the C-index was equal to 0.089 providing evidence that the predictive performance of γ-OMP was similar to that of LASSO.The p-value for the mean difference in the selected features was 0.922.The predictive performance of LASSO * was similar to that of γ-OMP (p-value = 0.116, Figure 3(d)).

Predictive performance and number of selected continuous and discrete features
• Case-control outcome scenario.Figure 4(a) presents the results for the binary outcome scenario.γ-OMP was on par with GLASSO in terms of predictive performance (p-value = 0.982), while selecting statistically significantly less features (pvalue = 0.001).The predictive performance of γ-OMP was statistically significantly higher though than that of GLASSO * p-value=0.001(see Figure 4(d)).Hence, on a more fair basis, when these two algorithms selected roughly equal number of features, γ-OMP, on average, performed statistically significantly better than GLASSO.• Continuous outcome scenario.Figure 4(b) presents the results for the continuous outcome scenario.γ-OMP achieved similar or higher predictive performance (smaller MSE values) than GLASSO in most cases.The mean difference in the predictive performance based on all 53 datasets was statistically significant (p-value=0.001),while GLASSO selected statistically significantly more features than γ-OMP (p-value = 0.001).The predictive performance of GLASSO * was significantly different than that of γ-OMP (pvalue = 0.001, Figure 4(e)).
• Time-to-event outcome scenario.Figure 4(c) shows the relationship between the predictive performance and the number of selected features.
In terms of predictive performance and number of selected features, γ-OMP achieved similar performance to GLASSO (p-value=0.558) with a smaller set of features though (p-value=0.001).The predictive performance of γ-OMP was also similar to that of GLASSO * (p-value = 0.101, Figure 4(d)).

Computational efficiency of the algorithms Continuous features:
The speed-up factors, for a range of subsets of features, are graphically presented in Figure 5.For the case-control outcome, γ-OMP is on average, 3 times faster than LASSO, for the continuous outcome scenario, γ-OMP is on average 5 times faster whereas for the time-to-event outcome scenario, γ-OMP is on average 4.6 times faster than LASSO.

Continuous and discrete features:
The speedup factors, for a range of subsets of features, are graphically presented in Figure 6.For the case-control outcome, γ-OMP is on average, 83.8 times faster than GLASSO, for the continuous outcome scenario, it is 14.4 times faster whereas for the time-to-event outcome scenario, γ-OMP is on average 18.7 times faster than GLASSO.

CONCLUSIONS AND FUTURE WORK
We introduced the γ-OMP feature selection algorithm and compared it to LASSO, when our data include only continuous features, and GLASSO, when our data include also discrete features, in the case of binary, continuous and survival outcomes, using simulated data and real gene expression data.The key point of γ-OMP is that it can treat numerous types of outcome variables employing various regression models, whereas LASSO is not that flexible, i.e. each regression model requires its own appropriate mathematical formulation.LASSO, on the other hand, due to its computational efficiency and predictive performance has gained research attention in various data science fields, and numerous extensions and generalisations have been proposed over the years.LASSO is also heavily dependant upon the outcome type.Each type requires a different approach and the development of new, algebraically tedious, algorithmic procedures.With circular data for example γ-OMP can employ the projected normal regression model [42], whereas LASSO has not been developed yet for such data.With compositional data, γ-OMP can employ the Zero Adjusted Dirichlet regression model [49] and easily bypass the problem of zero values, while LASSO requires the development of the appropriate algorithm, due to the log-ratio transformation not being applicable.
To assess the quality of each FS algorithm we focused on three key elements, a) predictive performance, b) number of selected features and c) computational efficiency.In the simulated data γ-OMP outperformed LASSO in all aspects.In the real data examples there was no clear winner.With continuous features LASSO outperformed γ-OMP except when LASSO was selecting nearly the same number of features as γ-OMP.With discrete features though γ-OMP was on par or outperformed LASSO in all cases.Our simulation and empirical studies clearly pointed out that γ-OMP can be used successfully instead of LASSO.LASSO is computationally efficient, with high predictive performance at the price of producing complex models by including many irrelevant features.γ-OMP, similarly to LASSO, tackles the FS problem from a geometrical standpoint, and fits a small number of regression models, thus also being computationally highly efficient.γ-OMP produces predictive models of similar predictive performance to LASSO but more parsimonious.γ-OMP's extra advantageous feature is its ability to employ various regression models, many of which can be found in the R package MXM [24], [50].
Due to page limitations we could not assess the proposed γ-OMP's generalisations, such as more types of outcomes, regression models, stopping criteria, types of residuals and types of associations.In addition, the LASSO solution neither exist for many types of regression models, nor uses various types of residuals.
We highlight though that γ-OMP is not devised solely for gene expression data, but also for high utility NGS data that have become popular.In fact, γ-OMP has been devised for most types of data, biological or not.For example it can be applied to cases of tumor classification [58] or for the task of face recognition [17].
On a different direction, Ein-Dor et al. ( 2004) [14] demonstrated that multiple, equivalent prognostic signatures for breast cancer can be extracted just by analyzing the same dataset with a different partition in training and test set, showing the existence of several genes which are practically interchangeable in terms of predictive power.We provided evidence of this phenomenon since γ-OMP, and LASSO achieved, many times, similar performances by selecting different sets of features.SES [24], [52] is one of the few FS algorithms that discover statistically equivalent feature sets.More recently, Pantazis et al. (2017) [40] proposed a, computational geometry based, solution for discovering equivalent sets of features using LASSO.Our ongoing research focuses on extending γ-OMP towards the identification of multiple sets of features that are statistically equivalent in terms of predictive performance.

Figure 1 .
Figure 1.FDR and TPR of γ-OMP and LASSO with binary and continuous outcome for a range of sample sizes and 50, 000 continuous features.The outcome is a linear function of a subset of the features.(a) FDR (y-axis) for a range of sample sizes (x-axis).The algorithm ideally should have 0% FDR and 100% TPR.Hence, in (a) the points should lie close to 0%, whereas in (b) the points should lie towards to 100%.

Figure 2 .
Figure 2. Predictive performance of γ-OMP and LASSO with binary and continuous outcome for a range of sample sizes and 50, 000 continuous features..The outcome is a linear function of a subset of the features.(a) Case-control outcome: higher values of AUC are better.(b) Continuous outcome (signal to noise ratio is equal to 32.5): lower values of MSE are better.

Figure 3 .
Figure 3. Comparative evaluation of γ-OMP and LASSO on the predictive performance with Continuous features: (a)-(c): Predictive performance vs number of selected features for the different outcome scenarios.The x-axis represents the percentagewise ratio of selected features between γ-OMP and LASSO.Values less than 100% indicate that γ-OMP selected fewer features than LASSO.The y-axis represents the predictive performance difference between γ-OMP and LASSO, with positive values indicating that γ-OMP performs better than LASSO, except for the case of continuous outcomes as in (c).In this case, negative values indicates that γ-OMP performs better than LASSO.γ-OMP outperforms LASSO in both performance and number of selected features in the top left quartile, shown in green.LASSO outperforms γ-OMP in both performance metrics in the bottom right quartile (red area).A 1 : γ-OMP has better predictive performance than LASSO while selecting more features.A 2 : LASSO has better predictive performance than γ-OMP while selecting more features.The rhombus sign corresponds to the median of the values in the axes.(d)-(e): Box plots of the differences in the predictive performance of γ-OMP and LASSO * .For the case-control and the time-to-event outcome scenarios positive values indicate γ-OMP outperforms LASSO * .For the continuous outcome scenario, γ-OMP outperforms LASSO * when the differences are negative.

Figure 4 .
Figure 4. Comparative evaluation of γ-OMP and LASSO on the predictive performance with Continuous and discrete features: (a)-(c): Predictive performance vs number of selected features for the different outcome scenarios.The x-axis represents the percentage-wise ratio of selected features between γ-OMP and GLASSO.Values less than 100% indicate that γ-OMP selected fewer features than GLASSO.The y-axis represents the predictive performance difference between γ-OMP and GLASSO, with positive values indicating that γ-OMP performs better than GLASSO, except for the case of continuous outcomes as in (c).In this case, negative values indicates that γ-OMP performs better than LASSO.γ-OMP outperforms LASSO in both performance and number of selected features in the top left quartile, shown in green.GLASSO outperforms γ-OMP in both performance metrics in the bottom right quartile (red area).A 1 : γ-OMP has better predictive performance than GLASSO while selecting more features.A 2 : LASSO has better predictive performance than γ-OMP while selecting more features.The rhombus sign corresponds to the median of the values in the axes.(d)-(e): Box plots of the differences in the predictive performance of γ-OMP and GLASSO * .For the case-control and the time-to-event outcome scenarios positive values indicate γ-OMP outperforms LASSO * .For the continuous outcome scenario, γ-OMP outperforms LASSO * when the differences are negative.

Figure 5 .
Figure 5. Speed-up factors of γ-OMP versus LASSO (y-axis) for a range of continuous features (x-axis).

Figure 6 .
Figure 6.Speed-up factors of γ-OMP versus GLASSO (y-axis) for a range of continuous and discrete features (x-axis).
, R. and Zhu, L. X.(2011).Model-free feature screening for ultrahigh-dimensional data.Journal of the American Statistical Association, 106(496): 1464-1475.Michail Tsagris received his BSc and MSc in statistics from the Athens University of Economics and Business (Greece) and his PhD in statistics from the University of Nottingham.He currently serves as Assistant Professor at the Department of Economics of UoC.He has published more than 30 papers in journals, conference proceedings and book chapters.His current research interests include feature selection algorithms, computational statistics, compositional and directional data analysis.Zacharias Papadovasilakis received his BSc and MSc in mechanical engineering from the Aristotle University of Thessaloniki.He also holds an MSc in Bioinformatics from the Medical School of the UoC and he is currently PhD student and data scientist at GnosisDA.His current research interests include scalable feature selection algorithms with applications to SNP and gene expression data.Kleanthi LakiotakiDr.Kleanthi Lakiotaki is a research collaborator at the Computer Science Department of UoC, working on the development of data analysis algorithms and methods for analyzing biological data.Formerly, she worked on the development of an electronic Molecular Diagnostics Assistant at the Institute of Computer Science-Foundation for Research and Technology Hellas.She holds a PhD degree in Information and Decision Sciences from University Paris Dauphine and Technical University of Crete.She has over 30 publications in journals, book sections and conference proceedings.Her research interests include Bioinformatics, Recommender Systems, Data Mining and Machine Learning with applications in Biology and Medicine.Ioannis Tsamardinos Ioannis Tsamardinos, Ph.D., is a Professor at the Computer Science Department of UoC.He obtained his Ph.D. (in 2001) from the Intelligent Systems Program of the University of Pittsburgh.Subsequently, he joined the faculty of the Department of Biomedical Informatics at Vanderbilt University until 2006 when he returned to Greece.His research interests lie in the field of Machine Learning, Data Science, and Bioinformatics and particularly feature selection, causal discovery, and automation of machine learning.Ioannis Tsamardinos has over 100 international refereed publications in journals, conferences, and edited volumes, 8,000+ citations in Google Scholar, and 2 US patents.