Generalizing Gain Penalization for Feature Selection in Tree-based Models

We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models. The new method allows for more flexibility in the choice of feature-specific importance weights. We validate our method on both simulated and real data and implement itas an extension of the popular R package ranger.


Introduction
In many Machine Learning problems, features can be hard or economically expensive to obtain, and some may be irrelevant or poorly linked to the target. For these reasons, reducing the number of features is an important task when building a model, and benefits the data visualization and model performance, whilst reducing storage and training time requirements [23]. However, for tree-based methods, there is no standard procedure for feature selection or regularization in the literature, as one would find for Linear Regression and the LASSO [47] for example. Performing feature selection in trees can be difficult, as they struggle to detect highly correlated features and their feature importance measures are not fully trustworthy [30]. Several methods to tackle this problem have been recently proposed, including Diaz-Uriarte [18], Friedman & Popescu [20], and Deng & Runger [15].
In Friedman & Popescu [20], the authors treat trees as parametric models and use procedures analogous to LASSO-type shrinkage methods, by penalizing the coefficients of the base learners and reducing the redundancy in each path from the root node to a leaf node. However, their selected features can still be redundant, since the focus is on reducing the number of rules instead of the number of features. Diaz-Uriarte [18] focuses on gene selection for classification. The authors propose an iterative tool that eliminates the least important features (in fractions of the number of features, p) and updates the algorithm at each iteration. The complication is that the method will always be either computationally expensive, if p is low, or will eliminate too many features at once, which can exclude useful or interaction features. Besides, the method does not generalize to other dataset contexts or tasks, such as regression. In the contrasting approach of Deng & Runger [15,16], the authors regularize Random Forests by gain penalization. Their method consists of letting the features only be picked by a Random Forest if their penalized (weighted) gain is still high. They make recommendations on how to set the penalization coefficients and present their implementation in the RRF package for R [? ]. However, the authors give no further guidelines on how to generalize their method for other models and penalization types and do not explore the influence of hyperparameters on the algorithm.
We develop a gain penalization approach that is more general and widely applicable than those mentioned above. In particular, our contributions are: • We provide a richer means by which local and global regularization of features can be balanced and allow for bespoke local regularization functions for domain-specific applications.
• We generalize gain penalization to multiple tree-based methods (CART, Bagging, Random Forests), for both regression and classification.
• We propose different techniques for setting the regularization parameters and discuss how they affect the final results, with real and simulated examples.
• We make available a faster implementation of the regularization method, included in the very widely used ranger package.
The paper is structured as follows. Section 2 explains the problem setup, followed by the generalization of gain penalization in Section 3. In Section 4, we present the results for simulated and real data. Section 5 explains the implementation details, and Section 6 has the conclusions and future work.

Problem setup
Consider a set of training target-feature pairs (Y i , x i ) ∈ R × R p , with i = 1, . . . , N indexing the observations with p being the total number of features. In general, we can estimate anf that describes how the features x i relate to Y i and use it for prediction or inference. However, not all features need to be involved inf . Especially for tree-based models, the occurrence of noisy or correlated features can badly influence the results [46]. Our interest here relies on estimatingf such that it will only use the matrix x A , composed by the sub-vectors of x ∈ R p indexed by A, A ⊂ {1, . . . p}, which should contain the optimal set of features (it produces similar or equal prediction errors as the full matrix of features), that is potentially of a much smaller dimension.

Trees
Trees are a particular case of non-linear models, that recursively partition the feature space, resulting in a local model for each estimated region [8]. They learn the features directly from the training data, creating an adaptive basis function model (ABM) [31] of the form where R r is the r-th region, w r is the prediction given to this region and v r represents the splitting feature chosen to and the correspondent splitting value. These algorithms are fitted using a greedy procedure, that computes a locally optimal maximum likelihood estimator by finding the splits that lead to the minimization of a cost function. For regression, the cost function of a decision D is frequently defined as cost(D) = i∈D (y i −ȳ) 2 , whereȳ = ( i∈D y i )|D| −1 is the mean of the observations in the specified region, while for classification this function is replaced by the misclassification rate, or cost(D) = |D| −1 i∈D I(y i =ŷ). The gain of a new split is a normalized measure of the cost reduction, given by for feature i at splitting point t, while D is related to the previous estimated split, LN = (left candidate node) and RN = (right candidate node). The global importance value is given by accumulating the gain over a feature, ∆(i) = t∈Si ∆(i, t), where S i now represents all the splitting points used in a tree for the i−th feature.

Ensembles
Trees are known to be high variance estimators: small changes in the data can lead to the estimation of a completely different tree [31]. One way to increase stability is to use the property that an average of many estimates has a smaller variance than one estimate, and grow many trees from re-samples of the data. Averaging such results give us a bagged ensemble [5] of the formf (x) = Ntree n=1 1 Ntreef n (x), wheref n corresponds to the n-th tree. The Random Forest [6] algorithm is defined by allowing only a random subset m of the features to be the candidates in each split. For ensembles, the importance value for a feature gets averaged over all the trees, or for feature i. Moreover, the performance of the trees in a Random Forest relies on the number of features tried at each split, called mtry here, as when mtry → 1, the larger the variance of each tree, but the more effective will be the averaging process, and vice versa [30].

Regularization by gain penalization
In Deng & Runger [15], the authors first discuss the regularization of Random Forests by gain penalization. The Regularized Random Forest (RRF) proposes weighting the gains of the splits during the greedy procedure, guiding the feature choosing of the model. The regularized gain is defined as where U is the set of indices of the features previously used, X i is the candidate feature, and t the candidate splitting point. The λ i ∈ (0, 1] parameter is the penalty coefficient that controls the amount of regularization each feature receives. A feature is penalized if it is new to the whole ensemble, as the method has a memory of which features were already used. Naturally, λ i can be a constant value for all the features but ideally, there should be a regularization parameter for each feature that best represents the information they carry about the target. In Deng & Runger [16], the authors develop this idea by introducing the Guided RRF. It consists of first running a Standard Random Forest (mtry ≈ √ p, number of trees = 500), producing an importance measure for each feature and scaling this measure, in order to find Imp i = Impi max P j=1 Impj , where Imp i is the importance measure calculated for the i-th feature in the Random Forest. The estimated normalized variable importance measures are considered jointly with a regularization parameter to create the overall gain penalization.

Generalizing Gain Penalization
One of our goals with this work is to show how Equation 4 can be fully generalized in two senses: in the algorithm type and the penalization coefficients. For the algorithm, this means that the regularization method can be applied to any tree-based model (single trees such as CART or ensembles). As for the penalization coefficients, we generalize the method by proposing that λ i can be written as where λ 0 ∈ [0, 1) is interpreted as the baseline regularization, g(x i ) is a function of the i-th feature, and γ ∈ [0, 1) is their mixture parameter, with λ i ∈ [0, 1). The equation balances how much all features should be jointly, or globally, penalized and how much will it be due to a local g(x i ). When γ = 0, we return to what was proposed in Deng & Runger [15], and for γ = 1, the regularization is fully controlled by g(x i ). The g(x i ) should represent relevant information about the features, based on some characteristic of interest. It can include, for example, external information about the relationship between x i and y, thus this has inspiration on the use of priors made in Bayesian methods. In the same fashion, the data will tell us how strong our assumptions about the penalization are, since even if we try to penalize a truly important feature, its gain will be high enough to overcome the penalization and the feature will get picked.

Choosing g(x i )
Correlation: A natural option for continuous features is just to use g(x i ) as the absolute value of the marginal correlation between x i and y, assuming a continuous target problem. It can be Pearson's, Kendall's, Spearman's, or other correlation coefficient of preference (the first is more suitable for ordinary numeric inputs, while the others will be more convenient for ordered inputs [9]). We can define it as g(x i ) = |corr(y, x i )|.
Entropy and Mutual Information: A different situation is when the features are discrete. In information theory, Shannon's entropy [43] is a measure of the uncertainty of a (discrete) random feature. In short, if a discrete feature X has K states, its entropy will be calculated as Higher entropy will mean more uncertainty, so it is reasonable to give more weight to features with lower uncertainties. One can use a normalized version of the entropy calculated for each compelling the features with lower entropy to have larger penalization coefficients. Similarly, we can also use the Mutual Information function, for which the similarity between a joint distribution p(X, Y ) and a factored distribution p(X)p(Y ) and we calculate MutInf(X; Y ) = x y p(x, y) log p(x,y) for two features X and Y [31]. Recalling the Entropy equation, it is easy to see that the Mutual Information value is the reduction in the uncertainty about Y when we observe X, so it can be straightforwardly used as g(x i ) = Combination: Another possibility is combining two or more g(x i ). Objectively speaking, some functions will be more appropriate to one type of feature than others. As an example, one could combine a Boosted method with the marginal correlations between the target and each feature, letting the absolute values of the correlations compose g(x i ) if the correlation is over a certain threshold , and use Imp i from a previously run algorithm otherwise.
Depth parameter: Growing very bushy trees with new features is not desirable when we want to use a small set of features. Following Chipman et al. [10], where the authors use prior distributions for whether a new feature should be picked in a Bayesian Regression Tree, we introduce the idea of increasing the penalization given the current depth of the tree. Their priors take into account the current depth of a tree, so when a tree is already deep the priors get less concentrated in high probability regions, resulting in lesser bushier trees. In our work, a similar idea is applied by setting where d T is the current depth of the T tree, T = (1, . . . , ntree), for the i-th feature. The aim here is to reduce the gains of the features if they are to be picked in a deep node, preventing new features to appear at the bottom of trees unless their gains are exceptionally high.

Issues & Details
Feature masking effect: Tree-based models often suffer from feature masking effects (Louppe [30]). For example, in a tree, some feature X j might never occur in the algorithm if it leads to splits slightly worse than some other feature X i , so if X i is removed, X j can prominently occur. This should be overcomed by ensembles like Random Forests, as selecting only m features to pick from decorrelates the trees, but if we regularize the Random Forests, the problem remains. If weak features end up being first picked by the model, their gains will have an unfair advantage against the other features, which will be penalized. This situation is easily fixed with hyperparameter tuning for mtry.
Correlated features: Random Forests are also biased towards giving high importance to correlated features [46]. Suppose we have a subset C ⊆ X of features which are correlated. Ideally, we want to have only one or just a few of these features being selected to avoid redundancy, but Random Forests are not able to detect and eliminate correlated features. The regularized method automatically deals with the correlated features, since when one of the features in C gets picked, the algorithm is less likely to pick the other correlated features as well, given that a new feature needs to reduce the prediction error more drastically to be selected.

Experiments
This section shows experiments that evaluate the effects of different regularization types in simulated and real datasets using the Random Forest algorithm.

Simulated data
Consider now a set X = (x 1 , . . . , x 205 ) of features, all sampled from a Uniform[0, 1] distribution, n = 1000. We generated a target of interest Y ∈ R as inspired by the simulation equation proposed in Friedman [21], totaling 250 features. This framework produces interesting relationships between the target and the features: non-linearities (i = (1, 2, 3)), decreasing importances (i = (6, . . . , 205)) and correlations (i = (5, 206, . . . , 250)), inducing a more complicated scenario. We created 10 datasets, all randomly split into train and test set (80%/20%). For all the algorithms we fixed the number of trees at 100, varied mtry = (15,45,75,105,135,165,195,225,250) and our accuracy measure is the RMSE calculated in the test set. We used a standardized version of y and, in the following, the term number of selected features represents any feature with importance ∆(i) > 0 in the final estimated model.

Standard Random Forest
As a benchmark, we run a Standard Random Forest for each of the 10 datasets and all the different values of mtry. The first mtry is what would be the default in a Standard RF, since √ 250 ≈ 15, and the last is the total of features available. The resulting number of features used for all the models is always the maximum available (250) ( Table 1). If we consider the correlated features issue, this means that too many features are being picked, once we know that they become irrelevant in their joint presence. The RMSE test changes when mtry changes: when mtry = 45 is when we have the best results, meaning that the default value (mtry = √ p ≈ 15) is not the best option.  The RMSE test is generally lower for the GRRF, and the regularization gets very affected by mtry.

RRF and GRRF
The simplest version of the regularized algorithm happens when we set λ i to be a constant value (RRF approach, Deng [14]), having all features penalized by the same factor. We now present the results when fitting this algorithm to the simulated data. We varied λ i = λ  in the 10 datasets for the two types of models. We can see a continuous transition on the number of features picked by the two models, but they present an inverse pattern regarding the mtry and regularization parameters. For the RRF, the lower the λ i and the higher the mtry, the less features are picked, but the RMSE test gets compromised. As for the GRRF, the same happens but for the higher γ values. Though their number of features might be similar depending on the hyperparameter values, the RMSE test values are in general lower for the GRRF, demonstrating how a more specific λ i for each feature improves the feature selection and the GRRF has a clear advantage over the RRF.

Generalized Regularization in Random Forests
For this subsection, we vary λ 0 = (0.1, 0.5, 0.9) and γ = (0.001, 0.25, 0.5, 0.75, 0.99), use all combinations (γ × λ 0 ), first with g(x i ) = |corr(y, x i )| and later using two Boosted methods, with a Standard Random Forest and with a Support Vector Machine. In Figure 2 we can see that RMSE test values are either close or below the 0.5 line. In comparison to Figure 1, our algorithms are doing better, as we can spot many cases where the RMSE test is low while using very few features, especially when g(x i ) = Boosted SV M . When γ is low the regularization is primarily controlled by λ 0 , and we spot a heavier influence of mtry on the number of selected features, which tends to decrease as λ 0 increases. When γ is high the penalization values depend more on g(x i ), and the results vary less regarding the values for λ 0 and mtry.
A more in-depth analysis of the results can be seen in Table 2. We define the most informative features in the simulation as V = (x 1 , x 2 , x 3 , x 4 , x 5 , ). We do not include the last 45 features which are correlated and we ideally want to avoid them. We then calculate the percentage of important features from the total of features that were picked, and from the correlated ones, which percentage of those was selected by the algorithm. So, for example, if an algorithm picked 10 features, 3 of them being important, 5 being from the correlated group and 2 being "non-important", we calculate the proportion of important features as 3/5 and the proportion of correlated as 5/45. With Table 2 we see that the proportion of important features is considerably higher for our approach with g(x i ) = Boosted RF and g(x i ) = Boosted SV M , and when we use g(x i ) = |corr(y, x i )| the algorithm picks less of the correlated variables. We also notice that the best results happened more or less when λ 0 ≤ 0.5, when g(x i ) has a higher influence in the penalization coefficients, so they are really helping the feature selection. The GRRF algorithm usually picks more of the correlated features and, once, more of the important ones, but this model also picks more features in general.  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125  250  3  5  10  25  50  125

Real Data Classification
Our experiments with real data consider gene classification micro-array datasets ( [40], [37], [48], [3], [22], [2], [41], [44], [28]), with an average of 4787 features, 67 observations and 3 classes (details are in the Appendix A 1.1). Those are classical examples of "large p, small n" datasets, but our method generalizes to data from any contexts or sizes. As the goal here is to find the best features to predict the gene classes, the experiment conducted for this section is different. We run the regularized models and extract their selected features, that are later used in a Standard Random Forest, with which the  misclassification rates are calculated. This is to mimic how such an approach can be used in practice, where first a discovery experiment is run to identify important features, then a subsequent algorithm is run on a new data set using all of the features. We set γ = λ 0 = 0.5, attributing the same weight to the baseline regularization and to g(x i ). We vary mtry = ( √ p, 0.15p, 0.40p, 0.75p, 0.95p) and g(x i ) = Boosted RF , MutInf(y,xi) max P j=1 MutInf(y,xj ) . For comparison, we also run a Standard RF and a GRRF for each dataset, which were separated into 50 different train (2/3) and test sets (1/3). We first find the average misclassification rates (MR) and number of features used for each of the 50 resamples, eliminating at this step the mtry column. Out of that, we filter by the resample with the smallest MR. According to Table 5, the Standard RF uses more features, but does not always have the lowest MR. As for the regularization, using g(x i ) = When this happens and such algorithms also have a low or very similar MR to a Standard RF one, we reach an optimal situation, which happened for almost all the datasets.
The comparison to LASSO [47] and varSelRF [18] is presented in the Appendix A 1.2 (only 10 resamples were done for these models due to their computational heaviness).

Implementation
The implementation is included as an extension to the ranger package [49], which is originally written in C++ and is currently the fastest tree model implementation available for R [? ]. Furthermore, the package has a wide variety of model extensions, is actively maintained and interfaces with python. The speed and scalability discussion presented in Wright & Ziegler [49] and its comparison to the randomForest package [29] is analogous to the one about our regularization implemented in the ranger and the one in the RRF package, so we do not repeat the same experiments 2 .
The downside of our approach is the addition of new hyperparameters, and how to choose them well. Future work involves finding theoretical properties of certain gain penalization approaches, parameter optimization (using e.g. Snoek et al. [45]), and compare our approach to other methods with a similar context (Johnson & Zhang [27], Nan & Saligrama [32], Nan et al. [33], for example).

A.2 LASSO and varSelRF results
In Table 5, we present the classification results added with the LASSO and varSelRF results. Only 10 resamples were done for these algorithms due to their computational heaviness. We can see that the LASSO and varSelRF can select very few variables but at the expense of the prediction error. Their variable selection results are more closely comparable to our regularized model using g(xi) = MutInf(y,x i ) max P j=1 MutInf(y) , but their prediction error is much higher.