Ensemble Pruning via Quadratic Margin Maximization

Ensemble models refer to methods that combine a typically large number of weak learners into a stronger composite model. The output of an ensemble method is the result of fitting a base-learning algorithm to a given data set, and obtaining diverse answers by re-weighting the observations or by re-sampling them using a given probabilistic selection. A key challenge of using ensembles in large-scale multidimensional data lies in the complexity and the computational burden associated with them. The models created by ensembles are often difficult, if not impossible, to interpret and their implementation requires more computational power than individual learning algorithms. Recent research effort in the field has concentrated on reducing ensemble size, while maintaining predictive accuracy. We propose a method to prune an ensemble solution by optimizing its margin distribution, while increasing its diversity. The proposed algorithm results in an ensemble that uses only a fraction of the original weak learners, with generally improved estimated generalization performance. We analyze and test our method on both synthetic and real data sets. The analysis shows that the proposed method compares favorably to the original ensemble solutions and to other existing ensemble pruning methodologies.


I. INTRODUCTION
Ensemble methods combine a large number of fitted values (sometimes in the hundreds) into a composite prediction. The output of an ensemble method is generally the combination of many fits of the same data set by either re-weighting the observations, following a path of gradient descent, or by using subsets of the original set obtained from bootstrapping, re-sampling or other probabilistic selections of the data. Empirical evidence points to ensemble performance being generally superior to individual or single learning algorithms [1]- [6], [6]- [8]. Boosting [9] is one of the most well-known ensemble frameworks. The term boosting refers to a family of methods that combine weak learner (classification algorithms that perform at least slightly better than random) into a strong performing ensemble through weighted voting. AdaBoost [10], stochastic gradient boosting [11], XGBoost [12], Light-GBM [13] and CatBoost [14] are the leading implementations of boosting algorithms. Among ensembles, bagging [2], random forests (RFs) [8] are also strong performers in terms of their generalization ability. In addition to their ability to outperform individual learning algorithms, ensembles can also be very robust to overfitting, even when performing a The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. large number of iterations [4], [5]. To explain the successful performance of ensembles, Breiman [15] suggested that boosting, bagging and RFs (which he referred to as arcing classifiers) reduce the variance, in the bias-variance decomposition framework, however Schapire et al. [5] refuted this claim by empirically providing evidence that AdaBoost mainly reduced the bias. More importantly, Schapire et al. [5] showed that AdaBoost is especially effective at increasing the margins of the training data. Schapire et al. [5] developed an upper bound on the generalization error of any ensemble, based on the margins of the training data, from which it was concluded that larger margins should lead to lower generalization error, everything else being equal (sometimes referred to as the ''large margins theory''). The large margins theory has its roots in the margin separation framework in support vector machines (SVM) [16].
The proliferation of large scale, high velocity data sets, often containing variables of different data types, creates challenges for most traditional statistical and machine learning algorithms, but it does so even more markedly for ensembles. The term ''big data'' has been used to describe large, diverse and complex data sets generated from various sources. The volume, variety and velocity (known as the 3Vs) are the main characteristics to distinguish big data problems from others [17]. A key drawback of fitting ensembles to VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ large scale multidimensional data (big data) is their computational burden. The iterative nature of ensembles, as well as how complex the resulting solutions are, makes their implementation especially challenging. In addition, interpretations of ensemble predictions are not as straightforward as those of single learning algorithms and the implementation of the resulting models requires fitting the data through all of the iterations (sometimes in the hundreds) of the ensemble. A high number of iterations is oftentimes necessary to reap the benefits of the improved generalization performance provided by ensembles [5], [7]. For this reason, recent research effort has concentrated on reducing ensemble sizes, also called ensemble pruning (thinning), while trying to maintain or improve their predictive accuracy (see, e.g., [18]- [27]). There is also evidence that smaller ensembles perform as well as, or better than, their large counterparts [28]. However, knowing how large they should be is still an open research question. Ensemble pruning generally places additional computational costs on the training phase of the ensemble, due to the additional emphasis to identify a strong-performing subensemble. A reduced ensemble, however, translates into a more manageable and computationally less prohibitive model in the implementation or prediction phase [21]. Of particular importance in ensemble pruning is obtaining an ensemble that takes into account not only the quality of the individual learners, but also their disagreement [27], [29]. In other words, the effectiveness of ensembles depends also on the diversity of their component-wise learners, with the premise that more diverse weak learners perform better. Therefore, for high dimensional data sets, a more efficient algorithm could be constructed, if only the most diverse weak learners of the ensemble solution are taken into consideration in the final combination. In this article we propose an algorithm that produces a reduced, strong-performing sub-ensemble by optimizing the diversity of the weak learners and maximizing its lower margin distribution. The proposed method is a weight-based quadratic optimization formulation that aims to tune the weights of a given ensemble, such that the pairwise correlations of the weak learners and the margin variance are minimized, while the lower percentiles of the margin distribution of the ensemble are maximized.

II. PRELIMINARIES
We assume we have a set of T weak learners, h t (x), t = 1, 2, . . . , T , created from the space (finite) of classifiers H . Each weak learner takes a p × 1 input vector x and maps it to a prediction h t (x) ∈ {−1, +1} for a binary response variable Y. The prediction of an ensemble with T weak learners for x is given by: where sign : R → {−1, 0, +1}, such that sign(a) = −1, when a < 0, sign(a) = +1 if a > 0 and sign(a) = 0 if a = 0 (when sign(a) = 0, we randomly assign {−1, +1} to f ); α t is the weight associated with the t th weak learner, where 0 ≤ α t ≤ 1 and T t=1 α t = 1. The task of any ensemble or combined classifier f is to create a set of weak learners and determine their associated weights {α 1 , α 2 , . . . , α T } based on a training sample of data pairs S = {(x i , y i ), i = 1, 2, . . . , n} generated independently and identically distributed (i.i.d) according to an unknown joint distribution P XY , to produce a combined prediction with small generalization (also called risk of the classifier) for a given loss function g. For the binary classification framework, the generalization error is defined as , and 0 otherwise. We denote P XY [a] as the probability of event a under the unknown distribution P XY , andP S [a] as the empirical probability of a under S. We use P[a] andP[a], when it is clear which distribution we are referring to. The weights assigned to the weak learners {α 1 , α 2 , . . . , α T } can be uniform, as in the case of RFs, or based the accuracy of the component-wise learners, as in the case of boosting. We will refer to the classifiers h t (x) contained in an ensemble as weak learners, base learners or (individual) classifiers, and they are generated by a base-learning algorithm B, that maps the input vector x to the binary response variable Y. Base-learning algorithms can be decision trees, neural networks, or any other kind of learning or statistical method. The construction of an ensemble is based on two main steps, i.e., generating the weak learners, and then combining them. The final combination is done with a linear function, but the final prediction can also be based on user-specified thresholds.
To explain the, generally superior, performance of ensembles, Schapire et al. [5] showed that margins were an integral part in understanding how ensembles could generalize. The margin of the i th training observation is defined by: A higher margin can be viewed as more confidence in the prediction for the i th training observation. The margin is equal to the difference in the weighted proportion of weak learners correctly predicting the i th observation and the weighted proportion of weak learners incorrectly predicting the i th observation, so that −1 ≤ m i ≤ 1. A margin value of −1 indicates that all of the weak learner predictions were incorrect, while a margin value of +1 indicates all of the weak learners correctly predicted the observation. Next, we will briefly define two common and basic ensembles: AdaBoost [10], and RFs [8]. Many other ensembles are generalizations of these two including stochastic gradient boosting [11], XGBoost [12], LightGBM [13], CatBoost [14], Rotation Forests [30] and Isolation Forests [31].

A. BOOSTING ALGORITHMS
Boosting refers to the idea of converting a weak learning algorithm into a strong learner, that is, taking a classifier that performs slightly better than random chance and improving (boosting) it into a classifier with arbitrarily high accuracy. Boosting originated from the PAC (probably approximately correct) learning theory [32] and the question that Kearns and Valiant [33] posed on whether a ''weak'' learning algorithm can be boosted into an arbitrarily accurate ''strong'' learner, hence the name boosting. Boosting can be based on re-sampling or re-weighting [34]. The main goal of boosting methods is to give more voting power α t to the weak learners or classifiers that perform the best. AdaBoost, for example, achieves this by iteratively using the same base-learning classifier, only modifying the weights of the observations D t i at iteration t, therefore the base-learning algorithm B generally handles observation weights D i as inputs. AdaBoost adaptively places more emphasis on the training observations that were misclassified by the previous iteration. The weight each observation receives in round t + 1 of the iterations is given by where The weights of misclassified observations increase by a factor of exp(α t ) at iteration t. The AdaBoost algorithm is described in Algorithm 1. The voting power of each base learner is given by where more emphasis is given to those base learners with lower misclassification error t . 1. Other boosting variations are based on modifying the observation weighting function (4) or by using gradient descent on the specified loss function (see, e.g., LogitBoost [35], MadaBoost [36], Gradient Boosting [37], Stochastic Gradient Boosting [11], Local Boosting [38], XGBoost [12], LightGBM [13], and CatBoost [14]). The applications of boosting methods can be extended to regression, and multiclass problems easily, however we will focus only on the binary classification problem.

B. RANDOM FORESTS
Breiman [8] defines an RF as an algorithm ''consisting of a collection of tree structured classifiers h t (x, θ t ), t = 1, . . . , T , where θ t are independently and identically distributed random vectors." Each tree casts a unit vote for the most popular class at input x.
RFs inject randomness by growing each of the T trees on a random sub-sample of the training data, and also by using a small random subset of the predictors at each decision node split. The RF method is similar to boosting in the fact that it combines classifiers that have been trained on a subset sample or a weighted subset, but they differ in the fact that boosting gives different weight to the base learners based on their accuracy, while RFs have uniform weights. There has been ample research on these ensemble methods and how they perform under different settings. For a more complete review on their performance, the reader is referred to [4], [6], [7], [39].

C. MARGINS AND GENERALIZATION PERFORMANCE
Schapire et al. [5] proved an upper bound on the generalization error of an ensemble that does not depend on the number of weak learners combined T . The bound is described below: Bound 1 [5]: Assuming that the base-classifier space H is finite, and for any δ > 0 and θ > 0, then with probability at least 1 − δ over the training set S with size n, every voting classifier f satisfies the following bound: where the termP [m(x, y) ≤ θ] is the proportion of training set less than a value θ > 0. When the hypothesis space is infinite, the expression ln n ln |H | is replaced by d log 2 (d/n), where d is the VC-dimension of the space of all possible weak classifiers (a measure of complexity). Schapire et al. [5] use this bound to provide an explanation for the superior performance of AdaBoost, which they show is highly effective at increasing the margins. Based on this bound, Schapire et al. [5] concluded that larger margins should lead to lower generalization error, holding other factors constant, such as the cardinality of the hypothesis space |H |, the sample size n, and δ. This is sometimes referred to as the ''large margins theory'' and applies to all voting classifiers (ensembles) [5], [40]- [49]. Given that maximizing the minimum margin has not yielded positive results in terms of generalization performance (see, e.g., [5], [40]), many authors have proposed optimizing other functions of the ensemble margin distribution instead. For instance, Reyzin and Schapire [50] suggested maximizing the average or the median margin, while other researchers have proposed that minimizing the variance of the margins might be a key component in designing better performing ensembles (see e.g., [42]).    (see table 1 for data set description) using full-grown trees. As the ensemble size grows, the test set error rate (which is an estimate of the generalization performance) decreases for both type of ensembles using these settings. More importantly, we can see that the variation of the margins does appear to decrease as T increases. An important observation of Figure 1 is that, as Schapire et al. [5] noted, ''boosting is especially aggressive at increasing the margins of the examples, so much so that it is willing to suffer significant reductions in the margins of those examples that already have large margins.'' This behavior is of particular significance, given that other researchers have shown that the lower margins play a pivotal role in ensemble performance (see, e.g., [51]). The improvement in the lower margins is evident in Figure 1 as the ensemble size grows, and it also corresponds closely to better generalization performance. This is more markedly visible in AdaBoost than the RF solution.

D. DIVERSITY AND ENSEMBLE PERFORMANCE
A different, but no less important measure of effectiveness, is how diverse the individual classifiers within an ensemble are. Several researchers provide evidence on the importance of diversity within ensembles [8], [52]- [57]. Li et al. [25] provided an upper bound on the generalization error of any ensemble based on the diversity of its individual classifiers. The bound is omitted here but it suggests that an increase in diversity should improve generalization performance, holding other factors constant, such as the complexity of the weak learner space, sample size.
Margineantu and Dietterich [52] define diversity or dissimilarities of ensembles based on either the probability distributions on which the weak learners are derived, or the agreement (disagreement) of the classifiers in their predictions. Several researchers have used the κ and κ-error diagrams proposed by Margineantu and Dietterich [52] to construct more diverse ensembles and/or evaluate their performance (see, e.g., [29], [30], [58]- [60]). Germain et al. [27] studied the relationship of ensemble diversity and their risk, along with the first and second moments of the ensemble margins. They bounded the risk of an ensemble with the expected disagreement between the individual learners. Germain et al. [27] define M (x, y) as a random variable, that given an example (x, y) drawn according to P XY , outputs the margin of the ensemble on that example, that is: The generalization error, or risk of f , can then be defined in terms of the margins, as the probability that the majority voter ]. An important characteristic of the random variable M (x, y) is its first moment defined as: is the margin of the i th training observation. As previously stated, several authors have concluded that maximizingm, or maximizing the whole margins distribution should improve the generalization performance, and that simply maximizing the minimum margin (min i m i (x i , y i )) does not result in improved generalization performance [40], [48], [50], [61]. The second moment of the distribution of M (x, y) is also of particular importance. We define the second moment as: Germain et al. [27] provide an upper bound on the generalization error R[f ] of an ensemble that relates the diversity or expected disagreement between voters (d S ), which is a particular measure of the diversity of the ensemble, and the second moment of the margin distribution µ 2 M . The bound is below: Bound 2 [27]: For any distribution Q on a set of voters and any distribution D ox X , if µ M > 0, we have: where R D (G Q ) is the Gibbs risk and 1 − 2d D Q relates the risk of the classifier with the second moment of the margin distribution. The reader is referred to Germain et al. [27] for a more complete explanation and the derivation of the bound, but the work in Germain et al. [27] suggests that reducing the second moment µ 2 M of the margins of any given ensemble, should produce a more diverse and better performing model. Hypotheses presented by Reyzin and Schapire [50], Shen and Li [42] and Germain et al. [27] all suggest that reducing the variation of the margins might also improve the generalization performance of ensembles. Shen and Li [42], for instance, proposed an algorithm named MD-Boost (Margin Distribution Boosting) that maximizes the average margin while reducing the variance of the margin distribution.
We measure the diversity of an ensemble f provided its individual classifiers h t , t = 1, . . . , T as:

E. ENSEMBLES UNDER NOISE
Ensembles, particularly AdaBoost, tend to be sensitive to outliers and noise. Grove and Schuurmans [40], Mason et al. [41] and Dietterich [7] provide evidence that AdaBoost does overfit and the generalization error deteriorates rapidly when the data is noisy. Long and Servedio [62] proved that for any boosting algorithm with a potential convex loss function, and any nonzero random classification noise rate, there is a data set, which can be efficiently learnable by the booster if there is no noise, but cannot be learned with accuracy better than 1/2 with random classification noise present. Many methods that automatically handle noisy data and outliers have been proposed to alleviate the limitations of AdaBoost. Algorithms such as BrownBoost [63], LogitBoost [35], MadaBoost [36], LPReg-AdaBoost [64], ν-LP and ν-ARC, [65] mostly attempt to accommodate noise by somehow allowing unusual observations to fall in on the wrong side of the prediction in subsequent iterations of AdaBoost. Although bagging and RFs generally perform better than AdaBoost under noisy circumstances, they are still not completely robust to noise and outliers [39]. Deleting outliers (also called noise filtering or noise peeling) by pre-processing the data is preferable under certain high noise circumstances [66].

III. ENSEMBLE PRUNING
The idea on diversity-based pruning is to reduce the size of a given ensemble based on the similarity of the weak learners, working on the premise that a more diverse ensemble performs better, however there are various types of ensemble pruning methods, not necessarily based on diversity.
Most of them fall into either selection-based methods, or weight-adjusting methods [21].

A. SELECTION-BASED METHODS
The main purpose of selection-based pruning methods is to either reject or select the given weak learner based on some criterion or criteria. The most common methodology used in selection-based pruning methods is to rank the weak learners within an ensemble according to some performance metric in a validation set and select a subset of the top T r out of the original T weak learners. Margineantu and Dietterich [52] proposed several measures of diversity and ways to prune ensembles accordingly. They proposed the use of the Kullback-Leibler divergence (KL distance) [67] to prune ensembles, by maximizing the KL distance of the distribution upon which the classifiers were constructed. The KL distance between two probability distributions p and q is defined as: To measure the agreement (or disagreement) of ensemble predictions, Margineantu and Dietterich [52] use the Kappa statistic (κ) [68]. Given two classifiers h a and h b , Margineantu and Dietterich [52] consider their agreement by constructing a contingency table with elements C ij , where i ∈ {−1, +1} and j ∈ {−1, +1}, corresponding to the number of observations for which h a (x) = i and h b (x) = j, and However, to account for class imbalances, the probability that the two classifiers agree is defined as: Finally, a measure of agreement κ can be obtained by quantifying the likelihood to agree, compared to the expected agreement by chance: The κ measure in (14) has become a standard way to measure diversity in ensembles, and has been used extensively in VOLUME 9, 2021 selection-based pruning methods. Examples of other metrics used in selection-based methods include the test set performance. For instance Martinez-Munoz and Suarez [69] use classification performance on a test set based on orientation ordering to select the best sub-ensemble, while Lu et al. [24] and Li et al. [25] proposed a heuristic to order the weak learners based on both their accuracy and their diversity. Prodromidis and Stolfo [70] proposed reducing the size of ensembles by minimizing a cost complexity metric. Ordering the weak learners could also be based on the observation margins (see e.g., [51]), or other measures that apply to specific types of analyses, such as time series (see e.g., [71]). The Diversity Regularized Ensemble Pruning (DREP) method [25] and κ-pruning [52] are two of the leading selection-based pruning algorithms. One of the main drawbacks of ordered-ensemble pruning methods such as κ-pruning, is that we need to specify the size of the pruned sub-ensemble and it might not necessarily be the optimal size. Several authors have considered that pruning close to 80% of a given ensemble yields the most consistent optimal results. Other selection-based methods include formulating the selection of the pruned sub-sensemble as an integer optimization heuristic (see e.g., [19]). Recently, combinations of margin-optimizing and diversity pruning methods have been of special interest to researchers (see e.g., [25], [72]- [76]). The list of selection-based pruning methods presented here is not exhaustive. The user is referred to Tsoumakas et al. [22] for a more complete reference. It is important to note that one of the main limitations of many selection-based methods is that, as in κ-pruning, we must pre-specify the size T r of the pruned sub-ensemble. The selection-based approach is straightforward, but does not guarantee the best possible sub-ensemble and does not necessarily guarantee optimal performance.

B. WEIGHT-ADJUSTING METHODS
For weight-adjusting pruning methods, the main goal is not necessarily to prune the ensemble, but to adjust the weights of the weak learners, so that the generalization error is improved.
In the process, some of the weights get zeroed out and consequently the ensemble is reduced (pruned). The main limitation of most weight-adjusting methods is that they do not have any theoretical guarantee to diminish the size of the ensemble, nor is there any explicit formulation to do so in their heuristics. For example Grove and Schuurmans [40] used a linear programming technique to adjust the weights of the weak learners, so that the ensemble's minimum margin is maximized. Grove and Schuurmans [40] remark that the final ensemble was generally reduced significantly with their proposed algorithm. Demiriz et al. [77] also used a weight-adjusting linear program to optimize a generalization error bound, which they show is also effective at pruning the ensemble. Chen et al. [21] use expectation propagation to approximate the posterior estimation of the weight vectors.
There are other weight-adjusting pruning methods in the literature that merit attention (see e.g., [78]), and the reader is referred to Tsoumakas et al. [22] for a more exhaustive reference.

IV. PROPOSED PRUNING ALGORITHM
In this section, we propose a weight-adjusting method based on quadratic programming to reduce the fraction of weak learners utilized in a particular ensemble. In designing the proposed algorithm, we take into account the generalization error bound in Bound 1, which suggests that larger margins should lead to lower generalization error. Specifically, we focus on increasing the lower margin percentiles [5]. We also consider Bound 2, which states that reducing the second moment of the margins induces a more diverse and better performing combined classifier. The quadratic program formulation aims to tune the weights of the given ensemble, such that the pairwise correlations of the weak learners and the variance of the margins are minimized, while simultaneously maximizing the lower percentiles of the margins. To the best of our knowledge, this is the first research contribution with a weight-adjusting quadratic optimization formulation emphasizing an improvement over the lower margins, and increasing the diversity of the weak learners selected.

A. QUADRATIC MARGIN MAXIMIZATION (QMM) PRUNING ALGORITHM
We assume that we are given an ensemble solution, i.e., a set of weak learners h t (x), t = 1, 2, . . . , T , and a set of weights α ∈ (0, 1) T , where α = {α 1 α 2 . . . α T } , where 0 ≤ α t ≤ 1 and T t=1 α t = 1, associated with the weak classifiers. The weights α t can be normalized without loss of generality. Note that for the training sample S = {(x i , y i ), i = 1, 2, . . . , n} used to produce the ensemble solution, the values of the weak learner predictions h it and the weights α t are fixed. We will let h it = ±1 denote the prediction of the t th weak learner for the i th observation in the training data, y = {y 1 y 2 . . . y n } and m = {m 1 m 2 . . . m n } . We define the matrix where H ∈ {−1, 1} n×T , as the predictions matrix for the T weak classifiers within the ensemble. The error matrix E = H • y is defined as: y n h n1 y n h n2 · · · y n h nT where E ∈ {−1, 1} n×T , with element E ij = −1 when the prediction of j th classifier for observation i is incorrect, +1 otherwise. Note that Eα = m. Letˆ = cov(E) be the sample covariance matrix of E, whereˆ is a symmetric positive definite matrix, with error variance of the each weak learner h t in the diagonalsσ 2 h t , and the errors covariance for weak learners i and j in the off-diagonalsσ h ij .
Let m (1) < m (2) < · · · < m (n) denote the ensemble margins after arrangement in increasing order of magnitude, and the margin vector ϕ υ ∈ (−1, 1) nυ = {m (1) m (2) · · · m (nυ) } for a given percentile υ. We define the matrix υ ∈ (−1, 1) nυ×T as (11) y (1) h (12) · · · y (1) h (1T ) y (2) h (21) y (2) such that The main goal is to minimize the error covariance matrix , such that the lower υ margin percentiles are optimized. We call the quadratic program used to achieve this solution the QMM pruning algorithm and it can be expressed in the following form: minimize w ˆ w subject to υ w ≥ ϕ υ , w 1 = 1, w ≥ 0, (19) where w are the new weights for the weak learners to be determined by solving the QP. The constraint υ w ≥ ϕ υ guarantees that the choice of w generates lower υ margin percentiles υ w 1 w 2 · · · w T at least as large, or larger than the lower υ margin percentiles generated by the ensemble υ α 1 α 2 · · · α T . Optimal solutions generally induce some of the w to equal zero, therefore reducing the final ensemble size, but the zeroing-out of the weights w is not guaranteed, as in most weight-adjusting pruning methods. The main hypothesis here is that reducing the error covariance, and minimizing the margin variance will cause the optimization algorithm to zero-out weights corresponding to the less diverse, worse performing weak learners.
One of the main issues that we can run into with QMM pruning algorithm is the possibility of a less than full-rank H matrix because of column dependencies, which consequently would render theˆ matrix also rank-deficient, and not positive semidefinite. This happens when weak classifiers within an ensemble produce the same predictions. Chen et al. [21] use the least square pruning method (lscov) as a baseline before their proposed method to alleviate the rank deficient cases. We propose using a QR decomposition with pivoting to detect the column dependencies of H . If H has rank r < min (n, T ), then there is an orthogonal matrix Q and a permutation matrix P such that Q HP = I r 0 r,T −r 0 n−r,r 0 n−r,T −r .
where H is a full-rank matrix comprised r ≤ T classifiers, therefore also potentially pruning the ensemble. We also define the matrix H * = H /H as Let h j , j = 1, · · · , r denote the j th column of H with α r associated weights, and h * k , k = 1, · · · , T − r the k th column of H * with α * T −r associated weights. If r < min (n, T ) then where G R is the number of columns from H * equal to h j . Finally, the j th element of the weight vector for the classifier α would be defined as: We would use H and α instead of H and α in the QMM optimization formulation in (19) as a baseline, and the covariance matrixˆ = cov(H • y) with the corresponding υ and ϕ υ . Therefore even if H is less than full-rank, the optimization would still be based on a full-rankˆ matrix. It is also possible thatˆ is a full rank matrix but not positive semidefinite, especially in the case where n < T . In that case, the QMM VOLUME 9, 2021  = υ , and ϕ υ . = ϕ υ 3: [Optimize] : minimize w ˆ w subject to υ w ≥ ϕ υ , w 1 = 1, w ≥ 0 4: [Return] : w pruning algorithm will fail to find an optimal solution, and the resulting ensemble solution will be the original f (x) = sign(Eα). All the steps of the QMM pruning algorithm are summarized in Algorithm 2. We illustrate the performance of the QMM pruning algorithm in Figure 2 by plotting the cumulative margin distributions (CMDs) for the original AdaBoost and RF solutions and the proposed QMM pruning algorithm on the Breast Cancer (BC) data set (see table 1 for data set description). The QMM ensemble uses only 74 trees (4-node, depth = 2) out of the 200 produced by the AdaBoost solution, and 76 trees out of the RF solution of 200 trees with equal error rates. We can visually inspect how the lower margin distribution has been optimized on the QMM pruned ensemble.

B. CHOICE OF υ
The parameter υ determines the fraction of margins that will be maximized, that is, the QMM solution will constraint these margins to be at least as large as those of the original ensemble solution. Selecting a high value of υ will require a higher fraction of margins to be improved or maintained and since AdaBoost and RFs are highly effective at increasing the training margins [5], the optimization will likely fail to find a feasible set. For υ = 1 we would likely obtain the original solution w = α or an even an infeasible solution in the optimization algorithm, in which case we also assign w = α. On the other hand, setting υ close to 0 would most likely result in a smaller but underperforming ensemble. The selection of υ can be based on cross-validation, but this will result in higher computational costs, and will does not necessarily guarantee good generalization. Making use of the structure of the margins distribution for the particular ensemble might also give some useful insights. For instance, Wang et al. [43] developed an upper bound based on a single margin instance called the equilibrium margin (EMargin) as an explanation  of the performance of ensemble methods. To explain the EMargin, we use the Bernoulli Kullback-Leibler function D(q || p) as defined in (11). D(q || p) is a monotone increasing function for a fixed q and q ≤ p < 1. We can also see that D(q || p) = 0 when p = q and D(q || p) → ∞ as p → 1. The bound that relates the EMargin to the performance of an ensemble is shown in Bound 3.
Bound 3 [43]: Assuming that the base-classifier space |H | is finite, and for any δ > 0 and θ > 0, then with probability at least 1 − δ over the training set S with size n, every voting classifier f satisfies the following bound: where u θ (q) = 1 n 8 ln |H | θ 2 (q) ln 2n 2 ln |H | + ln |H | + ln 1 The optimal value of q in (23) defined as q * and evaluated at θ(q * ) is called the EMargin, while q * is called the EMargin error. Wang et al. [43] suggest that an ensemble with higher EMarginθ(q * ) and lower EMargin error q * should perform better, holding everything else constant. With these results in mind, the value of υ can be set toP m(x, y) ≤θ(q * ) = υ.
The main drawback of using the EMargin to select the value of υ is also the extra computational cost, as well as the difficulties in obtaining the size of the base-classifier space |H |. Wang et al. [43] suggested using a pre-specified number of thresholds uniformly distributed on [0, 1] on each feature with a fixed classifier complexity, such as decision stumps, so that |H | can be computed accurately. Figure 3 illustrates the performance of the QMM pruning algorithm for the SPL data set under different values of υ for both VOLUME 9, 2021  AdaBoost and RFs. We fix the complexity and obtain the size of hypothesis space by using 4-node (depth = 2) decision trees. We further normalize each feature of the SPL data set to [0, 1] and consider only 100 thresholds uniformly distributed on [0, 1] on each feature, so that |H | = (2 × 100 × p) 3 , where p = 60. We then obtain the fraction of trees used, the test set error rate for values of υ = (0, 1] with increments of 0.01. This specific example shows that lower values of υ correspond to the highest ensemble pruning rate for both AdaBoost and RFs, however it translates to worse performance for the AdaBoost algorithm. For higher values of υ, it becomes increasingly more difficult for the algorithm to prune the ensemble and consequently the QMM pruning algorithm yields a higher fraction of trees selected. It is clear that the optimal value of υ depends on the ensemble type, as well as the data set used. The vertical dotted blue line 48940 VOLUME 9, 2021 in Figure 3 corresponds to the EMargin. The example in Figure 3 is a typical performance of the QMM pruning algorithm and shows that the performance of the QMM pruning algorithm is better for higher values of υ with an AdaBoost solution, while values of υ ranging from 0.01 to 0.40 result in improved performance in RFs. Using the EMargin to set the parameter υ generally results in a higher pruning rate with acceptable performance results, however to reduce computational costs in our simulations, we have set υ = 0.5 for AdaBoost, and υ = 0.25 for RFs knowing that this value could be further optimized by different means that not are limited to the use of the EMargin, but can also be derived by cross validation, especially if computational costs for the specific problem are not an issue. For further simulations,    we do not restrict the decision trees to a prespecified number of thresholds.

V. EXPERIMENTS AND SIMULATIONS
The application of the QMM pruning algorithm is illustrated on several synthetic and real data sets to gauge its applicability and performance under different scenarios. We use AdaBoost and RFs as the ensembles to be pruned. If the QMM pruning algorithm does not result in a feasible solution we have provided alternative decreasing values of υ. For AdaBoost, we have set the following options υ ∈ {0.50, 0.25, 0.05, 0.01}, and for RFs υ ∈ {0.25, 0.15, 0.05, 0.01}. These values consistently provide a workable range for optimal results. We compare the performance of QMM pruning algorithm to the original ensemble using the test set error rate, the reduced ensemble size or the pruning rate, and the diversity of the ensemble as measured by (10). The ensembles are based on decision stumps, or two-terminal nodes decision trees (depth = 1), 4-node decision trees, and full-grown decision trees. We test how the proposed method compares to two of the leading ensemble pruning techniques: the Diversity Regularized Ensemble Pruning (DREP) method [25] and κ-pruning [52]. For κ-pruning we set the pruning rate to 80%.

A. QMM UNDER NOISE
Given the findings in Long and Servedio [62], an ensemble pruning algorithm that aims to reduce the influence of noise to a given ensemble solution is ideal. To visually illustrate how the QMM pruning algorithm performs under a noisy scenario, we generate a synthetic two-dimensional data set consisting of 1000 data points uniformly distributed on the unit square with y = −1 when x1 + x2 < 1, and y = +1 when x1 + x2 > 1, and randomly assign to y ∈ {−1, 1}, when x1 + x2 = 1. The true boundary would be the diagonal line x1 + x2 = 1. To generate noise, we randomly flip the response variable of 20 observations with y = −1 to y = +1. A test set of 1000 observations with no noise was also generated to gauge upon the performance of the methods. Figure 4 shows the decision boundaries for AdaBoost and RFs and the corresponding solutions for the QMM pruning algorithm using 500 full-grown trees. The black boundary are the boundaries generated by AdaBoost or RFs, while the red boundaries are the ones generated by the QMM pruning algorithm. The QMM pruning algorithm improves slightly upon the performance of AdaBoost, since it is evident that AdaBoost does overfit the data set by trying to more closely get a boundary for the noisy points. The RF solution does not overfit as closely. In this case the RF and QMM boundaries mostly overlap.  Table 1 shows the description of the benchmark data sets utilized. The data sets range in size and dimensionality. We have arbitrarily broken down the data sets into three main types: low-dimensional, mid-dimensional, and high-dimensional data sets. The comparison of the QMM algorithm to DREP and κ-pruning is summarized in Table 2 for decision stumps, 4-node trees and full-grown trees using AdaBoost and RF ensembles of size T = 500. The simulations suggest that DREP prunes trees more aggressively than the QMM pruning algorithm, however the QMM pruning algorithm performs better in terms of the test set error in most of the cases, which is not surprising given the primary emphasis of the algorithm to achieve better performance, as opposed to explicitly prune the ensemble. In terms of test set error performance on AdaBoost ensembles, the average rank of the QMM pruning algorithm is 1.08, 1.46, and 1.54 for decision stumps, 4-node decision trees and full-grown trees respectively. The QMM pruning algorithm ranks better than DREP and κ-pruning for decision stumps, 4-node (depth = 2) and full grown trees, and ranks on average better than the original AdaBoost solution for decision stumps and full-grown trees, with a tied performance on 4-node trees. The average pruning rate for the QMM is 92.29% for decision stumps, 79% for 4-node trees and 64.17% for full-grown trees, which is also expected, because full-grown trees are generally more diverse and tend to perform better individually. DREP on average prunes the trees more VOLUME 9, 2021 FIGURE 6. Performance of the QMM pruning algorithm versus AdaBoost for low dimensional data sets using υ = 0.5. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and AdaBoost respectively. aggressively than the QMM pruning algorithm, as previously mentioned, with an average pruning rate of 97.58% for decision stumps, 89.43% for 4-node trees and 95.18% for full-grown trees. The pruning rate for the QMM pruning algorithm shows a decreasing trend as the dimensionality increases for decision stumps, 4-node trees, and full-grown trees (87.85% average for low-dimensional data sets, 78.57% average for mid-dimensional data sets, and 62.73% average for high-dimensional data sets). For RF ensembles, the QMM pruning algorithm ranks best, on average, in terms of the estimated generalization performance for both decision stumps and 4-node (depth = 2) trees, while it ranks second in full-grown trees after the original RF solution. κ-pruning outperforms DREP FIGURE 7. Performance of the QMM pruning algorithm versus AdaBoost for mid dimensional data sets using υ = 0.5. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and AdaBoost respectively. only in full-grown trees. In terms of the pruning rate, DREP also obtains on average a higher pruning rate under all decision tree types, however, the QMM pruning algorithm outperforms in most cases. The pruning rate for the QMM pruning algorithm also has a lower pruning rate for high-dimensional data when compared to lower-dimensional data sets (80.65% average for low-dimensional data, 78.96% average for mid-dimensional data, 60.29% average for high-dimensional data). The complete simulation results are shown in Tables 4 and 5. Table 3 shows the description of the synthetic data sets utilized. The data sets are randomly generated at each simulation with a training size of n = 300 and test set size n = 3000 for a total of 100 repetitions. The comparison of VOLUME 9, 2021 FIGURE 8. Performance of the QMM pruning algorithm versus AdaBoost for high dimensional data sets using υ = 0.5. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and AdaBoost respectively.

C. QMM PERFORMANCE ON SYNTHETIC DATA SETS
the QMM algorithm to DREP and κ-pruning is summarized in Table 8 for decision stumps, 4-node trees and full-grown trees using AdaBoost and RF ensembles of size T = 200. As previously found, DREP is more aggressive in pruning trees than the QMM pruning algorithm. The QMM algorithm outperforms both DREP and κ-pruning in terms of the test set error performance, and performs on par with the original ensemble. For AdaBoost ensembles, the average test error rank of the QMM pruning algorithm is 1.33, 1.78, and 2.11 for decision stumps, 4-node decision trees and full-grown trees respectively. This performance is only bested by the original AdaBoost ensemble for 4-node and full-grown trees, however, AdaBoost and the QMM algorithm are generally not statistically different according to the Tukey-Kramer (all pairs) test as shown in the full simulations in Table 7. The QMM algorithm prunes on average 61.05% of the decision stumps used by the AdaBoost ensemble, while pruning 48.74% of 4-node trees and 42.53% of full-grown trees. DREP, conversely, prunes 88.59%, 86.08% and 94.99% of decision stumps, 4-node trees, and full-grown trees from the AdaBoost ensembles. For RFs, the QMM pruning algorithm ranks better than DREP, κ-pruning and AdaBoost for decision stumps, 4-node and full grown trees in most simulations in terms of the test set error. Using the Tukey-Kramer (all pairs) test, we show how the QMM pruning algorithm is statically superior in most simulations against DREP and κ-pruning. The QMM algorithm also performs better than the original RF ensemble using decision stumps and 4-node trees, while performing similar to the original RF ensemble on full-grown trees as shown in Table 8. In terms of pruning capabilities, DREP shows more aggressive pruning than QMM, however, this high level of pruning results in a poorer performance. With a fixed pruning level, κ-pruning occasionally performs better than DREP, but it is clear that the strategy does not hold in most cases with wildly fluctuating test set performance errors. Table 8 shows the full simulations, comparisons and statistical tests with RFs.

D. QMM PERFORMANCE BY ENSEMBLE SIZE
Thus far we have only looked at the performance of the proposed method for ensembles of size T = 200 and T = 500. We now check how the performance of the QMM algorithm changes as the ensemble size changes. For these simulations, the ensembles are grown to a size T = 500, and the performance of the QMM algorithms is measured at each iteration T = 1, 2, . . . , 500. Specifically, we record the test set error and the fraction of trees utilized for decision stumps, 4-node tree and full-grown tree ensembles as T changes. Figure 5 illustrates a typical performance of the QMM algorithm as the ensemble size increases. This particular example is based FIGURE 9. Performance of the QMM pruning algorithm versus RFs for low dimensional data sets using υ = 0.25. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and AdaBoost respectively. on a 4-node tree RF ensemble on the SPL data set. The shaded grey area in Figure 5 represents the fraction of weak learners used for the QMM algorithm as T changes, while the red and black curves represent the test set error rates of the QMM pruning algorithm and the RF ensemble respectively. We can see in Figure 5 that the test set error rates of the QMM pruning algorithm is generally better than the original RF ensemble. As T increases, the number of decision trees selected by the QMM pruning algorithm decreases, and this is a typical behavior of the pruning approach of QMM regardless of the tree topology. For the sake of completeness Figures 6,7,8,9, 10 and 11 illustrate the fraction of weak learners used and the VOLUME 9, 2021 FIGURE 10. Performance of the QMM pruning algorithm versus RFs for mid dimensional data sets using υ = 0.25. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and RFs respectively. error curves for T = 1, . . . , 500 in all the benchmark data sets using the same approach as in Figure 5. The analysis on the benchmark data sets shows that pruning is more aggressive for ensembles of decision stumps, as expected. For instance, in low dimensional data sets, on average of only 22 trees out of an AdaBoost ensemble of size T = 500 are assigned non-zero weights by the QMM pruning algorithm, a 95.6% reduction. The AdaBoost ensemble is pruned to only 40.2 out of 500 trees on average for 4-node trees, and to 120 trees out of 500 for full-grown trees. The pruning rate is similar for RF ensembles. We believe that as the complexity of the individual classifiers increases, the more diverse trees are generated. FIGURE 11. Performance of the QMM pruning algorithm versus RFs for high dimensional data sets using υ = 0.25. The shaded grey represents the fraction of weak learners used given the ensemble size, while the red and black curves represent the test set error rates of the QMM pruning algorithm and RFs respectively.
Conversely, decision stumps are more likely to be similar due to the less complex tree topology, and hence the QMM pruning algorithm can reduce the ensemble and keep or improve the generalization performance. An interesting phenomenon happens for the ION and SON data sets illustrated in Figure 7, showing a sudden increase in the percentage of trees used right after T > n. This is likely due to the fact thatˆ is not positive semidefinite for data sets with n < T , nevertheless the QR decomposition of the H matrix reduces the size of the ensemble even if the QP formulation fails to find any feasible solution to prune the ensemble further. Overall, the QMM pruning algorithm does provide improvements both in the test set error and the reduction in the size of the ensemble compared to the RF and AdaBoost solutions on average regardless of the ensemble size T .

VI. CONCLUSIONS
Ensembles generally perform strongly in terms of their generalization ability compared to individual classifiers, however the application of ensembles in large scale, high velocity data sets, creates challenges given the more complex nature of these learning algorithms. Recent research effort has concentrated on reducing the size of ensembles, while attempting to maintain their predictive accuracy (ensemble pruning). In this paper, we proposed a quadratic program formulation that tunes the weights of a given ensemble solution, such that the pairwise correlations of the weak learners and the variance of the margin instances are minimized, while maximizing the lower percentiles of the margins. This strategy is based on the most current research in explaining ensemble performance. The proposed method results in an average pruning rate of 54.61% on ensembles of 200 trees, and an average pruning rate of 76.89% on ensembles of 500 trees. Simulations also suggest that the pruning rate increases as the ensemble size increases. More importantly, the estimated generalization performance of the proposed method (the QMM pruning algorithm) compares favorably to the other ensemble pruning methodologies analyzed here, and to the original ensemble solution for any ensemble size. The proposed method, as with many ensemble pruning methodologies, also offers some improvements over the original ensemble in data sets containing noisy examples. The main limitation to the proposed methodology is the additional computational cost of the quadratic programming formulation, which needs to performed on a given ensemble solution. Additionally, the proposed method decisively outperforms the original ensemble solutions on decision stumps and 4-node trees in terms of the test set error performance, however the proposed method performs very similar in statistical tests to the original ensemble on full-grown trees with the only advantage of a reduced size. VOLUME 9, 2021