A Review and Experimental Comparison of Multivariate Decision Trees

Decision trees are popular as stand-alone classifiers or as base learners in ensemble classifiers. Mostly, this is due to decision trees having the advantage of being easy to explain. To improve the classification performance of decision trees, some authors have used Multivariate Decision Trees (MDTs), which allow combinations of features when splitting a node. While there is growing interest in the area, recent research in MDTs all have in common that they do not provide adequate comparison of related work: they do not consider relevant rival techniques, or they test algorithm performance in an insufficient number of databases. As a result, claims have no statistical sustain and, hence, there is a lack of general understanding of the actual capabilities of existing MDT induction algorithms, crucial to improving the state-of-the-art. In this paper, we report on an exhaustive review of MDTs. In particular, we give an overview of 37 MDT induction algorithms, out of which we have experimentally compared 19 of them in 57 databases. We provide a statistical comparison in all databases and subsets of databases according to the number of classes, number of features, number of instances, and degree of class imbalance. This allows us to identify groups of top-performing algorithms for different types of databases.


I. INTRODUCTION
Decision trees (DTs) are popular classifiers, partly because their models are easy to explain and because they show remarkable performance. DTs' popularity has further increased due to the increasing need of using whitebox decision models: experts need to understand a model because in several practical problems it is mandatory to explain classification results [1]. Decision tree performance is highly competitive through the use of ensembles; in a recent survey [2], Random Forest [3] and eXtreme Gradient Boosting (XGBoost) [4] are among the topranked algorithms. Some applications of DTs or DT-based classifiers in such context include: predicting student dropout in subscription-based online learning environments [5], exploring customer purchasing patterns to evaluate the influence of product photos on sales [6], and evaluating the suitability of behavior change techniques in the context of mobile health applications [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos .
A decision tree is a graph with a tree structure that has a single root node with directed links (branches) to children nodes that may also have branches to other nodes. The terminal nodes, which do not have any branches, are commonly called leaves [8]. Each branch is tagged with a test, which evaluates to true or false for each object. For branches coming out of the same node, the tests define a partition of the database; so, for each object, one and only one of the tests evaluate to true. The tuple of tests tagging branches from a node is known as a split because they are used to split the objects in a node into disjoint subsets during tree construction. Each subset of objects is assigned to a different child node. We also use split as a verb; to split a node is to select a split and generate the corresponding children nodes.
According to the number of features considered in a split, we can categorize decision trees into Univariate Decision Trees (UDTs) and Multivariate Decision Trees (MDTs). UDTs use only one feature in a univariate split (e.g., weight > 60, weight ≤ 60), while MDTs use more (e.g., 2 * height + 3 * weight > 40, 2 * height + 3 * weight ≤ 40). For decision tree classification, multiple authors have shown that MDTs VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ achieve better accuracy than UDTs [9], [10]. This result is due to MDTs using multivariate splits which, often separate the classes better than using univariate relations. As a result, publications in MDT induction algorithms have proliferated, with almost 30 algorithms introduced between 1977 and 2019 (See Figure 1). Currently, there is not any comprehensive comparison to determine the relative performance of existing MDTs, let alone identifying the top ones. This is both because there are no surveys about MDTs, and because recent papers introducing MDTS suffer from one or two main shortcomings, in terms of the comparison of previous work: authors do not compare their algorithm with relevant rival techniques, or they do so but not in enough databases, and hence results are insufficient for statistically validating the underlying hypothesis.
Our goal with this paper is to fill in this gap; that is, we aim to evaluate the relative merit of MDT induction algorithms to identify how they compare one another. We hope that our findings help the community to select an MDT to use in a particular context.
To accomplish our goal, first, we have conducted a thorough review of MDT induction algorithms. Our review includes 37 MDT induction algorithms and is organized using an extension of the taxonomy proposed by Yildiz et al. [11], which groups algorithms into analytical and iterative.
Next, we conducted a thorough experimental comparison of prominent MDT induction algorithms surveyed in this paper. Our experimentation involved 19 MDTs, all of which are intended for general-purpose classification. We have tested these algorithms, using the implementations provided by their respective author(s), against 57 databases from the UCI repository [12]. The databases were carefully selected to ensure diversity. The largest databases have up to 20,000 objects and 856 features. While we have conducted a fair comparison of the studied MDT algorithms under study, future analysis of MDTs may involve more technological aspects, such as evaluating how well an MDT scales up in a large database, including ultra-high dimensional data.
We evaluated the algorithms on their classification performance according to the Area Under the Curve (AUC) of a Receiver Operating Characteristics (ROC) curve since it is robust to class imbalance [13]. We made a statistical comparison of the algorithms using the Bayesian signedrank test (see Section IV-B). We apply the statistical test in all databases and subsets of databases according to their number of classes, number of features, number of objects, and degree of class imbalance. This way, we were able to identify the top-performing algorithms for each group of databases.
Our conclusions are stronger than any other found in the literature since we compare nearly four times the number of algorithms as the most thorough MDT study, which compared 5 algorithms in 20 databases [11]. We also compare the algorithms in four more databases than the reviewed study with the largest number of databases [14].
Our main contributions in this paper are: • We provide a sound and extensive review of MDT induction algorithms.
• We provide the most extensive MDT comparison, with results that are sustained through statistical tests.
• We identify groups of top-performing MDTs for all databases and subsets of databases with common characteristics; these groups are so that their median probability of winning against other algorithms is high and their median probability of losing is low. The rest of the document is organized as follows.
In Section II, we present the notation used through the document and the taxonomy used to organize the MDTs. In Section III, we review MDT induction algorithms, show the widespread limitations of recent papers when comparing their algorithm to previous MDTs, and motivate the need for a thorough survey in the subject, hence motivating this paper. Next, in Section IV, we present the methodology used in our experimental comparison of MDTs, describe the databases, explain how we selected the MDT induction algorithms for our statistical comparison, and describe the measures and statistical tests used to compare the algorithms. In Section V, we provide a statistical comparison of the 19 selected MDT induction algorithms in 57 databases. Finally, in Section VI, we present our conclusions.

II. PRELIMINARIES
To organize our review, we categorize the algorithms by extending the taxonomy of Yildiz et al. [11]. We now need to introduce some notation and describe two distinctive elements of MDT induction: the form of candidate splits and the concept of feature selection. This notation will help us understand the taxonomy presented in Section II-B.

A. NOTATION
A training database D is assumed here to be composed of n instances with m real-valued features. An arbitrary instance is represented by the vector x = [x 1 , x 2 , . . . , x m ], with x j ∈ , ∀j ∈ F, where F = {1, 2, . . . , m} is the index set of features. Each x j represents the value that the feature with index j ∈ F takes for an arbitrary instance x. Each instance is tagged with a class from a predefined set of K classes A candidate split for a Univariate Decision Tree (UDT) takes the form where v is called a univariate split point. So, we can vary the selected feature (represented by the index j ∈ F) and the split point v to generate candidate splits. In comparison, a Multivariate Decision Tree (MDT) considers multiple features, and there may be multiple coefficients involved in the combination of features.
The most common multivariate splits for MDTs are linear splits. Given a subset of features F ⊆ F, for a binary MDT, a linear split takes the form w j x j ≤ v, w j x j > v, with w j ∈ , ∀j ∈ F . For each feature with index j ∈ F , w j is its corresponding weight coefficient in the linear combination.
The weight coefficients of the linear combination are w = [w 1 , w 2 , . . . , w |F | ], and the scalar v is called the split point.
To find candidate splits, MDT induction algorithms need to find values for w, v, and F . Searching for F is an optional step called feature selection; most algorithms lack feature selection and use all features in a linear combination, F = F. The approaches used to search w, v, and F are part of the taxonomy we will present.

B. TAXONOMY
To organize our discussion on MDT algorithms, we follow and extend the taxonomy proposed by Yildiz et al. [11]. The taxonomy groups the algorithms according to split type, approach to multi-class problems, the method for finding the weight coefficients w, method to find the split point v, branching factor, and split evaluation function. We extend the taxonomy by adding the feature selection strategy, if any.
• Split type. There are three possible split types: univariate, multivariate with linear combinations, and multivariate with non-linear combinations. Some MDT induction algorithms use different split types in different nodes.
• Approach to multi-class problems. When making a split, existing MDT induction algorithms have been designed to deal either only with two-class problems or with multi-class problems. From those of the latter, some deal with multi-class problems by transforming them into two-class problems.
• Feature selection. MDT induction algorithms either use or do not use feature selection for multivariate splits. The algorithms that use feature selection find multivariate splits using subsets of features. Most feature selection algorithms rely on a greedy search. For example, Sequential Forward Selection (SFS) begins with an empty set of features F = ∅; then, it adds a feature one at a time, provided that a split improves the evaluation function when using the feature in conjunction with all features already in F . Brodley et al. [15] describe other prominent feature selection algorithms used in MDTs.
• Branching factor. The branching factor is the number of children of a node: it is either equal to 2 or the number of classes K .
• Search for w. The search for the weight coefficients w can be either analytical or iterative.
• Search for v. The search for the split point v can also be analytical or iterative.
• Evaluation function. Some MDT induction algorithms generate more than one candidate split and need to use an evaluation function to choose one. Hernández et al. [16] have conducted an experimental comparison of evaluation functions, where they rank evaluation functions used with C4.5 by the classification performance achieved in terms of accuracy and AUC.

III. RELATED WORK
Since we want to identify common strategies for building MDTs, we have grouped split generation algorithms into two broad categories, according to their strategy for finding w.
The first category of algorithms, which we review in Section III-A, use analytical solutions for finding w. The second category of algorithms, which we review in Section III-B, use iterative approaches for finding w. In Section III-C, we briefly discuss algorithms related to MDTs. In Section III-D, we present the widespread limitations found in MDT induction literature regarding the comparison of new algorithms against relevant rival MDTs. Finally, in Section III-E, we present the conclusions of our review.

A. ANALYTICAL MULTIVARIATE SPLIT GENERATION
Analytical algorithms use only analytical calculations to find the weight coefficients w. Some of the algorithms in this category also find the split point v through analytical calculations, while others find v through an iterative algorithm after finding w. Most algorithms in this category use Linear Discriminant Analysis (LDA). There are two different approaches to LDA: Fisher's linear discriminant and Discriminant functions. Table 1 displays how we categorize each of the analytical algorithms considered in our investigation (one algorithm per row) in terms of the taxonomy presented in Section II-B. We notice that the evaluation measure is left blank in some cases because some analytical algorithms only produce a single candidate split, so there is no need to use an evaluation measure. The reasons algorithms in this category may use a split evaluation measure are an iterative search for the split point v, the usage of feature selection, or deciding between a couple of candidate splits generated only through analytical calculations. For analytical methods, we have identified the following common split generation strategies: • Discriminant functions (Section III-A1). Algorithms in this category use discriminant functions to generate K-ary decision trees.
• New features through discriminant functions (Section III-A2  [17]; the Classification Rule with Unbiased Interaction Selection and Estimation tree (CRUISE) [18]; and the Generalized, Unbiased, Interaction Detection and Estimation tree (GUIDE) [24]. The three algorithms use linear discriminant functions to make a split; however, CRUISE builds K-ary trees, while QUEST and GUIDE build binary trees. QUEST and CRUISE apply Principal Component Analysis (PCA), dropping the principal components with small TABLE 1. Properties of analytical MDT algorithms. The algorithms at the top are used in the experimental comparison in Section V. The column Split refers to the type of split used in the node: Univariate splits (Uni), Linear multivariate splits (Lin), Non-linear multivariate splits (Non), or any combination of the aforementioned split types. The column multi-class refers to the approach taken when working with multi-class databases: some algorithms can work directly with these databases, some algorithms need to transform the problem into one of two classes, and some algorithms cannot deal with multiple classes. The column w refers to the method for finding the weight coefficients w . The column v refers to the method for finding the split point v . The column Br refers to the branching factor, which is either equal to 2 or the number of classes K . eigenvalues; in this way, the algorithms avoid the problem of near singular covariance matrices. The remaining principal components are used to find the splits through linear discriminant functions. A newer version of CRUISE [29] can also fit linear discriminant models in each terminal node.
GUIDE is an improvement upon QUEST and CRUISE. The main difference regarding multivariate splits is that GUIDE only allows for linear multivariate splits with two features. GUIDE can also fit models on the leaves; however, it uses kernel and nearest-neighbor node models.
In this paper, we focus on multivariate split generation strategies. However, a more extensive discussion on this family of algorithms can be found in Loh's survey [30].

2) MDTs GENERATING FEATURES THROUGH DISCRIMINANT FUNCTIONS. LTREE, QTREE, AND LgTree
Ltree [22], Qtree, and LgTree [31], in each node, generate discriminant functions for the classes with a number of objects exceeding two times the number of features. The difference between the algorithms is that Ltree uses linear discriminants, Qtree uses quadratic discriminants, and LgTree uses logistic discriminants (which results in linear splits). At each node, the discriminant functions are used to construct new features by projecting the data onto them. An exhaustive search is made to find a split for each feature, including the original features, features constructed in previous nodes, and features constructed in the current node. Friedman [21], Linear Discriminant and Tabu Search (LDTS) [23], Scaling Up Recursive Partitioning with Sufficient Statistics (SURPASS) [10], Linear Discriminant Tree (LDT) [11], Fisher's Decision Tree (FDT) [26], and Multi-class Hellinger Linear Discriminant decision tree (MHLDT), [20] use Fisher's linear discriminant to generate splits. However, MHLDT uses a multi-class version of Fisher's linear discriminant that produces K − 1 eigenvectors used as candidates for w. MHLDT thus avoids grouping multiple classes into two groups.
LDTS and SURPASS find the split point v through analytical methods, while the rest of the algorithms in this section use exhaustive search to find it. The main difference between LDTS and SURPASS is that SURPASS is designed to work with databases large enough to exceed memory size; to work in this context, SURPASS removed the feature selection algorithm used by LDTS.

4) MPSVM. GEOMETRIC DECISION TREE AND ZHANG's MPSVM
Geometric DT [25] generates candidate splits through an analytical algorithm, which the authors claim captures the geometric structure of the data better than algorithms relying on impurity measures. Given two classes, the algorithm uses the Multisurface Proximal SVM (MPSVM) algorithm to find a clustering hyperplane for each class, which is a hyperplane where the average Euclidean distance of all the points in the class to the hyperplane is minimized. If the clustering hyperplanes of both classes are parallel, the authors use the hyperplane between them to split the data. Otherwise, the authors use the angle bisectors of the two hyperplanes as candidate splits and keep the one that minimizes an impurity measure.
Zhang's MPSVM trees [32] borrow the main ideas from Geometric DT; however, they apply regularization methods to deal with singular covariance matrices. There are two versions of Zhang's MPSVM: using Tikhonov regularization or using a univariate split when a singular covariance matrix is found.

5) EFFICIENT DECISION TREE
The authors of Efficient trees [28] propose two analytical algorithms for finding w: selecting w randomly or as the dominant eigenvector of the covariance matrix. For both methods of finding w, the authors project the data onto w, then select v as the median of the projected data. Properties of iterative MDT algorithms. The algorithms at the top are used in the experimental comparison in Section V. The column Split refers to the type of split used in the node: Univariate splits (Uni), Linear multivariate splits (Lin), Non-linear multivariate splits (Non), or any combination of the aforementioned split types. The column multi-class refers to the approach taken when working with multi-class databases: some algorithms can work directly with these databases, some algorithms need to transform the problem into one of two classes, and some algorithms cannot deal with multiple classes. The column w refers to the method for finding the weight coefficients w . The column v refers to the method for finding the split point v . The column Br refers to the branching factor, which is either equal to 2 or the number of classes K .

6) CLINE
The authors of Cline [19] propose six analytical algorithms for building MDTs based on finding two points, one for each class, and using the line passing through them as w. The cut point v is selected as the midpoint between the two selected points, projected onto w. The six variants choose the points A and B as follows: CL2 selects the nearest two points from different classes; CL4 uses four points instead of two; CLM selects the mean points of the classes; CLLDA first obtains w through Fisher's vector discriminant, then it selects the nearest two points from different classes after projecting them onto w; CLLVQ finds two centroids using Linear Vector Quantization; and CLMIX tests CLM, CLLDA, and CLLVQ in each node and uses the best one according to the split evaluation function.

7) HHCART
HHCART [27] calculates the eigenvectors of the covariance matrix of each class and uses the eigenvectors to build an m × m Householder matrix to project the data. Each column of the matrix is a candidate for w. For each candidate w, an exhaustive search is made to find a split point v, keeping the split that minimizes the impurity measure. Originally, HHCART had two variants: HHCART(A), which uses all eigenvectors as candidates for w; and HHCART(D), which uses only the dominant eigenvector as w.
A third version of the algorithm, HHCART(G) [33], is a variation of HHCART(D). HHCART(G) uses the angle bisector from the Geometric DT approach from Section III-A4, instead of the dominant eigenvector as w. For small sample sizes, where the angle bisector cannot be found using the original approach [25], the authors introduced a modified angle bisector. Of the three variations of HHCART, HHCART(G) has the highest average accuracy, and it is the most efficient.

B. ITERATIVE APPROACHES FOR MULTIVARIATE SPLIT GENERATION
In this section, we review iterative algorithms, which given an initial solution for the weight coefficients w, use an iterative procedure to modify them to improve the split evaluation measure. Table 2 displays how each of the iterative algorithms considered in our investigation (one algorithm per row) is categorized in terms of the taxonomy presented in Section II-B. We have identified five general approaches used to search candidate splits through iterative algorithms: • Hill climbing (Section III-B1). These algorithms use hill climbing or some variation thereof, such as simulated annealing, to search for candidate splits.
• Linear discriminant functions (Section III-B2). These algorithms generate K-ary trees through linear discriminant functions. The linear combination for each discriminant function is obtained through an iterative algorithm.
• Neural networks (Section III-B3). These algorithms use neural networks to generate candidate splits; some of these algorithms generate non-linear splits.
• Evolutionary algorithms (Section III-B4). These algorithms run an evolutionary algorithm to improve an initial population of solutions.
• Linear programming (Section III-B5). These algorithms pose the problem of finding a candidate split as a linear programming problem.

1) HILL CLIMBING AND ANNEALING ALGORITHMS. CART-LC, SADT, OC1, APDT, FAT/MOC1
The oldest iterative algorithm we review, Classification and Regression Trees with Linear Combination (CART-LC), was proposed by Breiman et al. [34] as an extension of CART to linear combinations. CART-LC is deterministic. It starts with the best univariate split, normalizes the weight coefficients, and generates two candidate splits by modifying a single weight coefficient at a time by 0.25 and −0.25.
One problem of CART-LC is that it can be easily stuck in local optima because it is deterministic. To solve this problem, Heath et al. [37] proposed Simulated Annealing of Decision Trees (SADT), which uses simulated annealing to find the weight coefficients w, so it is more difficult for it to get stuck in local optima.

VOLUME 9, 2021
Murthy et al. [9] noted that SADT is computationally expensive, so they propose Oblique Classifier 1 (OC1), an extension to CART-LC that includes two randomization procedures. As CART-LC, OC1 starts with the best axisparallel split, but it modifies the weight coefficients by small random amounts. The advantage of OC1 over SADT is that it is more efficient.
FAT and Margin OC1 (MOC1) [41] are DTs based on OC1; the main difference from OC1 is that they try to maximize the margin in each node. FAT first builds an OC1 tree; then, in each inner node, the objects are relabeled as right or left according to the child they fall in. The relabeled objects are linearly separable, so SVM is used to find the hyperplane with maximal margin. MOC1 modifies the split evaluation measure to include the size of the margin.
The Alopex Perceptron Decision Tree (APDT) [39] uses the Alopex algorithm, a variant of simulated annealing to find a split. One difference with the SADT algorithm that uses simulated annealing, is that in APDT, all weights are randomly modified at each step.

2) LINEAR DISCRIMINANT FUNCTIONS. LMDT, oRF
Brodley et al. [15] proposed to use linear discriminant functions to generate candidate splits in Linear Machine Decision Trees (LMDT). Usually, discriminant functions use an analytical approach; however, Brodley et al. [15] proposed three algorithms for finding the coefficients w i , v i iteratively. The first algorithm (RLS) uses the recursive leastsquares procedure to find the parameters w i , v i ; since this approach only works for two classes, the trees generated are binary. The other two algorithms generate k-ary trees, where they consider treating each linear discriminant function as a perceptron. Since convergence is a problem for the perceptron if the objects are not linearly separable, the authors use the Pocket algorithm as a possible solution for not linearly separable problems. As a second solution, the authors use Thermal Training, which is a variation on simulated annealing. The authors' experiments were restricted to twoclass databases, where the RLS algorithm outperforms the others in accuracy.
Menze et al. [43] proposed Oblique Random Forests (oRF), which builds MDTs using ridge regression, where the regularization parameter λ is adjusted iteratively. The authors note that with λ = 0 the split is similar to one obtained through discriminant analysis approaches, while λ 1 results in a split similar to one obtained through principal component analysis.
Both LMDT with recursive least squares and oRF are limited to two-class problems. The other versions of LMDT can work directly with multi-class problems.

3) NEURAL NETWORKS. BMDT, CTNNFE, OMNIVARIATE
Liu et al. [38] proposed BMDT, which transforms the problem of inducing binary multivariate decision trees to one of inducing binary univariate decision trees. The algorithm trains a 2-layer feed-forward neural network, where hidden units are used as new features x i = δ( m j=1 w i j x j + w i 0 ), where w i j are the weights from the feature with index j ∈ F to the hidden unit i, and δ is a non-linear squashing function. A univariate tree is built using the new features, where splits take the form x i ≤ v . A linear split is obtained by taking the inverse of the squashing function, m Guo et al. [36] proposed Classification Trees with Neural Network Feature Extraction (CTNNFE), which at each node builds a multilayer perceptron to generate non-linear splits. As an extension, Yildiz et al. [35] proposed Omnivariate decision trees, which can make univariate splits, linear multivariate splits, and non-linear multivariate splits. To make linear multivariate splits, the Omnivariate algorithm uses a single layer perceptron.

4) EVOLUTIONARY ALGORITHMS. HBDT, OmniGA
Struharik et al. [44] proposed the HereBoy DT (HBDT) algorithm, which runs an evolutionary algorithm to find the optimal split point at each node. OmniGA [46] uses a genetic algorithm to generate trees based on Omnivariate trees [35]. The genetic algorithm is used to select the type of split made at each node, optimize the parameters of the split, and prune nodes. The authors of Dipolar [40] propose to use the basis exchange algorithm, which is similar to linear programming, to minimize the dipolar criterion function. A dipole is a pair of objects from the database; a pure dipole is one with objects of the same class, while a mixed dipole has objects of different classes. The authors aim to select a hyperplane that divides a high number of mixed dipoles and a low number of pure dipoles, so they define the dipolar criterion function as a weighted sum of the cost of separating pure dipoles and not separating mixed dipoles.
The authors of Optimal Classification Trees (OCT) [14] formulate the problem of building UDTs and MDTs as a mixed-integer optimization problem. The objective function considers a trade-off between accuracy and model complexity.
The authors of Vertical Decision Trees (VDT) and Cutting Decision Trees (CDT) [42] also formulate the problem of building MDTs as a mixed-integer optimization problem. However, VDT is not allowed to grow in width by making each inner node have at least one leaf child. Given that the variables in the optimization problem grow at an exponential rate with respect to the depth of the tree, the authors propose the CDT, in which the number of variables grows linearly with the depth of the tree. In the CDT, only the leaf nodes at maximum depth may be impure.

The authors of Supervised Budgeted Tree (SBT) and
Powerset Tree (PT) [45] analyze the error bound of an MDT and conclude that to decrease testing error, they must decrease training error, put a constraint on the weight coefficients, and enlarge the margin in each node. The authors create a Budgetaware classifier, which has these constraints incorporated into an optimization problem. In each node, the Budget-aware classifier is used to generate splits. For two classes, the SBT tree is built in a top-down manner; however, for multiple classes, the PT bottom-up algorithm is used, which generates only one leaf per class.

6) DTSVM
The authors of Decision Tree SVM (DTSVM) [47] build a tree using νSVM to generate splits. Since νSVM is designed to deal with two classes, the authors mention they use a oneversus-others strategy to transform multiple classes into a binary class; however, it is not clear if this strategy is used to generate several trees, or if this strategy is used at each node to generate candidate splits.
After building the MDT, DTSVM encodes each object in a v = {0, 1} m vector, where each element corresponds to an inner node of the tree. If an object passes through node i, then its corresponding value is This new feature space is used to classify the objects using a linear SVM.

7) BDTKS
Binary Decision Tree based on K-means Splitting (BDTKS) [48] applies K-means with k = 2 at each node and uses the centroids to calculate w, v. To calculate w, BDTKS obtains the hyperplane passing through both centroids; then, the centroids are projected onto w, and the midpoint is used as v. The split evaluation function is a modified impurity function that takes into account class imbalance.

C. OTHER DTs WITH MULTIVARIATE DECISIONS
Other types of decision trees use multivariate decisions, such as Soft decision trees, Model trees, and Functional trees. We will briefly describe these trees; however, in this paper, our focus is on MDTs. Comparing these other trees deserves future study.
Soft decision trees [49] and Fuzzy decision trees [50], unlike the DTs we have discussed so far, do not tag branches with binary tests which tell us which branch to follow. Instead, each node has a gating function that gives probabilities or membership degrees to the children nodes. All paths from the root to the leaves are traversed with probabilities assigned by the gating function.
Model trees are univariate trees that make multivariate decisions at each leaf [51], while functional trees are a generalization of MDTs and Model trees [52]. Functional trees allow combinations of features at inner nodes and leaf nodes. Gama [52] compared his algorithm for Functional Trees with one algorithm of the other groups: CRUISE [18] for MDTs, and M5' [53] for Model trees. The algorithms were compared in 30 databases, and a Wilcoxon test with Bonferroni correction showed no significant difference between Functional Trees and the other algorithms. However, the Functional tree was ranked first, then the MDT, and at last the Model tree.

D. LIMITATIONS OF RELATED WORK COMPARISON
When presenting a new classification algorithm, authors should attempt to determine how it compares, in terms of performance, against others that are a reference to the community. Moreover, for this experimental comparison to be statistically sound, all algorithms should be tested in enough databases, observing the conditions of the corresponding hypothesis test. This way, algorithms can be ranked in terms of performance, and so any upcoming algorithm could be compared against only a subset of top-rank algorithms. In the literature about MDT induction, however, we have noticed two widespread limitations on recent papers: the proposed new MDT algorithm is not compared with relevant existing rival MDTs, or the comparison does not involve testing the algorithms in enough databases to support any claim statistically. Fig. 1 displays each algorithm, ordered in terms of year of appearance, from 1977 to 2018. It aims to convey both the number of times the algorithm in question has been used as a reference to compare the behavior of others (orange bar) and the number of algorithms it was compared against upon introduction (blue bar). We can see that in most papers, authors compare their algorithm against at most a couple of other MDT induction algorithms; the most comprehensive study compares their algorithm against five others. Furthermore, only CART-LC, OC1, QUEST, and LMDT are used as a reference for comparison more than twice. Note that most algorithms have never been used in any experimental comparison. Fig. 2 is similar to Fig. 1, except that it displays the number of databases used when testing the proposed algorithm. Looking at it, we notice that only 9 algorithms use at least 20 databases. Given that including more databases increases the power of statistical tests [54], and there are publicly available database repositories such as the UCI repository [12], new publications should strive to include more databases in their experimental comparisons.

E. MDT INDUCTION CONCLUSIONS
In this section, we presented a taxonomy that enables us to identify common split generation strategies. We grouped the algorithms into analytical and iterative, according to their approach for generating the weight coefficients w for candidate splits. In both groups, we identified subgroups of strategies for finding w.
Although plenty of algorithms can use a mix of univariate and linear multivariate splits when building a tree, multivariate splits often involve all features because most algorithms lack feature selection. Feature selection is important to  keep the models simple and is helpful because class separability is sometimes found on subsets of features [20]. Another problem of some algorithms is that they cannot work with multiple classes, or they group multiple classes into two groups, potentially losing information about class separability found on subsets of features. The widespread limitations on recent papers regarding the comparison of their algorithm against relevant rival MDTs prevent us from readily identifying the best strategies for building MDTs. Therefore, to achieve our goal of evaluating the relative merit of existing MDT induction algorithms, we will use 20 MDT induction algorithms and make a statistical comparison using 57 databases. In the following section, we describe the selection process and databases.

IV. EXPERIMENTAL SETUP
In this section, we present our proposed methodology to accomplish our research objectives of comparing MDTs to each other and identifying the top-performing algorithms. In Section IV-A, we define the measures used to evaluate a classifier. In Section IV-B, we show the methods used to compare classifiers using the measures defined in Section IV-A. In Section IV-C, we describe the databases and algorithms used in this study. Finally, we describe the evaluation protocol in Section IV-D.

A. EVALUATION MEASURES
The measures we use to evaluate classification performance can be obtained from the confusion matrix, which is a result of the classifier applied to a testing database. The confusion matrix is a k × k matrix, where k is the number of classes. The rows correspond to actual classes and the columns to predicted classes. We show an example of a confusion matrix in Table 3. From the row Iris-setosa, we can see that 49 objects were correctly classified as Iris-setosa, and one object was incorrectly classified as Iris-versicolor. Similar remarks hold for the two other classes.
Let C be a confusion matrix of size k × k. Each cell c ij in C counts the number of objects of class i classified as class j.
Accuracy is a popular measure to evaluate a classifier, defined as the number of correctly classified objects divided by the total number of objects in the testing database. It can be obtained by adding the objects of the main diagonal of the confusion matrix over the total number of objects, say n: One important drawback of using accuracy is that it does not take into account class imbalance. A database is highly imbalanced if it has many objects of one class compared to the rest of the classes; for such a database, always classifying objects as the class with most objects, so-called the majority class, will result in high accuracy.
Since many real-world databases are imbalanced, we use the Area Under the ROC curve (AUC), which is more insensitive to imbalanced databases [13]. The AUC measure for discrete classifiers for two classes is defined using recall and specificity [13]. Let (C i , C j ), with i = j, be any pair of classes, where C i denotes the Positive class and C j the Negative one. The number of objects of the Positive class C i correctly classified, c ii , is the number of True Positives. The number of objects of the Positive class C i incorrectly classified, c ij , is the number of False Negatives. The number of objects of the Negative class C j correctly classified, c jj , is the number of True Negatives. The number of objects of the Negative class C j incorrectly classified, c ji , is the number of False Positives.
The Recall is then defined as the proportion of objects of the positive class correctly classified: Specificity is defined as the proportion of objects of the negative class correctly classified: Finally, the AUC for two classes i, j is defined as: To extend the definition of AUC to multi-class problems, we take the recommended one versus the others approach [13]. This approach consists of averaging the AUC of all possible pairs of classes as follows: Since many of the databases tested are imbalanced, our performance indicator in this study is AUC. Now, we will describe how to compare algorithms using AUC as an evaluation measure.

B. STATISTICAL COMPARISON
For comparing algorithms in multiple databases, we used the Bayesian signed-rank test as described in the tutorial by Benavoli et al. [56]. This test is the Bayesian counterpart to Wilcoxon's test. To understand this test, we briefly describe how two classifiers are compared in a single database.
The Bayesian tests described by Benavoli et al. [56] are based on three hypotheses: that classifier A is practically better than B, that the classifiers are practically equivalent, and that classifier B is practically better than A. To calculate the probabilities of the hypotheses for a specific database, the Bayesian correlated t-test is used to obtain a distribution of mean differences of AUC.
The probabilities θ l , θ e , θ r correspond to the integral of the distribution on different intervals: the region (−∞, −r), where classifier A is practically better than B; the region (r, ∞), where classifier B is practically better than A; and the region [−r, r], where the classifiers are practically equivalent. The interval [−r, r] is known as the region of practical equivalence (rope). Benavoli et al. [56] use r = 0.01 for accuracy; we will use the same value for AUC given the similarity of the measure for balanced databases.
To compare the classifiers on multiple databases, the Bayesian signed-rank test is used. For this test, a distribution on the probabilities θ l , θ e , θ r is computed by Monte Carlo sampling. For a given sample, there is a bias towards θ i if θ i > max(θ j , θ k ). So, if for all our samples we have θ i > max(θ j , θ k ), we conclude with a probability equal to 1 that hypothesis i is true. Let us say we conclude that classifier B is practically better than classifier A with a probability equal to 1. This does not necessarily mean that the difference of AUC between classifier B and A is always greater than 0.01. This means that the probability θ r is always greater than both θ l and θ e ; in other words, there is always a bias towards classifier B winning.
Benavoli et al. [56] visualize θ l , θ e , θ r for each sample using a simplex with vertices {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. In Figure 3, we show one for a comparison between the classifiers CRUISE and CL2 on a subset of databases with a high number of features. If a point falls in a vertex, then it has θ i = 1 for the corresponding hypothesis i; in the figure, a point in the left corner is a sample where the difference in AUC between CRUISE and CL2 is greater than 0.01 with a probability equal to 1. There are three regions limited by θ i > max(θ j , θ k ); so the left region corresponds to the case where there is a bias towards CRUISE. In each corner, the proportion of samples falling in the corresponding region is shown; since almost all samples fall in the region where CRUISE is better, we have p(CRUISE) ≈ 1. Some samples fall in the region where CL2 is better; however, the proportion of those samples is smaller than 1 × 10 −3 .

C. DATABASES AND EVALUATED ALGORITHMS
We have found 57 numerical databases without missing values from the UCI repository [12]. The databases are diverse, with varying numbers of objects, number of features, number of classes, and degree of imbalance. The full description of the databases can be found in Appendix A, where we can verify the diversity of the databases. However, we summarize key characteristics of the databases in Tables 4, 5, and 6.
In the following section, we compare 19 implementations of MDT induction algorithms using the databases and algorithm comparison methods described in this section. The classifiers compared include seminal MDTs that were designed for general-purpose classification: CART-LC and OC1. The rest of the classifiers are also for general-purpose classification, and the original authors tested them in diverse databases, such as the ones of the UCI repository.
We used the original author's implementation for each classifier included. The implementations were publicly available online, or the authors were kind enough to share with us an implementation for academic purposes. The authors of SBT/PT [45] shared with us the implementation for their classifier; however, the implementation works only for two-class databases and the published results are also only for two-class databases.

D. EVALUATION PROTOCOL
For each algorithm a ∈ A listed in Section IV-C, we executed a in each dataset d ∈ D using 5-fold Distribution Optimally Balanced-SCV (DOB-SCV). The k-fold DOB-SCV [57] is an alternative to k-fold cross-validation that tries to keep the data distribution as similar as possible. Lopez et al. [58] suggest using k-fold DOB-SCV instead of k-fold cross-validation to avoid having different distributions between testing and training databases. For each execution of algorithm a ∈ A in database d ∈ D, we obtain the AUC (see Section IV-A) for each fold and calculate the mean AUC over the 5 folds.
When executing the algorithms (listed in Section IV-C), some implementations crashed for some databases. We have found it difficult to assess why a specific implementation failed in one database, because some of the algorithms do not have publicly available source code, and the errors were not ones handled by the developers that could provide some useful message. To deal with this issue, we ran two experiments: the full algorithms experiment and the full classifiers experiment.
The full algorithms experiment aims to compare all algorithms, and so we removed from analysis those databases for which at least one algorithm failed. In this case, we ended up with 40 databases. By contrast, the full databases experiment aims to preserve all databases, and so we removed from analysis those algorithms that fail at least with one database. Then, we were left with 15 algorithms. This way our analysis is more robust, for as shall be seen in Section V, we can understand the effect of adding or removing algorithms or databases from our experiments.
For each experiment, we make a first comparison of the algorithms through key statistics of the AUC obtained for each database. We show these statistics with boxplots and rank the classifiers by their median AUC.
Next, for each experiment, we take each pair of algorithms (a i , a j ), i = j considered in the experiment and apply the Bayesian signed-rank test (see Section IV-B) for the subset of databases considered in the experiment. The test gives three probabilities as a result: the probability that a i is practically better than a j , in other words, the probability of a i winning; the probability that a j is practically better than a i , in other words, the probability of a j winning or a i losing; and the probability that a i and a j are practically equivalent, in other words, the probability of a tie.
To identify in which databases the algorithms perform better, we also apply the Bayesian signed-rank test to subsets of the databases for each experiment. First, for each experiment, we compare the results in databases with two classes against databases with more than two classes. Therefore, we will have four subsets of databases, shown in Table 6, where we apply the test. The other comparisons are in databases with up to 20 features, against databases with more than 20 features; databases with up to 1,000 objects, against databases with more than 1,000 objects; and databases with up to two objects of the majority class for each object in the minority class, against databases with more than two objects of the majority class for each object in the minority class.

V. RESULTS AND DISCUSSION
In this section, we show the results of our comparison of the 19 MDT induction algorithms. As discussed in Section IV-D, we have conducted two experiments, which we call the full algorithms experiment and the full databases experiment.
In Figure 4, we show a boxplot of the distribution of AUC for the full algorithms experiment. The algorithms are ordered according to their median AUC, with algorithms with the highest values at the bottom. The boxplot helps us to visualize the distribution of AUC for each algorithm, showing the minimum and maximum values with the whiskers (left and right small vertical lines at the edge of each dashed line), the median (bold line inside the box), and the first and third quartiles (left and right edges of the box).
Since we want to maximize AUC, we aim to build algorithms with high median AUC and low variability. Visually, small boxes and whiskers closer to the median indicate low variability. MHLDT has the highest median AUC, which may indicate good performance. From Figure 4, we notice that the median AUC of most algorithms is in the range (0.8, 0.9), and there is great overlap between the boxes. Even so, a consistent difference in performance statistics of AUC (median, first quartile, third quartile, and minimum) indicates that one algorithm is performing better than the other. However, a consistent improvement of AUC of 0.01 might not be easy to notice visually; however, we will detect this consistent difference with the statistical tests. Other differences are more noticeable; for example, we know for sure that Omni obtains worse results compared to MHLDT in at least 25% of the databases because the first quartile of Omni is lower than the minimum AUC of MHLDT.
In Figure 5, we show a boxplot with the distribution of AUC for the full databases experiment. We notice that when adding 17 databases, the minimum, median, and first TABLE 7. Bayesian signed-rank test results for the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing. and third quartiles for AUC are generally lower. We also notice that the relative ranking according to the median AUC of some algorithms changes. The lowered performance of 15 classifiers might be because the 17 databases are harder to classify and furthers our motivation of giving results for all databases for the algorithms for a subset of algorithms. The full results with the AUC of each algorithm for each database are shown in Appendix B.
In the boxplots, we used the median AUC to rank the algorithms; however, the median AUC alone does not guarantee good performance, we need to consider the whole distribution of AUC. For example, by ranking the algorithms by median AUC, CRUISE is at sixth place and QUEST is at third place; however, the AUC of the first quartile is noticeably higher for CRUISE. Pairwise comparison of CRUISE and QUEST would confirm that CRUISE achieves higher AUC values most often than QUEST, which seems natural since CRUISE was published years later than QUEST by the same authors, with CRUISE preserving some successful characteristics from QUEST. Since it is difficult to assess which algorithm is better by only looking at AUC statistics, to make a fair comparison of the algorithms, we need to apply statistical tests.

A. STATISTICAL COMPARISON
In Section IV-B, we described the Bayesian signed-rank test to compare a pair of algorithms. In Table 7, we show the results of the statistical test for the full algorithms experiment. The same results for the full databases experiment are shown in Table 8. The number of each cell is the probability that the algorithm in the row is practically better than the one in the column. The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing. From the probabilities of a pair of algorithms i, j winning against each other, we can obtain the probability of the algorithms being practically equivalent as 1 − p ij − p ji .
For example, we see that the probability that CRUISE is practically better than QUEST is 0.43 from cell (2,9) of  Table 7. From cell (9, 2) of Table 7, we see that there is a probability that QUEST wins against CRUISE of 0.02. From the previous probabilities, we can obtain the probability that QUEST and CRUISE are practically equivalent as 1 − 0.43 − 0.02 = 0.55. In 55% of cases, there is a bias towards the algorithms being practically equivalent. However, with no additional information, we can conclude that CRUISE outperforms QUEST because CRUISE still wins in 43% of cases, and QUEST only wins in 2% of the cases.
In Figure 6, we show the median probability of winning or losing of each algorithm, as well as the median AUC (color and size of points). Although we are only showing the median values, we can now visually compare multiple algorithms simultaneously, which is challenging to do from the tables. Furthermore, we will see that we can visually identify groups of top-performing algorithms that match the ranking used in Tables 7 and 8.
For the full algorithms experiment, in the top plot of Figure 6, we notice a group of top-performing algorithms with a median probability of winning higher than 0.7 and a median probability of losing smaller than 0.3, namely, MHLDT, CRUISE, MPSVMpca, MPSVMparallel, CART-LC, MPSVMlda, and OC1. The algorithms in this group are also at the top of the ranking in Table 7. Bayesian signed-rank test results for the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
We notice that CART-LC and OC1 are at positions 5 and 6 in the ranking of Table 7, which seems high. These algorithms are two of the most popular algorithms used in experimental comparisons, so we would expect most algorithms to outperform them. We even notice that, with a median probability greater than 0.6, CART-LC wins over 11 algorithms (labeled 9 to 19 in Table 7), and OC1 wins over 10 algorithms (labeled 10 -19 in Table 7).
For the full algorithms experiment, in the bottom plot of Figure 6, we can identify a smaller group of five topperforming algorithms with a median probability of winning higher than 0.9 and a median probability of losing smaller than 0.1, namely, MHLDT, CRUISE, MPSVMparallel, MPSVMlda, and MPSVPpca. The algorithms in the group are also at the top of the ranking in Table 8. With this group of five algorithms in mind, we notice that the only difference in the ranking when considering all classifiers, for the full algorithms experiment, is that MPSVMparallel goes from rank 4 to 7.
We now want to identify common characteristics of MHLDT, CRUISE, MPSVMlda, and MPSVPpca, which are among the top-five ranked algorithms for the full algorithms experiment and the full databases experiment. The first common characteristic is that all algorithms use an analytical method to find the coefficients of the linear combination w. However, the specific procedure for finding w is different for each algorithm; MHLDT uses a multi-class version of Fisher's linear discriminant, CRUISE uses linear discriminant functions, producing K-ary splits, and all versions of Zhang's MPSVM use MPSVM to find clustering hyperplanes used to obtain w.
A second common characteristic is that the classifiers may use univariate splits in some cases. However, only MHLDT uses a feature selection method to obtain multivariate splits with few features, which may be an advantage. A third common characteristic between MHLDT and CRUISE, which are better ranked than MPSVM, is that they can work directly with multi-class problems.
The results shown have taken into account databases with a diverse number of features, objects, classes, and degrees of imbalance. However, an algorithm with low performance in a group of databases, such as high-imbalance databases, may have competitive performance in another group, such as lowimbalance databases. We now make statistical comparisons by type of database to identify top-performing algorithms in groups of databases.

B. ANALYSIS BY TYPE OF DATABASE
In this section, we will show the results of the statistical test for subsets of our databases according to the number of features, objects, classes, and degree of imbalance. Since we will have 16 comparisons, we only show the results with figures similar to 6. The results in table form can be consulted in Appendix B.

1) NUMBER OF CLASSES
First, let us compare the performance of the classifiers in a subset of databases with two classes against a subset of databases with more than two classes. In Figure 7, the upper subplots correspond to database subsets for the full algorithms experiment, while the lower subplots correspond to database subsets for the full databases experiment. The left subplots have database subsets containing only two classes, while the right subplots have database subsets containing more than two classes. We notice that the median AUC for all classifiers is lower for databases with two classes than for databases with more than two classes; this suggests that the selected two-class databases may be more difficult to classify correctly.
For both experiments, when considering a subset of databases with more than two classes, we notice that the five algorithms with the higher median probability of winning and lower median probability of losing are MHLDT, CRUISE, MPSVMlda, MPSVMparallel, and MPSVMpca. This group of algorithms matches the one obtained in the test considering all 57 databases for the full databases experiment, we can verify this in Figure 6.
For the case with only two classes, for the full databases experiment, we add QUEST to the group of five topperforming algorithms identified in Figure 6. OCT may also be considered because it has a median probability of winning slightly higher than CRUISE, but also a higher median Identifying groups of algorithms that perform well in database subsets with common characteristics can help us select which algorithms to test given a specific database.
For example, we would test QUEST before CRUISE for twoclass databases since it has a slightly higher median probability of winning and a slightly lower median probability of losing. However, if the objective is to design a new algorithm that works well in databases with more than two classes, we would like to know what CRUISE is doing different from QUEST to achieve an improvement in classification performance. Although CART-LC and OC1 are the oldest MDTs we compare, they manage to outperform most of the algorithms for two-class databases. However, CART-LC and OC1 have reduced performance for databases with more than two classes. Considering that the top-performing algorithms are analytical algorithms, this result may suggest that iterative algorithms may have difficulties finding good splits for databases with more than two classes.

2) NUMBER OF FEATURES
In Figure 8, the upper subplots correspond to database subsets for the full algorithms experiment, while the lower subplots correspond to database subsets for the full databases experiment. The left subplots have database subsets with up to 20 features, while the right subplots have database subsets containing more than 20 features. We notice that the median AUC for all classifiers is lower for databases with up to VOLUME 9, 2021 FIGURE 8. Statistical comparison of classification performance by the number of features, using the Bayesian signed-rank test. The upper subplots show the results for the full algorithms experiment, and the lower subplots show the results for the full databases experiment. The left subplots show the results for database subsets with up to 20 features, and the right subplots show the results for database subsets with more than 20 features. Each subplot shows the median probability of each algorithm winning and the probability of each algorithm losing; the best performing algorithms are at the bottom right corner, while the worst performing algorithms are at the upper left corner. We also show the median AUC through the size and color of the points. 20 features than for databases with more than 20 features; this suggests that the problems with more features are more difficult to classify correctly.
For the case with up to 20 features, for both experiments, we identify a small group of three top-performing algorithms: MHLDT, MPSVMlda, and CRUISE.
For the case with more than 20 features, the group of three best performing classifiers changes. We can still identify CRUISE and MHLDT in a group of five top-performing algorithms for the full databases experiment, but we only identify MHLDT in the group of five top-performing algorithms for the full algorithms experiment.
The top-performing algorithms for databases with few features are all analytical algorithms. However, we do not notice a clear pattern for the top-performing algorithms FIGURE 9. Statistical comparison of classification performance by the number of objects, using the Bayesian signed-rank test. The upper subplots show the results for the full algorithms experiment, and the lower subplots show the results for the full databases experiment. The left subplots show the results for database subsets with up to 1,000 objects, and the right subplots show the results for database subsets with more than 1,000 features. Each subplot shows the median probability of each algorithm winning and the probability of each algorithm losing; the best performing algorithms are at the bottom right corner, while the worst performing algorithms are at the upper left corner. We also show the median AUC through the size and color of the points in databases with many features. For the full algorithms experiment, the top-performing algorithms include an analytical algorithm with feature selection (MHLDT), an iterative algorithm with feature selection (CART-LC), an iterative algorithm without feature selection (OC1), and a tree optimization algorithm (OCT).

3) NUMBER OF OBJECTS
In Figure 9, the upper subplots correspond to database subsets for the full algorithms experiment, while the lower subplots correspond to database subsets for the full databases experiment. The left subplots have database subsets with up to 1,000 objects, while the right subplots have database FIGURE 10. Statistical comparison of classification performance by the degree of imbalance, using the Bayesian signed-rank test. The upper subplots show the results for the full algorithms experiment, and the lower subplots show the results for the full databases experiment. The left subplots show the results for database subsets with up to 2 objects of the majority class for each object of the minority class, and the right subplots show the results for database subsets with more than 2 objects of the majority class for each object of the minority class. Each subplot shows the median probability of each algorithm winning and the probability of each algorithm losing; the best performing algorithms are at the bottom right corner, while the worst performing algorithms are at the upper left corner. We also show the median AUC through the size and color of the points subsets containing more than 1,000 objects. We notice that the median AUC for all classifiers is lower for databases with up to 1,000 objects than for databases with more than 1,000 objects; this suggests that the problems with fewer objects are more difficult to classify correctly.
For the case with up to 1,000 objects, we notice that the five top-performing algorithms for the full databases experiment are again MHLDT, CRUISE, MPSVMPparallel, MPSVMPlda, and MPSVPpca. This group of algorithms can also be identified for the full algorithms experiment, which considers all algorithms.
For the case with more than 1,000 objects, for the full databases experiment, the group of five top-performing algorithms includes QUEST instead of MPSVMpca. For the full algorithms experiment, the group of top-performing algorithms changes, now including MHLDT, CART-LC, OCT, OC1, and CRUISE.
The top-performing algorithms for databases with few objects are again analytical algorithms. For the databases with many objects, we have similar results to the databases with many features; the top-performing algorithms MHLDT, CART-LC, OCT, and OC1, follow different approaches to generate splits.

4) IMBALANCE
In Figure 10, the upper subplots correspond to database subsets for the full algorithms experiment, while the lower subplots correspond to database subsets for the full databases experiment. The left subplots have database subsets with low imbalance, which we consider as databases with up to 2 objects of the majority class for each object of the minority class. The right subplots have database subsets with high imbalance, which we consider as databases with more than 2 objects of the majority class for each object of the minority class. We notice that the median AUC for all classifiers is lower for imbalanced databases than for balanced databases; this suggests that problems with high imbalance are more difficult to classify correctly.
For the case of balanced databases, for the full databases experiment, the five top-performing algorithms are CRUISE, MHLDT, CLDA, QUEST, and CLMIX. For the full algorithms experiment, the group changes, now including MHLDT, CRUISE, OCT, CLMIX, and OC1.
For the case of imbalanced databases, for both experiments, the five top-performing algorithms are again MHLDT, CRUISE, MPSVMPparallel, MPSVMPlda, and MPSVPpca.
The top-performing algorithms for imbalanced databases are analytical algorithms. However, the top-performing algorithms for balanced databases, including MHLDT, CRUISE, OCT, and OC1, follow different approaches for split generation.

VI. CONCLUSION
In this paper, we identified a gap in Multivariate Decision Tree (MDT) literature. There are no surveys about MDTs, and recent papers introducing MDTS suffer from one or two main shortcomings in their comparison of previous work: authors do not compare their algorithm with relevant rival techniques, or they do so but not in enough databases, and hence results are insufficient for statistically validating the underlying hypothesis.
Our goal is to evaluate the relative merit of published MDT induction algorithms to identify how they compare to one another. By doing so, we aim to fill a gap in MDT literature, and we hope that our findings help the community to select an MDT to use in a particular context.
To accomplish our goal, first, we conducted a survey of 37 relevant MDT induction algorithms. Then, we evaluate the classification performance of 19 general-purpose MDT induction algorithms in 57 databases and make a statistical comparison.
Our conclusions are stronger than any other found in the literature since we compare almost four times the number of algorithms as the most thorough MDT study, which compared 5 algorithms in 20 databases [11]. We also compare the algorithms in 4 more databases than the reviewed study with the largest number of databases [14].
Our main contributions in this paper are: (1) we provide a sound and extensive review of MDT induction algorithms; (2) we provide the most extensive MDT comparison to date, supporting our results through statistical tests; (3) we identify groups of top-performing algorithms for all databases and subsets of databases with common characteristics.
We provide the full results of the Bayesian signed-rank tests for each pair of algorithms, which can be used to find if there is a bias towards either algorithm winning or towards a tie. To summarize the results of the Bayesian signed-rank tests and compare the overall performance of the algorithms, we made plots with the median probability of an algorithm winning or losing against other algorithms. With these plots, we were able to identify groups of top-performing algorithms in all databases and in subsets of databases according to the number of classes, number of features, number of objects, and degree of class imbalance.
The two top-performing algorithms, when considering all databases, are CRUISE [18] and MHLDT [20]. Both algorithms are analytical algorithms that can work directly with multi-class problems. However, we notice that MHLDT often outperforms CRUISE and one differentiating characteristic is that MHLDT uses feature selection. Next in the ranking are some versions of Zhang's MPSVM [2], which is also an analytical algorithm, but lacks feature selection and cannot work directly with multi-class problems. Given these results, we would encourage exploring improvements in analytical MDTs, taking into account feature selection, and taking care of how to work with multi-class databases.
For specific types of databases, we found that MHLDT and CRUISE are often among the top-performing algorithms. We also found that CART [34] and OC1 [9] outperform many algorithms in some subsets of databases, even though they are the oldest algorithms in the comparison and new algorithms are often compared against them. Specifically, CART and OC1 are among the top-performing algorithms in two-class databases, in databases with more than 20 features, and in databases with more than 1,000 objects. This result highlights the importance of comparing new MDT algorithms against previous algorithms in a sufficient number of diverse databases to allow making a statistical comparison.

A. CHALLENGES FOR MDTs AND FUTURE WORK
We have identified the following challenges for MDTs that can motivate future work:   Table 9.   TABLE 11. Bayesian signed-rank test results for databases with two classes from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.  Bayesian signed-rank test results for databases with more than two classes from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 13.
Bayesian signed-rank test results for databases with two classes from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing. Bayesian signed-rank test results for databases with more than two classes from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
the PrefixSpan algorithm and are used to measure similarities between trees. This approach may be extended to MDTs; however, the challenge of comparing multivariate items efficiently must be handled first.
When dealing with MDTs with linear combinations, the hyperplane generated in a split might be changed by adding small noise so that the original and new hyperplanes are almost parallel and divide the data in the same way. Hence, we should evaluate when to consider pairs of multivariate items as equivalent even if the weights and splitting points do not match exactly.

5)
Training time is a measure of interest not included in this work. We believe that we cannot make a fair comparison of runtime at the moment, given the great diversity of platforms and programming languages used for implementing the algorithms such as Matlab, Weka (Java), C, and Julia. Furthermore, the algorithms implemented in C had to be evaluated in a Linux system; however, due to the volume of the experiments, the rest of the algorithms had to be evaluated in Windows servers of the GIEE-ML group at Tecnologico de Monterrey. Bayesian signed-rank test results for databases with up to 20 features from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing. Bayesian signed-rank test results for databases with more than 20 features from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 17.
Bayesian signed-rank test results for databases with up to 20 features from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
To make a fair comparison of the runtime of MDTs, we propose to implement the MDTs using the same platform and programming language. A statistical comparison should be performed comparing the original implementation with the new implementation of each algorithm to ensure there are no statistically significant differences. Then, it is possible to make a fair comparison of runtime by executing the algorithms on servers with the same characteristics and operating systems.
In Section III-C, we briefly described Model trees, which make multivariate decisions only at the leaves, and Functional trees, which make multivariate decisions at inner nodes and the leaves. It is important to compare the top-ranked algorithms of each group of algorithms, given that they all make multivariate decisions in some of the tree nodes, and can be interchangeably used in some contexts; for example, in contrast pattern-based classification, we could extract multivariate contrast patterns [20] from any of the three types of tree. Gama [52] compared one algorithm of each group without finding statistically significant differences between Functional trees and the other algorithms. However, the algorithms compared by Gama [52] are not shown to be TABLE 18. Bayesian signed-rank test results for databases with more than 20 features from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 19.
Bayesian signed-rank test results for databases with up to 1,000 objects from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
TABLE 20. Bayesian signed-rank test results for databases with more than 1,000 objects from the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
among the top-ranked in their respective groups, and more MDT and Model tree algorithms have been published since Gama's comparison. Therefore, Gama's comparison should be updated to include the top-ranked algorithms of each group.
In this work, we only compared single MDTs. However, authors have built ensembles using MDTs, such as random forests (Oblique Random Forest [43]), multivariate alternating decision trees [63], and pattern-based classifiers (Patternbased classifier for class imbalance problems PBC4cip with MHLDT [20]). Further work is thus concerned with comparing different MDT-based ensembles to identify which strategies lead to better classification performance.

APPENDIX A DATABASES
Here we list each database used in the experimental comparison of algorithms in Table 9. All databases, except dataset3d, come from the UCI repository [12]. References for the databases with a citation request are included in the table. For each database, we show the number of classes, number of features, number of objects, and degree of Bayesian signed-rank test results for databases with up to 1,000 objects from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 22.
Bayesian signed-rank test results for databases with more than 1,000 objects from the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 23.
Bayesian signed-rank test results for databases with up to two objects of the majority class for each object from the minority class, for the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing. class imbalance. The subset of databases used in the full algorithms experiment (see Section IV-D) can be distinguished in the table.
The names of the databases match exactly the names in the UCI repository; however, the number of objects and features may be different. The number of objects may differ because some databases have separate training and testing sets and only the number of objects of one set is reported, or because of inconsistencies between the reported number of objects and the actual number of objects in the files. The number of features may differ because some authors count the class as an additional feature (we do not), or some features, such as IDs, must be removed.

APPENDIX B CLASSIFIER PERFORMANCE AND FULL BAYESIAN SIGNED-RANK TEST RESULTS
In Section V, we showed the distribution of AUC for each classifier. In Table 10, we show the AUC of each classifier in the 57 databases. The subset of databases used in the full algorithms experiment (see Section IV-D) can be distinguished in the table.  TABLE 24. Bayesian signed-rank test results for databases with more than two objects of the majority class for each object from the minority class, for the full algorithms experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 25.
Bayesian signed-rank test results for databases with up to two objects of the majority class for each object from the minority class, for the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.

TABLE 26.
Bayesian signed-rank test results for databases with more than two objects of the majority class for each object from the minority class, for the full databases experiment. Each cell shows the probability that the algorithm in the row is practically better than the algorithm in the column. For a pair of algorithms i , j , the probability of a tie is 1 − p ij − p ji . The algorithms are ranked by the number of times their probability of winning against another algorithm is higher than their probability of losing.
In Section V-A, we presented the full results of the Bayesian signed-rank test for all databases with Tables 7 and 8. The tables show the probability that each algorithm wins and loses against the rest. In Section V-B, we showed plots summarizing the Bayesian signed-rank test in subsets of databases according to the number of classes, number of features, number of objects, and degree of class imbalance. Here, we show the full results of the Bayesian signed-rank test in table format. The results according  Tables 23 -26. related decision tree approaches and clarifying the correct category for Logistic Model Trees. They would also wish to express their gratitude to the members of the Grupo de Investigación con Enfoque Estratégico-Machine Learning (GIEE-ML) Group, Tecnologico de Monterrey, for providing computational resources for the experiments, useful suggestions and advice on earlier versions of the results presented in this article.