Decision Tree Algorithm Considering Distances Between Classes

Decision tree algorithm (DT) is a commonly used data mining method for classification and regression. DT repeatedly divides a dataset into pure subsets based on impurity measurements such as entropy and Gini. Then relatively “pure” partitions consisting of observations with the (almost) same class are obtained. Gini index is one of the representative indices for measuring the impurity of data. However, the Gini index does not take into account distances between classes. If the distances between classes are considered when measuring impurity, the decision tree algorithm can distinguish clearly observations with different classes. To the end, a new decision tree algorithm based on Rao-Stirling index is proposed considering distances between classes. Rao-Stirling index considers distances between classes in such a way that weights more to pairs of references in more distant classes when measuring data impurity. Experimental results indicate that the proposed method is superior in terms of accuracy, implying that considering the distances between classes can help improve accuracy in DT.


I. INTRODUCTION
Decision trees (DT), which are named after their treelike structure, are commonly used in data mining. A DT divides the whole data set into several subgroups containing instances with (almost) the same classes. In general, a DT consists of parent nodes and child nodes, and the parent nodes break down the data into smaller and smaller child nodes (subsets) using specific variables selected by split criterion. Partitioning progresses in the direction reducing impurities by measuring the impurity of the child nodes until the stop rule has been reached.
DTs offer several advantages. First, they can create a non-parametric model because there is no assumption regarding the data. Second, DTs can be visualized using a tree structure, so it is easy to interpret the results and to know which variables are important. Finally, the computational cost of a DT is relatively low, and tree-based decision rules for large data sets can be generated relatively quickly.
The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar .
However, DTs also have limitations in that they only consider the impurity measure and do not take advantage of other data features when splitting data. For example, the concept of distance can be used to take advantage of data features, as shown in Figure 1. There are six different classes of data, and these can be partitioned based on the Gini index, the impurity-based splitting criteria for CART algorithm to generate decision rules. For a detailed description of the Gini, the reader is referred to Gini [21]. Figures 1(a)-(c) all have the same Gini index decrease of 1.667 when any one split is performed, so the existing algorithm using the Gini index as the impurity measure cannot distinguish (a)-(c) at all. If the distance between classes is considered, (c) has a larger decrease than (a) and (b), as expected. In other words, the distance between classes should be considered when splitting in DT to obtain the generalized splitting boundary.
Related studies have considered the distance in decision trees. Mantaras [18] used the distance as an impurity measure by applying the distance of two partitions divided by split criterion in the attribute selection process. Another related study was presented by the work of Takahashi and Abe [4].  They proposed decision tree-based multiclass support vector machines that use the DT structure to solve multiple classification problems with SVM, and the distance between classes is used to classify distant classes first.
Related research has shown that the DT can obtain a general partition boundary if the distance between classes is considered, and this can improve the predictive performance with new data. Therefore, we propose a decision tree algorithm that considers the distance between classes as well as the impurity.
This paper is structured as follows. Section II explains the decision tree algorithms and their impurity measure. Section III describe the details of the proposed decision tree algorithm considering class distances. The experimental results are shown in Section IV, and the conclusions are given in Section V.

II. DECISION TREE AND RANDOM FOREST
This section describes how a decision tree works. A DT is a top-down approach to divide a data subset, and a variable and splitting boundary are selected at each stage of the process. Then, the dataset is repeatedly divided into pure subsets based on the impurity measure (See Figure 2 for the iterative splitting process of the DT). The DT defines the goodness of split as the difference between the degree of impurity before and after division. Therefore, a greater purity in the divided data results indicates a higher goodness in the split. As a result, the data set is split through division boundary R with the highest goodness of split defined as: where T is a set of the training example. G(T , R) indicates the goodness of split when the training set T is divided by R and I (T ) and I (T |R) indicate the impurities before and after division based on the division boundary.
DT applies its goodness of split criteria to each split point and evaluates the reduction in the impurity. Then, DT selects the best split point of the variable in which the reduction in the impurity is the highest. DT has impurity metrics that can be used to determine the splitting boundary. The impurity metrics are defined according to informatics and statistical approaches, such as the Information gain, Gini Index, gain ratio, distance measure [19]. The information gain [20] is an impurity-based criterion that uses entropy (origin from information theory) as a measure of impurity. The Gini index measures the divergence between the probability distributions of the target attribute's values [6]. The gain ratio [7] is a measure that ''normalizes'' the information gain divided by the entropy, and the distance measurements [18] differs from other measures in that they use the distance to normalize impurity measurements. However, to the best of my knowledge, there is no method to determine the splitting criteria considering the distance between classes among representative impurity-based splitting criteria.
DT methods have been developed over the years, e.g., ID3 [5], CART [6], C4.5 [7], and CHAID [8]. A detailed review of the structure of DT and the developed methods was provided by Murthy [9]. The included methods differs in the data type of the dependent variable, with impurity measures used to select variables and division boundaries. A summary of the decision tree algorithms is provided in Table 1. Of those, CART adopts a statistical approach using a statistical index called the Gini index. The Gini index is a measure of the degree or probability of samples being incorrectly classified when it has been randomly chosen. ID3 and C4.5 use entropy and CHAID uses chi-square as impurity measures.
Random forest (RF) is an ensemble method that randomly trains many decision trees. DT-based ensemble methods have been used effectively in various domains. DTs can produce results with a large variation and a high variance, which leads to inconsistency and overfitting. Random forest methods alleviate some of these shortcomings by constructing many trees with various properties (See Figure 3 for the ''bagging'' construction). Such methods reduce the variance because bagging aggregates results of multiple models. Random forest methods adopt random feature selection, and when modeling each subtree, only some features are randomly used for all of the subtrees to then be aggregated by average or vote. This process prevents the use of only specific features and reduces the correlation between subtrees. As a result, the random forest model is robust against noise.
To construct various trees, random forest generates subsets of size D to use a bootstrap technique, which is a sampling method that samples the dataset with a replacement. Then, RF uses subsets to generate a classifier of size D and then combines the generated classifiers into one classifier as:  where p(c|x) is the probability of class c when x is given. This structure is called bagging (bootstrap aggregating).

III. RAO-STIRLING BASED CLASSIFICATION TREE (RSCT) A. RAO-STIRLING BASED IMPURITY
In this section, we propose a new decision tree algorithm that considers not only the impurity but also the distance between two classes. First, we present a new impurity measure, the Rao-Stirling measure, that considers not only the impurity but also the distances between classes. The Rao-Stirling measure is one of a family of diversity measures used to consider distances between fields, and it is extensively used in interdisciplinary research [1]- [3]. The proposed Rao-Stirling measure multiplies the Gini index, a representative measure of impurity in decision trees, by the distance between the classes. The Rao-Stirling based impurity is defined as: where, p i and p j denote the probabilities for the i-th and j-th classes, respectively. d ij denotes the distance between the i-th class and j-th class.
In the Rao-Stirling measure, the splitting boundary is determined to classify samples with similar classes into the same partition. A simple example shown in Figure 1 can also be split using the Rao-Striling based impurity. The decrease in the Gini index after the partition in Figure 1 In other words, the Rao-Stirling based impurity approach preferentially determine the splitting boundary so that observations of similar classes can be gathered.
To calculate the distance between the classes, we use the class's center point. When a data set is given X ∈ R n×p , n is the number of data and p is the number of features. The center point is defined as the average value of the features for each class as follows: whereX i p is average value of each feature of i-th class. First, the center points (the average value of the variables in each class) are calculated and then, using the obtained center points of each class, the distances between the classes are calculated.
In this study, we use the Euclidean distance and cosine distance as the measure of the distance between classes.
Euclidean distance d ij is calculated as: The cosine distance d ij is calculated as: where s ij is the cosine similarity. The cosine similarity can be applied to any number of dimensions and is used to measure cohesion within clusters. The cosine similarity is defined as: ||CP i ||||CP j || The value of the cosine similarity s ij (for i, j = 1, . . . , M ) has a range from −1 to 1. −1 means exactly the opposite, 1 is exactly the same, and 0 indicates orthogonally between the two classes. The lower the degree of similarity, the farther the distance between the two classes. So, we use the cosine distance instead of the cosine similarity. The cosine distance has a range from 0 to 2. With a closer distance between the classes, the value approaches 0, and the farther the distance, the closer to 2. The cosine distance has a similar trait with the Euclidean distance, but it does not have the triangle inequality property, that is, d ij ≤ d il + d lj .

B. PROCEDURE
This section introduces the overall process of the proposed method, with the process map described in Figure 4. First, Steps (2)-(3) in Figure 4 comprise the preprocessing. Steps (2) and (3) find the central value of each class and calculate the distance between the classes based on the center points. Then, Steps (4) calculate the Rao-Stirling measure for all candidates of the splitting boundary to divide the data into sub data. Finally, in Steps (5)-(6), the goodness of split is calculated for each candidate, and the best division boundary is determined with the largest goodness of split. Then, if the goodness of split is positive, the procedure repeats the steps from (4). Finally, the procedure stops when the goodness of split is zero.

IV. EXPERIMENTAL RESULTS
Computational experiments were conducted using three data sets to compare the proposed Rao-Stirling based model with the Gini based model. In addition, two methods were compared to assess their results in measuring the class distance.

A. DATASETS
The datasets used for the computational experiments include the contraceptive method choice, car evaluation, and yeast datasets obtained from the UCI repository for machine learning databases. All data have positive values without missing values. The characteristics of the three datasets used for the experiments are summarized in Table 2. The contraceptive method choice (CMC) dataset is part of the 1987 National Indonesia Contraceptive Prevalence Survey data. The CMC is used to classify women's current contraceptive methods based on demographic and socioeconomic characteristics. The dataset contains nine attributes with categorical and integer types and 1,473 observations with three classes. The car evaluation dataset was donated by Marco Bohanec in 1997. It contains six categorical attributes and 1,728 instances with four classes. The task of this dataset is to classify whether the car is good or not with variables. The yeast data contains 1,484 instances with ten classes and eight attributes with real values. The dataset is used to predict the localization of cellular components consisting of proteins in a yeast cell.

B. RESULTS
The performance of the proposed method is evaluated by comparing the results to those of the existing method in terms of accuracy. For Gini-based DT and the proposed Rao-Stirling based DT, the max depth d was varied such that d = 1, 2, 3, . . . , 12, and pruning was conducted. For Gini-based RF and the proposed Rao-Stirling based RF, the number of trees n was varied such that n = 300, 310, 320, . . . , 500, and the max depth d was varied in the same way as the DTs. For each combination of parameters, 10-fold cross-validation was performed. The best mean classification accuracy over 10 cross-validations are summarized in Table 3.
For the CMC dataset, the best accuracy was achieved when using the Rao-Stirling with Euclidean distance and 5 max depths, regardless of approach taken (i.e., decision tree and random forest). For the car evaluation dataset, Rao-Stirling with a cosine distance of 8 max depths provides the best accuracy, regardless of approach taken. The best accuracy in the yeast dataset was obtained when using Rao-Stirling (Euclidean) with 6 max depths in both decision tree and random forest. The results show that the proposed method considering the distance between classes offers competitive performance. From Table 3, we observe the proposed methods considering class distances are superior to existing methods, regardless of the class distance measuring methods and approach taken (decision tree vs. random forest).  In Figure 5, for the CMC dataset, the RSCT with Euclidean class distance provides 1.9 percent higher performance than Gini index based DT. For the car evaluation dataset, the RSCT with cosine class distance shows 0.5 percent higher performance than the existing Gini based DT. For the yeast dataset, the DT using the Rao-Stirling measure and Euclidean class distance yielded a 1.7 percent higher performance than the Gini-based DT. Figure 6 shows the classification accuracies for the random forest methods. The RF methods that consider class distances are more accurate than RF methods that do not consider class distances. Especially, for the CMC dataset, the Rao-Stirling based RF with Euclidean class distance provides 1.2 percent higher performance than the Gini index-based RF. For the car evaluation dataset, the Rao-Stirling based RF with cosine class distance shows 0.7 percent higher performance than the existing Gini-based RF. For the yeast dataset, the RF using the Rao-Stirling measure and Euclidean class distance yielded a 1.6 percent higher performance than the Gini index-based RF. Figures 7(a)-(c) show how accuracies of RF change with respect to the number of trees n for the CMC, car evaluation, and yeast data sets, respectively. Note that accuracies start to stabilize when n is approximately equal to 100 for all three data sets, regardless of the datasets. If number of trees is greater than 100, the method that takes the distance of the class into account performs better than the method that does not consider the distance at all.
The Rao-Stirling based impurity approach determines splitting boundaries to classify samples with similar classes into the same partition. Therefore, in the Rao-Stirling based model, similar samples will be classified in a partition, and the distance between samples in a partition will be close.
To assess the effects of considering class distances in the Rao-Stirling based model, the change of sum of the distances between the samples over all leaf nodes with respect to decision tree depth for both approaches (i.e., Gini based model and Rao-Stirling based model) is investigated in Table 4. Table 4 shows that for CMC and car evaluation datasets, the sum of distances between samples over all leaf nodes in the Rao-Stirling based model is less than that in the Gini based model regardless of decision tree depth. The results confirm that the Rao-Stirling based impurity approach, taking into account the class distances, allows similar data to belong to the same class.

V. CONCLUSION
A decision tree method based on the Rao-Stirling measure was developed to consider the distance between classes, and it is compared to existing DT methods using the CMC, car evaluation and yeast datasets. While the existing Gini index used in existing DT methods only considers the impurity, the Rao-Stirling measure considers the class distances as well as the impurity. The Experimental results show that the proposed approach performs consistently better than existing approaches, regardless of the data set and class distance measures employed. This is an encouraging result since only considering the distances between classes can improve the performance in both decision tree and random forest analyses. Future work may include verifying the above findings using various datasets and considering other class distance measurements.