Topological Forest

We propose a new ML model called Topological Forest that contains an ensemble of decision trees. Unlike a vanilla Random Forest, Topological Forest has a special training process that selects a smaller number of decision trees on a topological graph representation that TDA Mapper constructs. Compared to Vanilla Random Forest, Topological Forest significantly improves the computational efficiency of inference time due to the smaller ensemble size and selection of better decision trees while keeping the diversity of decision trees. Our experiments show that Topological Forest can speed up inference time by more than 100x on average while compromising at most 2% reduction in the AUC metric for the prediction quality.


I. INTRODUCTION
Random forests [1] are a type of ensemble learning model that use multiple decision trees to make predictions or perform classification tasks. In a random forest, each decision tree is trained on a subset of the features and training data, and the results of all the decision trees are combined to make more robust predictions.
Random forests have gained popularity in recent years due to their simplicity and efficiency [2], [3], [4], [5], [6], [7]. Unlike more complex models like gradient boosted decision trees [8] or deep neural networks [9], random forests can be easily parallelized on a cluster of machines, which makes them faster to train and use for inference. Additionally, the decision rules from the split nodes in a random forest can be easily interpreted by humans, which makes them more transparent than other, more complex models. These advantages make random forests attractive for many classification and regression tasks.
While random forests have many advantages, the qualities and contributions of their individual decision trees are often not considered. Previous attempts to tune the number of decision trees [10], [11] and understand the impact of ensemble size on performance [12] have focused on determining the The associate editor coordinating the review of this manuscript and approving it for publication was Xujie Li . upper bounds of ensemble size and the optimal number of decision trees, rather than analyzing the quality of individual trees or their contribution to the ensemble. In this paper, we propose a training process that takes into account the similarity and quality of decision trees in a random forest. By carefully selecting decision trees in this way, we achieve similar performance on prediction tasks with a smaller ensemble size. This would reduce the size of the model and improve computational efficiency during inference.
In this paper, we propose the Topological Forest model, which uses a smaller ensemble of decision trees to perform regression and classification tasks. The Topological Forest model has a multi-step training process that begins by training a vanilla random forest. Next, we transform each tree in the random forest into a feature graph and extract key features from the graphs as high-dimensional vectors. We then use TDA Mapper [13] to transform these high-dimensional feature vectors into a topological cluster network. This allows us to identify representative decision trees that can be used to construct a smaller, more efficient random forest. By using this approach, we aim to achieve similar performance on prediction tasks with a smaller ensemble size, reducing the model size and improving computational efficiency.
TDA Mapper is a technique that allows for the soft clustering of high-dimensional data points while constructing a network view where similar clusters are connected by short VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ paths. TDA Mapper transforms the high-dimensional input features into a low-dimensional space (in this case, 2D) using a lens or filter function. Common choices for the filter function include projection onto one or more axes using techniques like TSNE [14] or density-based methods. Once the filter function is applied, the data in the original feature space is transformed into 2D space. TDA Mapper then constructs a set cover of the projected space in the form of a set of overlapping intervals with constant length. Next, TDA Mapper constructs a mapper graph with clustered trees as vertices that represent each set in the cover of the projected space. An edge exists between two vertices if the two sets (due to the overlapping intervals) share some trees in common. Finally, we use different strategies to select decision trees from each cluster to discover a more representative tree ensemble with fewer trees. Our experimental results show that trees in a cluster make similar predictions on out-of-bag samples, which makes TDA Mapper well-suited to our tree selection problem.
In particular, we list the contributions of this paper as follows: • We propose a new machine learning model called Topological Forest, which uses a significantly smaller number of decision trees than traditional random forest models while achieving similar prediction quality (within 2%). This reduction in the number of decision trees allows for a smaller model size and improved computational efficiency.
• We introduce a novel graph representation of decision trees that can be used in various applications to take advantage of tree features and decision tree similarities. This graph representation allows for more effective analysis and interpretation of decision trees and can be used in a variety of contexts.
• We use TDA Mapper to map similar decision trees into the same clusters, allowing us to construct a more effective random forest by applying various ensemble selection strategies.
• From our experiments on several binary classification tasks, we find that the Topological Forest model significantly improves computational efficiency during inference, with minimal compromise on prediction quality.
The structure of this paper is as follows: The next section provides an overview of related work in the field. In Section III, we introduce the Topological Forest model and its training process. Section IV presents our experimental results. Finally, in Section V, we provide our conclusions and discuss potential future work.

II. RELATED WORK
Random Forest is a popular ensemble-based Machine Learning model that is used for classification or regression tasks. It works by constructing a prediction value from individual decision trees, using different aggregation policies such as majority voting, computing the mean, or median, depending on the type of problem. Because of its simplicity and efficiency, the Random Forest model has been applied successfully to many practical problems, such as patient health prediction, image classification, emotion recognition, malware detection, and user response prediction.
A significant advantage of Random Forest comes from its interpretability [17], [18], and there have been several works [19], [20], and [21] to improve the explainability of Random Forest. For example, decision tree split rules in Random Forest can easily be converted into human-readable rule format [22] to understand the relationship between feature space and the predicted outcome.
Parallelization of training and inference processes [23], [24] is another significant advantage of Random Forest. We can train all decision trees in parallel, and during the inference, we can invoke all decision trees concurrently to compute predicted outcomes. This makes Random Forest very efficient and suitable for large-scale applications that require processing big data [16], [25], [26], [27].
While Random Forest has the advantages mentioned above, the ensemble size can be unnecessarily large without a significant contribution to the prediction performance. Previous attempts to limit ensemble size focused on tuning the number of decision trees during the training process [10], [12], [28], [29]. Although these approaches yield improvements in the run-time efficiency of Random Forest, they don't analyze the individual quality of decision trees or the similarity between decision trees to boost quality.
We use topological data analysis to analyze our decision trees and make the random forest more efficient in computational costs of inference with a small compromise on prediction accuracy. TDA considers that data has a shape and aims to investigate the underlying manifold structure of the data rather than just using the statistical description [30]. TDA is mainly powered by persistent homology [31], which can capture topological differences across various scales and depict them in persistent diagrams. This novel data analysis approach is rapidly developing for different purposes in machine learning research.
TDA can be used to extract topological features from data to use them as inputs for the machine learning models, or it can be used to improve the model's design and study some aspects of the model [32]. In terms of TDA feature extraction, Harer et al. [33] used total persistence of a persistence diagram, Chen et al. [34] applied p-norm, and Atienza et al. [35] used persistent entropy as a topological descriptor input feature. Rieck et al. [36] used Betti curves [37] to summarize the features. Chevyrev et al. [38] utilized this representation and persistent diagrams to develop a classifier using a random forest and support vector machine. Bubenik et al. presented persistent landscape, a new topological descriptor for mapping the persistent diagram to function space [39]. Zhao et al.used TDA features for graph classification [40]. These studies showed that TDA features could be well-suited as inputs for machine learning models to improve their accuracy.
Some studies applied TDA to design a machine learning model or improve some aspects of a model. Moor et al. presented a topological auto-encoder for low-dimensional representation of input data features [41]. Yuvaraj et al. used TDA to study complex multilayer networks [42] and to cluster them based on topological approaches, and Bulauan et al. [43] clustered complex multilayer networks with topological approaches. Chen et al. [44] introduced an approach for measuring the classification boundary of a classifier by using a topological complexity, and Hofer et al. [45] developed topological constraint to improve the generalization performance of their model. Our method is categorized in the second area of research on topological data analysis. TDA helps improve our Random Forest quality and considerably increases the model's performance.
Our proposed approach is unique because topological random forests can reduce the ensemble size and improve the speed of predictions compared to vanilla Random Forest at the cost of reducing at most 2% in the AUC metric. Topological Forest also keeps the diversity of decision trees by selecting individual decision trees that belong to different similarity classes.

III. TOPOLOGICAL RANDOM FOREST
In this section, we'll explain the details of the Topological Forest: We present a high-level overview of the training process for the Topological Forest. After that, we briefly describe a vanilla Random Forest and introduce the notation we use throughout the paper. Next, we explain the graph encoding of decision trees and discuss how a Random Forest ensemble is constructed from a topological clustering of encoded decision tree representations.

A. OVERVIEW OF THE TRAINING PROCESS
Building a Topological Forest from the training data is a multi-step process, as shown in Figure 1. In the first step, we train a vanilla random forest by using the original dataset with all its features. After that, each decision tree in vanilla random forest is transformed into the graph representation by using relationships between features. In the next step, we extract features from the graph representation of decision trees as N-dimensional vectors. These vectors have topological features like the number of edges or the average degree of vertexes.
Once feature extraction is completed, we use TDA Mapper to transform and cluster N-dimensional feature vectors into a topological network that has a cluster of nodes (each node represents a decision tree). TDA mapper step first invokes a feature embedding task by leveraging TSNE as a filter function to transform N-dimensional feature vectors in 2D. Once compressed features in 2D space are computed, TDA mapper generates a mapper graph with clustered trees as vertices by leveraging a set cover of the projected space in the form of overlapping intervals with fixed lengths.
As Figure 1 shows, we employ two types of graphs to build Topological Forest: a graph representation of a decision tree and a TDA Mapper produced graph (aka mapper graph) to cluster decision trees. The first graph is relatively smaller than the mapper graph and it contains parent and child relationships of features. On the other hand, the mapper graph is a topological network of soft decision tree clusters, and an edge between two clusters exists if they share common decision trees. In the final step of the entire process, we employ an ensemble selection task to build a topological forest from the mapper graph. In the ensemble selection task, we used different selection algorithms to pick decision trees to construct a topological forest from the cluster of decision trees in the mapper graph. In the following sections, we give details of each step.

B. TRAINING VANILLA RANDOM FOREST
We consider a binary classification scheme for ease of exposition, but our approach can be trivially generalized to multilabel classification. Let {x n } n∈Z + be a set of data points, and let each point x n be associated with a pair ( x n , y n ), where x n ∈ R D is a feature vector and y n ∈ {0, 1} is a label. There are D features for each data point x n , and a total of N data points in the training set.

Definition 1 (Random Forest): A Random Forest is a classifier consisting of a collection of decision tree classifiers
{h(X , Y , k ), k = 1, . . . , N } where the k are independent identically distributed random vectors [1].
Each tree in Random Forest is trained on a randomly selected subset of data through data bagging. Once the VOLUME 10, 2022 Random Forest is constructed, each tree casts a unit vote for a class of a given input x n ∈ X .

C. GRAPH ENCODING AND FEATURE EXTRACTION
In this step, we convert each decision tree into a graph structure. Our decision to convert a tree into a graph is motivated by two reasons: • First, creating a graph from a decision tree allows us to generalize parent-child relations as weighted featurefeature edges, which create summary, interpretable representations of large decision trees (see Figure 5).
• Second, although extracting features of a decision tree is under-studied in the machine learning community, graph feature extraction is a well-studied problem. To illustrate; graph ML offers features for vertices (e.g., eccentricity [46]), edges (e.g., edge centrality [47]), subgraphs (e.g., modularity [48]) and graphs (e.g., clustering coefficient [49]) which can encode the trees of a random forest from various aspect. The encoding brings us closer to measuring the performance of tree ensembles in a principled way.
In the graph representation of each decision tree T = V × E, each vertex v ∈ V corresponds to a feature d ∈ D, and a directed edge between two vertices e = v i , v j ∈ E denotes a parent-child split of d i → d j in the decision tree. A decision tree can split the same feature multiple times into different levels; however, we do not create a duplicate vertex for each new split. Instead, we add duplicate edges from the parent vertex to the child vertex for new parent-child splits. As a result, the encoded graph may contain duplicate edges between vertex pairs. We also ignore feature values in splits during the entire encoding process because traditional graph features do not utilize attributes of vertexes.
Consider example in Figure 3 to understand how the decision tree is transformed into a graph representation. Since our goal here is to show the mapping between the decision tree and graph representation, we omit the example data set and hyperparameters that are used to construct decision tree in Figure 2. In the decision tree, let's assume that the feature x 1 of the root node has a split value of 25 in Figure 2. However, we do not retain this information in the graph representation of this decision tree in Figure 3. The reader should notice that there is only one node for each feature in the graph representation in Figure 3. However, there are two edges between x 1 and x 2 in Figure 3 since there are two child nodes that split on x 2 of the root node, which splits on x 1 .
The graph kernel [50] or graph neural network [51] approaches can incorporate the multi-set of split values as vertex features in the learning process. However, such approaches may create high computational costs in the training process. For this reason, we leave the kernel and graph neural network approaches as future work.
We detail the process for encoding a graph from a decision tree in Algorithm 1 where we employ a breadth-first traversal to discover the parent-child relations in feature splits. The  algorithm takes a decision tree as an input and outputs a directed graph structure such as the one shown in Figure 3. In the edge case where the decision tree has only a root node without any split, the algorithm would create a graph that consists of one vertex with special empty feature label without any edges. In all other cases, the algorithm creates at least one feature vertex since there must be at least one split at root node. In cases where tree depth is more than 1, the algorithm creates at least one edge.
We outline three approaches to graph encoding: traditional features, graph kernel [50], or graph neural network [51]. Traditional features and graph kernel approach are considered shallow graph encodings because the output is a lowdimensional feature vector. Graph neural networks create deep encodings that are high-dimensional feature matrices. Deep encoders frequently outperform shallow encoders in ML tasks (see Tables 3 and 4 in [52]); however, the performance comes at a great computational cost which we want to avoid for this task.
We used two main approaches to encode a graph in traditional graph features. We extract individual vertex features, such as vertex degree, in the first approach. Next, we design a Input: Decision tree T Output: Directed multi-graph G(V , E) Initialize G as a directed multi-graph; Initialize S as a multi-set; r ←the root of T ; // Insert root into multi-set S S ← S ∪ {r}; while S is not empty do r x < − pick the first element in S; Algorithm 1 Graph Creation From a Decision Tree readout function, which can be as simple as averaging values to pool vertex features. In the second approach, we ignore vertex features and directly extract graph-level features such as graph diameter and the number of strongly connected components. We follow a combination of both approaches and extract the following features on both vertex and graph levels: • Vertex: in-degree of a vertex out-degree of a vertex degree of a vertex vertex path distance to all vertices betweenness centrality of a vertex • Graph: diameter of the graph number of vertices number of edges counts (16) of directed three-vertex motifs [53] clustering coefficient Additionally, we have tested several vertexes and graph features such as numbers of strongly and weakly connected components and hub/authority scores [54]. For reporting purposes, we only provide features that improve the performance of Topological Forest on out-of-bag test data.
Computational cost: Extracting graph encodings of an individual decision tree is a non-trivial operation. Features like betweenness centrality requires a time complexity of O(|V | × |E|) [55] and motif counting requires O(|E|) [53]. We use the vanilla clustering coefficient implementation of the Jung library [56] whose time complexity is O(|V | 3 ). Most datasets on the UCI repository have less than 50 features. As a result, the graphs that we extract from decision trees are quite small, with less than 50 vertices. Furthermore, decision tree depth is bounded by the number of training data points. As a result, there are less than 200 parent-child relationships recorded as edges in these graphs. For these reasons, the graph representation creates quite small graphs ( Figure 5) (i.e., |V | < 50 and |E| < 200 for most datasets). As a result, computational costs of graph representation and encoding in Topological Forest are negligible.
We average vertex features to find the corresponding graph feature (e.g., average vertex degree in the graph). Combining five averaged vertex features and 20 graph features, we create a decision tree representation e ∈ R D where D = 25. This representation allows us to compare and contrast decision trees which may use different features, split values, and tree depth.

D. TOPOLOGICAL CLUSTERING
We employ the highly customizable TDA tool Mapper [57] to analyze and cluster decision trees in the original vanilla random forest. TDA Mapper complements traditional clustering, and projection pursuit approaches with a systematic insight into data geometry and topology. It uncovers hidden data patterns that are otherwise inaccessible with conventional data analytic techniques.
The key idea behind TDA Mapper is as follows: Let T be a total number of observed trees and { e t } T t=1 ∈ R D be a data cloud of tree encodings. For our dataset, D = 25. We employ the t-distributed stochastic neighbor embedding (t-SNE) [14] as a lens to reduce the data into a twodimensional space. The t-SNE converts similarities between data points to joint probabilities and minimize the Kullback-Leibler divergence between the joint probabilities of the lowdimensional embedding and the high-dimensional data. Next, we select a function ξ : { e t } T t=1 → R that filters data in one of the two dimensions.
Let I be the range of ξ , that is, I = [m, M ] ∈ R, where m = min ξ ( e t ) and M = max ξ ( e t ) in the dimension d . We place data into overlapping bins by dividing the range I into a set S of smaller overlapping intervals of uniform length and let u j = {t : ξ ( e t ) ∈ I j } be trees corresponding to features in the interval I j ∈ S. For each u j we perform a k-means clustering to form clusters {t jk }.
We analyze the empirical distribution of edge lengths where each cluster is merged to find the number of clusters. The merging criteria are based on the rationale that internal distances (i.e., within a cluster) are expected to be lower than external distances (i.e., in-between clusters), and distributions of internal and external distances are disjoint. Let {t jk } denote the k-th cluster of the j-th interval. We construct a cluster graph by transforming each cluster into a node and adding an edge between two nodes k and p if clusters {t jk } and {t lp } contain overlapping data points, i.e., {t jk } ∩ {t lp } = ∅. Formally, the graph is called a TDA Mapper graph or a topological network.
After the graph transformation of decision tree clusters, TDA Mapper produces a low dimensional representation of the underlying data structure in the form of ''cluster tree'' graph CT where each ''cluster'' is a branch of some single connected component rather than a disconnected component on its own as in conventional clustering analysis. In Fig. 4, we show an example of ''cluster tree'' graph that is constructed from Adult Dataset [58].
In the underlying mapper library, we can control the bin count with the n cubes , and interval overlap with the overlap parameters. A high overlap value creates more edges between vertices, whereas higher n cubes and k values create more vertices in the graph. As explained in Section III-E, Topological Forest allows fine-tuning the inference process on the mapper network with various selection strategies that consider vertex sizes and edges. We consistently report good performance results across various datasets in our experiments with appropriate mapper parameters.

E. ENSEMBLE SELECTION
As the previous section III-D explained, the TDA mapper network contains vertices that are clusters of decision trees and edges that show the relationship between neighbor clusters. This type of mapper network encodes topological insights into the trained decision trees. In particular, the network shape and the positions of vertices on the network convey helpful information about the diversity and quality of decision trees. For example, consider the disconnected components of the mapper graph in Figure 4. Although we use a high overlap value of 0.6, which increases the probability of establishing an edge between two clusters, three groups of vertices (i.e., clusters) at the bottom right have no edges to the rest of the network; they form disconnected components. This phenomenon is inherently due to the dissimilarity of some decision trees and their encodings with respect to the rest of the decision trees. However, this observation of dissimilarity does not tell us anything about the utility of such isolated clusters and their decision trees. In one (and most likely) scenario, trees of the isolated vertices may have been built on useless features or noisy data points that add no predictive power to the forest. In a second scenario, the isolated trees may have been built on the most predictive features and data points. We design topological cluster selection strategies to test such hypotheses and evaluate the predictive power of clusters.
We will use Figure 6 to explain three selection strategies: random, greedy mapper, and quality. In all three strategies, we first compute a cluster graph CT with a set of clusters defined as vertices on the graph. For simplicity, we will use a graph notation and refer to a cluster k as v k ∈ CT .

1) RANDOM
We randomly select n clusters and build a Random Forest from the union of the trees of the chosen clusters.

2) GREEDY MAPPER
We select n clusters that yield the highest AUC individually and create an ensemble from the union of the decision trees. We build an ensemble from the trees of each cluster and test the predictive power of the ensemble on validation data. Next, we test the ensemble on out-of-bag test data.

3) QUALITY
The Quality Strategy uses a homogeneity metric-based selection which we define as the average tree agreements over correct and incorrect predictions for all validation data points. If most trees agree on a label, homogeneity will be high. However, the trees may agree on true (correctly predicted) and false (incorrectly predicted) labels. Clusters whose trees are homogeneous on true labels are preferable to those of false labels. In both cases, we hypothesize that if decision trees of a cluster contradict each other in classification (i.e., low homogeneity), the cluster must have been formed out of ''bad trees''.
We split the set of data points in X = {X i ∪ X f ∪ X t } w.r.t. the random forest classification. X t comprises data points classified correctly by the random forest by majority voting, whereas X f comprises data points that are misclassified by random forest by majority voting. Lastly, X i includes points where the decision trees vote equally for true and false labels. Formally, we define true homogeneity as follows: Definition 2: (True homogeneity of cluster v k where h is the number of decisions trees in cluster v k that correctly classify data point x) Similarly, we define false homogeneity as follows: Definition 3: (False homogeneity of cluster v k where h is the number of decisions trees in cluster v k that miss classified data point x) We calculate a cluster quality score for all existing clusters based on the true and false homogeneity scores as follows: Definition 4 (Cluster Quality Index):  In true homogeneity computations, we will use data points 1 and 2 for cluster 1, but only datapoint 1 for cluster 2. Datapoint 2 is misclassified by the majority of cluster 2 decision trees. As a result, datapoint 2 will be used to compute the false homogeneity of cluster 2.
Homogeneity does not punish nor reward Tie cases where decision trees vote equally for correct and incorrect labels. In Tie cases the number of miss-classifying decision trees is equal to number of correctly classifying decision trees in a cluster. After indexing all clusters, we select the highest quality clusters and create an ensemble out of their trees. Quality Top-x: Differing from the Quality approach, after selecting Top-K clusters based on the index score, we limit tree selection to Top-1, Top-2, or Top-5 best trees based on the tree score index in each cluster. For calculating the tree score index inside each cluster, we count the number of positive and negative collaborations of each tree. If a tree contributes to true homogeneity, it is rewarded with +1, and if it contributes to false homogeneity, it is penalized with −1. The top 1, 2, and 5 trees with the highest rewards are selected for an ensemble.

IV. EXPERIMENTAL RESULTS
We have released our source code at https://github.com/ cakcora/MultiverseJ where we have developed a complete classification Random Forest in Java.

A. DATASETS
For the experiments, we selected six classification datasets from the UCI Machine Learning Repository with two  selection criteria to ensure robust classification results: the number of data points in a dataset must be more than 10K, and the dataset should have six or more features. As Table 1 shows, the selected datasets have diversified population sizes, attributes, and majority-class percentages. Diabetes, Adult, and Nursery data sets have binary labels while other datasets have more than two classes.
Since our focus in this paper is binary classification tasks, we reduce the number of classes for non-binary datasets in the following way. For the Poker dataset nothing in hand, one pair, two pair and three of a kind classes were re-labeled as low probability to win. Other classes were classified as high probability to win. In the Connect-4 dataset, draw and loss were merged in not-win class along with the existing win class. In the Letter Recognition dataset, the first 13 letters were merged in a 0 class, and the last 13 letters were merged in a 1 class.

B. FEATURES AND GRAPH ENCODINGS
During the training phase (See Section III-C), we extract a graph from each decision tree of the vanilla Random Forest for all data sets. Next, we extract features from each graph and use the TDA Mapper to create a topological network.
We use hierarchical clustering to show the similarity of features over trees. Although hierarchical clustering is not used by proposed approach in the paper, we use graph features extracted from each decision tree and pass them to the TDA mapper. If these features are very correlated and clustered together in the feature space, this will reduce the efficiency of TDA mapper step as it process these features as an input. Therefore, visualization of this clustering information is very critical from experimental results perspective.
In Figure 8 we show the relationship between the features where we hierarchically cluster the extracted features from graphs. Triads are the directed 3-vertex motifs [53], and they co-cluster well, except for triad 16, which is the strongest (closed) triangle motif. Unsurprisingly the median, average in (avgIn), and average out-degree (avgOut) co-cluster with the triads 12 and 13, which are closed triangle motifs. However, these degree-statistics-based features do not co-cluster with betweenness or edge count. This behavior arises because we one-hot encode categorical features of the data, creating many vertices (i.e., new features) in the graph. Such graphs have many edges, but they exhibit low clustering coefficients and few connected triads.

C. TUNING FOR TDA MAPPER
We have experimented with a set of TDA Mapper parameters for each dataset and reached the best overall AUC performance for n_cubes = 10, perc_overlap = 0.6 and number_of _clusters = 5. The performance is not sensitive to the number_of _clusters within a cube or num_cubes. However, perc_overlap determines how connected the topological network will be. A more connected network implies more shared decision trees between clusters. As a result, clusters include less similar decision trees that create lower classification homogeneity (i.e., trees vote differently on data points). Figure 9 shows that our methods are robust against the overlap percentage. An increasing overlap percentage lowers homogeneity, but the decrease is not drastic (from 0.99 to 0.89), which shows that features of decision trees are diversified enough to create separate clusters even when we allow for a higher overlap. As a result, trees in a cluster are similar and vote similarly. The high homogeneity is a significant result because, as we show in Section IV-D it will enable us to use fewer decision trees from a cluster but reach similar performance in classification.

D. CLASSIFICATION RESULTS
Our experiments partition each dataset into 80% training, 10% validation, and 10% out-of-bag test subsets. We ran each experiment 30 times, where we used random seeds to select the partitions in each replica. A vanilla forest has been created on the training subset in each replica. The vanilla forest has been tested for any possible overfitting by comparing the overall accuracy of the train and test set, which were close to each other.
In the next step, we employ the Kepler Mapper [63], a Python implementation of TDA Mapper, with built-in visualization, dimensionality reduction, and clustering options on the vanilla forest decision trees to create and visualize our TDA results.
We report our results in terms of ROC AUC and run-time performance. We compared six ensemble selection strategies with vanilla random forest. The definition of each strategy is explained in Section III-E. Table 2 shows the mean and standard deviation of AUC results over the 30 replicas. As Table 2 shows, we find that Greedy Mapper has the best AUC on four datasets among all ensemble selection strategies. Greedy Mapper also loses only 2% of AUC with significantly smaller (10x, 30 trees) Topological Forest size compared to the vanilla Random Forest, which has 300 decision trees. In the Poker and Letter Recognition datasets, where we have substantial class imbalances (99% and 80%, respectively), Greedy Mapper and Q Top5 AUC are close to the Vanilla Forest in terms of AUC, which offers evidence that selecting the best clusters can also help with class imbalance.
We also reported the prediction quality of three random forest policies that have 5%, 10% and 100% of decision trees as proxy for all approaches that tune the number of decision trees. Table 3 shows the results for randomly picking up decision trees. Since topological forest has 10% of decision trees, comparing these three policies gives an idea of the upper and lower bound on prediction quality when number of trees is purely tuned. From these new experimental results, we find that the quality and diversity of decision trees are very important factors in further improving the prediction quality of random forest. As shown in Table 3, the Greedy Mapper approach produces random forest with 10% of decision trees, which is better than random policy that uses 10% of decision trees. Table 4 shows the computational cost of the inference task from our best approach, Greedy Mapper, and the Vanilla Random Forest when deployed on the test data. Here, the computational cost during the inference task is defined as the total CPU time spent for all computations, including the cost of inference from individual decision trees. In the best case, Greedy Mapper improves computational cost 217 times compared to Vanilla Forest in the Adult dataset, whereas the improvement is around 22 times for the Binary Poker dataset. On average, Greedy Mapper reduces the inferring time on the test data with such high values for two reasons. First, the number of decision trees in the ensemble is reduced by 90% or more after applying ensemble selection strategies. Second, trees that are built on low quality features grow too complex to fit the training data better. However, such deeper trees are unlikely to appear in the best-performing clusters. As a result, such trees are excluded in Greedy Mapper, contributing to better run-time results.
We also compared the computational cost of all approaches for the training task in Table 5. Here, the computational cost during the training task is defined as the total CPU time spent for all computations, including training individual decision trees and any other downstream steps like graph encoding of decisions trees or running the TDA Mapper. While, on average, Greedy Mapper is 4.1 times slower than Vanilla Forest for the training task, it is 113 times faster than Vanilla Forest in the inference task. Thus, the trade-off between the cost of training and the computational performance in the inference task significantly justifies the use of Topological Forest.
The efficiency of the Topological Forest is mainly related to the computational time savings in inferral. Topological Forest uses 10% of the trees and yields comparable performance to Vanilla Forest. Furthermore, Topological Forest performs better than the Vanilla forest in Nursery and Letter  Recognition datasets ( Table 2). In this sense, our method has better AUC performance for some datasets as well.

V. CONCLUSION AND FUTURE WORK
We have developed an open-source implementation of our novel ML method Topological Forest. Our approach builds on a Vanilla Random Forest implementation but uses topological methods to create a refined ensemble that has a smaller number of decision trees and better trees in the forest. On average, Topological Forest speeds up inference time by more than 100x for a cost of at most 2% reduction in AUC. The results of our experiments suggest that the topological forest is considerably faster than random forest. Moreover it needs less resources and efforts compared to neural networks.
As a future work, the capabilities of topological forest in different machine learning tasks will be a good area of research. We will use the topological forest in our future research to address the distribution shift problem by developing more diverse random forest.