A Breast Cancer Diagnosis Method Based on VIM Feature Selection and Hierarchical Clustering Random Forest Algorithm

Breast cancer is a neoplastic disease which seriously threatens women’s health. It is regard as the most common cause of cancer death in women. Accurate detection and effective treatment are of vital significance to lower the death rate of breast cancer. In recent years, machine learning technique has been considered to be an effective method for accurate diagnosis of various diseases, among which Random Forest (RF) has been widely applied. However, decision trees with poor classification performance and high similarity may be generated during the training process, which affects the overall classification performance of the model. In this paper, a Hierarchical Clustering Random Forest (HCRF) model is developed. By measuring the similarity among all the decision trees, the hierarchical clustering technique is used to carry out clustering analysis on decision trees. The representative trees are selected from divided clusters to construct the hierarchical clustering random forest with low similarity and high accuracy. In addition, we use Variable Importance Measure (VIM) method to optimize the selected feature number for the breast cancer prediction. Wisconsin Diagnosis Breast Cancer (WDBC) database and Wisconsin Breast Cancer (WBC) database from the UCI (University of California Irvine) Machine Learning repository are employed in this study. The performance of the proposed method is evaluated by utilizing accuracy, precision, sensitivity, specificity and AUC (Area Under ROC Curve). Experimental results indicate that the classification based on HCRF algorithm with VIM as a feature selection method reaches the best accuracy of 97.05% and 97.76% compared to Decision Tree, Adaboost and Random Forest on both the WDBC and WBC datasets. The method proposed in this study is an effective tool for diagnosing breast cancer.


I. INTRODUCTION
Breast cancer is one of the most important problems in women's health and has become the highest incidence of malignant tumor in women globally [1], [2]. According to the latest global cancer data in 2020, breast cancer has overtaken lung cancer as the world's leading cancer.
Accurate and early diagnosis can increase the probability for patients to gain timely and effective treatment and thus reduce the mortality of breast cancer [3]. The diagnosis for breast cancer mainly includes pathological diagnosis and imaging diagnosis. Compared with pathological diagnosis, The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico .
imaging diagnosis is a non-invasive diagnostic means widely concerned in recent years [4]- [6]. However, imaging diagnosis often needs to be confirmed after the visualization of the tumor and may miss the early detection. Fine Needle Aspiration biopsy (FNA) is a minimally invasive pathological diagnosis method based on cell morphology [7], which have great potential to provide high accuracy and low false positive diagnosis. First, a fine needle is used to extract the cells from the breast tumor. Then the cell size, thickness, uniformity, smoothness and other data are statistically analyzed. Finally, these data are used to predict the new cases.
Machine learning is a process of utilizing data to discover latent information that may not been easily identified [8], which is suitable for prediction with FNA data.
Random forest is one of the ensemble learning methods widely used in disease detection. The randomness of it is reflecting in two aspects, the samples and features used in the decision trees. Therefore, compared with a decision tree, random forest reduces the possibility of over-fitting. Moreover, it is less susceptible to unbalanced samples, noise and outliers, and usually achieves high prediction accuracy [9]. Scholars have already applied random forest to the diagnosis of different kinds of diseases [10]- [14]. The exiting approaches improving the random forest mainly focus on the improvement of the decision tree algorithm, the modification of voting method, the preprocessing of the data set and optimization of feature selection. However, a random forest classifier is more effective only if the decision trees in the random forest classifier are diverse [15].
Clustering analysis is a significant process in the field of data mining [16], [17], which is a process of classifying objects into groups according to their similarities [18]. Various clustering methods have been proposed, mainly including k-means [19], hierarchical techniques [20], [21], densitybased techniques [22], [23], grid-based algorithms [24]. In Chavent's work [25], he develops a technique which combines feature selection and variables clustering. Another study on the application of random clustering forest base on extended belief rule-based (EBRB) system has been published by Murugan et al. [26].
Feature selection is also a vital procedure before a classification task since biomedical data sets are often characterized by high dimensions which may include some irrelevant and redundant features [27]- [29]. Hou proposed a Sparse matrix regression (SMR) feature selection algorithm based on matrix data and sparse constraints [30], which was effective in the application of scene classification. Luo proposed a semi-supervised feature selection method based on insensitive sparse regression (ISR) and applied it to video semantic recognition [31]. In order to efficiently process high-dimensional non-Gaussian data in face recognition, an adaptive discriminant analysis (ADA) method was proposed in [32], which can distinguish the importance of each data point. Variable Importance Measure (VIM) is also a method of ranking the importance of features according to the Gini index [10], [33], which could help to pick up those most important ones to improve the classification performance.
In this paper, a breast cancer diagnosis methodlogy that uses VIM for feature selection and Hierarchical Clustering Random Forest (HCRF) for classification is proposed. We first generate a traditional random forest with several decision trees. Then, we group the decision trees into several clusters according to the similarity between them. Eventually, the decision trees in each cluster with the best performance are retained to construct the HCRF model. Then, VIM is adopted to extract the most significant tumor features of breast cancer for model construction. The optimal feature subset is obtained by deleting the less important features so as to improve the performance of the HCRF classifier, including training time, generalization ability and simplicity.
Finally, grid search algorithm is used to optimize the parameters of our model. Experimental results show that the classification based on HCRF algorithm with VIM as a feature selection method is a practical way for in the early diagnosis of breast cancer.
The main contributions are summarized as follows: 1. An intelligent diagnose algorithm HCRF is proposed for the detection of breast cancer.
2. A feature selection method called VIM which uses the Gini index to measure the importance of each feature is used before classification to help us select the optimal feature subset.
3. Hierarchical clustering is introduced to improve the diversity and classification ability of decision trees in the random forest. This proposed method has great reference value for designing structural diversity using other types of basic learners or other ensemble learning algorithms.
4. The developed model is superior to other classifiers such as decision tree, Adaboost, random forest and performs better than other state of the art machine learning models for breast cancer detection.
The remaining parts of the paper are arranged as follows: In section II, we describe the detailed information of the database as well as the proposed method and give the evaluation metrics. Section III evaluates our proposed method and demonstrates the experimental results and discussion. Section IV is about the summary of this paper and future work.

A. DATASET DESCRIPTION
The Wisconsin Diagnosis Breast Cancer (WDBC) and Wisconsin Breast Cancer (WBC) datasets used in this research are obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg [34], [35]. The WDBC database contains 569 instances. Each instance consists of 30 attributes and a class label. Features are obtained from a digitized image of a FNA of a breast mass, which describe the traits of the cell nuclei [36], [37]. The WBC dataset includes 699 samples. Each sample has 9 features and a class label. 16 instances that include missing value are removed from the WBC dataset . TABLE 1 and TABLE 2 give the detailed information of the database.

B. RANDOM FOREST CLASSIFIER
Random Forest (RF) is an ensemble learning method proposed by Breiman et al. in 2001 [38]. It is implemented based on the idea of Bagging. The Bootstrap sampling technique is used to extract several different training subsets from the original training set. From each training subset, we train a decision tree. Finally, these decision trees form a random forest. For a binary task, the ultimate prediction is determined by the votes of all the trees.
Suppose that there is a training set X with M samples, and each sample consists of N input features and a classification VOLUME 10, 2022  Step1: Select M samples from the training set X by applying Bootstrap technique.
Step2: Select n features (n < N ) randomly and the feature with minimum Gini value is selected to split the node of the decision tree.
Step3: Repeat step1 and step 2 K times and obtain K decision trees.
Step4: Combine the decision trees into a random forest and determine the classification result by voting.

C. VARIABLE IMPORTANCE MEASURE (VIM) METHOD
Selecting the most discriminating features plays a crucial role in early cancer detection because it provides clinical information about potential biomarkers. Therefore, we need to find an optimal feature subset to improve the accuracy of classification [39], [40].
VIM method is used to calculate the importance of all the features and rank them according to their importance [10].
By deleting the less important features, the significant tumor features of breast cancer are extracted to form the optimal feature subset, which enhances the performance of the classifier.
Assume that the training set has N features {N 1 , N 2 , . . . , N N }, we use the ''Gini Index'' to select the optimal partitioning feature at each node when constructing decision trees in the random forest. Gini Index reflects the probability of category inconsistency of two samples randomly selected from the subset after node division. The smaller Gini Index is, the higher the purity of subset is. That means the partitioning features we choose are more conducive to classification. The calculation formula of ''Gini Index'' is where C is the number of categories on the training set and p mc is the probability of a classification c at node m. In a binary task, C = 2, the ''Gini Index'' of node m is where p m is the probability of a classification 0 or 1. The feature importance of N i at node m is calculated by where G L and G R represent the ''Gini Index'' of the left and right nodes after node m is split, respectively. w L and w R are the number of weighted samples reaching the left and right nodes after node m is split, respectively. If feature N i is selected M times in the decision tree T i , then the feature importance of N i on decision tree T i is calculated by Finally, the feature importance N i in the random forest is defined as where K is the number of decision trees in the random forest and N is the number of input features on the training set.

D. HIERARCHICAL CLUSTERING RANDOM FOREST CLASSIFIER
The generalization ability of random forest is positively correlated with the classification ability of an individual decision tree and the diversity among decision trees [38]. Therefore, we try to improve the random forest model from two aspects: one is to improve the classification accuracy of a single decision tree; the other is to reduce the correlation between decision trees. According to the random forest algorithm, we generate an initial random forest with K decision trees. We use the hierarchical clustering method to divide the decision trees into several clusters. We first regard each decision tree as an initial cluster. The two clusters with the highest similarity are found and they are combined into one cluster. Then the similarity between the new clusters is recalculated. The process of clustering is iterated multiple times and the decision trees in the initial random forest are clustered into several clusters. Finally, we select the decision tree with the highest AUC (Area Under ROC Curve) from each cluster as the representative of this cluster, and eliminate other decision trees from the clusters to obtain the HCRF model.
The method used in this paper to calculate the similarity between decision trees is disagreement measure (DIS), since it is a simple and effective method to measure diversity in decision forests [15]. DIS represents the ratio between the number of observations on which one classifier is correct and the other is incorrect to the total number of observations, which can be calculated with formula (6) using the variables defined in TABLE 3.
Suppose there are two decision trees T i and T j , their classification result of the samples on the training set is listed in TABLE 3. x 11 means quantity of training samples that is correctly classified by both T i and T j and x 00 means quantity of training samples that is incorrectly classified by both T i and T j . x 10 means quantity of training samples that is only rightly classified by T i while x 01 means quantity of training samples that is only rightly classified by T j .
Then, the similarity calculation equation between decision trees T i and T j is as follows.

D i,j =
x 01 + x 10 x 01 + x 10 + x 00 + x 11 (6) The smaller value of D i,j is, the greater the similarity is. According to the above equation, we calculate the similarities between decision trees of the initial random forest and obtain a similarity matrix Sim which is defined as The construction process of HCRF is summarized as follows.

E. EVALUATION METRICS
For a binary classification, the model divides the samples into two categories: Positive and Negative. If the prediction and the fact are both True, it is called True Positive (TP). If the prediction is False but the fact is True, it is called False VOLUME 10, 2022 for j = 1, 2, . . . , K do 6: Sim(i, j) = D(C i , C j ) 7: end for 8: end for 9: repeat 10: Find the two clusters with the highest similarity C i and C j in the matrix Sim 11: Merge C i and C j : C i = C i ∪ C j 12: Update the matrix Sim 13: K = K -1 14: until K = Q 15: for i = 1, 2, . . . , Q do 14: Select the decision tree T i with the highest AUC as the representative of cluster C i and delete other decision trees 15: end for Output: HCRF {T i , i = 1, 2, . . . , Q} Negative (FN). If the prediction is True, but the fact is False, it is called False Positive (FP). If the prediction and the fact are both False, it is called True Negative (TN). According to the above four cases, a confusion matrix can be obtained (TABLE 4).
The evaluation metrics used in this paper include: accuracy, precision, sensitivity, specificity and AUC.
The TP, FP, TN and FN measures can be collected to construct a plot, which is a Receiver Operating Characteristic (ROC) curve, to show the tradeoff of FN and FP rates to model classification errors. ROC curve is typically plotted using FP rate vs. TR rate. By calculating the area under the ROC curve, we can get the AUC value [41].

III. EXPERIMENTAL RESULTS AND DISCUSSION
In the experiment, we used 70% of the data as the training set and the remaining 30% of data for the test set. Feature selection was first carried out using VIM method. Different subsets of features of different sizes from 1 to N (a subset of features of size N means no feature selection) were produced according to the ranking of the feature importance, where N denotes the number of all features in the dataset. For each subset of features, grid search algorithm is used to optimize the parameters of the models. FIGURE 2 shows the distribution of all the features' importance of the WDBC and WBC datasets.
Then, DT, Adaboost, RF and our HCRF model was tested on the WDBC and WBC database and the accuracy when using all subsets of features size from 1 to N is shown in FIGURE 3. From the comparison of accuracy on the above four models, it is proved that HCRF has a higher accuracy   than other models whatever subsets of features we choose. The maximum accuracy was achieved by using the first 24 features on the WDBC dataset and the first 8 features on the WBC dataset. Thus, these features were regarded as the most discriminating features obtained using VIM technique, which are shown in   Benign cells are usually monolayer while malignant cells are usually multilayered. Normal nucleoli is used to describe small structures present in the nucleus. Nucleoli are usually small, but begin to protrude in malignant cells. The bare nuclei mean a nuclei lacking cytoplasm. Cells that exhibit this phenomenon are likely to be malignant. We consider them as the optimal feature subsets which help doctors find the most essential features in each dataset [42].
We also optimize our HCRF model by selecting different number of clusters based on an initial random forest. As shown in FIGURE 4, the accuracy gradually rises when the number of clusters increases and reaches a peak when the number of clusters is 14(WDBC) and 25(WBC). Then, the accuracy drops with the number of clusters keep on increasing, indicating that either too much or too little clustering will affect the accuracy of the HCRF model. FIGURE 5 shows the process of clustering according to the similarity between the clusters. In each cluster, the decision tree with the highest AUC is picked out to construct the HCRF. TABLE 6 lists the representative decision trees in each cluster when HCRF achieves the highest accuracy.
The comparison results are given in TABLE 7. The proposed HCRF model outperforms other models in term of accuracy, precision, sensitivity, specificity and AUC. The highest accuracy achieved by HCRF using a subset of 24 features is 97.05% on the WDBC dataset. The number of decision trees is reduced from 65 in the initial random forest to 14 in the hierarchical clustering random forest. Compared with DT, Adaboost and RF, the accuracy increases 5.59%, 3.72% and 0.68% respectively. For the WBC dataset, the highest accuracy of 97.76% is also obtained by HCRF using a subset of 8 features. The number of decision trees is reduced  from 75 in the initial random forest to 25 in the hierarchical clustering random forest. The accuracy increases 3.37%, 2.74% and 0.5% respectively compared with DT, Adaboost and RF.
We also make a comparison of the performance of the HCRF algorithm with others. The compared models also use feature selection methods to remove the redundant features . TABLE 8 and TABLE 9 shows the result of the our proposed HCRF model and some published studies on the WDBC and WBC datasets. From the table, we can obviously see that our proposed model which uses VIM as a feature selection method and HCRF as a classifier achieves the best performance.

IV. CONCLUSION
In conclusion, we have developed a model for breast cancer diagnosis which uses VIM for feature selection and HCRF VOLUME 10, 2022 for classification. Both of these two processes not only enhance the performance and generalization ability of the classifier but also reduce the complexity and testing time of the model. In the end, our proposed method achieves 97.05% accuracy on the WDBC dataset and 97.76% accuracy on the WBC dataset. Compared to the traditional random forest, the proposed HCRF model increases accuracy by 0.68% and 0.5% on the WDBC and WBC datasets respectively. This is of vital importance in actual diagnosis scenario, which means more breast cancer can be detected in time and more lives could be saved.
Our proposed method has great reference value for designing structural diversity using other types of basic learners, such as neural networks and support vector machines or other ensemble learning algorithms. The proposed method could be also applied in the detecting cancers of other types and provide doctors with guidance for early diagnosis, which have many useful medical applications in clinical breast tumor diagnosis. For patients who are with breast cancer history, such a model can lead to a more rapid intervention with the most appropriate treatment.
As future work, we plan to visualize the decision trees and take structural diversity into consideration to further enhance the diversity among decision trees of random forest. Moreover, we will use the heuristic algorithms to optimize the relevant parameters to make our method more intelligent.