Improving the Heart Disease Detection and Patients’ Survival using Supervised Infinite Feature Selection and Improved Weighted Random Forest

Heart disease is the leading cause of death worldwide. A Machine Learning (ML) system can detect heart disease in the early stages to mitigate mortality rates based on clinical data. However, the class imbalance and high dimensionality issues have been a persistent challenge in ML, preventing accurate predictive data analysis in many real-world applications, including heart disease detection. In this regard, this work proposes a new method to address these issues and improve the predict the presence of heart disease and patients’ survival, including supervised infinite feature selection (Inf-FSs) to find the most significant features and Improved Weighted Random Forest (IWRF) to predict heart disease, and Bayesian optimization to tune the new hyperparameters for IWRF. Two public datasets, including Statlog and heart disease clinical records, were used to develop and validate the proposed model. The proposed model is compared with other hybrid models to show its superiority using performance metrics like accuracy and f-measure to evaluate the models’ performance. The results have shown that the proposed Inf-FSs-IWRF achieved better results than other models in attaining higher accuracy and F-measure on both datasets. Additionally, a comparative study has been performed to compare with previous studies, where the proposed model outperformed the others by an accuracy improvement of 2.4% and 4.6% on both datasets, respectively.

have been proposed in the literature. Earlier studies used different ML algorithms to detect CVD, focusing on feature selection (FS). Including rough sets (RS) to select the most significant features and feed them to the chaos firefly algorithm [5] and backpropagation neural network (BPNN) [6] to predict heart disease (HD). In addition, Amin et al. and Chicco detect CVD by employing vote with Naive Bayes (NB) and logistic regression (LR) on selected features to predict HD presence and patients' survival [7,8]. The authors in [9] conducted a comparative analysis study. The authors employed different classifiers on various datasets. They concluded that the conditional inference tree forest (cforest) surpassed the other classifiers.
Haq et al. conducted a comparative analysis on a hybrid model constructed using various feature selection strategies and machine learning models. Their research established that reducing features affected the models' performance. According to the study, a combination of Relief-LR delivers maximum accuracy [10]. Gupta et al. created a framework for machine intelligence that includes factor analysis of mixed data (FAMD) and random forest (RF). The FAMD was used to identify significant characteristics, and the RF was used to predict CVD [11]. Khan and Algarni [12] developed an Internet of Medical Things (IoMT) to predict HD. The developed model used Modified Slap Swarm Optimization (MSSO) to optimize the adaptive neuro-fuzzy inference system (ANFIS) parameters.
Also, Ali et al. proposed two stacked SVMs predicting CVD presence. The first SVM removed the non-significant attributes, while the second SVM was utilized to indicate CVD presence and absence of the model tuned using a hybrid grid search algorithm [13]. Tama et al. [14] developed a two-tier ensemble model to detect CVD. The model stacked architecture is intended to combine the CVD forecast of the selected ensemble learners XGBoost, RF, and gradient boosting machine (GBM). In addition, the authors employed particle swarm optimization to choose the most significant features. The authors in [15,16] developed an IoT framework to evaluate the CVD status. The developed model employed a modified deep convolution neural network (MDCNN) to predict the patient's status based on data received from the sensor. However, since the clinical datasets are imbalanced, the above studies have some limitations in detecting CVD through the mentioned methods.
Some studies have developed reliable CVD detection and patient survival models to address this issue. For example, Ishaq et al. applied the synthetic minority over-sampling technique (SMOTE) to balance data distribution and extremely randomized trees (ET) on selected attributes using random forest importance ranking to predict the patients' survival [17]. In addition, Fitriyani et al. developed a hybrid method to detect HD consisting of density-based spatial clustering applications with noise (DBSCAN) to detect and remove outliers instances applied to the features selected from information gain. Then, hybrid SMOTE-ENN was employed to balance the dataset and extreme gradient boosting (XGBoost) for CVD detection [18]. Recently, Waqar et al. proposed SMOTE-based deep learning to predict heart disease. The authors applied SMOTE technique to balance the dataset without the need for feature selection [19]. However, the balancing method SMOTE has limitations, including blindness of neighbour selection, instance overlapping, small disjuncts, and noise interference [20][21][22]. The related work is summarized in Table I regarding the method utilized, feature selection, data balancing, validation method and the datasets used.
Based on prior research, there is still a lack of a model to address the imbalance issue on an algorithm level instead of a data level to improve the accuracy of CVD detection and survival. Therefore, we proposed an Improved Weighted Random Forest (IWRF) to address the imbalanced dataset classification based on cost-sensitive learning to cope with those limitations. Moreover, we integrate the proposed IWRF with supervised infinite feature selection (Inf-FSs) for feature ranking and selection and Bayesian Optimization (BO) to optimize the IWRF weighting coefficient. Prior studies have reported that the model prediction performance significantly improved by integrating Inf-FS [23][24][25] and optimizing the ML model using BO [26,27]. However, to the best of our knowledge, no studies integrated Inf-FSs and BO with IWRF to predict the presence and survival of CVD.
Therefore, we propose an effective method to predict CVD and patients' survival: Inf-FSs to rank the features by importance and select the best features, IWRF to predict CVD, and BO to find the best weighting coefficient. Two public datasets were chosen to develop the model and test the model, the Statlog dataset [28] to detect the absence and presence of CVD and the heart failure clinical record dataset [29] to predict the patients' survival. So, we set out to develop ML algorithms to diagnose CVD and patient survival to assist healthcare professionals. As a result, early treatment might be implemented to avoid the deaths caused by late CVD detection. The main contributions of this study can be summarized as follows: • An Improved Weighted Random Forest (IWRF) is developed to deal with class imbalance. • Decision support (Inf-FSs-BO-IWRF) is proposed to predict the presence of CVD and patients' survival. • The evaluation of the proposed IWRF model in comparison to other ML models such as SVM, kNN, XGBoost, and SMOTE-RF highlights the superiority of the proposed model • Identify the most important attributes in the dataset that impact the machine learning system performance. • The effectiveness of the proposed Inf-FSs-BO-IWRF is evaluated on two binary public datasets.
The rest of the paper is structured as follows: Section 2 presents the related studies. Section 3 is the proposed methodology. Section 4 describes the performance evaluation metrics, results and discussion for the conducted experiments, and the state-of-art comparison. Finally, the conclusion and future work are presented in section 5.

II. Related Studies
Weighted ensembles have been the subject of a wide range of research investigations. Pham et al. developed a weighted approach for generalizing bootstrap-aggregated ensemble learning to a weighted vote by evaluating various averaging methods [30]. Later, Pham et al. proposed a Cesaro averagebased to enhance the RF for the binary classification issue. This strategy is driven by the inherent instability of tree-based prediction averaging [31]. Next, Chen et al. introduced the weighted random forest (WRF), incorporating cost-sensitive learning. It weights both the majority and minority instances in a training set, with a higher weight for minority instances. Also, Chen pioneered the balanced random forest (BRF) approach, which involves sampling. BRF was developed to account for the likelihood that some bootstrap samples generated will include fewer or no minority cases. The core concept of BRF is to generate bootstrap samples through systematic under-sampling of the majority class [32].
To detect credit card fraud, Xuan et al. developed Refined Weighted RF (RWRF). The enhancement is in two areas. They utilized all training data (both Out-of-Bag (OOB) and In-Bag (INB) data) because they believed that evaluating the performance of various base classifiers should be done using the same dataset. Additionally, they utilized the gap between the chance of correctly predicting true and false class labels to determine how the predicted number of votes for the correct label surpasses the anticipated number of votes for the incorrect label [33]. Kulkarni et al. discussed efforts to increase the accuracy and time required for training the RF classifier. They are based on disjoint partitioning of training datasets, the usage of split measures or multiple feature evaluation to generate RF base decision trees, the use of weighted voting rather than majority voting, the usage of diversity in bootstrap datasets to create the most diverse classifiers, and the usage of dynamic programming method to discover the best subset of RF [34].
A probabilistic approach for combining classifiers was presented by Kuncheva et al. The four combination approaches, including recall combiner, majority vote, I Bayes combiner, and weighted majority vote. It provides strict optimality requirements (lowest classification error) for each. Both the class-conditional independence of classifier outcomes and the presumption of certain accuracy form the basis of the framework [35]. Gajowniczek et al. proposed a new weighting approach with tunable parameters that apply to each RF tree [36]. The classification strategy of hybrid NB and sample weighted RF (SWRF) used by Babu et al. for sub-acute ischemic stroke lesion segmentation was a successful metaheuristic feature selection method. NB is taught and used to estimate training sample weights in this example. To train SWRF, a set of training samples with predicted weights is used [37].
Recently, Utkin et al. presented a weighted Random Survival Forest (RSF) as a way to improve the performance of the RF. The suggested model's core idea is to replace the traditional averaging approach used to estimate the RSF hazard function with weighted averaging. Each tree is given a set of weights, which can be considered training parameters. They are calculated by solving a conventional quadratic optimization problem to maximize Harrell's C-index [38]. Bader-El-Den et al. introduced Biased Random Forest (BRF). Rather than boosting minority occurrences in data sets, BRF tries to oversample the classification ensemble by expanding the number of classifiers that represent the minority class in the ensemble. The BRF technique uses the kNN algorithm to determine the crucial regions within a data set. The conventional RF is supplemented with additional random trees based on the key locations [21].

III. Proposed Methodology
The proposed method is developed to obtain a highperformance heart disease prediction of the presence and patients' survival of CVD. Figure 1 presents the flowcharts of the proposed method.

Feature selection using infinite feature selection
Feature selection is critical in ML since the performance of ML techniques is strongly reliant on the features selected. Various features can obscure and entangle the data's different explanatory components [40]. There are multiple approaches for choosing the best attributes in the literature. We have selected a recently developed FS approach called infinite feature selection supervised (Inf-FSs) [41] for this study. This approach is graph-based feature filtering that makes the ranking by considering all the potential subsets of features work in a supervised and unsupervised form. It is constructed upon a fully connected weighted undirected graph G = (V, E). The nodes V denote all features, and the edges E reflect the pairwise relationships between them. Consider G as an adjacency matrix A, where each of its components aij (1 ≤ I, j ≤ n) represents the degree of confidence that the nodes ⃗ and ⃗ are both potential candidates for selection done with the following weight function: Where weight function φ is real-valued, that specifies the value of each edge. In Inf-FSs, the weight function integrates class labels utilizing Fisher criteria and mutual information. Therefore, the weight function φ(⃗⃗⃗⃗, ⃗⃗ ⃗⃗) is produced by three factors, Fisher criteria (hi), normalized mutual information (mi), and normalized standard deviation (σi). The following equations calculate the factors: Where σi,g, and μi,g represent the standard deviation and mean for ith attributes considering the instances of gth class. The ith feature is less redundant the closer hi is to 1 since it doesn't overlap with the other domain. While Y and p(z,y) are the class labels and joint probability distribution, respectively. In practice, mi is a measure of how much a feature vector's knowledge reduces the level of uncertainty about each class. Also, the normalized standard deviation (σi) is normalized to a range of [0,1] by the maximum std over the set features (F). Finally, the three-element are weighted linearly.
The parameters αk is the mixing coefficients where αk belongs to a range of [0,1], ∑ = 1. Their values are set during experiments. The score si shows how much a feature is not redundant and relevant to other classes. Finally, the adjacency matrix A's weights are constructed by coupling the correspondent s in the following manner: After the adjacency matrix is constructed, ranking is performed while evaluating the redundancy of the features, taking into account all possible pathways among the nodes. The Inf-FSs algorithm is described in detail [41]. Finally, cross-validation (cv) is used to determine the mixing coefficient αk for each training split of both datasets.

Improved Weighted Random Forest
Bagged (bootstrap-aggregated) DTs can reduce overfitting effects and improve generalization by merging the outcomes of several DTs. In the bootstrap aggregating learning concept, T base models (decision trees) are trained over subgroups taken with replacement from the dataset. Their results are voted to create a prediction estimate of the model. Voting and bagging are implemented to reduce the model's variance without raising its bias since base models are provided with multiple training sets, creating a varied ensemble [25]. A bootstrap of M' samples is picked randomly from the initial M training samples and replaced for every tree t, where t belongs [1, T]. During the training process, F' < F attributes are randomly chosen from F available features at each tree node, and the optimal split is determined by applying those F' attributes. In the testing process, the unseen instance is run through all the T trees in the forest, resulting in T predictions for the test instance. Finally, these forecasts are pooled via voting to provide the final prediction.
In imbalanced data classification, RF classifiers tend to be biased in the direction of the major class since standard RF treats both classes equally. However, several studies have shown that a weighted RF can deliver better prediction results. For this reason, this study presents an Improved Weighted Random Forest (IWRF), which assigns a weight for each class, a higher weight for the minor class. The class weight in the random forest can be computed using the inversely proportional class frequencies in the training dataset. The class weights are presented as the following: Where M presents the total number of samples in the dataset, M1 and M2 show the number in major and minor classes. We assign a new coefficient, the weighting factor (α), to compute class weights. Thus, the class weights will be calculated as follows: Where α1 and α2 are the weighting factor for major and minor classes, respectively, α1 and α2 vary in a range from [0, 1] with default values M1/M2 and one for α1 and α2. To ensure that CW2 is always greater than CW1 to have a heavier penalty on misclassifying the minor class, the weighting factor is subjected to the constrain as follows: The RF algorithm incorporates class weights in two places. Class weights are used in the tree induction technique to weight the Gini criteria for detecting splits. Class weights are again considered at each tree's terminal nodes. Each terminal node's class prediction is established by a weighted majority vote. Moreover, in imbalance classification, there is a substantial chance that a bootstrap sample has few or no instances of the minority class, leading to a tree with low performance in predicting the minority class. A new coefficient (p) is added to control the number of minor class samples in each bootstrap to overcome this problem. It randomly draws a bootstrap from both classes, containing at least one-third of the minority class out of the total samples in the bootstrap. The value of p can range between (1/3≤ p <1/2) depending on the imbalance ratio between the majority and minority classes.

C. Bayesian Optimization
BO [42] is a reiterative algorithm widely used for HPO problems. BO applies two key components to define hyperparameter configuration: an acquisition function and a surrogate model [30]. The surrogate model seeks to fit all the examined observations into the objective function. After finding the probabilistic surrogate model's predictive distribution, the acquisition function determines various points by balancing exploitation and exploration. Exploration tests the samples in the areas that have not been tested. In contrast, exploitation tests in the currently promising regions where the global optimum is most expected to occur, depending on the posterior distribution. Bayesian tuning models balance exploitation and exploration processes to determine the current most expected optimal regions and avoid losing better configurations in the unexplored regions [43]. After each evaluation of the objective function, the surrogate model is updated. BO models are sequential processes that are difficult to parallelize since they are built on previously tested variables. Still, within a few iterations, BO can find nearby optimal hyperparameter coefficients [44].
BO's common surrogate model is the tree-structured Parzen Estimator (TPE) [45]. BO-TPE creates two generative models for all domain variables, g(x) and l(x). [46]. The observations are divided into poor and good results by a specified percentile y*; both sets are modeled by simple Parzen windows [45]: D is the search space of the hyperparameter. After that, the acquisition function's expected improvement is reflected by the ratio between g(x) and l(x), which is applied to establish the latest configurations for evaluation. The PE is created in a tree structure to ensure that the necessary conditional dependencies are maintained. Thus, TPE adopts specific conditional hyper-parameters naturally [44].

IV. Experimental Results
This section presents the feature selection results first, followed by HD presence and survival classification performed for both datasets. The developed model was built and tested for HD on Statlog and heart failure clinical record datasets. The Statlog dataset consists of 14 attributes with the status label, 270 cases, 150 for HD absence and 120 for HD presence. The heart failure clinical record dataset consists of 13 attributes with the survival label, 299 total cases, 202 patients survive, and 97 patients deceased. We used a 10-fold cv procedure in our experiment to avoid overfitting [47]. We evaluated the proposed model using six performance metrics. The confusion matrix is applied to measure the model's output, including True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP). The six-performance metrics are calculated as follows: This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

A. Feature Selection Results
Inf-FSs-based feature selection is conducted at each stage of the 10-fold cv utilizing the training data. The Inf-FSs method ranks and weights each feature in the 13 and 12 features pool for both datasets. Table II summarizes the features and associated Inf-FSs weights for a given fold. The top ten attributes for each validation fold are chosen from these ranking features automatically. Nine characteristics appear in both datasets' top ten features for each of the 10-folds of the training data evaluated. As a result, the presence and survival classifications use these nine features. The selected features for both datasets are listed in Table III.

B. Classification results
The developed IWRF model was used for both datasets and showed significant improvement in prediction accuracy compared to existing models. For comparison, we chose six distinct machine learning models (G-NB, LR, SVM, kNN, XGBoost, and RF) frequently utilized in the research field and have a proven record of accuracy and efficiency. The results of different ML models are presented in Table IV and Table V for Statlog and HD clinical records, respectively, including the effects of both with and without FS. IWRF performed better across both datasets than other ML models achieving accuracy, F-measure, and MCC up to 95.5%, 94%, and 0.9 for Statlog, 93.3%, 86%, and 0.81, for the HD clinical dataset, respectively. Also, it is noted that all models have been improved when using FS on both datasets, especially for IWRF, by reaching accuracy, F-measure, and MCC up to 97.7%, 97%, and 0.95 for Statlog, and 95.9%, 91.3%, and 0.88, for HD clinical dataset, respectively. Furthermore, it can be shown from the results that IWRF achieved better results than the standard RF model in handling the imbalanced data, where IWRF improved the performance for detecting CVD and patients' survival by 3.7% and 5%, respectively, after FS. Recognizing the minority class sufficiently during classification is difficult because the standard RF and the other models used to learn from data input are biased towards the majority class. With the benefit of feature selection, doctors can forecast the survival of patients and the presence of HD by assessing the essential attributes.
To get another point, the IWRF was compared with SMOTE as it is commonly used in handling unbalanced datasets. As with any sampling technique, SMOTE is not a stand-alone classifier but can be integrated with any classifier. For a fair comparison, SMOTE was combined with RF and then compared with IWRF. Table VI and Table VII present the results of IWRF against base RF with SMOTE for both datasets. Moreover, we employed BO for tuning SMOTE hyperparameters (sampling ratio and k-neighbors) and (α, p) for IWRF, while the other hyperparameters such n_estimators, max_depth, max_features, and min_samples_split, are set for the default values as in Sklearn library. The findings showed that IWRF achieved higher results than base RF with SMOTE since SMOTE has several drawbacks related to overlap and noisy information. It regularly assigns a global k-neighbor but ignores the local distribution features [48,49]. The hyperparameter tuning improved model prediction accuracy, but it showed more impact on SMOTE. Increasing the kneighbor value to compensate for the imbalance ratio may be effective in SMOTE. The results illustrated in Figure 2 show that the improvement achieved by the proposed IWRF is higher than SMOTE-RF compared to the base RF classifier.
The proposed model improved the performance of CVD detection by 3.62%, 4.82%, for the Statlog dataset, and 6.3%, 11.98% for HD clinical records in terms of accuracy and f-measure, respectively.
In the end, we compared our findings to those of previous studies. Because we utilized the same datasets as previous research, we could take the results from prior works without employing their methods. The comparison between the proposed model with the earlier studies is presented in Table VIII and Table IX for Statlog and HD clinical records datasets. According to this assessment and evaluation, the current research on the CVD detection and survival prediction

V. Conclusion
This article aims to present an accurate and efficient machine learning ensemble model for predicting the presence of CVD and cardiac patient survival. The proposed model integrates Inf-FSs, IWRF, and BO. Those three methods are utilized to select the most significant features, handle the imbalanced data classification issue found in medical datasets, and tune the weighting factor. The developed model is evaluated using two public datasets and benchmarked against previous studies.
The experimental results show that the proposed model was more effective in achieving higher results without changing the data distribution. Also, the proposed IWRF improves the performance of detecting CVD by 3.62% and 6.3% compared to the standard RF. This research can significantly enhance the healthcare system and serve as a valuable tool for healthcare professionals in diagnosing and forecasting heart failure survival. For our future work, we aim to develop a general framework based on ML ensembles, including outlier detection and removal, and optimize critical hyperparameters of ML ensemble models to improve the detection and severity level classification of various diseases using clinical data.