HDPF: Heart Disease Prediction Framework Based on Hybrid Classifiers and Genetic Algorithm

Supervised machine learning algorithms are powerful classification techniques commonly used to build prediction models that help diagnose the disease early. However, some challenges like overfitting and underfitting need to be overcome while building the model. This paper introduces hybrid classifiers using the ensembled model with a majority voting technique to improve prediction accuracy. Furthermore, a proposed preprocessing technique and features selection based on a genetic algorithm is suggested to enhance prediction performance and overall time consumption. In addition, the 10-folds cross-validation technique is used to overcome the overfitting problem. Experiments were performed on a dataset for cardiovascular patients from the UCI Machine Learning Repository. Through a comparative analytical approach, the study results indicated that the proposed ensemble classifier model achieved a classification accuracy of 98.18% higher than the rest of the relevant developments in the study.


I. INTRODUCTION
There are several cardiovascular diseases, such as heart failure, angina, cardiomyopathy, and arrhythmia. Heart disease is a universal disease that affects many people, especially during middle or old age [1]. Heart diseases are more common among men than among women. According to WHO statistics [2], it is estimated that 30% of deaths in developing countries are caused by heart disease [3], [4]. One-third of global deaths worldwide are due to heart disease [5]. Half of the deaths are in the United States, and other developed countries are due to heart disease. Every year approximately 17 million people die from cardiovascular disease (CVD) worldwide [2].
Currently, we have a wealth of big data provided by patients' electronic health records. Technology has also provided us with many methods, techniques, and models that enable data scientists and researchers to contribute to medical development. Through analytics, the data can determine the causes of the disease and the medical team's contribution by spreading community awareness through prevention.
The associate editor coordinating the review of this manuscript and approving it for publication was Christian Pilato . By adopting preventative behavior, a person can better avoid disease. ''Prevention is better than cure''. Therefore, there are many challenges associated with this field, which can be summarized as follows: and Random Forest (RF). We adopted a set of metrics to evaluate the hybrid model. These metrics include accuracy (AC), sensitivity (SN), specificity (SP), F1-score (F1-S), and the Area Under Receiver Operating Characteristic Curve (AUC) [9], [10].
Therefore, the contribution in this paper to improving and solving some problems related to this field, which can be summarized as follows: • We propose HDPF which consists of DBSCAN-based and resampling techniques (such as under-sampling, over-sampling, and hybrid) are used to solve the imbalance problem and eliminate the outliers.
• Analyzing and indexing a large amount of heart disease patient features, including elements that contain different categories in type and quantity, such as numbers, texts, etc. We use multiple datasets (two different large datasets).
• Preprocessing data in terms of deleting redundant data and treating missing data so that the classifier can work well.
• Extract the most important features using a simple genetic algorithm that can be relied upon in classifying data and reducing their numbers without compromising the data's accuracy so that we can reduce the time consumed.
• We use Density-based spatial clustering of applications with noise (DBSCAN) which is used to group closely features together (features with many nearby neighbors), marking as outliers features that lie alone in low-density regions.
• Can be built a hybrid classifier from supervised machine learning algorithms (such as LR, SVM, KNN, DT, RF) to classify existing data and predict new data for similar cases.
• Performance analysis and comparison with state-of-theart models.
The remainder of this manuscript is divided into five sections. Section II reviews previous work related to early diagnosis and health care for some diseases in general, and it focuses on heart diseases. Section III describes the data used and the challenges that must be overcome while dealing with this type of data. Section IV explains the different stages of the proposed framework and novel algorithms. Section V describes the experimental results of different test cases and the discussion section. Finally, section VI will provide conclusions and references.

II. RELATED WORKS
In this section, some related studies regarding recent modalities in healthcare and disease diagnostics will be reviewed. Researchers, academic scholars, and data scientists have undertaken various research initiatives in predicting and screening medical data for heart diseases. Multiple ML and data mining algorithms have been used in recent studies to carry out these predictions.
Therefore, these relevant works will be reviewed and compared with our proposed system. Most of the studies tend for analysis, and decision support systems are typically implemented using two various approaches. The first approach combines many features such as age, sex, chest, blood pressure, cholesterol, blood sugar, electrocardiographic results, heart rate, and several significant vessels colored by fluoroscopy, thalassemia, etc. The second approach reduces and restricts input patterns that can be easily measured [11], [12].
Desai et al. [13] utilized a novel classification model using a Backpropagation Neural Network (BPNN) and LR on the Cleveland heart disease dataset. Accuracies of 85.74% and 92.58% were recorded BPNN and LR respectively.
Padmanabhan et al. [12] proposed an approach of using Auto-Machine Learning (AutoML) in addition to the human expert system. The authors evaluated two cardiovascular disease datasets performance and compared the results to an AutoML library and human expert system. The accuracy and area under the curves for AutoML are significantly higher and better than those of the human expert system. Additionally, the time consumed by AutoML to produce these results is significantly less than the time consumed by the human expert system.
Islam et al. [14] proposed some superior data analysis techniques such as Naive Bayes (NB), LR, DT. In this case, LR provided the highest accuracy with 86.25%.
Abhishek et al. [15] performed a heart disease forecast framework using the R programming language. The training and testing patterns are produced by dividing datasets into 70% and 30%, respectively. The test results showed that the NB classifier achieved a higher accuracy of 89%.
Rabbi et al. [16] conducted a comparative study on remarkable current classification models used in data mining such as SVM, KNN, and Artificial Neural Networks (ANN). Their test results showed that SVM achieved higher accuracy than both the KNN and ANN with 85% accuracy.
Dwivedi [17] performed a classification model using LR for predicting heart disease on the Cleveland dataset. They achieved 85% accuracy.
Abdeldjouad et al. [15] used a Genetic Fuzzy System-Logit Boost (GFS-LB), and Fuzzy Hybrid Genetic-Based Machine Learning (FH-GBML). The performance evaluation of these algorithms was implemented using WEKA [18] and KEEL tools. The highest accuracy of 80% was gained by majority voting.
Haq et al. [19] performed a classification model using LR with some preprocessing techniques for predicting heart disease on the Cleveland dataset. They achieved 89% accuracy.
Ali et al. [20] suggested an authority system based on stacked SVM to aid heart failure analysis. The primary SVM model was applied to exclude irrelevant features, while the second model was applied as a predictive model. They achieved 92% accuracy. Gupta et al. [21] performed a classification model using RF with some preprocessing techniques for predicting heart disease on the Cleveland dataset. They achieved 96.9% accuracy.
Fitriyani et al. [10] performed a classification model using XGBOOST + DBSCAN with some preprocessing techniques for predicting heart disease on the Statlog dataset and on the Cleveland dataset. They achieved an accuracy of 95% by using the Statlog dataset while the accuracy of 98% was achieved using the only Cleveland dataset. Table 1 represents the comparison between recent previous related works.
Amin et al. [23] proposed a hybrid technique with Naïve Bayes and Logistic Regression to predict cardiovascular disease. This research aims to identify significant features to improve the accuracy. They achieved 87.4% accuracy.

III. GLOBAL CHALLENGES AND DESCRIPTIVE DATA ANALYSIS A. GLOBAL CHALLENGES
Due to the nature of the dataset used, some challenges must be faced and overcome in the proposed model. These challenges are summarized as follows [10], [11]: • Dataset contains wide variation and diversity features with high dimensionality. We use DBSCAN-based and resampling techniques to solve the imbalance problem and eliminate the outliers.
• Any diagnostic system requires high speed and accuracy to perform the tasks. We use a novel algorithm to extract the semantic features using simple genetic algorithm for reducing dimensionality without compromising the data's accuracy so that we can reduce the time consumed.
• There are redundant and missing data within the dataset being used. We use a novel preprocessing algorithm to solve this problem.

B. DATASET DESCRIPTION
We used the UCI database of cardiology. It contains four datasets that have been previously used by ML researchers. The ''target'' attribute indicates the appearance or nonexistence of heart disease in the patient [12]. This dataset contains 76 features. These features are smoking, body mass, physical activity, a healthy diet, cholesterol levels, blood pressure, fasting blood glucose, etc. These attributes are the same seven ideal measures that the American Heart Association has set to promote cardiovascular health and disease reduction [6]. The four databases contain redundant and sometimes missing data [30], [31]. We will reduce the number of investigated  attributes to 14. We will use algorithms that only select the best 14 of 76 attributes to minimize feature dimensionality. Data were collected from four different datasets: the Cleveland Foundation, Statlog, the Budapest Institute of Cardiology, the California Medical Center, and the Zurich Hospital. Table 2 shows the description of the 14 features used in our model.

C. DESCRIPTIVE DATA ANALYSIS
In this section, we will review a statistical and investigative analysis of the data used. The distribution of the target feature among the remaining features will also be studied. We found that the age group '55-60' occupied the distribution peak. Figure 1 shows the distribution of the 'target' feature against 'age' feature. Additionally, the distribution of cardiovascular patients by sex is shown in Figure 2. We found that the largest numbers of people suffering from heart diseases were in the range of '41-64' years old. Patients in the 20-30 age group are less likely to suffer from heart disease [13]- [15]. Using the 'describe ()' function from the Pandas library, we obtain various descriptive statistics that exclude NaN values. Several descriptive statistics are returned as the count, mean, standard deviation, minimum-maximum values, and data quantiles. As shown in Table 3, most values are generally categorized. Mean values tell us the average value of that feature. Using the Python Matplotlib library functions, we can explore the correlation between the attributes of the dataset by visualizing it as shown in the heat map in Figure 3. It is clear that the degree of correlation between the 'target' and the rest of the data variables is weak.

IV. HEART DISEASE PREDICTION FRAMEWORK (HDPF)
This section will propose an intelligent framework to diagnose heart disease using machine learning and a Simple Genetic Algorithm (SGA). The proposed framework aims to diagnose heart disease early and help doctors make the appropriate decision to reduce mortality. One of the main challenges is the wide variation and diversity of features in the data. Therefore, we will process the data to extract the features and derive new features for machine learning which are more accurate and faster. The proposed framework contains three different phases. The first phase cleans the data by deleting duplicates, imputing missing data, and normalization, called preprocessing. The second phase includes primary processing, such as extracting features and deriving additional features from the data based on SGA. SGA operates based on crossover and mutation to generate synthetic chromosomes from the original population or set of factors. These chromosomes that produce high fitness values remain, while the others drop out. The mutation step is conducted at the end, in which the global search is maximized, and the best value is found. Finally, the chromosome describes the picked feature. Finally, the third phase applies a hybrid method of machine learning algorithms to classify data. Figure 5 shows the proposed framework. In algorithm 1, there are several steps. First, the performance vector is initialized using well-known performance metrics such as accuracy, AUC, precision, recall, and the F1-score. Next, the flowchart of HDPF has several steps, including Data Imputation and Partitioning, DBSCAN, SGA, Feature Extraction, Machine Learning (ML) Approach, and Performance Metric Evaluation as shown in figure 6. Finally, to make the dataset complete and reasonable for processing, data imputation is done to fill the missing values of the features with the new labels.
Second, we use algorithm 2 to perform data preprocessing such as data cleaning, data imputation, and feature normalization. Some well-known equations are used to perform feature normalization preprocessing. These equations are considered as follows: where µ = mean, σ = standard deviation, D = dataset, n = total number of values, x = single feature value. These data features are normalized using one unit mean and zero variance. SGA is a scientific representation determined by the famous Charles Darwin's approach based on Biological pick [26]. Natural selection processes just the most qualified individuals over several periods. In machine learning, the use of SGA is to take the best amount of variables to produce a favorable treatment [29].
Preparing the perfect part of variables is an investment of combinatory and optimization. The advantage of this method over others is that it provides the most suitable assistance to emerge from the various helpful prior solutions. An evolutionary formula that promotes the option in time. The idea of SGA is to combine the multiple solutions along many periods after production to extract the most helpful genetics (variables) from each one.
We can determine several other uses of GA, such as hypertunning specification, find the maximum (or minutes) of a feature, or look for a correct neural network design (Neuroevolution), or among others [29]. To calculate fitness value (FV), we include an optional weight W for the selection probability. By default, W = 1 means that the candidate solutions' fitness fully determines the selection probability for six crossovers. If W is set to values smaller than 1, the VOLUME 9, 2021  importance of the individual fitness decreases. If W = 0, the selection probability is independent of the fitness so that the chance of being chosen for a crossover would be equal for every candidate solution. Fitness value was calculated according to equation (5).
Equation 4 is used for fitness probability estimation to a single gene type. Fpro ith is fitness probability. FV ith is fitness value. In Equation 5, the search space is denoted by x i (t), t represents time, and i mean feature level. The summation of cumulative fitness values should be equal to 1. Pick the maximum fitness value j and check if it satisfies the condition csj < csk where csj is the cumulative sum, and csk is the newly formed subsequence set, as shown in Figure 7. Convergence is the state where we arrive at an optimal solution with leading fitness values. Fourth, the feature selection step is applied in Algorithm 3.
Third, SGA is applied to the data to obtain old and newly derived features. This step is considered feature extraction. Figure 6 shows the flowchart of optimal feature search. Figure 8 shows the flowchart of the SGA general structure. Next, principal component analysis (PCA) is applied to obtain qualitative label features (QLF) and quantitative numeric features (QNF). The optimal features are selected from algorithm 3 by maximizing the squared correlation coefficient summation between QLF, FX, and QNF, and FX. Finally, the classification step is obtained. We used the splitting holdout function to divide the data into training and validation datasets where factor = 0.2. We use 10-fold cross-validation to overcome overfitting problems. Several ML algorithms are applied, and the three algorithms' highest accuracy is chosen to perform the ensembled process.

V. EXPERIMENTAL RESULTS ANALYSIS AND DISCUSSION
In this section, the results of our various experiments will be explained and compared to relevant previous research. A heart disease dataset extracted from the UCI Machine Learning Repository was used and is described in Section III. All tests were conducted on Intel Core i7 2.90 GHz CPU and 8 GB RAM. We use Python as the programming language to develop different tasks.

A. RESULTS AND PERFORMANCE MEASURES
We used 10-folds cross-validation to avoid overfitting problems. Also, we analyzed and enumerated the model's performance during the learning phase. Finally, the dataset was divided into test and train sets. Dataset separated utilizing dimension 70:30, i.e., 70% from the data for training and 30% for testing the model, which is the standard dimension for partitioning datasets. The upside of this partitioning is that it provides sufficient information to prepare and test the proposed framework.
Moreover, it manages away from under-fitting if the training partition is smaller than the testing samples. Additionally, if the training partition is more distinguished than the testing partition, this can overfitting the framework. We used classification metrics such as sensitivity (SN), specificity (SP), accuracy (AC), and F1-score (F1-S) to measure the model's efficiency. The equations for those metrics can be listed as follows: where TP = true positive, FN = false negative, FP = false positive and TN = true negative. Our study used HDPF and SGA to determine the optimal features for our recommended framework. The initial information about SGA factors is as follows: the initial population is set randomly at 100, the number of periods used is 100 with crossover and mutation probability of 0.5 and 0.001, respectively. The experimental outcomes exposed that the proposed framework achieved an accuracy of 98.18%. The accuracy obtained by the suggested framework using SGA has improved by 3.18% compared to the accuracy performed by other related models. The recommended HDPF model also achieved 98% precision and 0.98 F1-Score. In addition, we have conducted experiments by using several supervised machine learning algorithms and different numbers of extracted features. As represented in Table 4, we found that DT and RF achieved the highest precision and accuracy than the other algorithms. They were also the least time-consuming to implement the processing.
Confusion matrices are drawn in Figure 8. From Figure 8(e), we find that the confusion matrix of the RF algorithm has achieved the largest total of true positive and true negative. Also has the lowest sum of the values false negative and false positive compared to the rest of the participated algorithms.
LR and SVM came second in achieving total TP and TN, as shown in Figures 9(a) and 9(c). But SVM algorithm was the best out of all in error type II, where achieved zero TN.
As for DT and KNN algorithms, they were the lowest achieved values in total TP and TN. And at the same time, they had the highest value in errors type I and II, as shown in Figures 9(b) and 9(d). Figure 10 represents the ROC curve for the members participating in the Hybrid classification technique. The biggest    participating in this experiment. We notice that by using feature sets (11,17,23,28) and applying them to the same algorithm, we find that the evaluation metrics have been affected by them.   According to Table 4, several features are extracted, and then supervised machine learning algorithms are applied. In addition, we found DT and RF achieved higher accuracy than other algorithms. Also, the average time consumption for DT and RF is less than different algorithms. So, the proposed ensemble algorithm is a hybrid model between DT and RF.
The performances of different ML methods change according to the number of features used. For example, Table 3 and Figures 10, 11, 12, 13, and 14 show that RF and DT achieved higher accuracy than the other algorithms. Therefore, we applied DT to training and validation datasets. Then, the majority voting ensemble technique was used to the result with RF to achieve high accuracy (98.18%).
Several experiments were conducted using multiple machine learning algorithms with SGA. First, several features  with different levels were extracted (feature sets 11, 17, 23, and 28). Then, we found that the evaluation metrics have been affected by them. For example, Table 4 and figures from 11 to 15 showed that feature set 28 achieved high or equal accuracy with the previous feature set. For most cases, the more extracted features, the more precision we have. But, according to time consumption, feature sets 23 and 28 consumed more time.

C. COMPARISON BETWEEN PROPOSED FRAMEWORK AND PREVIOUS RELATED WORKS
The comparative analysis presented in Table 5 reveals that there is a significant difference in the performance of the proposed HDPF and other models. Visualizing these results through Figure 16, the proposed framework achieved the highest accuracy than the related works of 98.18%. While, Fitriyani et al. [10], Gupta et al. [21], and Ali et al. [20] achieved an accuracy of 95%, 93.4%, and 92%, respectively.

D. HEART DISEASE DECISION SUPPORT SYSTEM TO TEST THE PREDICTION SYSTEM
We designed and developed the proposed HDPF into Decision Support System (DSS) to diagnose the heart disease status effectively and efficiently. The DSS was developed using PHP version 7.2 scripting language and MYSQL version 8.0 database. Figure 17 shows the general structure of the DSS model. In DSS, the patient uses a web application through the local webserver (WAMP server) to enter diagnosis data such  as Age, Sex, CP, thal,.. etc. Then, the proposed model was processed the input data using the proposed algorithms and hybrid ensembled machine learning to predict heart disease status. Figure 18 shows the result of DSS.

VI. CONCLUSION
This paper introduces hybrid classifiers using an ensembled model with a majority voting technique to improve prediction accuracy. Furthermore, a proposed preprocessing VOLUME 9, 2021  technique and feature selection based on a genetic algorithm is suggested to enhance prediction performance and overall time consumption. Experiments were performed on a dataset for cardiovascular patients from the UCI Machine Learning Repository. The study results indicated that the proposed ensemble classifier model achieved a classification accuracy of 98.18% through a comparative analytical approach. In comparison, the average performance of each machine learning algorithm gained 88%, 85%, 80%, 92%, and 93% for LR, SVM, KNN, DT, and RF, respectively. For future research, you can predict health status in real-time based on health-based streaming data as Twitter heart disease streaming data. In this paper, you will develop the proposed system using Twitter Streaming API, Apache Kafka, Apache Spark, and various machine learning models. Also, we can use a semantic ontology algorithm as in the published paper [32] to extract semantic features to enhance accuracy and reduce overall processing time. Since we have done the first stage of the system work in this paper to get the best machine learning model, a real-time online prediction pipeline will be attached as a second stage in the development work.