Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques

Heart disease is one of the most significant causes of mortality in the world today. Prediction of cardiovascular disease is a critical challenge in the area of clinical data analysis. Machine learning (ML) has been shown to be effective in assisting in making decisions and predictions from the large quantity of data produced by the healthcare industry. We have also seen ML techniques being used in recent developments in different areas of the Internet of Things (IoT). Various studies give only a glimpse into predicting heart disease with ML techniques. In this paper, we propose a novel method that aims at finding significant features by applying machine learning techniques resulting in improving the accuracy in the prediction of cardiovascular disease. The prediction model is introduced with different combinations of features and several known classification techniques. We produce an enhanced performance level with an accuracy level of 88.7% through the prediction model for heart disease with the hybrid random forest with a linear model (HRFLM).


I. INTRODUCTION
It is difficult to identify heart disease because of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate and many other factors.Various techniques in data mining and neural networks have been employed to find out the severity of heart disease among humans.The severity of the disease is classified based on various methods like K -Nearest Neighbor Algorithm (KNN), Decision Trees (DT), Genetic algorithm (GA), and Naive Bayes (NB) [11], [13].The nature of heart disease is complex and hence, the disease must be handled carefully.Not doing so may affect the heart or cause premature death.The perspective of medical science and data mining are used for discovering various sorts of metabolic syndromes.Data mining with classification plays a significant role in the prediction of heart disease and data investigation.
We have also seen decision trees be used in predicting the accuracy of events related to heart disease [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Jun Wu.Various methods have been used for knowledge abstraction by using known methods of data mining for prediction of heart disease.In this work, numerous readings have been carried out to produce a prediction model using not only distinct techniques but also by relating two or more techniques.These amalgamated new techniques are commonly known as hybrid methods [14].We introduce neural networks using heart rate time series.This method uses various clinical records for prediction such as Left bundle branch block (LBBB), Right bundle branch block (RBBB), Atrial fibrillation (AFIB), Normal Sinus Rhythm (NSR), Sinus bradycardia (SBR), Atrial flutter (AFL), Premature Ventricular Contraction (PVC)), and Second degree block (BII) to find out the exact condition of the patient in relation to heart disease.The dataset with a radial basis function network (RBFN) is used for classification, where 70% of the data is used for training and the remaining 30% is used for classification [4], [15].
We also introduce Computer Aided Decision Support System (CADSS) in the field of medicine and research.In previous work, the usage of data mining techniques in the healthcare industry has been shown to take less time for the prediction of disease with more accurate results [16].We propose the diagnosis of heart disease using the GA.This method uses effective association rules inferred with the GA for tournament selection, crossover and the mutation which results in the new proposed fitness function.For experimental validation, we use the well-known Cleveland dataset which is collected from a UCI machine learning repository.We will see later on how our results prove to be prominent when compared to some of the known supervised learning techniques [5], [17].The most powerful evolutionary algorithm Particle Swarm Optimization (PSO) is introduced and some rules are generated for heart disease.The rules have been applied randomly with encoding techniques which result in improvement of the accuracy overall [2].Heart disease is predicted based on symptoms namely, pulse rate, sex, age, and many others.The ML algorithm with Neural Networks is introduced, whose results are more accurate and reliable as we have seen in [8], [12].
Neural networks are generally regarded as the best tool for prediction of diseases like heart disease and brain disease.The proposed method which we use has 13 attributes for heart disease prediction.The results show an enhanced level of performance compared to the existing methods in works like [3].The Carotid Artery Stenting (CAS) has also become a prevalent treatment mode in the medical field during these recent years.The CAS prompts the occurrence of major adverse cardiovascular events (MACE) of heart disease patients that are elderly.Their evaluation becomes very important.We generate results using a Artificial Neural Network ANN, which produces good performance in the prediction of heart disease [6], [18].Neural network methods are introduced, which combine not only posterior probabilities but also predicted values from multiple predecessor techniques.This model achieves an accuracy level of up to 89.01% which is a strong results compared to previous works.For all experiments, the Cleveland heart dataset is used with a Neural Network NN to improve the performance of heart disease as we have seen previously in [9], [19].
We have also seen recent developments in machine learning ML techniques used for Internet of Things (IoT) as well [43].ML algorithms on network traffic data has been shown to provide accurate identification of IoT devices connected to a network.Meidan et al. collected and labeled network traffic data from nine distinct IoT devices, PCs and smartphones.Using supervised learning, they trained a multi-stage meta classifier.In the first stage, the classifier can distinguish between traffic generated by IoT and non-IoT devices.In the second stage, each IoT device is associated with a specific IoT device class.Deep learning is a promising approach for extracting accurate information from raw sensor data from IoT devices deployed in complex environments [44]- [47].Because of its multilayer structure, deep learning is also appropriate for the edge computing environment [48], [49].
In this work, we introduce a technique we call the Hybrid Random Forest with Linear Model (HRFLM).The main objective of this research is to improve the performance accuracy of heart disease prediction.Many studies have been conducted that results in restrictions of feature selection for algorithmic use.In contrast, the HRFLM method uses all features without any restrictions of feature selection.Here we conduct experiments used to identify the features of a machine learning algorithm with a hybrid method.The experiment results show that our proposed hybrid method has stronger capability to predict heart disease compared to existing methods.
The rest of the paper is organized as follows, Section II discusses heart related works, existing methods and techniques available.We also provide an overview of our results in Section III.Section IV discusses HRFLM Data pre-processing followed by feature selection, classification modeling and performance measure.Section V gives the algorithms used and the experimental setup.Section VI shows the evaluation of datasets and experimental setup.It also shows how the experiment was conducted and the results that were achieved.Section VII contains a discussion about the HRFLM method results and benchmarking of the proposed model.Finally, Section VIII ends with a conclusion of current work and some notes on future enhancement.

II. RELATED WORK
There is ample related work in the fields directly related to this paper.ANN has been introduced to produce the highest accuracy prediction in the medical field [6].The back propagation multilayer perception (MLP) of ANN is used to predict heart disease.The obtained results are compared with the results of existing models within the same domain and found to be improved [10].The data of heart disease patients collected from the UCI laboratory is used to discover patterns with NN, DT, Support Vector machines SVM, and Naive Bayes.The results are compared for performance and accuracy with these algorithms.The proposed hybrid method returns results of 86.8% for F-measure, competing with the other existing methods [7].The classification without segmentation of Convolutional Neural Networks (CNN) is introduced.This method considers the heart cycles with various start positions from the Electrocardiogram (ECG) signals in the training phase.CNN is able to generate features with various positions in the testing phase of the patient [22], [41].A large amount of data generated by the medical industry has not been used effectively previously.The new approaches presented here decrease the cost and improve the prediction of heart disease in an easy and effective way.The various different research techniques considered in this work for prediction and classification of heart disease using ML and deep learning (DL) techniques are highly accurate in establishing the efficacy of these methods [27], [42].

III. OVERVIEW OF METHOD AND RESULTS
In HRFLM, we use a computational approach with the three association rules of mining namely, apriori, predictive and Tertius to find the factors of heart disease on the UCI  Cleveland dataset.The available information points to the deduction that females have less of a chance for heart disease compared to males.In heart diseases, accurate diagnosis is primary.But, the traditional approaches are inadequate for accurate prediction and diagnosis.
HRFLM makes use of ANN with back propagation along with 13 clinical features as the input.The obtained results are comparatively analyzed against traditional methods [20], [23].The risk levels become very high and a number of attributes are used for accuracy in the diagnosis of the disease [24].The nature and complexity of heart disease require an efficacious treatment plan.Data mining methods help in remedial situations in the medical field.The data mining methods are further used considering DT, NN, SVM, and KNN.Among several employed methods, the results from SVM prove to be useful in enhancing accuracy in the prediction of disease [25].The nonlinear method with a module for monitoring heart function is introduced to detect the arrhythmias like bradycardia, tachycardia, atrial, atrialventricular flutters, and many others.The performance efficacy of this method can be estimated from the accuracy in the outcome results based on ECG data.ANN training is used for the accurate diagnosis of disease and the prediction of possible abnormalities in the patient [26], [34].
Diverse data mining approaches and prediction methods, such as KNN, LR, SVM, NN, and Vote have been rather popular lately to identify and predict heart disease [23].The novel method Vote in conjunction with a hybrid approach using LR and NB is proposed in this paper.The UCI dataset is used for conducting the experiments of the proposed method, which resulted in 87.4% accuracies in the prediction of heart disease [28], [36].The Probabilistic Principal Component Analysis (PPCA) method is proposed for evaluation, based on three data sets of Cleveland, Switzerland, and Hungarian in UCI respectively.The method extracts the vectors with high covariance and vector projection used for minimizing the feature dimension.The feature selection with minimizing dimension is provided to a radial basis function, which supports kernel-based SVM.The results of the methods are 82.18%,85.82% and 91.30% of UCI data sets of Cleveland, Switzerland and Hungarian respectively [29].The hybrid method combining Linear regression (LR), Multivariate Adaptive Regression Splines (MARS) and ANN is introduced with rough set techniques and is the main novel contribution of this paper.The proposed method effectively reduced the set of critical attributes.The remaining attributes are input for ANN subsequently.The heart disease datasets are used to demonstrate the efficacy of the development of the hybrid approach [30], [38].The heart disease prediction with multilayer perception of NN is proposed.This method uses 13 clinical attribute features as the input and trained by back propagation are very accurate results in identifying whether the patient has heart disease or not [39].
We also introduce the Apriori algorithm with SVM and compare it with nine other classification methods to predict heart disease more accurately.The results of the classification method have proved a higher degree of accuracy and performance in the prediction of heart disease compared to the other existing methods [32].The feature selection plays a prominent role in the prediction of heart disease.ANN with back propagation is proposed for better prediction of the disease.The results obtained from the application of ANN are highly accurate and very precise [33].The genetic algorithm with fuzzy NN known as Recurrent Fuzzy Neural Network (RFNN) is introduced for the diagnosis of heart disease.
In the UCI data set 297 instances of patient records, in total, are considered of which 252 records are used for training and the remaining for testing.The results have been located to be satisfying based on the assessment [35].Heart disease prediction with SVM and ANN is proposed.In this approach, two methods are used for the premise of the accuracy and time of testing.The proposed model arranges the data records into two classes in SVM as well as ANN for further analysis as shown in [37].The Back Propagation Neural Network (BPNN) with classification method is introduced, where the hypertension gene sequence is generated and then, thereafter the exact gene sequence.The performance of the BPNN techniques has been measured in the training phase as well as the testing phase with the various numbers of samples.The accuracy of this technique has improved in correspondence to the number of records [40].

IV. PROPOSED METHOD HRFLM
In this study, we have used an R studio rattle to perform heart disease classification of the Cleveland UCI repository.It provides an easy-to-use visual representation of the dataset, working environment and building the predictive analytics.ML process starts from a pre-processing data phase followed by feature selection based on DT entropy, classification of modeling performance evaluation, and the results with improved accuracy.The feature selection and modeling keep on repeating for various combinations of attributes.Table 1 shows the UCI dataset detailed information with attributes used.Table 2 shows the data type and range of values.The performance of each model generated based on 13 features and ML techniques used for each iteration and performance are recorded.Section A summarizes the data pre-processing, Section B discusses the feature selection using entropy, Section C explains the classification with ML techniques and Section D presented for the performance of the results.

A. DATA PRE-PROCESSING
Heart disease data is pre-processed after collection of various records.The dataset contains a total of 303 patient records, where 6 records are with some missing values.Those 6 records have been removed from the dataset and the remaining 297 patient records are used in pre-processing.The multiclass variable and binary classification are introduced for the attributes of the given dataset.The multi-class variable is used to check the presence or absence of heart disease.In the instance of the patient having heart disease, the value is set to 1, else the value is set to 0 indicating the absence of heart disease in the patient.The pre-processing of data is carried out by converting medical records into diagnosis values.The results of data pre-processing for 297 patient records indicate that 137 records show the value of 1 establishing the presence of heart disease while the remaining 160 reflected the value of 0 indicating the absence of heart disease.

B. FEATURE SELECTION AND REDUCTION
From among the 13 attributes of the data set, two attributes pertaining to age and sex are used to identify the personal information of the patient.The remaining 11 attributes are considered important as they contain vital clinical records.Clinical records are vital to diagnosis and learning the severity of heart disease.As previously mentioned in this experiment, several (ML) techniques are used namely, NB, GLM, LR, DL, DT, RF, GBT and SVM.The experiment was repeated with all the ML techniques using all 13 attributes.Figure 2 shows the prediction method of HRFLM.

C. CLASSIFICATION MODELLING
The clustering of datasets is done on the basis of the variables and criteria of Decision Tree (DT) features.Then, the classifiers are applied to each clustered dataset in order to estimate its performance.The best performing models are identified from the above results based on their low rate of error.The performance is further optimized by choosing the DT cluster with a high rate of error and extraction of its corresponding classifier features.The performance of the classifier is evaluated for error optimization on this data set.

1) DECISION TREES
For training samples of data D, the trees are constructed based on high entropy inputs.These trees are simple and fast constructed in a top down recursive divide and conquer (DAC) approach.Tree pruning is performed to remove the irrelevant samples on D.
2) LANGUAGE MODEL For given input features x i , y i with input vector x i of data D the linear form of solution f (x) = mx +b is solved by subsequent parameters: b = ȳ − mx where x, ȳ are the means.

3) SUPPORT VECTOR MACHINE
Let the training samples having dataset Data = {y i , x i } ; i = 1, 2, . . ., n where x i ∈ R n represent the i th vector and y i ∈ R n represent the target item.The linear SVM finds the optimal hyperplane of the form f (x) = w T x + b where w is a dimensional coefficient vector and b is a offset.This is done by solving the subsequent optimization problem:

4) RANDOM FOREST
This ensemble classifier builds several decision trees and incorporates them to get the best result.For tree learning, it mainly applies bootstrap aggregating or bagging.For a given data, X = {x  The unseen samples x ′ is made by averaging the predictions B b=1 fb(x ′ ) from every individual trees on x ′ : The uncertainty of prediction on these tree is made through its standard deviation,

5) NAIVE BAYES
This learning model applies Bayes rules through independent features.Every instance of data D is allotted to the class of highest subsequent probability.The model is trained through the Gaussian function with prior probability P X f = priority ∈ (0 : 1) At last, the testing data is categorized based on the probability of association: The neuron components includes inputs x i , hidden layers and output y i .The final result is produced through the activation function like sigmoid and a bias constant b.

D. PERFORMANCE MEASURES
Several standard performance metrics such as accuracy, precision and error in classification have been considered for the computation of performance efficacy of this model.Accuracy in the current context would mean the percentage of instances correctly predicting from among all the available instances.Precision is defined as the percentage of corrective prediction

Algorithm 4 Apply Classifier on Extracted Features
Apply the hybrid method based on the error rate in the positive class of the instances.Classification error is defined as the percentage of accuracy missing or error available in the instances.To identify the significant features of heart disease, three performance metrics are used which will help in better understanding the behavior of the various combinations of the feature-selection.ML technique focuses on the best performing model compared to the existing models.We introduce HRFLM, which produces high accuracy and less classification error in the prediction of heart disease.The performance of every classifier is evaluated individually and all results are adequately recorded for further investigation.

V. EXPERIMENTATIONAL ENVIRONMENT A. DATASETS
Heart disease data was collected from the UCI machine learning repository.There are four databases (i.e.Cleveland, Hungary, Switzerland, and the VA Long Beach).The Cleveland database was selected for this research because it is a commonly used database for ML researchers with  comprehensive and complete records.The dataset contains 303 records.Although the Cleveland dataset has 76 attributes, the data set provided in the repository furnishes information for a subset of only 14 attributes.The data source of the Cleveland dataset is the Cleveland Clinic Foundation.Table 1 depicts the description and type of attributes.There are 13 attributes that feature in the prediction of heart disease, where only one attribute serves as the output or the predicted attribute to the presence of heart disease in a patient.The Cleveland dataset contains an attribute named num to show the diagnosis of heart disease in patients on different scales, from 0 to 4. In this scenario, 0 represents the absence of heart disease and all the values from 1 to 4 represent patients with heart disease, where the scaling refers to the severity of the disease (4 being the highest).Figure 1 shows the distribution of the num attribute among the identified 303 records.

B. EXPERIMENTAL SETUP FOR EVALUATION
We have used an R studio rattle to perform the classification of heart disease from Cleveland UCI repository.Figure 1 depicts the evaluation of the experiment by step-by-step stages.In the first step, the UCI dataset is loaded and the data becomes ready for pre-processing.The subset of 13 attributes (Age, sex, cp, treetops, chol, FBS, restecg, thalach, exang, olpeak, slope, ca, that, target) is selected from the pre-processed data set of heart disease.The three existing models for heart disease prediction (DT, RM, LM) are used to develop the classification.The evaluation of the model is performed with the confusion matrix.

VI. EVALUATION RESULTS
The prediction models are developed using 13 features and the accuracy is calculated for modeling techniques.The best classification methods are given below in Table 3.This table compares the accuracy, classification error, precision, F-measure, sensitivity and specificity.The highest accuracy is achieved by HRFLM classification method in comparison with existing methods.

VII. DISCUSSION OF HRFLM TO IMPROVE THE RESULTS
The UCI dataset is further classified into 8 types of datasets based on classification rules.The classification rules are listed in Table 4.Each dataset is further classified and processed by R Studio Rattle.The results are generated by applying the classification rule for the dataset.
The classification rules generated based on the rule after data pre-processing is done.After pre-processing, the data's three best ML techniques are chosen and the results are generated.The various datasets with DT, RF, LM are applied to find out the best classification method.Table 5 shows that results of existing and proposed methods.
The results show that RF and LM are the best.The RF error rate for dataset 4 is high (20.9%)compared to the other datasets.The LM method for the dataset is the best (9.1%) compared to DT and RF methods.We combine the RF method with LM and propose HRFLM method to improve the results.Table 6 show the results of the proposed method.Figure 3 shows the overall error rate of the dataset.
Figure 4 shows the overall classification error rate of the dataset.

A. BENCHMARKING OF THE PROPOSED MODEL
Benchmarking is needed to compare the performance of the existing models compared with the proposed model.This method is used to identify whether the proposed method is the best and improves accuracy or not.The accuracy is calculated with the number of feature selection and the model generated results.HRFLM has no restriction in selecting of features to use.All the features selected in this model accomplish the best results.Table 7 shows that comparison of various models with our proposed method.Figure 5 and Figure 6 shows the performance comparison of the various model with respect to proposed method respectively.
Table 5 depicts the details of features selected by various models from the UCI dataset for heart disease.The proposed method is used on all 13 attributes and classified, based on the error rate.This result clearly proves that all the features selected and ML techniques used, prove effective in accurately predicting heart disease of patients compared with known existing models.

VIII. CONCLUSION
Identifying the processing of raw healthcare data of heart information will help in the long term saving of human lives and early detection of abnormalities in heart conditions.Machine learning techniques were used in this work to process raw data and provide a new and novel discernment towards heart disease.Heart disease prediction is challenging and very important in the medical field.However, the mortality rate can be drastically controlled if the disease is detected at the early stages and preventative measures are adopted as soon as possible.Further extension of this study is highly desirable to direct the investigations to real-world datasets instead of just theoretical approaches and simulations.The proposed hybrid HRFLM approach is used combining the characteristics of Random Forest (RF) and Linear Method (LM).HRFLM proved to be quite accurate in the prediction of heart disease.The future course of this research can be performed with diverse mixtures of machine learning techniques to better prediction techniques.Furthermore, new featureselection methods can be developed to get a broader perception of the significant features to increase the performance of heart disease prediction.

) Algorithm 1 Algorithm 2
Decision Tree-Based Partition Require: Input: D dataset -features with a target class for ∀features do for Each sample do Execute the Decision Tree algorithm end for Identify the feature space f 1 , f 2 , . . ., f x of dataset UCI.(9) end for Obtain the total number of leaf nodes l 1 , l 2 , l 3 ,. . ., l n with its constraints (10) Split the dataset D into d 1 , d 2 , d 3 ,. . ., d n based on the leaf nodes constraints.(11) Output: Partition datasets d 1 , d 2 , d 3 ,. . ., d n Apply ML to Find Less Error Rate Require: Input: Datasets with partition -d 1 , d 2 , d 3 ,. . ., d n for ∀apply the rules do On the dataset R(d1,d2,d3,. . . . . .dn) end for Classify the dataset based on the rules C(R(d 1 ),R(d 2 ) . . . . . .R(d n )) (12) Output: Classified datasets with rules C(R(d 1 ),R(d 2 ) . . . . . .R(d n )) 7) K-NEAREST NEIGHBOUR It extract the knowledge based on the samples Euclidean distance function d x i , x j and the majority of k-nearest neighbors.

FIGURE 3 .
FIGURE 3. Overall error rate of the dataset.

FIGURE 4 .
FIGURE 4. Overall classification error rate of the dataset.

FIGURE 5 .
FIGURE 5. Performance comparison with various models.

FIGURE 6 .
FIGURE 6. Performance comparison with various models.

TABLE 1 .
UCI dataset attributes detailed information.

TABLE 2 .
UCI dataset range and datatype.

TABLE 3 .
Result of various models with proposed model.

TABLE 5 .
Result of various models with proposed model.

TABLE 6 .
Results generated based on HRFLM.

TABLE 7 .
Comparison of various models with the proposed model.

TABLE 8 .
Data split based on DT.