Efficient Medical Diagnosis of Human Heart Diseases using Machine Learning Techniques with and without GridSearchCV

Predicting cardiac disease is considered one of the most challenging tasks in the medical field. It takes a lot of time and effort to figure out what’s causing this, especially for doctors and other medical experts. In this paper, various Machine Learning algorithms such as LR, KNN, SVM, and GBC, together with the GridSearchCV, predict cardiac disease. The system uses a 5-fold cross-validation technique for verification. A comparative study is given for these four methodologies. The Datasets for both Cleveland, Hungary, Switzerland, and Long Beach V and UCI Kaggle are used to analyze the models’ performance. It is found in the analysis that the Extreme Gradient Boosting Classifier with GridSearchCV gives the highest and nearly comparable testing and training accuracies as 100% and 99.03% for both the datasets (Hungary, Switzerland & Long Beach V and UCI Kaggle). Moreover, it is found in the analysis that XGBoost Classifier without GridSearchCV gives the highest and nearly comparable testing and training accuracies as 98.05% and 100% for both the datasets (Hungary, Switzerland & Long Beach V and UCI Kaggle). Furthermore, the analytical results of the proposed technique are compared with previous heart disease prediction studies. It is evident that amongst the proposed approach, the Extreme Gradient Boosting Classifier with GridSearchCV is producing the best hyperparameter for testing accuracy. The primary aim of this paper is to develop a unique model-creation technique for solving real-world problems.


I. INTRODUCTION
The availability of various medical data prompts us to consider whether there are any efficient and effective methods for analyzing this data and deriving potentially innovative and applied knowledge. One of the most significant difficulties for data analytics is the diagnosis of various illnesses. The researchers concentrate their efforts in numerous ways, such as developing high-accuracy prediction models, extracting IF-THEN rules, and experimenting with new cut-off values for key input variables [1]. All directions are essential and can assist in the case of medical test diagnostic effectiveness. According to many medical data released by competent authorities worldwide, the most common cause of heart failure is myocardial infarction. Cardiovascular disease (CVD) is becoming increasingly common [2]. As reported by the WHO, CVDs affected roughly 1.79 crore individuals in 2016, accounting for 31% of all worldwide mortality. Only heart attacks and strokes are responsible for 85% of these deaths. With early diagnosis and good medical care, many lives may be spared. One of the most significant elements is blood pressure [3]. Heart disease can strike at any time without generating discomfort, and doctors are unable to forecast these silent assaults [4]. By 2030, the number of deaths due to cardiovascular disease is expected to climb to 23.3 million [5]. The blood veins of the heart carry oxygen, and when these channels become blocked or restricted, heart disease or stroke can occur [6]. High blood pressure, high cholesterol, stress, tension, alcohol usage, sedentary lifestyle, obesity, and diabetes are all factors that affect the heart. These factors help in cardiovascular disease diagnosis. The walls of the arteries thicken when blood pressure rises, causing obstruction and perhaps increasing mortality [7]. Every human being should receive correct treatment before their condition progresses; if a technology exists that can anticipate cardiac disease at an early stage; it might save a million lives. Heart-related disorders can be reduced with the use of medical data and machine learning techniques. Machine learning aids in the finding of patterns in enormous volumes of data that people are unable to detect the use of sophisticated technology aids in the improvement of the current health-care system [8]. A machine learning model learns from the historical data it receives and then creates prediction algorithms to forecast the outcome for fresh data that enters the system as input. The accuracy of these models is determined on the quality and amount of input data. A vast amount of data will aid in the development of a better model that precisely predicts the output [9]. A range of supervised and unsupervised machine learning methods have been utilized by medical academics to identify and predict heart illness. [10]. The major goal of the various academics' suggested algorithms is to extract patterns from data and forecast future outcomes. Early detection and treatment of cardiac disease can help patients avoid death or lower their mortality rate. Angiography is one of the most popular procedures for diagnosing aberrant narrowing of a cardiac artery. SMO, Nave Bayes, and Ensemble methods were used to evaluate the symptoms, examination, and ECG characteristics, with an accuracy of 88.5 percent in predicting the existence of CAD [11]. Artificial intelligence is being used by governments and corporate health organizations for a variety of purposes, including lowering operating costs and increasing patient satisfaction [12]. Lowering the length of stay and determining the etiology of sickness, among other things to predict early heart attacks, many data mining classification methods were utilized [13]. Finding an ideal solution is quite difficult without thorough data analysis. The purpose of this study is to get as close to zero prediction error rates as feasible. Many academics have previously utilized the dataset that is being examined for this project, which is accessible in the UC Irvine machine learning repository. There are 73 distinct characteristics in the dataset. Machine learning engineers grouped the most optimal qualities from all relevant attributes and created a 14-attribute dataset [14]. Data mining, machine learning, deep learning, and other automated methods for diagnosing cardiac disease are already available. As a result, we'll present a short review of machine learning methodologies in this post. To train the datasets, we leverage machine learning resources. The development of cardiac disease can be predicted by a number of risk factors. Risk factors include age, sex, blood pressure, cholesterol level, family history of coronary artery disease, diabetes, smoking, alcohol, obesity, heart rate, and chest discomfort etc. Currently, information about patients with medical reports is widely available in databases in the healthcare sector, and it is rising fast day by day. This unbalanced raw data is significantly redundant. It necessitates pre-processing in order to extract important features, reduce training algorithm execution time, and increase classification efficiency [15]. The current advances in processing capacity and machine learning reprogramming skills improve these processes and open up new research prospects in the healthcare sector [16] particularly in terms of disease early detection, such as cardiovascular disease and cancer, in order to enhance survival rates. Machine learning is utilised in a wide range of applications, from identifying disease risk factors to developing superior automotive safety systems. To solve the current restrictions, machine learning provides the most used prediction modelling tools [17]. It has a lot of potential for transforming massive data and developing prediction algorithms. It uses a computer to learn complicated and non-linear relationships between attributes by minimising the difference between projected and observed results [18]. The machine learns patterns from current dataset features and applies them to an unknown dataset to predict the outcome. One of the most effective machine learning prediction algorithms is classification. When taught with appropriate data, classification is a supervised machine learning method that is excellent at identifying the illness [19]. The main contribution of this research was to use modern machine learning techniques to construct an intuitive medical prediction system for the diagnosis of heart disease. Different types of machine learning classifier algorithms were trained in this study, including logistic regression (LR), K-nearest neighbours (K-NN), support vector machine (SVM), and Gradient Boosting Classifier (GBC) with and without GridSearchCV, to select the best predictive model for accurate heart disease detection at an early stage. To achieve the ideal collection of attributes that strongly influenced the performance of the classifiers when predicting the target class, four model selection strategies were used, including correlation-based feature subset evaluator and Gradient Boosting Classifier evaluator. Finally, the whole attribute set and optimal sets obtained via attribute evaluators were used to tune the hyperparameter "GridSearchCV" in the GBC classifier.
One of the most challenging issues in medicine is predicting cardiac disease. It takes a lot of time and effort to figure out what's causing this, especially for doctors and other medical professionals. Researchers used a range of algorithms, including LR, KNN, SVM, and GBC, as well as the GridSearchCV, to predict cardiac disease.

Fig-1 The suggested model's flow diagram
The following is a breakdown of how the paper is structured: Section II examines previous heart disease research using a variety of machine learning approaches, Section III explains the database we use and our recommended model's approach analysis, and Section IV concludes with detailed results and comparisons to other methods. Finally, in Section V, the paper's conclusion and future study potential are discussed.

Review literature
In the medical area, a lot of works have been done on illness prediction systems employing various machine learning techniques. Vikas Chourasia et al. [20] presented another work on Early Prediction of Heart Diseases Using Data Mining Techniques. They used classification and regression trees, iterative dichotomization, and decision tables, among other approaches. The approach uses 10 cross-fold validations, with CART achieving the highest accuracy of 83 percent. Alizadeh Sani et al. [21] employed a rule-based classifier, a cost-sensitive method, and Sequential minimum optimization to identify coronary artery disease (SMO). Lakshmana Rao and others [22]. Machine Learning Techniques for Heart Disease Identifying the elements that are most likely to cause heart disease (circulatory strain, diabetes, current smoker, high cholesterol, etc.). As a result, distinguishing between heart disease and other conditions is challenging. The severity of cardiac disease in patients has been determined using a range of data mining and neural networks. CHD sickness is a difficult idea, and as a result, it should be addressed with caution. Failure to recognize the disease early on might have serious consequences for the heart or even result in death. Data burrowing is also used in pharmaceutical research to locate various sorts of metabolic machine learning, which is a strategy that allows a framework to learn from prior data tests and models without having to be actively altered. Logic becomes reliant on machine learning. To construct an HD prediction system, Mohan et al. [23] created a hybrid machine learning technique. He also demonstrated a unique technique for collecting essential characteristics from data in order to train and evaluate machine learning classifiers. They were properly categorized 88.07% of the time. Olaniyan et al. [24] developed a three-phase strategy for HD prediction in angina that achieved 88.89 percent accuracy using an artificial neural network technology. Ratnasari et al. [25] utilised a Gray-level threshold of 150, which was based on PCA and ROI, to decrease X-ray image features. A 13feature dataset was used in the bulk of prior research. Every study employs the e classification to determine if a patient has heart disease or not, and one of the most common trends is that Cleveland is the most commonly used dataset. [26]. G'arate-Escamila et al. [27] with the X2 statistical model, DNN and ANN were applied. The clinical data parameters were utilized to ensure that the forecasts were accurate [28]. Harvard Medical School is a prestigious medical school in Boston. Using Hungarian-Cleveland datasets, several machine learning classifiers were used to predict heart disease, and PCA was used for dimensionality reduction and feature selection. For feature extraction, Zhang et al. [29] used an AdaBoost classifier with a PCA combination, which enhanced prediction accuracy.

3.Materials and Methods
The proposed methodology, which includes dataset definition, data pre-processing, machine learning classifiers, attribute evaluators, and performance measures, is discussed in this part.

Proposed methodology
The main contribution of this research was to use modern machine learning techniques to construct an intuitive medical prediction system for the diagnosis of heart disease. Different types of machine learning classifier algorithms were trained in this study, including logistic regression (LR), K-nearest neighbours (K-NN), support vector machine (SVM), and Gradient Boosting Classifier (GBC) with and without GridSearchCV, to select the best predictive model for accurate heart disease detection at an early stage.
To achieve the ideal collection of attributes that strongly influenced the performance of the classifiers when predicting the target class, four model selection strategies were used, including correlation-based feature subset evaluator and Gradient Boosting Classifier with Gradient boosting classifier evaluator. Finally, the whole attribute set and optimal sets obtained via attribute evaluators were used to tune the hyperparameter "GridSearchCV" in the GBC classifier. The system uses a 5-fold cross-validation technique for verification. In the comparison research, these four methodologies were applied. The models' performance is evaluated using the Datasets for Cleveland, Hungary, Switzerland, and Long Beach V, as well as the Dataset Heart Disease UCI Kaggle Dataset. The Extreme Gradient Boosting Classifier with GridSearchCV generated the highest and nearly comparable accuracy results for both the hungary, switzerland & Long Beach V and Heart Disease UCI Kaggle Datasets (100% and 99.03%, respectively). The optimal hyperparameter for testing accuracy is the Extreme Gradient Boosting Classifier with GridSearchCV. Many performance indicators, like as Accuracy, Recall, Precision, and F1-Score, may be used to evaluate the models' performance. Previous cardiac prediction studies were compared to the findings. We'd want to develop the model in the future so that it can be used with a variety of feature selection techniques; another option is to utilize a GridSearchCV in conjunction with an Extreme Gradient Boosting Classifier. The study's main purpose is to build on past work by inventing a new and distinctive modelcreation approach, as well as to make the model relevant and easy to apply in real-world circumstances. For Cleveland, Hungary, Switzerland, and Long Beach V, the models' performance is evaluated using the GridSearchCV technique with 5-fold cross validation. Our findings imply that cardiac disease may be consistently predicted using four machine learning models and optimization methodologies.

Methodology Recommended
Creative datasets were acquired, and then the collected data was pre-processed. The sequential forward selection approach was used to pick relevant characteristics. The hyper parameter optimization approaches Grid search were used to optimize the parameters of Logistic Regression, k-Nearest Neighbors, Support Vector Machine, and XGBoost. Finally, the models were tested and analyzed to see if they might predict cardiac disease. The data in our proposed model is validated using 5-cross validation. The suggested model's flowchart is shown in Fig. 2.

Dataset description 3.3.1. Cleveland_hungary_switserland & Long Beach V heart disease dataset and heart disease UCI Kaggle dataset
The dataset utilized in this paper comes from [30], which is a shortened version of [31]. The dataset contains 76 characteristics, including the class attribute, for 1025 patients from Cleveland, Hungary, Switzerland, and Long Beach V, however only a subset of 14 are used in this study. The heart disease prediction dataset has 14 columns, with 13 independent components and one dependent target variable. The target variable is separated into two groups: those who have heart illness and those who do not. There are 1025 rows in all. for 918 patients from heart disease UCI Kaggle dataset, however only a subset of 12 are used in this study. The heart disease prediction dataset has 12 columns, with 11 independent components and one dependent target variable. The target variable is separated into two groups: those who have heart disease and those who do not. There are 918 rows in all. There are no missing values in the dataset. Table 1 contains a description of the dataset. Chest pain Cp 4 types of chest pain (1-typical angina; 2atypical angina; 3-non-anginal pain; 4asymptomatic) 4

Fig-2 The Heart Disease Prediction System Proposed Model
Rest blood pressure trestbps Blood pressure normal (in mm Hg on admission to the hospital) 5 Serum cholesterol chol Cholesterol level in mg/dl. 6 Fasting blood sugar fbs Fasting blood sugar>120 mg/dl (0-false; 1-true) 7 Rest electrocardiograph restecg 0-normal; 1-having ST-T wave abnormality; 2-left ventricular hypertrophy 8 MaxHeart rate Thalach heart rate achieved which is at its maximum value. 9 Exercise-induced angina exang Exercise-induced angina (0-no; 1-yes)

Exploratory Data Analysis
To comprehend data insights, some type of data analytics is required, which allows for a better understanding of data patterns. The data distribution of the dataset is examined in the next section.

Method-I Logistic Rigression Analysis
The question may arise like why logistic why can't linear? Linear regression is unbounded and the classifier may get error, for this reason logistic regression is used. The most straight-forward interpretation of logistic regression which resides between 0 and 1is given in [32]. The model predicts the probability of our random variable that is our target value from the given data [33]. The cost function used in logistic regression is sigmoid function. In logistic regression, the linear function is essentially utilized as an input to another function, such as f in the equation Then ⇒ x=independent variable and are coefficient to be learnt [34]. But the linear regression stuff so we need to find a way to cast the logistic regression problem in a manner where by at least the expression above can be used thus if we computed the odds of the outcome as: We can move a step closer to casting the problem in a continuous linear manner but this is still just having positive values we need a range of (−∞, +∞). That can be done by getting the (natural) logarithm of the odds as:

Method-II K-Nearest Neighbors (KNN)
K nearest neighbors is a straightforward method that maintains all existing examples and categorizes new ones using a similarity metric (e.g., distance functions). KNN has been utilized as a non-parametric approach in statistical estimates and pattern recognition since the early 1970s.KNN (K-Nearest Neighbors) is a classifier technique that is based on "how similar" is a data (a vector) from other [35]. It is one of several (supervised learning) algorithms used in data mining and machine learning. The KNN's steps are as follows:  Get info that isn't classified;  Calculate the distance (Euclidian, Manhattan, Murkowski, or Weighted) between the new data and all previously categorized data.  Calculate the distance (Euclidian, Manhattan, Murkowski, or Weighted) between the new data and all previously categorized data.  Examine the list of classes with the lowest distance between them and count how many of each class appear;  Takes the class with the most appearances as the proper class;  Sorts the new data into the class you choose in step 7; Below is a graphic that depicts all of the processes discussed in this post; you have unclassified data (in red) and all of your other data (in yellow and purple), each with your class (A or B). So you calculate the distances between your new data and all the others to see which ones have the smallest distance, and then you take 3 (or 6) of the closest data and see which class appears most frequently. For example, in the image below, the closest data to the new data are those that are inside the first circle (inside circle), and there are 3 other data inside this circle (already classified with yellow), We'll see which class is the most prevalent; see, it's purple, since there are two purple balls and only one yellow, therefore this new data, which was previously unclassified, will now be classed as purple.
 If we chose K=3, we will have two Class B observations and one Class A observation. As a result, the red star is assigned to Class B.  If K=6, we have two observations in Class B and four in Class A. As a result, the red star receives a Class A classification. Distance calculation: The distance between two points (your new sample and all of the data in your dataset) may be calculated in a variety of ways; as previously said, there are numerous ways to obtain this value; in this article, we will utilize the Euclidean distance [36]. The formula for the Euclidean distance is: Given two feature vectors with numeric values . Training examples (Pi,Qi), Attribute -value represent of example and real valued target.

Method-III Support Vector Machine (SVM)
Support Vector Machine (SVM) is a potential machine learning approach based on statistical learning theory. The support vector machine (SVM) is a machine that can classify both linear and nonlinear data. It separates the data into two categories by generating a linear optimum separating Hyperplane inside a higher dimension with the use of support vectors and margins (or classes). The original training data is transferred onto a higher dimension using an appropriate nonlinear mapping. Within this, a hyperplane may always be used to split data from two classes [37]. If f is a classification function for Support vector machines, then , Where P is the Domain (here i.e., data set), , , The collection of n training tuples with associated class labels y i is referred to as x i , Each y i can take one of two values: +1 or -1, which correspond to two different classes as: The term "output set" can be used to describe a group of results as: (7) The main principle of SVM is to discover the hyper plane with the greatest margin to distinguish a collection of positive examples from a set of negative examples, as shown in Fig-1. A linear classification of the type is computed using a support vector machine is as: f(x)=b+wx. The best hyperplane with the most margins, a linear classification of the type is computed using the support vector machine. f(x)= b+w.x, Where w is a weight vector, x denotes a training sample, and b denotes bias. The hyperplane that separates the two hyperplanes may be expressed as f(x)=0, As a result, any point from one class is located above the dividing hyperplane satisfies, f(x)>0 Similarly, any point from a different class is below the separating hyperplane. satisfies, f(x)>0. The above equations were used to form the linearly separable set D, which satisfied the inequality, , Here the margin m is, Fig-9 support vector machine Maximizing margin may be represented as an optimization problem using the above equation: (8) The dual Lagrange multiplier can be used to tackle this optimization challenge, The support vectors for linearly separable data are a subset of the actual training tuples. A dot product exists between the support vector x i and the test tuple x i in the Lagrangian formulation of the aforementioned optimization problem. Each Lagrange multiplier and each training tuple have a one-to-one connection. Not all data sets can be separated in a linear fashion. There may not be a hyperplane that separates the positive and negative cases. SVMs may also be used to create non-linear classifiers. The Lagrange multiplier is used to compute the output of a non-linear SVM (10) K is the kernel function. In this case, we employed the Radial Basis Kernel Function (RBF), which is written as follows: The quadratic form is altered by the nonlinearity, but the dual objective function is still quadratic in, The aforementioned quadratic programming issue is solved using a sequential minimum optimization approach.

Method-IV XGBoost Classifier
The gradient boosted trees technique's XGBoost (extreme Gradient Boosting) version is a popular and efficient opensource implementation. Gradient boosting is a supervised learning strategy that attempts to properly forecast a target variable by combining an ensemble of estimates from a variety of simpler and weaker models. The XGBoost approach performs well in machine learning competitions due to its robust handling of a variety of data types, relationships, and distributions, as well as the variety of hyper parameters that can be fine-tuned. Regression, classification (binary and multiclass), and ranking problems may all be solved with XGBoost [38]. The flow chart XGBoost Classifier flow chart is shown in Fig-10.

GridsearchCV
In virtually all machine learning applications, we train many models on a dataset and then choose the one that performs the best. However, for the sake of progress, we cannot state with certainty that this model is the best for the situation at hand. As a result, our goal is to improve the model in whatever manner we can. One essential feature in these models' performances is that after we select proper values for these hyperparameters, the model's performance can considerably increase. GridSearchCV [39] may also be used to determine optimal values for a model's hyperparameters. Choosing   Fig-10 XGBoost Classifier flow chart the best hyper parameters has a big impact on the performance model. There are a number of optimization techniques to choose from, each with its own set of advantages and disadvantages. Experiments were done on numerous optimization methodologies to discover the ideal hyper parameter combination, which was then used to Logistic Regression, k-Nearest Neighbors, Support Vector Machine, and XG Boost. The machine learning model has two types of parameters. There are two sorts of parameters: hyperparameters and model parameters. The User must set random hyper parameters before training the Machine Learning Model. The learning process is regulated by the parameters. Different learning rates or weights are used to govern the learning process and find hidden patterns in the data for the same type of machine learning model. To decrease inaccuracy and improve model accuracy, these hyper parameters are fine-tuned. Using a trial-and-error method, these parameters are modified until the ideal hyper parameters are found. Hyper settings that achieve a balance between overfitting and underfitting are optimal. The ability to properly explore the search space is aided by selecting great hyper parameters. To improve the performance of the Logistic Regression, k-Nearest Neighbours, Support Vector Machine, and XG Boost models, the Hyper parameters can be modified. Model parameters are learnt during the training phase. Grid search techniques are commonly used in hyper parameter optimization. Grid Search is a well-known method for determining all hyper parameter combinations. The learning rate and the number of layers are the two most important Enter new data for which stroke has to be predicted Result: predicted data with respect to model parameters in Grid Search. A collection of values is initially determined for each hyper parameter. In each cycle, the hyper parameter combination is determined. Finally, the most successful hyper parameter combination is selected and implemented in the learning process.

Performance Matrix:
The performance of the two models was assessed using a confusion matrix. The count of anticipated values that accurately recognised the presence of illness is known as True Positive (tp). The count of expected values properly identifying the absence of illness is known as a true negative. The number of anticipated values that are wrongly labelled as positive is known as false positive (fp) (actually when it was negative). The count of anticipated values mistakenly labelled as negative is known as False Negative (fn) (actually when it was positive). Once the model has been trained, ten-fold cross validation is used to predict and verify the risk of heart disease. The investigation employed the performance metrics Accuracy, Recall, Precision, and fmeasure score values [40][41][42].

Results and analysis
This study looked at the effect of hyper parameter on the prediction performance of four different machine learning models. The ML models Logistic Regression, k-Nearest Neighbours, Support Vector Machine, and XG Boost were improved using the hyper parameter optimization approach GridSearchCV. In the trials, hyper parameter techniques were used to compare the prediction performance of the four algorithms LR, K-NN, SVM, and XGboost. Each approach is analysed using different hyper parameters. The findings of Randomized Search, Grid Search, and other hyper parameter optimization procedures were compared to those of established approaches. Data was taken for training and testing in proportions of 70% and 30%, respectively. We employed 5-fold Cross Validation in our research.
Previous research has shown that 5-fold cross validation yields a generalized model while avoiding overfitting.

Fig-11 Graphical representation of five-fold cross -validation
The presence of heart illness is represented as one in our study, whereas the absence of heart disease is represented as zero. We used the Scikit Python package to implement the three hyper parameter optimization strategies. Scikit-learn has built-in support for hyperparameter tuning approaches.

Experimental result with four different machine learning model training & testing results (Cleveland, Hungary, Switzerland &Long Beach V) and UCI Kaggle data
The training and testing confusion matrix is used to assess the Logistic Regression, k-Nearest Neighbours, Support Vector Machine, and XG Boost algorithms. A 2x2 matrix represents the classifier's correct and incorrect predictions.                              Fig-12 demonstrates the performance of several classifiers.

T H E B E S T M O D E L W I T H A N D W I T H O U T G R I D S E A R C H C V
tarining accuracy testinng accuracy

H E A R T D I S E A S E U C I K A G G L E D A T A S E T
tarining accuracy testinng accuracy verification, a 5-fold cross-validation approach is employed. These four approaches are used in the comparative study.    In the future, we'd like to improve the model so that it may be used with several feature selection algorithms. Another possibility is to utilize a GridSearchCV with a Gradient Boosting Classifier. The major goal of this study is to improve on previous work by developing a new and unique model-creation method and to make the model relevant and easy to use in a real-world situation.