A Stacking-Based Model for Non-Invasive Detection of Coronary Heart Disease

Coronary arteriongraphy (CAG) is an accurate invasive technique for the diagnosis of coronary heart disease (CHD). However, its invasive procedure is not appropriate for the detection of CHD in the annual physical examination. With the successful application of machine learning (ML) in various fields, our goal is to perform selective integration of multiple ML algorithms and verify the validity of feature selection methods with personal clinical information commonly seen in the annual physical examination. In this study, a two level stacking based model is designed in which level 1 is base-level and level 2 is meta-level. The predictions of base-level classifiers is selected as the input of meta-level. The pearson correlation coefficient and maximum information coefficient are first calculated to find the classifier with the lowest correlation. Then enumeration algorithm is used to find the best combining classifiers which acquire the best result in the end. The Z-Alizadeh Sani CHD dataset which we use consists of 303 cases verified by CAG. Experimental results demonstrate that the proposed model obtains an accuracy, sensitivity and specificity of 95.43%, 95.84%, 94.44%, respectively for the detection of CHD. The proposed method can effectively aid clinicians to detect those with normal coronary arteries from those with CHD.


I. INTRODUCTION
Coronary heart disease (CHD) remains one of the leading causes of cardiovascular death globally. At present, the diagnostic methods of CHD can be divided into invasive and non-invasive ways. Coronary angiography (CAG) is a relatively safe and reliable invasive diagnostic technique, which has been widely used in clinical practice as the gold standard for the CHD diagnosis [1]. However, its invasive nature and relatively expensive operation cost makes it difficult to apply in the annual physical examination. Electrocardiogram (ECG) and echocardiography are non-invasive methods, but neither with reliable accuracy [2]. Therefore, it is necessary to find new non-invasive methods to detect CHD.
In clinical cardiology, machine learning (ML) has been proved an effective method for prediction of all-cause mortality in patients with suspected CHD [3]. In subclinical The associate editor coordinating the review of this manuscript and approving it for publication was Chua Chin Heng Matthew . cardiovascular epidemiology, ML can provide better prediction than standard cardiovascular risk scores in conjunction with phenotypic data points [4]. ML methods are widely used in dealing with existing data in medicine. In recent years, a quantity of ML algorithms for diagnosing CHD has been developed. Feshki and Shijani improved CHD diagnosis by an evolutionary algorithm and a feedforward neural network [5]. Davari et al. extracted features from ECGs by frequency and nonlinear domain methods to identify CHD symptoms with support vector classifier (SVC) classifier [6]. Vernekar et al. extracted Markov features along with other statistical and frequency domain features from phonocardiogram (PCG) and used the set of artificial neural network and gradient enhancement tree for model training [7]. Kumar et al. also used ECG signals but with flexible analytic wavelet transform to characterize the CHD [8]. Verma et al. proposed a hybrid method which included risk factor identification using correlation-based feature subset selection with particle swam optimization search method and K-means clustering algorithms [9]. Alizadehsani et al. used three classifiers for detection of the stenosis of three coronary arteries, i.e., left anterior descending, left circumflex and right coronary artery to get higher accuracy for CHD diagnosis [2]. Davari et al. achieved 99.2% detection accuracy with the Long Term ST database, but the database they used for CHD patients is accompanied by various ST segment changes [6]. And in clinical practice, many CHD patients have normal ST segment. Therefore, using the databases of patients with coronary artery disease but normal ST segments maybe more helpful to the application of artificial intelligence-based CHD diagnosis model in clinical complex situations. Besides, previous research usually employed only one kind of ML classifier to automatically diagnose CHD. However, many ML researchers especially those participating in ML competitions have successfully used classifier combinations techniques to improve the accuracy of the classifiers [10], [11].
Techniques for combining predictions obtained from multiple base-level classifiers can be summarized into three combinatorial frameworks: voting (used in bagging and boosting), stacking and cascading [12]. For more complex data sets, the traditional classifier can be improved by various types of combination rules [13]- [16]. In stacking, the predictions of a collection of classifiers are given as inputs to the next-level learning algorithm [17]. The next-level of algorithm is trained to associate the model predictions optimally and to form the next-level of the final set of predictions. Coupling relationships always exist between the different levels before the final prediction. We analyze the relationships between models in the base-level and find the optimal combination of the model by an enumeration algorithm.
In summary, the main contribution of this work are summarized as follow: • Eight feature selection methods are investigated to evaluate their performances for automated CHD diagnosis. We find that the RFECV machine-learning strategy achieved the highest predictive performance in repeated ten-fold crossvalidation. Those features selected by the RFECV method are of high reference value to cardiologists in their clinical CHD diagnosis.
• A total of 10 classification methods are utilized. By analyzing the results, it is found that the model combination exhibiting the best performance cannot be determined by directly calculating pearson correlation coefficient (PCC) and maximum information coefficient (MIC). Therefore, a novel strategy of seeking the optimal combination is proposed, in which a model having the minimum correlation with other models is first selected and then the optimal combination is determined by enumerating any possible combination of the selected model with others. Our results show that the proposed strategy yield satisfactory performances.
• The optimal model combination for automated CHD diagnosis is determined. The application of the proposed model combination strategy on the other 3 data sets also shows satisfactory results, which demonstrates the generalization ability of our proposed model combination strategy.
The remainder of this paper is organized as follows. In section II, the data source and the preprocessing methods of the data are introduced. In section III, the technical details of our proposed two-level stacking based model are described. Experimental results are presented in Section IV followed by discussions in Section V.

II. MATERIAL
The Z-Alizadeh Sani dataset [http://archive.ics.uci.edu/ml/da tasets/extention+of+Z-Alizadeh+sani+dataset] consists of 216 CHD patients and 87 healthy subjects represented by 54 different kinds of clinical and demographic features as shown in Table 1 [18]. The dataset exhibits a huge imbalance in the distribution of the target classes, for there are approximately 3 times more CHD patients than healthy subjects. In such case, the synthetic minority oversampling technique (SMOTE) is employed to solve the imbalance problem. The basic idea of the SMOTE method is to analyze minority classes and synthesize new minority classes by oversampling. The data of normal individuals are oversampled by SMOTE during cross validation and not prior to the cross validation process. Synthetic data are created only for the training set without affecting the test set. If a feature has a variance that is orders of magnitude larger than others, it might affect the objective function and makes the estimator fail to learn from other features correctly as expected [19]. Since the 54 features of the dataset include 23 numeric and 31 categorical data, the technique of maximum and minimum normalization is applied to standardize these features. Maximum and minimum normalization is a common method of data processing, which can be defined as (1). x is the input feature, max represents the maximum value, min represents the minimum value, and x * represents the output value after normalization. In this study, we use this approach to scale the It's helpful to find the potential importance relationships among the features.

A. FEATURE SELECTION
Feature selections are of great importance in dealing with the redundant features [20], [21]. Three common feature selection criteria consist of filter, wrapper and embedded. The filter methods calculate the relationship between the features and the label using the statistical tools including variance, mutual information and chi-square test (CHI2) [22], [23]. The wrapper methods are closely related to the classifier. The principle of the wrapper method is to select the best subset according to the classifier performances [24], [25]. What's more, the recursive feature elimination with cross-validation (RFECV) can eliminate the influence of artificially setting of the features number remaining in the feature set. The embedded methods are integrated with the process of model training to select features automatically. Extreme gradient enhancement (XGB) has been widely used as an embedded feature selection method due to its high efficiency [26].

B. MODEL BUILDING
The proposed model mainly consists of two levels, in which the level 1 is the base-level and the level 2 is meta-level.
The predictions of base-level classifiers are selected as the input of meta-level. The base-level contains 10 models from scikit-learn, including random forest (RF), extra trees (ET), adaBoost (ADB), SVC, multi-layer perceptron (MLP), XGB, gaussian process classification (GPC), gaussian naive bayes (GNB), logistic regression (LR), gradient boosting (GB) [27]- [36]. The performance of the stacking schemes is affected by the number of base-level classifiers [37]. Generally, the base-level classifiers with weakly correlated predictions yield good performance [37]. The PCC and MIC can be used as a measure of quantifying the relevance and redundancy among features [38]- [40], with being closer to 0 indicating weaker correlation. Then, we use the enumeration algorithm to search for the best combining classifiers. We summarize two algorithms that can illustrate the process of the stacking and enumeration. The dataset is first shuffled randomly and split into 10 folds. For each fold, one fold is treated as a test data (S) and the remaining folds are taken as R. The whole process is repeated 10 times. R and S are the input to Alg.1. The Alg.1 mainly contains two loops. The first loop is the process of building the ten base-level models, and the second loop is the process of 10-fold cross-validation to produce training and test data. R are also split into 10 folds. One fold is taken as the validation set (R kv ) and the remaining folds are treated as training data set (R kt ). R kt is entered into the base-level model used to train the base-level model (ξ l ). R kv is used to produce the train l . Later, S is entered into the base-level model ξ l to generate test l . Since the loop repeats 10 times, the train l is exactly equal to the sum of the ten folds, and the test data set needs to be averaged. Finally, the union set of training and test generated by 10 different basic models is taken as output.
The output of Alg. 1 is considered as the new features of the meta-level. Since it is unwise to directly use all the new features without filtering, the Alg. 2 is employed to search for the optimal combination. The Alg. 2 mainly contains two loops. In the first loop, there are 10 kinds of possible combinations, including C 1 10 , C 2 10 , C 3 10 , C 4 10 , C 5 10 , C 6 10 , C 7 10 , C 8 10 , C 9 10 , C 10 10 as the input of the second loop. All possible combinations (without repeating them) are enumerated rather than putting them all into the next loop. In the second loop, the input of the train is used to train the model H m . The LR model is applied to reduce the complexity of the model [37]. The test is then imported into the trained model (Hm) to evaluate the performance of the model on the test data set. Finally, the model combination with the highest accuracy is determined.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
The programming language is python and the version is 3.7.2. Jupyter Notebook is a friendly environment IDE for programmers, which provides smart code completion, code inspections and especially services for interactive computing across lots of programming languages. In this study, our project is implemented in this open-source software. According to the knowledge of medical diagnosis, the accuracy(A cc ), sensitivity(S e ), specificity(S p ), F1, and AUC area are used to evaluate our method. Since 10-fold cross-validation is used, the results are expressed in the form of mean ± standard deviation.

A. RESULTS OF DATA PRE-PROCESSING
After the preprocessing depicted in MATERIAL section, the raw data are standardized and the 'Processed data' have a range of 0 to 1. As shown in Table 2, the results of 'Processed data' show better performances than that of 'Raw data' through different classifiers. For 'Processed data', the LR and ADB models have higher accuracy than others. But the XGB and GNB acquire better scores than other models in sensitivity and specificity respectively. Therefore, these differences are produced by the heterogeneity of the model, which reflects the foundation of stacking. Table 3 shows the features selected by three different typical feature selection methods. The selected features will help doctors improve their understanding of the different importance of selected features. Furthermore, various therapeutic interventions can be specifically conducted to reduce or even eliminate the harmful influence of some selected features.    Table 4 shows the accuracy of CHD diagnosis for different feature selection methods including CHI2, mutual information, variance, RFE, SVC [41], and LR, with different k values. The classification algorithm uses SVC (C = 1.0, kernel = 'linear'). The accuracy increases first and then decrease with the increase of k value. The values (k = 15, 17, 20 and 22) are highlighted when the accuracy of the model higher than 90%. Table 4 shows that when k = 15 an accuracy of 91.1% is obtained by the LR.   Table 5, the results of two representative feature selection methods including the wrapper and embedded approaches are compared. Table 5 shows that the best performance is achieved by the RFECV. And the results of RFECV have a smaller standard deviation. Therefore, the RFECV is decided as our feature selection method.

C. RESULTS FOR THE PROPOSED METHOD AND OTHER METHODS
The data are split into training data set and testing data set with a ratio of 7:3 and 4:6, respectively. Then, the training data set is used to train the model and calculate the PCC and MIC between each model pairwise. Fig. 1 and Fig. 2    show the PCC of two different proportions of the same data. Fig. 3 and Fig. 4 show the MIC of two different proportions of the same data. In different proportions, the PCC and MIC of GNB always get the minimum values. So the model of GNB can be selected as one of the best combining classifiers. Seven optimal models (GNB, GB, RF, ET, ADB, MLP, XGB) are selected as the base-level model by Alg. 1 and Alg. 2. Table 6 shows the comparative results of our method and other different methods include the one proposed by the publisher of the dataset. It is shown that our proposed stacking based model acquire significant improvements in nearly all measures of results. Our method achieves an accuracy, sensitivity, specificity and F1-score of 95.43%, 95.84%, 94.44%, 96.77%, respectively for the detection of CHD. The model parameters we use are given in Table 7.

D. RECEIVER OPERATING CHARACTERISTIC CURVES
As shown in Fig. 5, the receiver operating characteristic (ROC) is also used to evaluate the proposed method. The ROC curve is shown to be a simple yet complete recognition VOLUME 8, 2020   of all possible combinations of the relative frequencies of the correct and incorrect decisions [50]. A series of sensitivity and specificity are calculated. Then, the sensitivity is used as the ordinate and (1-specificity) as the abscissa to draw the curve. The larger the area under the curve are, the higher the diagnostic accuracy. On the ROC, the point closest to the upper left of the coordinate graph is the critical value with high sensitivity and specificity. Our proposed method has a high mean area under curve (AUC) up to 0.95, as shown in Fig. 5.

E. APPLYING OUR PROPOSED METHOD ON OTHER DATASETS
We test our method on the other three data sets to show that our proposed method does not depend on a particular data set. The fisrst data set is the Statlog heart disease data set consisting of 270 subjects [51]. Each subject is presented with 14 features including age, sex, chestpaintype, restbloodpressure, serumcholestoral, fastingbloodsugar, reselectrocardiographic, maxheartrate, exerciseinduced, oldpeak, slope, majorvessels, thal. Each of the subjects is classified into two categories: normal and abnormal. Table 8 shows the comparative results of our method and other different methods on the Statlog dataset. Our proposed method achieves an accuracy, sensitivity and specificity of 90.7%, 85.8%, 94.7%, respectively for the detection of CHD on the Statlog dataset. 37130 VOLUME 8, 2020  The second data set is the SPECTF heart data set which describes CHD diagnosis with cardiac single proton emission computed tomography (SPECT) images [51]. The data set contains 44 different types of continuous features derived from a total of 267 SPECT images that can be classified into 2 categories: normal and abnormal groups. Table 10 shows the comparative results of our method and other different methods on the SPECTF dataset. Our proposed method achieves an accuracy, sensitivity and specificity of 92.2%, 98.2%, 69.0%, respectively for the detection of CHD on the SPECTF dataset.
The third data set is the cardiovascular disease database (https://www.kaggle.com/sulianova/cardiovascular-diseasedataset) of the Kaggle competition platform. It contains 70000 records of subjects with or without cardiovascular diseases. Each patient is expressed with 11 features which can be categorized into 3 types, namely, the subjective, objective, and examination features. The objective features provide factual information that consists of age, height, weight and gender. The examination features are results of medical examinations containing the systolic and diastolic blood pressures as well as the concentrations of cholesterol and glucose. The subjective features are information given by the subjects including the status of smoking, alcohol take, and physical activity. Our proposed method achieves an accuracy, sensitivity and specificity of 73.2%, 69.3%, 77.0%, respectively for the detection of CHD on the big dataset.
As shown in Table 8, Table 9 and Table 10, our proposed fusion model can significantly improve the performance. All algorithms are used with their default parameter settings. The three data sets we used have different features and have different kinds of relationships with each other. But, the performance exhibits the robustness of our algorithm, indicating that the results achieved on Alizadehsani dataset are not stochastic.

V. DISCUSSION
In our proposed method, the technologies of stacking with all combinations searching are used for CHD diagnosis. The performance for the detection of CHD is higher than the known approaches in the literature. Additionally, the standard deviation of the results presented by our method is also minimum, indicating a better stability. It is undoubtedly more suitable for clinical application to make the model more stable under the premise of improving accuracy.
The data are normalized in order to compare the features more reasonably. Feature evaluation by filter, wrapper and embedded approaches is applied to select several relatively important features for the construction of our proposed model. The eight feature selection methods on the experimental results have important reference value for other researches in this field. When selecting the best combination in base-level, the PCC and MIC are calculated to find the classifier with the lowest correlation. This step greatly accelerates the training process of the model. Then, an enumeration process is developed to determine the other models in the optimal combination. In other words, with regard to the models included in the ultimately optimal combination, one of the models is determined via the PCC and MIC, and the other models are selected by the proposed enumeration process. This enumeration procedure is depicted as pseudocodes (Alg. 1 and Alg. 2) in the manuscript.
As far as we know, the high sensitivity and low specificity mean that more patients without CHD will be misdiagnosed. The proposed model acquires both high sensitivity and specificity which is clinically significant. The development of the VOLUME 8, 2020 non-invasive measurements will be helpful for the people who suspected to have CHD. There will be no need for these people to suffer from CAG at the beginning. They can be tested with our method first. Later, the doctor can make a better decision on the issue whether the patient needs to undergo CAG.
The decisions of 7 classifiers are combined in order to produce accurate recognition results. Our proposed method exhibits the advantages of 7 ML algorithms. Due to the complementarity among multiple models, our proposed method in this study can be implemented as reference in other applications. We also acquire good results on other three datasets. The performance exhibits the robustness of our algorithm.
Our proposed method has two limitations. First, the model parameters we use are not optimal. We mainly focus on the way of searching model based on ten-fold cross-validation. The change of model parameters will also have great impact on the final results. Second, the training of multiple models in each level costs a lot of time and the method cannot narrow the search results quickly. Due to the 10-fold cross-validation and the using of multiple models, the experiments consume lots of time to train all model in each fold. However, when the model is trained, it will be convenient to test. Since the data set is too small, we add cross validation at the beginning to prevent overfitting. It complicates the framework. Nevertheless, doing so might have persuasive results.
The size of the dataset and useful features are the key issue in the field of ML. In future, a goal to meet is to build partnerships with hospitals to enlarge the data set of the CHD and extract more features from physiological signals.

CONFLICT OF INTEREST
There is no conflict of interest in this work. Since 2000, he has been a Professor in biomedical engineering with the School of Control Science and Engineering, Shandong University, where he was the Head of the Research Group of Noninvasive Evaluation of Cardiovascular Function. He has authored more than 100 articles. He also holds more than 15 Chinese invention patents. His research interests include novel solution for noninvasive detection of cardiovascular function, biomedical measurements, and biomedical devices.
Prof His research interests include biomedical signal processing QT interval variability and machine learning.
HAN LI received the B.S. degree in automation from Shandong University, Jinan, China, in 2014, where she is currently pursuing the Ph.D. degree with the School of Control Science and Engineering.
Her current research interests include the application of computational intelligence in the detection of cardiovascular diseases and biomedical signal processing.
HUAN ZHANG received the B.S. degree in electronic science and technology from Shanxi University, Taiyuan, China, in 2013. She is currently pursuing the Ph.D. degree with the School of Control Science and Engineering, Shandong University, China.
Her research interests include biomedical signal processing, machine learning, and early detection of coronary artery disease. VOLUME 8, 2020