Feature Selection by mRMR Method for Heart Disease Diagnosis

Heart disease has become a non-ignorable threat to human health in recent years. Once without timely diagnosis and treatment, patients often suffer disability or even death. However, the diagnosis accuracy directly relies on different doctors’ experiences and various factors associated with heart disease bring heavy tasks on them make the situation worse. Therefore, to improve heart disease treatment, introducing computer-aided techniques to assist doctors in diagnosis is a feasible approach. At present, researchers usually use the processed dataset (13 features) selected by doctors from the unprocessed dataset (74 features) (UCI Machine Learning Repository) and apply the feature selection method to the dataset, it’s inappropriate because the feature scale is so small. People neglect the unprocessed dataset’s value and don’t realize it could contain some latent information. A comprehensive comparison is needed to demonstrate the unprocessed dataset’s advantages. Besides, the incremental feature combination method should be verified. As the minimum Redundancy - Maximum Relevance (mRMR) gains great success in feature selection, applying it as a feature filter can enhance classification accuracy. Thus, in this research, we introduced the mRMR method as a filter for feature selection and made a comprehensive comparison within several methods like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Kendall, Random Forest, and other research works in several metrics. By analyzing the results, in most cases, the unprocessed dataset can enhance algorithm’s performance. The incremental feature selection method is effective and the mRMR is superior to other methods. Not only does it own the highest accuracies, but also the least supportive features. It has 100% accuracy with 8 features on the Cleveland dataset, 98.3% accuracy with 14 features on Hungarian, and 99% accuracy with 9 features on Long-beach-VA, respectively. Furthermore, we find that some features, which doctors regard as useless, play a part in classification, that should attract some attention from doctors.


I. INTRODUCTION
every year due to their suddenness. Once without the treat-26 ment timely, the patient is prone to suffer disability or even 27 death. Therefore, early and accurately diagnosing heart dis-28 ease is an effective way to save lives [1]. 29 With the development of medical science, doctors have 30 discovered lots of symptoms associated with heart disease, 31 The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . such as high blood pressure, stress, diabetes, etc [2], [3], [4]. 32 But, these factors only indicate that the patient has a high 33 possibility of suffering from heart disease. The diagnosis 34 accuracy depended on the doctor's experience will lead to 35 misdiagnosis and missed diagnoses. To overcome the defi-36 ciency in current situation and the prosperity of AI [5], [6], 37 researchers have tried to make use of AI's power to assist 38 doctors in medical diagnosis [7], [8]. In some areas, there 39 is a reality that AI's diagnosis ability has suppressed human 40 beings [9]. 41 In the beginning, some researchers only used the classi-42 fiers to deal with the original dataset. Atallah et al. [10] 43 directly adopted the majority voting method to produce a 44 more powerful classifier. This ensemble method consists of 45 four machine learning methods, including Stochastic Gradi-46 ent Descent (SGD), K-Nearest Neighbor, Random Forest, and Logistic Regression. The highest prediction accuracy is 88% correlation between features, they classified the correlation 83 into three categories, high, medium, and low. Then, using 84 the heuristic algorithm to search for the optimal combina-85 tion. Results show that the feature selection method pro-86 vides a high level of predictive classification performance. 87 You et al. [20] reduced the high-dimensional multi-category 88 data by the PLS-based local recursive feature elimination. 89 They considered that the single-feature measure methods to 90 evaluate the importance of feature is based on the assump-91 tion of independence among features features' correlation 92 will influence the prediction. But, these methods ignore the 93 interaction within features. 94 Wrapper methods incorporate a classifier into a prede- 95 termined objective function that evaluates the appropriate-96 ness of the predictor subsets through an exhaustive search.

97
The optimization of feature combinations is an NP-hard 98 problem. To obtain the optimal feature combination, all 99 2 N −1 combinations should be tested by classifiers, assuming 100 the feature number is N , and the time complexity will increase 101 exponentially [21]. Amini et al. [22] used a two-layer feature 102 selection method to resolve the regression problem. This 103 method consists of a genetic algorithm and elastic net. The 104 genetic algorithm, one of the heuristic methods is the first 105 layer to select a local optimization feature subset, the elastic 106 net as the second layer eliminates the redundant features 107 by using the penalty factors. Based on current computation 108 ability, it's unrealistic to search for the optimal feature com-109 bination [23].

110
Since the emergence of Shannon information entropy [24], 111 filter methods based on this theory play an important 112 role in AI, such as the famous decision tree algorithm. 113 Mutual information maximization (MIM) method is first 114 put up for reducing class labels' uncertainty. However, this 115 method merely considers a feature's relevancy and ignores 116 its redundancy, which makes the redundancy exist in the 117 selected features. To promote mutual information's effect, 118 Peng et al. [25] introduce the conception of feature redun-119 dancy into mutual information, which greatly improves its 120 applicability. In heart disease datasets, value missing is a 121 common phenomenon and has a profound negative influ-122 ence on feature selection methods. The mRMR method will 123 ignore specific feature values and resist this negative influ-124 ence brought by missing values. Yan  Current works still have these drawbacks: they mainly apply 137 some algorithms and feature selection methods to the 13 fea-138 tures dataset. We think that there is not necessary to employ 139 the feature selection method on such a few number features. 140 The algorithm's performance is limited by the dataset, and the 141 dataset with 74 features should contain more supportive infor-142 mation. The interaction within features must be considered, 143 the mutual information method is a good choice. Besides, 144 the incremental feature combination method's effectiveness 145 should be verified. In this research, we emphasized the impor-146 tance of 74 features dataset and removed the performance 147 limitation from the dataset. The mRMR method is selected 148 as the feature selection method, it's one of the filter meth-149 ods and has high efficiency, well considering the interaction 150 within features. What's more, we verified the effectiveness of 151 the incremental feature combination method and conducted 152 comprehensive experiments.
where X j i is the ith sample of class j, µ j is the mean value of 198 class j, c is the number of the classes, N j is the sample amount 199 in class j.
where µ is the mean of all classes. To achieve the aim of 202 the largest inter-class differences and the smallest intra-class 203 differences, we can maximize the ratio det|S b | det|S w | . If S w is a 204 nonsingular matrix, this ratio is maximized when the col-205 umn vector of the projection matrix, W , are the eigenvectors 206 Kendall's τ correlation is an index to make sure whether there 209 is a correlation between two variables. It's defined as the 210 difference between the number of pairs of concordant and dis-211 concordant values, normalized by the total number of pairs. 212 Let x = (x 1 , . . . , x n ) and y = (y 1 , . . . , y n ) be two sequences 213 to calculate. Define A as the set of all pairs of indices: Kendall's τ rank correlation is defined as: The value of τ ranges from 1 when n C = n A , to -1, n D = n A . 222 Sometimes, for eliminating the influence of the pairs when 223 x i = x j or y i = y j , Kendall's correlation will be change to 224 define as: Many factors linked with heart disease appear in one per-227 son and their values are high always means the person has a 228 high potential risk of heart disease. Besides, in the dataset, 229 label 0 represents health, and label 1 means illness, smartly, 230 treating label values as the extent of the disease satisfies 231 Kendall's application condition. Duo to Kendall's correlation 232 method just judges the connection strength between two vari-233 ables, thus, it can handle the interference of the non-linear 234 correlation. As we know, no researcher introduces Kendall 235 correlation into heart disease prediction. The detailed algo-236 rithm of Kendall correlation feature selection is described 237 in algorithm 15. We defined a Rule for eliminating some 238 features directly, like missing too many values.  2) Using the subsets to train the decision trees. In the 248 training process, the split rule for each node is randomly 249 Algorithm 1 Kendall τ Input: The heart disease dataset D, selected feature number n; Output: Dataset D p with the target feature; 1: load the original dataset D, extract label L from dataset D, D t = ∅; add (h f , τ f ) into dict d; 10: end for 11: sort d by τ f in descending order; 12: selecting k features from all features, then selecting the opti-250 mal split node from these k features to divide the sub-trees.

251
Repeatedly generating the decision tree.  if set(f ) >= 20 then 6:  The mRMR implements a paired two-tailed t-test with other 286 methods, P-Value is set as 5%. '+', '−' and '=' indicate that 287 mRMR performs 'better than', 'worse than', and 'equal to' 288 the corresponding method.    Table 2 shows Cleveland, Hungarian, and Long-Beach-   In Fig. 2, we test the incremental feature combination 337 method's performance based on the Kendell τ and Ran-338 dom Forest methods on three datasets. In this experiment, 339 30 selected features are accumulated in order, the least 340 number of the initial features is five, the more important 341 features are added later, so the first feature we add is 342 25th important, then 24 th important, and so on. By observ-343 ing the results, with the increment of feature number, the 344 accuracy of Cleveland and Hungarian grows fast. There is no 345 obvious growth of accuracy on the Long-Beach-Va dataset 346 until the feature number is above 25. Especially, when the 347 feature number is below 15, the accuracy growth is small and 348 stable. It indicates that unimportant features have little influ-349 ence on the prediction and most of the accuracy promotion is 350 brought by the important features. When the feature number 351 approaches 30 on Cleveland and Long-Beach-Va datasets, 352 each feature addition will bring the obvious accuracy pro-353 motion. While this phenomenon nearly disappears from the 354 Hungarian dataset. By analyzing the results, we can conclude 355 that the incremental combination of important features is 356 reasonable and effective for getting a comparatively optimal 357 feature combination. By the way, in the incremental combi-358 nation methods, the more important features are added firstly, 359 and there are five initial features at the beginning. Cleveland dataset. We can see that results of RF, Kendall, 363 and mRMR are similar. By reaching the highest accu-364 racy, RF and Kendall's results have a ladder growth, which 365 denotes the addition of some features is useless and there 366 exists some redundancy in features; While mRMR's accuracy 367 grows rapidly, which indicates it can accurately catch the 368 key features. Thus, we can have a conclusion that mRMR 369 method has the best performance because it not only owns 370 the highest result but also the smallest feature number (8). 371 PCA performs worst in the four feature selectors. As for 372 the specific classifiers, KNN has the same tendency in four 373 figures which grows in the beginning and then decreases 374 with the increment of features. Except for KNN, nearly all 375 methods maintain high accuracy.

377
In Fig. 4, it's shown that all the classifiers' results have no 378 obvious growth except Random Forest. It denotes that most 379 of the selected features are redundant and Random Forest 380 has the advantage of mining the latent information compared 381 with other classifiers. Besides, we can discover that nearly all 382 the accuracies are low than the results on Cleveland, which 383 validates previous experimental results, the phenomenon of 384 KNN's accuracy decline also disappears in the Hungarian 385 dataset. For accuracy, mRMR should have the best perfor-386 mance, RF achieves the highest accuracy, more than 98%.     Table 4. 401 We sort the features by their importance, the smaller serial 402 number is more important in the above methods. To dis-403 play the feature difference between the above algorithms 404 and medical view, we highlight the processed features with 405 bold in Kendall, Random Forest, and mRMR. The first 406 13 features provided by selectors are quite different from 407 the medical view. For example, 3 (age), 4 (sex), 10 (resting 408 blood pressure), and 12 (serum cholesterol in mg/dl) usually 409 are behind 13. Therefore, we think that they may not play 410 a vital role in the medical view. In contrast, 60 (ladprox), 411 61 (laddist), 63 (cxmain), and 67 (rcaprox) often appear in the 412 first 13 columns in Kendall and Random Forest. According to 413 the description of the dataset, these four features are recorded, 414 but without explaining their meanings. They should con-415 tain some latent information that helps diagnose the disease. 416 VOLUME 10, 2022  and mRMR method should be the best one. Plus, the best 428 methods of these feature selectors are the same on the three 429 datasets. Concerning the best accuracies, different methods 430 show their best performances on datasets. Compared with 431 PCA's results, the best classifier on Cleveland is ANN, accu-432 racy increases 1.4%, feature number decreases 11; Hungarian 433 is the Random Forest, the accuracy increases 4.9%, feature 434 number increases 1; Long-Beach-Va is GB, the accuracy 435 increases 11.7%, feature number decreases 17. We can also 436 observe that the optimal feature number usually is below 17 437 after feature selection. What's more, when we compare the 438 results in the tables above together, nearly the first ten features 439 can achieve more than 90% of the best performance value. 440 With the same amount of features, the accuracy of mRMR 441 is higher than other methods with the same classifier, which 442 demonstrates that it can select features more accurately.     In Table 6, we compare our work with other researchers' 444 on the Cleveland datasets. Most of the researchers are prone 445 to apply the dataset with 13 features. After the comparison, 446 the mRMR not only has the best accuracy in these methods 447 but also owns the least feature amount. Furthermore, the 448 mRMR makes nearly all the classifiers achieve the highest 449 accuracy.   advantages than others, we reckon it should be concerned 483 about the data and task type. What's more, we also find 484 that some features ignored by doctors promote prediction 485 accuracy. These features may contain information that helps 486 doctors diagnose disease. In the future, we plan to collect 487 more large-scale datasets on heart disease and verify mRMR 488 method on them.

490
The authors would like to thank the experts who have con-491 tributed in development of heart disease database.

492
GAOSHUAI WANG is currently pursuing the 618 Ph.D. degree with UTBM. His research interests 619 include machine learning, computer vision, opti-620 mization, and heuristic learning.

639
He is an Associate Professor HDR with the 640 Université de Technologie de Belfort-Montbéliard 641 (UTBM) and the Head of software engineer-642 ing. He is the Deputy Director of Nanomedicine 643 Laboratory, Imagery & Therapeutics of Univer-644 sité de Franche Comté (UFC), and the Research 645 Team Leader of Health Systems Organization. His 646 research interests include data mining and machine learning for decision 647 support in the field of e-Health. He is a member of the Science Steering Com-648 mittee of the annual conferences IADIS e-Health and e-Medisys. He was the 649 Co-Chair of the First International Conference eTelemed'09 and has been 650 its Advisory Chair, since 2010. He has organized many conferences and 651 chaired several technical sessions. He is an Expert to the ANRT France, 652 ARI Alsace, and CNRST Morocco. He is an editorial board member of 653 four international journals and the author/coauthor of three books and many 654 international publications in refereed journals and conferences.