A New Medical Decision Support System for Diagnosing HFrEF and HFpEF Using ECG and Machine Learning Techniques

As heart failure (HF) appears to be a growing epidemic, no case should be overlooked in the diagnosis of HF. Two subtypes of HF by left ventricular ejection fraction (LVEF) are HF with reduced ejection fraction (HFrEF) (LVEF <inline-formula> <tex-math notation="LaTeX">$\le40$ </tex-math></inline-formula>%) and HF with preserved ejection fraction (HFpEF) (LVEF <inline-formula> <tex-math notation="LaTeX">$\ge50$ </tex-math></inline-formula>%). HFrEF is easier to diagnose. However, the diagnosis of HFpEF is more complex and difficult even for specialists. The diagnosis of HFpEF is a problem that is being tried to be solved in medicine. Since LVEF appears normal (LVEF <inline-formula> <tex-math notation="LaTeX">$\ge50$ </tex-math></inline-formula>% also in healthy individuals), HFpEF can be confused with chest diseases due to some similar symptoms. The diagnosis of HF subtypes is ideally made using echocardiography. Echocardiography should be performed in all patients with HF; however, it is expensive and requires specialists. Even in high-resource regions, this test is not always performed, and treatment may need to be initiated before the echocardiographic data are obtained. For such situations, economical and practical systems are required. In this study, a medical decision support system was developed to detect HFrEF and HFpEF cases using only 3-lead ECG. From the ECG data of 61 volunteers, 37 features were extracted, of which 16 were Yule-Walker and Burg’s method parameters, and 21 were in the time domain. Consequently, 37 features were reduced by feature selection and triple classification was performed with only 4 features with maximum accuracy. This study aimed to determine whether the individuals with HF symptoms were HFrEF, HFpEF, or healthy. Four machine learning algorithms were used for classification. The best classification accuracy rate was 100% for k-NN, and significant results were also obtained from the other three algorithms: SVMs, Decision Trees, and Ensemble Bagged Trees.


I. INTRODUCTION
Ejection fraction (EF) is the amount of blood ejected from a ventricle with each beat of the heart. It can be formulated as (Pulse Volume) / (End-Diastolic Volume). Left Ventricular Ejection Fraction (LVEF) is a measure of pumping efficiency into the systemic circulation, while right ventricular The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino . ejection fraction is a measure of pumping efficiency into the pulmonary circulation. EF is usually measured by echocardiography and is used as an overall measure of a person's heart function. LVEF is usually low in patients with systolic heart failure, and it is an important determinant of the severity of systolic heart failure. Unlike the heart rate in a healthy person, which can be high or low and vary in its daily course, low LVEF is always associated with disease [1]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Heart failure (HF) can be defined as a functional or structural disorder that induces the heart to be unable to supply sufficient oxygen to the tissues [2]. Clinical HF is described in the European HF Guidelines as a clinical syndrome arising from a functional or structural disorder in the heart, and patients have typical signs and symptoms (signs such as ankle swelling, shortness of breath, and fatigue and symptoms such as apex beat displacement, pulmonary crepitation, and increased jugular vein pressure).
As heart systolic dysfunction increases, LVEF gradually decreases, and end-systolic and end-diastolic volumes generally increase. LVEF is important not only as a manifestation of the disease but also because in most clinical studies patients are identified by LVEF.
Basic clinical studies with patients with systolic HF or reduced EF (HFrEF) generally included patients with LVEF < 40% and currently proven treatments are effective only in this group of patients. On the other hand, studies have also been conducted with HF patients with LVEF > 40-45% and no other cardiac disorders. In some of these patients, as well as LVEF is completely normal (usually > 50%), there is no significant reduction in systolic function either. So, the term HF with preserved ejection fraction (HFpEF) was improved to define these patients [3], [4].
The underlying pathophysiological failure in HFpEF patients is thought to be LV diastolic dysfunction. No single echocardiographic parameter is sufficiently precise and reproducible to diagnose LV diastolic dysfunction. So, an exhaustive echocardiographic examination with exactly correlated two-dimensional and Doppler data are suggested [3], [5], [6].
Doppler Ultrasound devices are expensive, and the echocardiographic examination requires an expert and is laborious. HFpEF is a complicated syndrome that may ensue from functional and structural cardiac disorders rather than only one disease presence, correct diagnosis can be difficult even for HF specialists [7].
By applying AI to ECG, it was shown that subtle changes in QRS could be correlated with heart functions such as myocardial fibrosis, congestive HF, the efficacy of diuresis treatment, etc., and faster and less costly evaluation could be made [23]. Studies were carried out to predict HF-related death and hospital readmission using AI and ML algorithms [24]. In some of those studies, echocardiographic data [25], or many parameters such as clinical phenotyping, laboratory, ECG, and echocardiography were used simultaneously [26]. Congestive heart failure (CHF) includes HFrEF and HFpEF. The binary classification was made as the presence or absence of CHF [27]. In the study in [28], the patient class in the data set included HFrEF, HFpEF, and Coronary Artery Disease. However, the classification was made as the presence or absence of HF, and the patient class was also classified according to the New York Heart Association (NYHA) functional classification. The focuses of the studies in [23], [24], [25], [26], [27], and [28] were not on the diagnosis of HFrEF and HFpEF.
A 10-year HF risk estimation was made for HFrEF and HFpEF with an AI-based system developed using 12-lead ECG data. Here, LVEF < 50% for HFrEF and LVEF ≥ 50% for HFpEF were accepted. That was, mid-range ejection fraction (HFmrEF) (41% ≤ LVEF ≤ 49%) was also included in the HFrEF. There was no evidence this system could help distinguish between HFpEF and HFrEF [29].
HFrEF cases are easier to diagnose. HFrEF is diagnosed when LVEF ≤ 40% is detected by echocardiography. The underlying phenotypic heterogeneity in HFpEF is more complex than in HFrEF [30]. The diagnosis of HFpEF is difficult even for specialist physicians. Because of some similar symptoms, cases of HF can be confused with cases of chest diseases. Since LVEF appears to be normal in the case of HFpEF, a patient who goes to the hospital with the symptoms of HF can be referred to a pulmonologist. When the functions of the patient evaluated in terms of chest diseases are normal, no action is taken. Therefore, the case of HFpEF is overlooked. If the pulmonologist is careful and considers the possibility of HFpEF and orders BNP and NT-proBNP blood tests, the case of HFpEF will be detected. (Brain natriuretic peptide (BNP) and N-terminal pro b-type natriuretic peptide (NT-proBNP) blood tests are often used to diagnose HF). To prevent such confusion and cases from being overlooked, it was aimed to design a medical decision support system using 3-lead ECG measured simply, and giving results within seconds. Thus, before performing echocardiography or other troublesome blood tests, a system that provides the doctor with preliminary information about the individual was HFrEF, HFpEF, or healthy was designed. It was thought that the system, advantageous both in terms of patient comfort and economically, would support situations that even specialist physicians have difficulty in diagnosing.
Only features extracted from 3-lead ECG data were used as features in the classification. To investigate whether only these features could be used without the need for demographic or echocardiographic data and other detailed examinations for the medical decision support system, the difference between classes was examined by statistical methods. Statistically the most distinctive features were determined, and classification studies were carried out with ML algorithms by reducing the number of features. The classification was carried out with high accuracy with the k-NN method.

II. MATERIAL AND METHODS
The data were collected from patients who visited Sakarya University Training and Research Hospital's cardiology outpatient clinic as well as those who were admitted to the inpatient service. To carry out the study, the ethics committee report numbered 16214662/050.01.04/123 was obtained from the Sakarya University Faculty of Medicine Dean's Office. An Informed Consent from each volunteer was obtained. Data collecting and recording were made with Biopac MP36 device. Data were obtained from individuals whose echocardiographic results were interpreted by a specialist cardiologist and were eligible for inclusion in our study.
ECG signals from 61 volunteers (25 years old or older) were taken from the right ankle and right and left wrists (Standard Bipolar Lead I). The sampling frequency of the signal was 200 Hz. Recordings were taken from each individual for 10 s. Data from the same volunteer at different time intervals were also used. The dataset contained a total of 180 data, of which 60 were HFrEF, 60 were HFpEF, and 60 were healthy individuals with LVEF above 50%. The demographic information of the volunteers was given in Table 1.

A. PREPROCESSING THE DATA
The workflow diagram of the study was provided in Figure 1. The study was shaped according to this flow diagram, and the results were obtained. After data were acquired from the volunteers, the data were labeled by a specialist cardiologist. First, noise and artifacts in the ECG signal were cleaned. For this, a Chebyshev Type II band-pass filter was applied in the range of 0.25 -100 Hz, and Stopband Attenuation was 60 dB. A notch filter, in the range of 49-51 Hz, was applied to clean the mains noise at 50 Hz. Stopband Attenuation was also specified as 60 dB. Finally, the Moving Average Filter was applied to the signal.
A graphical representation of the ECGs of individuals with HFrEF, HFpEF, and healthy (LVEF above 50%) was given in Figure 2. In addition, the periodogram graph with the Fast Fourier Transforms of the ECGs was also given in the figure. Although the ECGs were very similar to each other, they were differentiated in the periodogram. Therefore, using a  periodogram is a smart way for machine learning methods. Then, feature extraction from the ECG was performed. Statistical analysis was performed and features were selected. Finally, classification was made with the selected features and the diagnosis algorithm was developed.

B. ECG FEATURES
First, 21 features were extracted from the ECG in the time domain. These extracted features were given in Table 2 [31]. In the table, the numbers given to the features are in the first column, the names of the features are in the second column and the formulas of the features are in the third column. In addition to these features, 8 the Yule-Walker output VOLUME 10, 2022 parameters (4 of the normalized autoregressive (AR) parameters corresponding to the model of order 4, the estimated variance of 1 white noise input and 3 reflection coefficients) and 8 Burg's Method output parameters (4 of the normalized autoregressive (AR) parameters corresponding to the model of order 4, 1 the estimated variance of the white noise input and 3 of the reflection coefficients) 16 features were also added to the features. A total of 37 features were extracted. These features were also calculated using MATLAB.

C. STATISTICAL ANALYSIS
The ECG signal is not normally distributed. So, it can be statistically analyzed with non-parametric test methods. Kruskal Wallis Test is preferred as the data are non-normally distributed and there are three or multiple classes. The test is the nonparametric version of the one-way ANOVA [32]. The aim of using this test was to investigate whether there is a significant difference between the classes (HFrEF, HFpEF, and Healthy).  Asymptotic Significances (p values), obtained for each feature as a result of the Kruskal Wallis test, were provided in Table 3. According to the table, p < 0.05 for features 2, 6, 7, 9, 10, 12, 13, 15, 20, 21, 27, 28, 30, 31, 32, 34, 35, 36, 37. That was, these features were distinctive for at least two of the three classes being compared.
The Mann Whitney U test is a non-parametric test used to determine whether the two sampled groups are from the same population [33]. As post hoc tests, binary Mann Whitney-U tests were performed to determine whether the features that appeared distinctive as a result of the Kruskal Wallis test were distinctive only for certain binary classes or for all three classes. Mann-Whitney U test results were provided in Tables 4, 5, and 6. Features were ranked at these tables.  Since there is own error percentage for each of the classes, the value 0.05 is divided by 3 (= 0.0167) (This is named the Bonferroni correction.) [30]. When p-value is lower than 0.0167 as a result of the Mann Whitney U test for the relevant feature, the feature is said that it is distinctive for those two classes. That is, for example, features 2,7,9,10,12,13,15,20,21,28,30,31,32,34,36, and 37 were distinctive for HFpEF -Healthy binary classification because of p < 0.0167 for them.
Then, a different number of features from the most relevant ones were taken as input, and experiments were conducted. In the relationship of fewer features/higher accuracy, the optimum value was determined. Consequently, classification was performed with high accuracy as a result of the algorithm using features 2nd, 32nd, 36th, and 37th. These features ranked highly in the Mann Whitney U test ranking. Triple classification average performance parameters were calculated [35] and provided in Table 7.

D. CLASSIFICATION
The dataset was organized as individuals in rows and features in columns. With tagging, a tag column was added to the end of the feature columns. The data were separated as 80% training and 20% testing (validation). Balanced test data were created by randomly taking 20% of the data in each class. Performance parameters were calculated by comparing the results obtained from the simulation with the predetermined label column.
Four different machine learning algorithms were applied as k-NN, Support Vector Machine (SVM), Decision Trees, and Ensemble Bagged Trees.
The k-NN classifier was fine k-NN [36]. The euclidean metric distance was used for the fine k-NN. Cubic SVM was applied [37]. The Kernel Function was polynomial, the Polynomial Order was 3, the Kernel Scale was automatic and the Box Constraint was 1. The applied Fine Decision Tree [38] Split criterion was Gini's diversity index, the maximum number of splits was 100, and the Surrogate decision splits were off. Ensemble Bagged Trees, the ensemble classifier, was ensembled with the Bag Method [39], learner type was Decision Tree. The maximum number of splits was 179 and the Number of learners was 30.
Performance parameters for triple classification made with each classifier were provided in Table 7. The best results were obtained by the k-NN algorithm. Parameters were calculated separately for each of the three classes. For each parameter, the final value indicated in the table is the mean value of them.
The parameters were calculated using the following formulas [35]: For class i: i=1 C ii the total number of elements correctly predicted VOLUME 10, 2022 i 3 j C ij the total number of elements p i = 3 k C ik the number of times that class i was predicted (column total) t i = 3 k C ki the number of times that class i truly occurs (row total).

III. RESULTS AND DISCUSSION
In this study, a new diagnostic algorithm for HFrEF and HFpEF was developed. A machine learning based system was developed using features derived from ECG. A total of 37 features were derived. However, it is difficult to derive so many features in real-time systems. Therefore, it was aimed to reduce the number of features and improve the system. To increase the performance of the classifier, the 37 features were selected by the Kruskal Wallis Test and Mann Whitney-U Test methods and used.
The Kruskal Wallis test determined which of the features were distinctive for the triple classification ( Table 3). As a result of this test, the number of features was reduced from 37 to 19. Then, these features were evaluated by the Mann Whitney-U Test for different binary groups, and ranked (Tables 4, 5, 6).
After that, a different number of features from the most relevant ones were taken as input, and experiments were carried out. The optimum value was determined in the fewer feature/high accuracy relationship. As a result, classification with high accuracy was performed as a result of the algorithm using the 2nd, 32nd, 36th, and 37th features. These features ranked highly in the Mann Whitney U test ranking. The classification performance parameters are provided in Table 7. The results in the table were obtained with only four features. This may be argued as an impressive performance for a system that may be used in practice.
Patients get various diagnostic tests, invasive processes and therapies during their illness and generate big amounts of data that can be collected in registries or other institutional databases to assess healthcare utilization, quality and cost of care, and prognosis [40]. For traditional analytical methods, the size, dynamic nature and complexity of this 'big data' can be demanding to make sense of [41]. ML methods can handle temporary, large volume and multimodal data [42]. ML methods are well-equipped for handling high-dimensional datasets with numerous variables, that complicates traditional statistical approaches such as regression. ML can also process collinear or correlated data points and evaluate complex interactions between predictors. ML algorithms enable accurate, higher performing computation of nonlinear relationships [43]. ML comprises computational techniques that can take out patterns from data, get knowledge, and implement that knowledge to tasks such as risk prediction [44]. Providing the right care to the patient in HF is challenged by diagnostic ambiguity, variability in treatment and complexity in risk stratification, and restricted integration of information about care. ML can act an important role in filling these gaps in HF and has significant advantages over traditional human-induced models [45]. For these reasons, in this study, ML methods were preferred for classification. This study was thought to be applicable to larger datasets in the future and could be integrated into measurement devices.
Classifier results with four different ML algorithms were given in detail for each classifier in Table 8. The classification processes were carried out in order. First, 37 ECG features were classified without applying any feature selection algorithm, and performance parameters were calculated to measure the performance of the classifier and recorded in the relevant column. Then, 37 features were reduced to 19 with the first feature selection stage (Kruskal Wallis Test) and the same process was repeated. 19 features were reduced to four with the second feature selection stage (Mann Whitney-U Test) and the same process was repeated. The performance parameters for the classifiers were calculated and provided in the table.
According to the table, the best results were obtained with the k-NN classifier. At the table, reducing the number of features with feature selection generally did not decrease the system performance, but increased it even more. For k-NN, SVM, and Ensemble classifiers, this increase was quite evident. In addition, it was important to obtain the highest performance parameters with 4 features instead of 37 or 19 features, for real-time systems having less workload and for working more effectively. Although there was no significant increase in the performance parameters with feature selection for the Decision Trees classifier, in detail, the AUC increased for each class. It was also advantageous to obtain the performance parameters that could be obtained with 37 or 19 features, with 4 features.
Receiver operating characteristic (ROC) curves for the classifiers were provided in Figure 4. A ROC curve may be evaluated as: it can better diagnose the positive class if the curve is closer to the left axis; If the curve is closer to the upper axis, it can better define the control group. Regarding the figure, for example, in the ROC curve for the Healthy class, the positive class is ''Healthy'' and the control class is ''HFrEF & HFpEF''. According to Figure 4, when there were 37 features for k-NN, the AUC for all three classes was 0.938. AUC values increased for all three classes when there were 19 features. HFpEF was ideal, the positive class was better discriminated for the Healthy class, while the control class (HFpEF & Healthy) was better discriminated for HFrEF. With 4 features, ideal results were achieved for all classes.
When there were 37 features for SVM, the control classes were better differentiated for HFpEF and Healthy classes, while the positive class was better differentiated for HFrEF. With 19 features, HFpEF was ideal, while AUC was slightly reduced for other classes. HFrEF was ideal when there were 4 features, and the situations were close to ideal for HFpEF and Healthy classes.
When there were 37 features for the Decision Tree, the positive class was better differentiated for HFpEF, while the control classes were better differentiated for the Healthy and HFrEF classes. The AUC value for HFrEF was significantly reduced when there were 19 features. Better results were obtained for all three classes when there were 4 features.
When there were 37 features for Ensemble Classifier, HFpEF was ideal, for HFrEF the control class was better distinguished. When there were 19 features, the positive class for HFrEF was better differentiated. With 4 features, HFpEF was ideal, the control class was better differentiated for HFrEF, and the positive class was better differentiated for the Healthy class.
This study was considered to be implantable in measurement devices in the future. The k-NN classifier was focused on because it was widely used in industrial implementations [46]. Three other algorithms used were also used to support the study and to see the feasibility of this triple classification using 3-lead ECG. According to the results, this classification could be done with high accuracy and the best results were obtained with the k-NN method.
Compared to other studies in the field, there are three classes in the classification made in this study, namely HFrEF, Healthy, and HFpEF. Not only is the presence or absence of HF but also, if HF is present, the question ''which one'' is also answered.
This study contributed to the literature by showing that HFrEF and HFpEF could be diagnosed only with 3-lead ECG. It is thought that it will be a pioneer for future studies on this subject. The dataset used in the study was not very large.  Therefore, traditional ML methods were used. New studies can be done using newer ML methods with a larger dataset.

IV. CONCLUSION
Echocardiogram and electrocardiogram (ECG) are the most frequently used tests in patients with suspected HF. LVEF is important not only as a manifestation of the disease but also because patients are defined by LVEF in most clinical studies. LVEF is generally determined by an echocardiogram. Rarely, SPECT (Single Photon Emission Computerized Tomography) and radionuclide ventriculography or radionuclide angiography (Multiple-gated acquisition -MUGA) are also used. These devices are pricey, some of the methods are invasive and all of them require an expert. In addition, there may be conditions where the attainment of these devices is limited. On the other hand, ECG is not expensive, easy to acquire, and gives fast results. Therefore, for example, when a patient admitted to the emergency service was suspected of HF, before carrying out detailed examinations, for informing the doctor whether the incoming case was HFpEF, HFrEF or healthy, a medical decision support system was developed using 3-lead ECG. Thus, diagnosis will be easier for the doctor, and time will be saved by preventing unnecessary and expensive tests. Therefore, it is a useful system both in terms of patient comfort and economics.
For HFrEF and HFpEF that was difficult to diagnose and can be overlooked, the medical decision support system that did not require an expert and facilitated the diagnosis of the doctor was developed using only 3-lead ECG. It was thought that this study would fill the gap in the literature regarding simultaneous evaluating the diagnosis of HFrEF and HFpEF using only ECG.