Clinical Implication of Machine Learning in Predicting the Occurrence of Cardiovascular Disease Using Big Data (Nationwide Cohort Data in Korea)

Machine learning (ML) and large-scale big data are key factors in developing an accurate prediction model for cardiovascular disease (CVD). Although the CVD risk often depends on the race and ethnicity, most previous studies considered only US or European populations for the CVD risk prediction. In this work, to complement previous researches, we analyzed the Korean National Health Insurance Service–National Health Sample Cohort (KNHSC) data and studied the characteristics of ML and big data for predicting the CVD risk. More specifically, we assessed the effectiveness of various ML methods in predicting the 2-year and 10-year risk of CVD such as atrial fibrillation, coronary artery disease, heart failure, and strokes. To develop prediction models, we considered the usual medical examination data, questionnaire survey results, comorbidities, and past medication information available in the KNHSC data. We developed various ML-based prediction models using logistic regression, deep neural networks, random forests, and LightGBM, and validated them using various metrics such as receiver operating characteristic curves, precision-recall curves, sensitivity, specificity, and F1 score. Experimental results showed that all ML models outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year CVD risk, demonstrating the usefulness of ML methods. In addition, in our analysis, whether we included the past medication information as a feature or not, the prediction accuracy of all ML models was comparable to each other. Since the use of medications by the physicians provided important information on the occurrence of diseases, when we included it as a feature, all prediction models achieved a slightly higher prediction accuracy.


I. INTRODUCTION
Representative cardiovascular disease (CVD) includes myocardial infarctions, atrial fibrillation, heart failure, and strokes. The occurrence of CVD is affected by various risk factors such as the race, ethnicity, age, sex, weight, height, body mass index, and a blood test result including the kidney function, liver function, and cholesterol levels [1]- [4]. These The associate editor coordinating the review of this manuscript and approving it for publication was Zhe Xiao . factors are often intertwined and affect the development of various diseases in a complicated way. Hence, prediction models based on conventional statistical methods often cannot reflect all the complex causal relationships between various risk factors [5], [6].
The recent standardization of medical big data and the systematization of national health examination data have made it possible to analyze previously unknown risk factors that may have a statistically significant association with the occurrence of disease, which may in turn allow us to trace back VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ various disease mechanisms. Moreover, a big data analysis is crucial in developing accurate prediction models for the occurrence of disease [7]. Traditionally, various statistical methods have been used to develop prediction models and to discover important risk factors. However, recently, artificial intelligence (AI) and big data have gained a lot of attention and are being increasingly used to develop prediction models for various diseases [5], [6], [8]- [12]. One possible limitation of AI-based prediction models is that since they are often black box models, it is challenging to analyze the causal relationship between risk factors and the occurrence of disease. Nevertheless, by learning complex patterns and regularities from big data, AI-based prediction models often improve the accuracy of predicting the occurrence of disease.
For the CVD risk prediction using machine learning (ML) and big data, most previous work mainly considered only US or European populations [5], [6], [8], even though the CVD risk often depends on the race and ethnicity. In this work, to complement previous researches, we developed various ML models based on logistic regression, deep neural networks, random forests [13], and LightGBM [14] to predict the risk of CVD using systematically organized largescale nationwide health examination data in Korea [15], [16]. In particular, since most cardiovascular risk factors change constantly and a short-term risk prediction has higher clinical significance from the physicians' clinical perspective, we compared the results of the 2-year and, the usual, 10-year risk predictions of the proposed ML models. Validation results under various metrics showed that all proposed ML models outperformed the baseline method derived from the American Heart Association/American College of Cardiology (ACC/AHA) guidelines for an estimation of the 10-year risk for CVD [17].
In addition, retrospective medical data, on which most previous ML prediction models are based [5], [6], [8], often includes bias of physicians for the use of disease-related medications. To evaluate the effect of the bias, we considered the use of cardiovascular drugs as a feature variable. In our analysis, when the past medication information was exploited, all ML models achieved a slightly higher prediction accuracy, partly because the use of medications by the physicians provided important information on the occurrence of diseases.
The main contributions of the paper are summarized as follows: • We studied the characteristics of ML and big data, in particular for a less studied Asian population, for the CVD risk prediction.
• We developed various ML-based prediction models and experimentally showed that they outperformed the standard baseline method, thus demonstrating the usefulness of ML methods in predicting CVD.
• We analyzed the effect of the short-term and long-term predictions by comparing the results of the 2-year and 10-year risk predictions.
• We also considered the past cardiovascular medication information to evaluate the physicians' bias in the use of disease-related medications.
The rest of the paper is organized as follows. Section 2 describes the study population and feature variables. It also presents the employed ML methods and experimental settings. Section 3 presents experimental results by comparing the performance of the proposed ML-based prediction models using various metrics. Section 4 discusses the experimental results and related work. Finally, Section 5 concludes.

A. STUDY POPULATION
In this study, we developed ML-based prediction models for CVD such as atrial fibrillation (AF), coronary artery disease (CAD), heart failure (HF), and strokes by analyzing the Medical Check-up Cohort DB ver 1.0 provided by Korean National Health Insurance Service [15], [16] (NHIS-2016-2-263). This cohort database is a non-personally identifiable research DB consisting of general medical examination results from 2002 to 2013 (12 years) of about 510,000 qualified individuals who were between 40 and 79 years old as of 2002. The baseline year was set to 2003 to consider 10 years of follow-up, as suggested by the ACC/AHA guidelines. Among 310,210 subjects who took a health medical examination (HME) in 2003, we excluded those subjects diagnosed with AF (I48), CAD (I21-25), HF (I50), hemorrhagic stroke (HS) (I60-62), or ischemic stroke (IS) (I63-69) before their 2003 HME result. We also excluded those subjects whose HME data contained a null or an incorrect value. Consequently, the analysis cohort consisted of a total of 297,875 subjects. We conducted 2-year and 10-year follow-up cohort analyses. For each analysis, all subjects were divided into two groups: disease group and no disease group. The former consisted of those patients who were diagnosed with CVD during the follow-up period and the latter consisted of the rest of the subjects (Fig. 1).

B. FEATURE ENGINEERING
Feature extraction was conducted as follows. First, the following twelve features were extracted from HME results: age, sex, blood sugar, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), gammaglutamyltransferase (GGT), hemoglobin level (HMG), presence of gross proteinuria (OLIG_PROTE), serum aspartate aminotransferase (SGOT_AST), serum alanine aminotransferase (SGPT_ALT), and total cholesterol. Second, the following five features were derived from the results of the questionnaire survey included in the HME: the exercise and smoking status, estimated total amount of smoking (pack-years), average number of drinking days per week (DRNK_HABIT), and average alcohol consumption. Next, the information on eight diseases was extracted from the accompanied treatment DB, which contained the treatment information for each subject at the clinic. The following FIGURE 1. Study population and data extraction procedures. The analysis cohort was divided into disease and no disease groups, which were then divided into training and test sets. The training set was used to build a prediction model and the test set was used to validate the model. diseases were considered: AF, CAD, cancer (C00-96), diabetes mellitus (DM) (E10-14), HF, hypertension (HTN) (I10-15), HS, and IS. When predicting the risk of CVD such as AF, CAD, HF, HS, and IS, the remaining three comorbidities, i.e., cancer, DM, and HTN, were exploited as binary input features of the prediction models. In particular, each comorbidity was considered only if the subject had been hospitalized more than once or had visited the clinic more than twice due to a comorbidity before the HME date. Lastly, we considered whether the subject took cardiovascular drugs such as antiarrhythmics, anticoagulants, antiplatelets, cardiotonics, and statins before the HME date. Note that all the subjects were free from CVD before their baseline HME. In total, 25 input variables and one target variable were considered.

C. DATASETS
From the analysis cohort, we constructed two datasets: one for the 2-year follow-up analysis and the other for the 10-year follow-up analysis. Both datasets consisted of the same subjects. The main difference is that for the former dataset, those who were diagnosed with CVD more than two years after their HME date were classified to the no disease group since we considered only 2 years of follow-up for this dataset. For each dataset, all subjects were exclusively divided into training and test sets. More specifically, for the training set, 80% of the subjects were randomly selected from both disease and no disease groups. The remaining 20% were used as the test set. The training set was used to build prediction models and the test set to evaluate the models. Fig. 1 illustrates the study population and data preprocessing steps and Table 1 summarizes the baseline characteristics of the two datasets, where we omit multi-valued categorical variables such as OLIG_PROTE, the smoking status, and the average number of drinking days per week.

D. LOGISTIC REGRESSION
Logistic regression (LR) is a linear regression model that estimates the value of a binary dependent (target) variable as a linear combination of independent variables (features). More specifically, given the independent variables x 1 , x 2 , . . . , x n , the dependent variable y is defined as follows: where σ is a sigmoid function such that σ (t) = 1/(1 + e −t ). That is, LR models the risk of the target disease as a probability between 0 and 1. By using the training set, the best values for parameters w 1 , w 2 , . . . , w n , b are computed such that the root-mean-square error is minimized. To develop an LR model, we used the LogisticRegression class provided in the Python Scikit-learn library (https://scikit-learn.org/stable/) with the gradient descent optimizer.

E. DEEP NEURAL NETWORK
A deep neural network (DNN) is one of the most widely used deep learning methods, which can encode a nonlinear relationship between independent and dependent variables. A DNN normally consists of an input layer, several hidden layers, and an output layer (Fig. 2). Each layer consists of several nodes each of which is connected to every node in the previous and next layers, i.e., fully connected. The output of each node is basically defined to be a linear combination of the output values from the previous layer followed by an application of a nonlinear activation function. To develop a DNN model, we used Python 3 and the Keras 2.1.6 deep  learning library (https://keras.io/) with TensorFlow [18] as a backend. In particular, the hyperparameters of the DNN model were determined by using a grid search followed by a manual search. We varied the number of hidden layers (from 1 layer to 20 layers), number of nodes in each layer (from 10 nodes to 200 nodes), batch size, learning rate, activation function, and regularization methods. To this end, we used only 80% of the training data for learning the model and the remaining 20% as validation data. Then, the hyperparameters that led to the best performance on the validation data were used in the experiments. More specifically, we used three hidden layers each of which were composed of a fully connected layer with 30 nodes, followed by a batch normalization [19], ELU activation function [20], and dropout [21]. In the experimental results, the prediction performance of the DNN model was measured using the test data. Since the performance of the DNN model slightly varies with the initial parameter values, all measurements for the DNN model were averaged over 10 sample runs.

F. RANDOM FORESTS
A random forest (RF) is a widely used ensemble learning method based on a bagging (bootstrap aggregating) approach [13]. It consists of multiple decision trees each of which is built from a random sample drawn with replacement (called a bootstrap sample) from the training set, where the sample size is the same as the training set size. When each tree is built, only a random subset of features is used to split each node. Then, the prediction of the RF model is given as the averaged prediction of all decision trees to reduce the variance of the model and produce an overall better model. To develop an RF model, we used the RandomForestClassifier class provided in the Python Scikit-learn library. The hyperparameters of the RF model were determined by a grid search and 5-fold cross-validation on the training data. We varied the number of decision trees to build the RF model (from 100 to 1,000), maximum size of a random subset of features used to build each tree, maximum depth of each tree, and criterion function to measure the quality of a split.

G. LIGHTGBM
LightGBM is a fast and high-performance gradient boosting framework based on tree-based learning algorithms [14]. The gradient boosting decision tree (GBDT) [22] is also a widely used ensemble learning method like bagging, but it sequentially builds decision trees to reduce the bias of the model. Specifically, a new tree is built in such a way that it reduces the prediction error of the previously built trees. LightGBM is an efficient and effective implementation of GBDT with techniques called gradient-based one-side sampling and exclusive feature bundling. To develop a LightGBM model, we used the LGBMClassifier class provided in the Python Light-GBM package (https://lightgbm.readthedocs.io). As for the RF model, the hyperparameters of the LightGBM model were determined by a grid search and 5-fold cross-validation on the training data. We varied the boosting type, number of boosted trees (from 50 to 200), maximum number of tree leaves, boosting learning rate, subsample ratio of the columns when constructing each tree, and regularization terms.

H. EXPERIMENTAL SETTINGS
All experiments were conducted on a workstation equipped with two octa-core Intel R Xeon R E5-2630 v3 2.40 GHz CPUs, 96 GB of main memory, and an NVIDIA GeForce GTX 1080Ti GPU with 11 GB memory. The host operating system was Ubuntu 16.04.3 LTS (64-bit) and all prediction models were implemented using Python 3, the Scikit-learn machine learning library, and the Keras 2.1.6 deep learning library. For the 2-year and 10-year follow-up analysis data, we implemented ML-based prediction models using LR, DNN, RF, and LightGBM with and without features for medications. To validate their effectiveness, we compared them with the baseline method derived from the ACC/AHA guidelines for estimation of 10-year risk for CVD [17]. More specifically, the baseline method was a simple logistic regression model built using the following seven features: age, sex, SBP, total cholesterol, smoking status, DM, and HTN. The performance of all prediction models was compared using the following metrics: the receiver operating characteristic (ROC) curves, precision-recall (PR) curves, sensitivity, specificity, and F1 score. Moreover, to find the most important features for estimating the risk of CVD, we analyzed the feature importance for each prediction model using the Shapley additive explanations (SHAP) method which is a game-theoretic, unified method to explain the output of any ML model [23].

A. PREDICTION PERFORMANCE
Figs. 3 and 4 show the performance of every prediction model, including the baseline model, for the test datasets in terms of the ROC and PR curves. Fig. 3 shows the results of the 2-year CVD risk prediction with and without medication features while Fig. 4 shows the results of the 10-year risk prediction. We also included the ROC curve regions of interest [24], that is, regions where the false positive rate, i.e., 1 -specificity, is less than or equal to the imbalance ratio of the dataset: 0.03 for the 2-year follow-up dataset and 0.125 for the 10-year follow-up dataset. Table 2 presents the area under an ROC curve (AUROC), area under a PR curve (AUPRC), sensitivity, specificity, and F1 score of each prediction model.
First, as shown in Figs. 3 and 4, every ML-based prediction model outperformed the baseline model in terms of ROC and PR curves, regardless of whether medication features were used or not. In particular, as shown in Table 2, with medication features, DNN achieved the best AUROC values, which are 2.36% and 1.31% higher than those of the baseline for 2-year and 10-year risk prediction, respectively. Meanwhile, with medication features, LightGBM achieved the best AUPRC values, which are 3.45% and 2.59% higher than those of the baseline, for 2-year and 10-year risk prediction, respectively. Moreover, the performance of the proposed ML models was mostly comparable to each other. In particular, there was no clear winner in terms of sensitivity, specificity, and F1 score as shown in Table 2.
Second, on the one hand, every ML-based prediction model achieved better performance for 2-year CVD risk prediction than 10-year risk prediction in terms of AUROC values and sensitivity. For example, with medication features, ML models achieved 3.04%-3.59% higher AUROC values for 2-year risk prediction than 10-year risk prediction. On the other hand, every ML-based prediction model achieved better performance for 10-year risk prediction than 2-year risk prediction in terms of AUPRC values and F1 scores. For example, without medication features, ML models achieved 18.37%-19.21% higher AUPRC values for 10-year risk prediction than 2-year risk prediction. The reason for this big difference in AUPRC values and F1 scores is that the 2-year follow-up dataset is more highly imbalanced than the 10-year follow-up dataset. Specifically, the subjects having CVD are only 3.09% in the former dataset whereas 12.34% in the latter dataset. Thus, the precision of 2-year risk prediction was much lower than that of 10-year risk prediction, and consequently so were AUPRC values and    LightGBM achieved a 2.12% higher AUROC value and a 3.49% higher AUPRC value with medication features than those obtained without medication features for the 2-year follow-up dataset. In particular, medication features were more effective for 2-year CVD risk prediction than 10-year risk prediction.

B. FEATURE IMPORTANCE
To estimate the contribution of each feature to the prediction, we analyzed the feature importance of each prediction model using the SHAP method [23]. Tables 3 and 4 show SHAP feature importance for each model for the 2-year and 10-year follow-up training datasets, respectively, where features with larger Shapley values are more important. Due to the high computational cost, the Shapley values for the DNN and RF models were computed using bootstrapping; thus, in Tables 3 and 4, the Shapley values should not be compared across different prediction models.
From the results, we make the following observations. First, each ML-based prediction model differently utilized the input features, due to their different characteristics. That is, an important feature in one prediction model was not necessarily important in another model. For example, in Table 3, exercise was the second most important feature for the LR model, but rarely utilized for the RF and LightGBM models.
Second, among the seven features, namely, age, sex, SBP, total cholesterol, smoking status, DM, and HTN, suggested in the ACC/AHA guidelines, only age, SBP, and HTN were effectively utilized in every prediction model for both 2-year and 10-year follow-up datasets; they were all ranked in the top 11 important features. Third, medication features were more effectively utilized for 2-year CVD risk prediction than 10-year risk prediction. In particular, for 2-year risk prediction, antiplatelet and cardiotonic were included in the top 4 important features of DNN, RF, and LightGBM, and in the top 9 important features of LR. Finally, BMI, DRNK_HABIT, and exercise were more important for 10-year risk prediction than 2-year risk prediction; they were all ranked in the top 9 important features in Table 4 except that DRNK_HABIT was identified as the top 13 important feature of LightGBM.

IV. DISCUSSION
In this study, we developed various ML-based prediction models for estimating 2-year and 10-year risk of CVD by analyzing the Korean National Health Insurance Service-National Health Sample Cohort (KNHSC) data. Specifically, we developed prediction models based on LR, DNN, RF, and LightGBM and compared them with the baseline method derived from the ACC/AHA guidelines. We trained the ML-based prediction models with and without past cardiovascular medication information and compared their performance under various metrics. Every ML model achieved higher prediction accuracy with the medication features than without them, and significantly outperformed the baseline method when trained with the medication features. The SHAP feature importance analysis in Tables 3 and 4 also confirmed that the pre-existing use of cardiovascular drugs such as antiplatelets and cardiotonics was an important feature variable. With the medication features, for both 2-year and 10-year CVD risk prediction, the DNN model achieved the highest AUROC values, whereas the LightGBM model achieved the highest AUPRC values. Still, all the ML-based prediction models were comparable to each other in general in terms of AUROC, AUPRC, sensitivity, specificity, and F1 score.
Several previous studies also analyzed the KNHSC data to assess the risk of various cardiovascular events, but they mainly used statistical analyses [25]- [29]. Recently, many researchers have extensively been studying the use of ML and deep learning methods to build more accurate prediction models [5], [6], [8], [10]- [12], [30]- [35], mostly by analyzing medical images and wave signal data such as those obtained from MRI, CT, and electrocardiography. However, most of them only considered a much smaller number of subjects, from hundreds to tens of thousands. In contrast, in this study, 297,875 subjects were analyzed and thus the results can be applied more generally. Our work is similar in spirit to [5], [6] in that they also analyzed clinical big data and compared various ML-based prediction models for 5-year or 10-year risk of CVD. The main difference is that [5], [6] analyzed European populations, whereas we analyzed Korean populations, and thus our work is complementary to [5], [6] from the perspective of race and ethnicity. Moreover, in this work, to analyze the effect of short-term and long-term prediction, we compared the results of 2-year and 10-year risk prediction. In addition, we further analyzed the effect of past medication information on the prediction accuracy using SHAP feature importance, which has not been considered in [5], [6].
The prediction accuracy and performance of ML models vary depending on the data used. For example, a DNNbased model significantly outperformed LR-and RF-based models in [32], whereas an RF-based model outperformed other ML-based models including LR and neural network models in [36], [37]. In this study, the performance of the proposed ML models was mostly comparable to each other under various metrics. In many applications, DNN often achieves a higher prediction accuracy than LR and tree-based models such as RF and LightGBM. However, it is not the case in this study partly because the KNHSC dataset is a simple tabular dataset and thus it seems that there exists no particular complex nonlinear relationship between the features considered here. If we also considered spatial or sequential data such as images, regular laboratory data, and electrocardiograms, then DNN would be much more effective than other models. In such cases, convolutional neural networks or recurrent neural networks can even be more effective [12], [30], [31], [35].
Some clinically interesting points were also found in our study. When we excluded the past cardiovascular medication information, the prediction accuracy of all prediction models significantly decreased. Note that the use of medication is an important part of a physician's judgment on a patient. If a doctor examines a patient and determines that CVD is likely to occur based on various clinical data, he/she may prescribe cardiovascular drugs as part of a preventive treatment. In other words, the use of such medications is the result of a doctor's analysis of various interrelated risk factors and the occurrence of diseases. Therefore, the prediction accuracy of all ML models was significantly improved when we included the medication features (see Figs. 3 and 4 and Table 2). Since we have found a paradoxical regularity that the incidence of CVD increases when cardiovascular drugs were used, we also developed prediction models using a dataset without the medication features to avoid the bias of a doctor's judgment and to check the performance of the ML models themselves (see Figs. 3 and 4 and Table 2). The SHAP feature importance analysis in Tables 3 and 4 also confirmed our finding. For example, the DNN, RF, and LightGBM models included the use of antiplatelets and cardiotonics in the top 4 important features among 25 input features for 2-year risk prediction. Moreover, the medication features were more effective for 2-year risk prediction than 10-year risk prediction. That is, the importance of past medication information was degraded for long-term prediction.

V. CONCLUSION
In this study, we analyzed the Korean National Health Sample Cohort big data and developed various ML-based prediction models to estimate the 2-year and 10-year risk of CVD. When we included the past medication information as input features, all proposed ML models significantly outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year risk for CVD, thus demonstrating the effectiveness of the ML methods in predicting CVD. However, whether the medication features were used or not, the performance of all ML models was mostly comparable to each other. Therefore, as future work, it will be interesting to investigate a more effective ML method for the CVD risk prediction. Meanwhile, since the use of medications by the physicians provided important information on the occurrence of diseases, when we included the past medication information as input features, all ML models achieved a higher prediction accuracy. In particular, the past medication information was more effective for a short-term prediction than a long-term prediction.
YEONGJIN SONG received the B.S. and M.S. degrees in computer science from Kangwon National University, South Korea, in 2018 and 2020, respectively. He is currently an Engineer with Classmethod, Japan. His research interests include programming languages, machine learning, and precision medicine.
HYEONSEUNG IM received the B.S. degree in computer science from Yonsei University, South Korea, in 2006, and the Ph.D. degree in computer science and engineering from Pohang University of Science and Technology (POSTECH), South Korea, in 2012. From 2012 to 2015, he was a Postdoctoral Researcher with the Laboratory for Computer Science, Université Paris-Sud, and Tyrex Team, Inria, France. He is currently an Associate Professor with the Department of Computer Science and Engineering, Kangwon National University, South Korea. His research interests include programming languages, logic in computer science, big data analysis and management, machine learning, precision medicine, and network security.

JUNBEOM PARK is currently an Associate
Professor with the College of Medicine, Ewha Womans University. He is also the Director of the Cardiac Electrophysiology Laboratory, Ewha Womans University Medical Center. His research interests include mechanisms and predictors of atrial fibrillation, sinus node dysfunction, and clinical implication of AI in cardiovascular diseases. VOLUME 8, 2020