An efficient prediction method for coronary heart disease risk based on two deep neural networks trained on well-ordered training datasets

This study proposes an efficient prediction method for coronary heart disease risk based on two deep neural networks trained on well-ordered training datasets. Most real datasets include an irregular subset with higher variance than most data, and predictive models do not learn well from these datasets. While most existing prediction models learned from the whole or randomly sampled training datasets, our suggested method draws up training datasets by separating regular and highly biased subsets to build accurate prediction models. We use a two-step approach to prepare the training dataset: (1) divide the initial training dataset into two groups, commonly distributed and highly biased using Principal Component Analysis, (2) enrich the highly biased group by Variational Autoencoders. Then, two deep neural network classifiers learn from the isolated training groups separately. The well-organized training groups enable a chance to build more accurate prediction models. When predicting the risk of coronary heart disease from the given input, only one appropriate model is selected based on the reconstruction error on the Principal Component Analysis model. Dataset used in this study was collected from the Korean National Health and Nutritional Examination Survey. We have conducted two types of experiments on the dataset. The first one proved how Principal Component Analysis and Variational Autoencoder models of the proposed method improves the performance of a single deep neural network. The second experiment compared the proposed method with existing machine learning algorithms, including Naïve Bayes, Random Forest, K-Nearest Neighbor, Decision Tree, Support Vector Machine, and Adaptive Boosting. The experimental results show that the proposed method outperformed conventional machine learning algorithms by giving the accuracy of 0.892, specificity of 0.840, precision of 0.911, recall of 0.920, f-measure of 0.915, and AUC of 0.882.

It is also highly ranked in South Korea, being ranked second of all deaths [2]. If suffering from CHD, a waxy substance called plaque will be built up inside the coronary arteries that deliver oxygen and nutrients to the heart muscle. This plaque narrows arteries, and the flow of oxygen-rich blood to the heart muscle is limited [3]. Over time, heart arteries are more narrowed and block the blood flow. Then, a heart attack or sudden death can occur because of the blockage. It usually progresses over many years without any symptoms.
Therefore, most people are diagnosed in the middle or late stage after feeling some symptoms, such as chest pain, shortness of breath, or fatigue. If CHD reaches serious condition, the most advanced treatments are necessary, such as stent surgery for keeping coronary arteries open and reducing the occurring of a heart attack, and coronary artery bypass grafting for supporting blood flow to the heart muscle, and heart transplant [4]. In the early stage, a healthy diet, active exercise, and appropriate medicines and care can help prevent suffering from CHD.
Recently, many studies have been conducted to predict the risk of CHD using machine learning and deep learning approaches. The machine learning-based methods mainly proposed single or ensemble classification algorithms [6], and some of them used feature selection or feature extraction techniques to improve the performance [ [7], [13], [14]]. Nowadays, deep learning techniques have been successfully used to diagnose CHD [ [15]- [19]]. Most existing methods first split an experimental dataset into two parts for training and testing. Then, they build predictive models from the whole or randomly sampled training dataset using classification algorithms. As a result, the models are more fitted on the regularly distributed dataset and misclassify irregularly distributed (biased) data.
Therefore, we focused on this problem by using distinct predictive models for the regular and biased inputs. In our previous study [20], the proposed method consisted of four deep learning models, including two Stacked Autoencoder (SAE) models and two deep neural network (DNN) models. First, we divided a training dataset into two groups based on their reconstruction errors given from the first SAE model. Next, two DNN models were trained on these divided groups by combining a reconstruction error-based new feature with other risk factors to predict the risk of developing CHD. The main idea was extracting the reconstruction error-based new feature from the second SAE model for two DNN models. In this study, the presented method does not perform feature extraction for DNN models. Instead, it is focused on the data distribution to improve the performance. It successfully improved the performance of the previous study.
We propose a prediction method for CHD risk based on a combination of DNN, Variational Autoencoder (VAE), and Principal Component Analysis (PCA). We addressed the following problems related to improving the prediction performance: (1) Previous studies used the whole or randomly sampled training datasets for model training. However, some data can be significantly different from the same labeled dataset. It degrades the performance of predictive models if the training dataset includes this highly biased subset. Therefore, the proposed method divides the training dataset into two groups, regular and highly biased using reconstruction error (RE) of the PCA. (2) The grouped highly biased subset from the training dataset may not be sufficient for model building due to accounts for a small percentage of the total dataset. The proposed method decides this problem by enriching the highly biased subset via two deep VAE models.
In this study, we improve the prediction performance by preparing the training dataset efficiently by solving the previously mentioned problems. The main contributions of this study are as follows:  We propose a novel method for predictive analysis and applied it to the Korea National Health and Nutrition Examination Survey (KNHANES) dataset to predict CHD risk. The proposed method consists of one PCA and four deep learning models, including two VAE and two DNN models. The combination of the used models together is more effective. In other words, the performance of a single DNN model was improved by using these models.  The proposed method was evaluated through two kinds of experiments. First, each model was experimented with independently and proved how they improve the performance. Second, the proposed method contrasted with several machine learning algorithms. The rest of the paper is organized as follows. Section II provides an overview of existing methods for CHD risk prediction. The proposed method is detailed in Section III. Section IV presents evaluation metrics, experimental dataset, and parameter tuning of the compared algorithms. Section V provides a performance evaluation of the compared algorithms on the KNHANES dataset. Finally, Section VI concludes the paper.

II. LITERATURE REVIEW OF CHD RISK PREDICTION METHODS
The early detection of CHD increases the chance of successful treatment. Many researchers have focused on finding and inventing efficient algorithms to perform the CHD prediction task. In this section, an overview of CHD prediction methods is provided. First, we talk about machine learning-based methods used for CHD. Then, an overview of deep learningbased CHD prediction methods is given. At last, CHD risk prediction methods experimented on the KNHANES dataset are discussed.
Machine learning-based approaches have been used commonly for predicting CHD. Soni et al. compared several algorithms, such as Decision Tree (DT), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Neural Network (NN) on the Cleveland Heart Disease dataset using a free data mining software named Tanagra. As a result, DT showed the highest accuracy of 89%, followed by NB [6]. The authors of [7] compared classification methods, namely NN, Support Vector Machine (SVM), Classification based on Multiple Association Rule (CMAR), DT, and NB to predict CHD on two kinds of datasets consisted of ultrasound images of Carotid Arteries (CAs) and Heart Rate Variability (HRV) of the electrocardiogram signal. First, they extracted feature vectors from the CAs dataset, HRV dataset, and a combination of CA and HRV datasets. As a result, the extracted vector from the CA+HRV dataset showed higher accuracy than the separated feature vectors of CAs and HRV. As a result, SVM and CMAR classifiers outperformed other compared classifiers by the accuracy of 89.51% and 89.46%, respectively. Gonsalves et al. studied NB, SVM, and DT algorithms on the South African Heart Disease dataset with 462 instances. Based on 10-fold cross-validation, the NB algorithm gave a promising result for detecting CHD with a sensitivity of 63% and specificity of 76% [8]. Beunza  implemented several data classification models, including Chi-squared Automatic Interaction Detection, SVM, C5.0, and Random Tree (RT) for CHD prediction using the Z-Alizadeh Sani dataset with 303 records from the UCI machine learning repository [10]. As a result, the RT model showed the best accuracy of 91.47% and an AUC of 96.70%. These studies generally proposed and compared conventional machine learning algorithms on publicly available heart disease datasets.
PCA has been widely used in the dimension reduction of high-dimensional data. Recently, several studies have used PCA as a feature extractor for improving classification performance [[13]- [14]]. The authors of [13] improved the performance of SVM, NB, DT algorithms by reducing data dimension from 10 to 6 using PCA on Cleveland heard disease dataset. In [14], the combination of Chi-square and PCA showed promising results to detect CHD. First, they obtained important features using the Chi-square test and reduced their dimension using PCA. Another application of PCA is to use it for detecting anomalies. Heiko Hoffmann [21] modeled the distribution of the training dataset by kernel-PCA for detecting an anomaly. The proposed approach computed the RE in feature space and used it as a novelty measure. In [22], the authors detected anomalies by computing errors when reconstructing the original image on PCA projections for hyperspectral imagery.
VAE is one kind of neural network that is not only used as a generative model but also used as a classifier. In [23] paper, VAE was proposed for generating synthetic electronic health records (EHR). They confirmed the performance of the LSTM model trained on the synthetic data is similar to those trained on real EHRs containing over 250,000 records. The authors of [24] paper used generated data from VAE for missing data imputation to identify the abnormal carotid arteries. They also removed a few labels from the test dataset and generated the incomplete labels by the VAE. As a result, VAE based classifier outperformed other supervised classifiers, including SVM, LR, and RF algorithms.
Tama et al. proposed a two-tier ensemble model for CHD prediction and evaluated it on the Z-Alizadeh Sani, Statlog, Cleveland, and Hungarian datasets [11]. The first-tier was constructed by the RF, Gradient Boosting (GBM), and Extreme Gradient Boosting Machine (XGBoost) classifiers. These classifiers predicted CHD in a parallel manner, and its output fed the second-tier. The final prediction was made by Generalized Linear Model (GLM). The comparison results showed the proposed method outperformed DT, RT, Classification and Regression Trees (CART), RF, GBM, and XGBoost algorithms on all datasets except the Cleveland dataset. Wang et al. designed a two-level stacking based model and evaluated it on the Z-Alizadeh Sani dataset [12]. The predicted outputs of the first level (base-level) classifiers, including RF, Extra Trees (ET), AdaBoost, SVM, Multi-layer Perceptron (MLP), XGBoost, Gaussian Process Classification (GPC), NB, and LR were given as an input of the second level (meta-level) classifier based on LR. The proposed method outperformed the compared machine learning algorithms by the accuracy, sensitivity, and specificity of 95.43%, 95.84%, and 94.44%. By using an ensemble approach, these studies outperformed the single machine learning algorithms on the Z-Alizadeh Sani dataset.
In recent years, deep learning techniques have been successfully used to diagnose and predict disease. Deep learning is derived from the conventional neural network but it is designed for using numerous hidden layers without requiring any human-designed rules [25]. Atkov et al. developed an NN-based model with two hidden layers (four neurons in each hidden layer) for predicting CHD using genetic and non-genetic CHD risk factors [15]. The authors built ten predictive models from different risk factors; the accuracy reached 93% on 487 patients' data in Central Clinical Hospital No. 2 of Russian railways. Samuel et al. proposed a combination of an Artificial Neural Network (ANN) and Fuzzy Analytic Hierarchy Process (Fuzzy-AHP) techniques for heart failure risk prediction [16]. Fuzzy-AHP technique was used to compute the global weights for the attributes based on the fuzzy triangular membership function. Then, global weights that represent the contributions of attributes were applied to train the ANN. The performance of the proposed method was evaluated on the Cleveland Heart Disease dataset with 297 patients. As a result, the proposed method showed an accuracy of 91.10%, which is 4.4% higher than conventional ANN. Darmawahyuni et al. used DNN for CHD prediction on the Cleveland Heart Disease dataset [17]. The authors chose the number of hidden layers from one to five, and each layer had a hundred neurons. The best-performed model had three hidden layers, and its accuracy, sensitivity, and specificity reached 96%, 99%, 92%, respectively. Ayon et al. compared LR, SVM, DNN, DT, NB, RF, and KNN for predicting CHD [18]. The Statlog and Cleveland heart disease datasets retrieved from the UCI machine learning repository were used in an experimental study. As a result, the DNN with four hidden layers (the number of neurons in hidden layers were 14, 16, 16, and 14, respectively) showed the highest accuracy of 98.15%, sensitivity of 98.67%, and precision of 98.01%. Khaneja et al. focused on the class imbalance problem using NNs with two hidden layers; each layer had 256 nodes [19]. The proposed method consisted of two identical NNs that work together and share their weights. First, an input pair was prepared by the combination of two records based on the generation of random numbers. Next, the prepared pair was given to the model, and the NNs received one sample each from the pair. Then the distance was calculated from outputs of these NNs and used in the calculation of loss. In the experiment on the Framingham Heart Study dataset that contains 4,240 samples with 16 columns, the accuracy of the proposed method was 99.66%. M.A. Khan proposed an Internet of Things (IoT) framework for CHD prediction based on a deep Convolutional Neural Network (MDCNN) classifier that optimized by an Adaptive Elephant Herd Optimization (AEHO) algorithm [26]. First, the smart watch and heart monitor devices were attached to a patient for monitoring the blood pressure and Electrocardiogram (ECG). Then MDCNN was utilized for classifying the received sensor data into normal and abnormal. It outperformed the compared algorithms such as DNN and LR classifiers. And its accuracy reached 93.3%, 98.2%, and 96.3% on Cleveland, Framingham, and Sensor datasets, respectively.
Recently, several studies have been conducted using the KNHANES dataset related to the Korean population. Kim et al. developed a CHD prediction model based on Fuzzy Logic and DT for Koreans [27]. The model used the Framingham risk factors (gender, age, Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), Total Cholesterol (TCL), High-Density Cholesterol (HDL), obesity, smoking) and diabetes as input. The proposed model contrasted with NN, SVM, LR, and DT classifiers and gave the highest accuracy and sensitivity scores with 69.51% and 93.10%, respectively. Lim et al. proposed the optimized DBN model to predict CHD risk on the KNHANES-VI dataset with 748 instances using the Framingham risk factors [28]. The optimum number of nodes and layers in the DBN was derived through the genetic algorithm. They compared the result of the Optimized-DBN with NB, LR, RF, and FRS algorithms. The proposed approach showed the highest performance with an accuracy of 89.24%, specificity of 74.40%, sensitivity of 85.49%, and AUC of 76.20%. Amarbayasgalan et al. proposed a deep learning-based CHD risk prediction model (DAE-NNs) consisted of a Deep Autoencoder (DAE) and two DNN models [29]. The DAE-NNs used the Framingham risk factors as an input of the model, and it was evaluated on the fifth and sixth KNHANES datasets, including 25,990 patients. First, the training dataset was divided into two groups by a RE-based threshold from the DAE model. Then, DNN classifiers were trained on each group. As a result, the performance measurements, including accuracy, f-measure, and AUC reached 83.53%, 84.36%, and 84.02%, respectively. NN with a feature correlation analysis (NN-FCA) approach has been presented [30]. They have performed a statistic-based feature selection for the sixth KNHANES dataset with 4,146 records. The selected features such as age, Body Mass Index (BMI), TCL, HDL, SBP, DBP, triglyceride, smoking status, and diabetes were given as an input of the NN model with three hidden layers. Compared to the results of the Framingham Risk Score (FRS) and LR model, their proposed model has shown high performance with an accuracy of 82.51% and AUC of 74.9%. According to the KNHANES-VI dataset with 4,244 records, Kim et al. proposed a CHD risk prediction method based on the Statistics and Deep Belief Network (DBN) [31]. First, important features such as age, SBP, DBP, HDL, diabetes, and smoking were selected by the statistical analysis. Then, DBN with two hidden layers was worked as a predictor using the selected features. As a result, the Statistical-DBN outperformed NB, LR, SVM, RF, and DBN, and its accuracy and AUC reached 83.9% and 79.0%, respectively. The authors of [20] proposed a CHD risk prediction model using Autoencoder and DNN models. The first Autoencoder model was trained on a dataset labeled as risky for feature extraction. The second Autoencoder model was trained on the whole dataset to select an appropriate prediction model from two DNN classifiers. They selected fourteen risk factors, such as age, knee joint pain status, lifetime smoking status, waist circumference, neutral fat, BMI, weight change in one-year status, SBP, TCL, obesity status, frequency of eating out, HDL, marital status, and diabetes from the KNHANES dataset using an Extremely Randomized Tree classifier. As a result, the proposed method outperformed machine learning algorithms; its accuracy, precision, recall, fmeasure, and AUC score reached 86.33%, 91.37%, 82.90%, 86.91%, and 86.65%, respectively.
Most proposed methods in previous studies were based on the whole training dataset. The proposed method in this study is unlike them. It focuses on data distribution to prepare training datasets efficiently using PCA and VAE models. First, the training dataset is partitioned into two groups by their divergence using the RE-based threshold estimated from the PCA model. PCA is used to reduce the dimensionality of a dataset by projecting high dimensional space into lowerdimensional space. RE occurs when transforming back the lower-dimensional representation of data to its original dimension. In other words, data with high RE (highly biased) and low RE (regular) are grouped separately. For VAE models, they employ to enrich the highly biased group by generating similar samples with normal (labeled as 0) and risky (labeled as 1) data in the highly biased group. Finally, the first DNN classifier learns from the regular group that includes data with low RE, and the second DNN classifier learns from the enriched highly biased group. At the prediction time, only one appropriate classifier employs to predict CHD risk from the given input. For selecting an appropriate DNN classifier, the proposed method checks the given input is whether closer to the highly biased group based on its reconstruction error on the PCA model. First, input data is given to the PCA model to get reconstruction error. If returned reconstruction error exceeds the threshold calculated by equation (2), the DNN model that was trained on the highly biased training group will be used; otherwise, the DNN model based on the regular training group will predict the class label. By preparing well-ordered two training groups, the proposed method successfully improved the performance of a single DNN classifier that is based on the whole training dataset.

III. THE PROPOSED METHOD FOR CHD RISK PREDICTION
The proposed method consists of three modules, as shown in FIGURE 1. The first module (Preparation of two groups) splits the whole training dataset into highly biased and regular groups, the second module (Enrichment of the highly biased training group) generates samples similar with both normal and risky datasets in the highly biased group, and the third module (CHD risk predictor) builds two DNN classifiers to predict output from the given unseen data as normal or risky.

A. PREPARATION OF TWO TRAINING GROUPS
In this module, two groups of training datasets are prepared from the initial training dataset. The whole training dataset is divided into two subsections by their divergence using the PCA model. PCA is a dimensionality reduction technique that transforms input variables into a lower-dimensional space that contains most information of the input variables. It is possible to reconstruct back original space of the input from the lowerdimensional data. RE is a difference between input data and its inverse transformation (reconstruction) on the PCA model. The proposed method uses RE for distinguishing the highly biased subset from the training dataset. First, the PCA model is trained on the whole training dataset. Thus, it is more suitable for commonly distributed data than for highly biased data. And it projects common data into lower dimensional space with less information loss and reconstructs back with a smaller error. It is possible to separate the highly biased subset from the dataset based on the RE of the PCA model. RE is calculated through the mean of the squared difference between the input features and its reconstruction; it can be defined as (1): (1) where n is the number of input features; xi is the i-th feature; x′i is the reconstruction of the i-th feature.
First, we calculate the RE of the training dataset on the PCA model. Then, a threshold to split training dataset is estimated by the mean and standard deviation of these REs; it can be described as (2): where k is the number of instances in the training dataset; REi is the reconstruction error of the i-th training instance. As a result of this module, two different groups of training datasets are prepared, as well as the RE-based threshold is estimated for further analysis. Later, the threshold is also used to select an appropriate CHD risk prediction model from the DNN models trained on the prepared two groups.

B. ENRICHMENT OF THE HIGHLY BIASED TRAINING GROUP
Instead of using prepared training groups directly, two VAE models enrich the highly biased training group. The first VAE model is for generating samples labeled as risky, remained one is for generating samples labeled as normal. FIGURE 2 presents the process of enrichment of the highly biased training group using two VAE models. In this figure, datasets in green are the results of the previous module (preparation of two training groups). All data with the RE greater than or equal to the TRE are assigned to the highly biased group. And it is again bisected into the normal and risky sections according to the class label. Each section is used to train VAE models named VAE-normal and VAE-risky, as shown in FIGURE 2. . VAE was first introduced by [32], and its architecture consists of encoder and decoder parts. The encoder compresses the data to the encoded space, also named latent space, whereas the decoder decompresses them. In VAE, the encoder part is trained to return mean and variance that describe the normal distribution, and it encodes an input as a distribution instead of encoding it as a fixed vector. The loss function of the VAE is defined by two terms, such as the reconstruction loss calculated by the difference between original data and its reconstructed output and the Kullback-Leibler (KL) Divergence score that quantifies how much latent distribution differs from the standard normal distribution. VAE minimizes the loss during training to learn the latent distribution to be as close as possible to the standard normal distribution. The loss can be calculated as (3): (3) where n is the number of instances; is the i-th instance; ′ is the reconstruction of ; μ and σ are mean and variance of the latent distribution. FIGURE 3 represents the architecture of the used two VAE models. Each hidden layer uses the ReLu activation function as given in (4), and the output layer uses the sigmoid activation function shown in (5).
The ReLU activation function is usually used in hidden layers. It can be described as (4): The sigmoid activation function converts input x into a value between 0 and 1, and it is especially used to predict the probability of output. It can be described as (5): First, the input is encoded as a distribution over the latent space. Second, an input of the decoder (z) is randomly sampled from the latent distribution. Then, the sampled point z is decoded to the output. In this study, the latent distribution is chosen to be normal, and the encoder is trained to return the mean and variance that describes the normal distribution. To generate samples using the VAE model, ε is sampled randomly from the standard normal distribution and add it to the mean value (μ) by multiplying it by the standard deviation (σ) of its latent distribution for obtaining z, as described in (6). Finally, the sampled point z is decoded to get new data. The decoded output of z is a generated sample. z = μ + σε (6) where ε is the random value from the standard normal distribution; μ and σ are the latent distribution's mean and standard deviation.

C. CHD RISK PREDICTOR
In this study, we use DNN model to predict CHD risk. NN model was first proposed by Warren McCullough and Walter Pitts in 1943 [33]. It has been applied successfully to speech recognition [34], emotion recognition [35], disease predictions [36], and so on.  shows the NN example that has an input layer with n neurons, a hidden layer with two neurons, and an output layer with one neuron. The input layer is composed of neurons that describe input features, whereas neurons in hidden and output layers receive results of activation function that converts the weighted summation of the neurons of the previous layer. The output of the NN represented in FIGURE 4 can be written in (7): where a is an activation function, w is the weight matrix, x is the input vector, and b is the bias.
In the CHD risk predictor module, two DNN models are trained on the prepared training groups by splitting the whole training dataset. In practice, a dataset can include a subset that is higher variance than most data. This highly biased subset degrades the performance of predictive models. Therefore, we isolate a highly biased subset from the common subset using the RE of the PCA model. It also gives a possibility to train two distinct predictive models targeted for regular input and biased input separately. By using two distinct classifiers for regular and biased distributions, it can classify the regular as well as biased input data well. Moreover, we augmented the biased section with new samples generated from VAE models. As a result, the performance of the predictive model trained on the biased section is more improved.
The architecture of the proposed DNN models is the same as each other, as shown in FIGURE 5. Each model has six hidden layers with 71, 51, 31, 11, 5, and 3 neurons, respectively, and all of the hidden layers use the ReLU activation function. The input layer consists of 12 neurons for CHD risk factors to predict the target variable. The output layer uses the sigmoid activation function for the binary classification problem; it returns the probability associated with class 1 as a value from 0 to 1. Finally, a high probability class is selected for the output result.  shows the prediction steps. In the process of CHD risk prediction, first, input data is given to the PCA model, and its RE on PCA is calculated. If the RE exceeds the threshold (TRE) estimated by (2), then DNN-biased trained on the highly biased training group employs; otherwise, the DNN-regular trained on regular training group with low RE predict the CHD risk.

IV. EXPERIMENTAL STUDY
To evaluate the proposed method, we have conducted two types of experiments. In the first type of experiment, we proved how each model improves the prediction performance. In other words, the purpose of this experiment is to show the contribution of the performance improvement of used models. Therefore, first, we trained a predictive model based on DNN from the whole training dataset without any other models, and it was used as a baseline model. In the proposed method, we prepared training groups from the initial training dataset using PCA and VAE models to improve the baseline model. We showed how the prediction performance was improved using the PCA model first and then the VAE model step by step. The following models were compared in this experiment:  The single DNN model trained on the whole training dataset and its architecture was the same as the two DNNs used in the proposed method. It was used as a baseline model.  Two DNN models that were trained on the training groups divided by the PCA model (the first step of preparing well-ordered training groups in the proposed method).  Two DNN models that were trained on the training groups divided by the PCA model. However, the highly biased training group was enriched by two VAE models (the second step of preparing wellordered training groups in the proposed method). The comparison between the baseline model and the two-DNN shows that the two-DNN improves the performance of the single DNN model significantly. After that, we showed the performance improvement of the two-DNN by enriching the highly biased training group using generated samples from VAE models.
In the second kind of experiment, we compared the proposed method with machine learning-based algorithms, including NB, RF, KNN, DT, SVM, and AdaBoost.

A. EVALUATION METRICS
This section describes performance measurements for prediction models on the test dataset. The confusion matrix is a table to visualize the performance of classification models when data labels are available. It represents the total number of correct (True Positives (TP) and True Negatives (TN)) and incorrect predictions (False Positives (FP) and False Negatives (FN)).
Accuracy is the proportion of correct predictions among all data. It defined by (8): True Positive Rate (TPR) known as "Sensitivity" or "Recall" is defined as the fraction of positive instances predicted correctly by the model, which is defined by (9): Recall = + (9) Precision is a fraction of TP predictions among all positive predictions. It evaluates the effectiveness of TP predictions. Precision can be defined as (10): Precision = + (10) However, it is difficult to compare models with low precision with high recall or high precision with low recall. Thus, F-measure is used to measure precision and recall together, where a high value indicates a good result. It can be defined as (11): ROC curve is a graphical representation of the balance between the TPR (y-axis) and FPR (x-axis) of a classifier. It compares the performance of several classifiers together and evaluates which model is better on average. It indicates how much a classification model is capable of distinguishing between classes [37]. If the model is perfect, the area under the ROC curve (AUC) is close to 1. A model with a larger AUC is better.

B. DATASET
KNHANES is a nationwide program to evaluate Koreans' health and nutritional status. It has been continuously conducted by the Korea Centers for Disease Control and Prevention (KCDC) since 1998 [38]. KNHANES dataset consists of 3 parts: medical examination, health survey, and nutrition survey described in TABLE I. We have analyzed samples spanning the years 2010-2015. In this experiment, we used a total of 25,340 records without a history of previous myocardial infarction or angina from the KNHANES dataset. If a patient has been diagnosed with myocardial infarction or angina and the first diagnosed age is younger than the current age, we removed the record. The final output of the proposed method is to predict whether there is a risk of CHD from the given input. The final experimental dataset consisted of 10,991 men and 14,349 women. From them, 15,796 records were high risk, and 9,544 records were normal. Risk factors including age, knee joint pain status, waist circumference, neutral fat, BMI, weight change in oneyear status, SBP, TC, obesity status, frequency of eating out, HDL, and marital status were used to predict CHD risk [20]. The general descriptions of the risk factors used in the experimental study are shown in TABLE II.

C. PARAMETER TUNING FOR COMPARED MACHINE LEARNING ALGORITHMS
To compare the proposed method with other machine learning algorithms, we used the sklearn library [39]. The following Python implementations were used for the compared machine learning algorithms:  DT criterion: "gini" for the Gini impurity and "entropy" for the information gain measurements; they were used to identify the best decision tree splitting candidate. criterion = "entropy" RF n_estimators: The number of trees in the forest. It was configured between 10 and 200 and increased by 10. criterion: "gini" and "entropy" were used for splitting criteria. n_estimators = 80 criterion = "entropy" SVM kernel: It specifies the kernel type to be used in the algorithm. It must be one of "linear", "poly", "rbf", or "sigmoid". The proposed DNN model was trained with Adam optimizer [40], a learning rate of 0.001, batch size of 32, and epochs of 1000. Early stopping [41] with the validation accuracy as a stopping criterion and patience of 500 epochs is applied. The proposed method uses 90% of data for training, 10% of the training set for validating, and remained 10% of data for testing. For the VAE models, they trained with Adam optimizer, a learning rate of 0.001, batch size of 8, and epochs of 1000.

V. EXPERIMENTAL RESULTS
We conducted two types of experiments to evaluate the proposed method. The first kind of experiment proved how the components of the proposed method work together more efficiently and improve the prediction performance. The comparison between the proposed method and machine learning algorithms such as NB, RF, KNN, DT, SVM, and AdaBoost was shown in the second kind of experiment.

A. RESULTS OF THE FIRST EXPERIMENT
In this section, the performance improvement of the single DNN model that is the baseline model using two DNN models learned from the prepared training groups is detailed. We compared the baseline model to Two-DNN and VAE-Two-DNN models.
The single DNN was trained on the whole training dataset. The Two-DNN was trained on the highly biased and regular training groups separated by the PCA model. For VAE-Two-DNN, we enriched the highly biased training group with samples generated from VAE models. There were 3,000 samples generated for each class label (normal and risky), respectively.

1) PREPARE HIGHLY BIASED AND REGULAR GROUPS BY SPLITTING THE TRAINING DATASET USING PCA
According to the proposed method, the training dataset is divided into two groups based on the PCA model. First, the PCA model was trained on the whole training dataset, and it fitted more for the common dataset. Therefore, the data that is different from most data gives higher RE than common data on the PCA model. Based on this characteristic, we distinguished the highly biased section of the training dataset. By separating highly biased data and commonly distributed data, the dataset with high divergence can be modeled independently to improve the prediction performance. In this experiment, the number of principal components of the PCA model was 6, which can explain 95% of the input variance.
Each group consisted of a dataset labeled as both risky and normal. According to the partitioned groups in 10-fold crossvalidation, the highly biased group accounted for approximately 9.07% of the whole training dataset (from 8.94% to 9.42% in each fold), and about 82.6% was labeled as risky. As shown in FIGURE 7, the mean values (and standard deviation in parentheses) of risk factors such as age, waist circumference, neutral fat, body mass index, systolic blood pressure, total cholesterol, and high-density lipoprotein cholesterol were 47.66 (18.12) 10 (20.15) in G2, respectively. However, the deviation of risk factors in the G1 was lower than G2, and about 60.3% were labeled as risky. Moreover, we can see the average neutral fat, total cholesterol, and systolic blood pressure increased significantly in G2.     [25]. For both of compared models, AUC (p-value<0.000001) was statistically significant.

2) ENRICH HIGHLY BIASED TRAINING GROUP USING VARIATIONAL AUTOENCODERS
The distinguished highly biased training group is one of the prepared groups based on the PCA model. It consists of data with high RE and may not be sufficient for building the model due to accounts for a small percentage of the total dataset. In the experiment, 9.07% of the whole training dataset belonged to the highly biased group. And 82.6% were risky, 17.4% were normal in this group. The proposed method decides this problem by enriching both risky and normal instances in the highly biased training group via two deep VAE models.
This section describes the performance improvement of the Two-DNN model introduced in the previous section by the enriched training group. We generated 3,000 samples for each class label (normal and risky), respectively, using two different VAE models. Even if 82.6% of the dataset was risky in the highly biased group, the number of risky instances is not big enough for training. Therefore, we used two VAE models for data generation of both risky and normal data. FIGURE 9 shows the comparison of Two-DNN and VAE-Two-DNN methods. In Two-DNN, there were two DNN models independently trained on the divided training groups directly. For VAE-Two-DNN, the highly biased training group was augmented by newly generated samples. Then two DNN models were independently trained on these groups. As a result, all performance of VAE-Two-DNN outperformed Two-DNN. It increased accuracy, precision, recall, specificity, and f-measure of Two-DNN by 1.9%, 1.07%, 2.12%, 1.77%, and 1.59%, respectively in VAE-Two-DNN.   As a result, all of the steps in the proposed method improved the prediction performance of the baseline model. In the first step, the irregular dataset (highly biased) was distinguished from the training dataset using RE from the PCA model. Then, two DNN models were trained on the separated groups one by one. When predicting the CHD risk using these two models, the PCA model received the input first and returned the RE error. If the returned RE was higher than the threshold, a DNN model based on the irregular training group predicted the CHD risk. In the opposite case, a DNN model learned from the regular training group was employed to predict CHD risk. In the second step, the irregular training group was enriched by samples generated from the VAE models because it consisted of insufficient instances to build a predictive model. After that, two DNN models were trained on the regular and enriched highly biased training groups separately. FIGURE 10 shows the improvement of the baseline model step by step.

B. RESULTS OF THE SECOND EXPERIMENT
We compared six machine learning algorithms such as KNN, NB, DT, RF, AdaBoost, and SVM with the proposed method using the 10-fold cross-validation.   The recall measures what proportion of the dataset labeled as risky was predicted correctly, and the precision evaluates how many percent of total risky predictions is correct. The proposed VAE-Two-DNN incremented the recall, precision as well as f-measure by 5.68%, 5.2%, and 5.44%, respectively. Therefore, VAE-Two-DNN successfully improved the prediction of both normal and risky cases.    The comparative evaluation of the proposed method and existing CHD risk prediction methods on our experimental dataset are limited because the existing methods are not publicly available. Therefore, we did not run existing methods on our experimental dataset, and a comparison was made by taking the results from the papers. TABLE IX shows the comparison of the existing methods in previous studies and the proposed method on the KNHANES dataset, and the highest values of evaluation scores are marked in bold.

VI. CONCLUSIONS
In this study, we proposed the CHD risk prediction method based on two DNN models and applied it to the KNHANES dataset. The proposed method addressed preparing an efficient training dataset by distinguishing and enriching the highly biased subset that degrades the model performance using the PCA and VAE models. First, we grouped the highly biased subset from the whole training dataset using the PCA model. This is because the highly biased subset of the training dataset degrades the performance of predictive models. It is possible to improve the performance of a single predictive model trained on the whole training dataset by two different predictive models trained on the highly biased and remained common subsets. Therefore, to address this issue, we suggested using RE from the PCA model. As a result, the performance of CHD risk predictor based on the single DNN For improving the prediction performance by enriching the insufficient number of instances in the highly biased training group, the proposed method was designed to use two deep VAE models. The performance of CHD risk predictor based on two DNN models learned from the partitioned training groups was improved by using VAE based two DNN models learned from the enriched highly biased training group and regular training group (accuracy: 0.892, precision: 0.911, recall: 0.920, specificity: 0.844, f-measure: 0.915, AUC: 0.882).
To evaluate the proposed method, the proposed method was compared with various machine learning algorithms. The evaluation results showed that the proposed method improved the accuracy, specificity, f-measure, and AUC of NB, SVM, DT, KNN , AdaBoost, and RF by (13.3, 19.2, 18.3, 7.5), (11.2, 19.4, 16.5, 5.4), (11.2, 12. In shortly, this study proposed the comprehensive prediction method using PCA, VAE and DNN models. The two DNN trained on the partitioned training groups according to the PCA significantly improves the performance. Moreover, the proposed method raises the performance again using the VAE-based enriched training group. We show the performance improvement of the proposed method by using PCA and VAE models in the first experiment, and comparison between the proposed method and other machine learning algorithms in the second experiment.
The limitation of the proposed method is that it does not allow missing values. Therefore, we will focus on handling missing values by generating new values using the VAE model in our future study. Also, the reconstruction errorbased threshold was estimated by the mean and standard deviation of the reconstruction errors of the training dataset on the PCA model. Finding the optimal threshold is challenging in this module.