Precision Clinical Medicine through Machine Learning: Using High and Low Quantile Ranges of Vital Signs for Risk Stratification of ICU Patients

Remote monitoring of patients in the intensive care unit (ICU) is a crucial observation and assessment task that is necessary for precision medicine. In this work, we follow the state-of-the-art in risk stratification through machine learning-based prediction, but with minimal features that rely on vital signs, the most commonly used physiological variables obtained inside and outside hospitals. We provide a formal representation of a feature engineering algorithm and report the development and validation of three reproducible machine learning prediction models: ICU patient readmission, abnormality, and next-day vital sign measurements. For the readmission model, we proposed two solutions for data with imbalanced classes and applied five binary classification algorithms to each solution. For the abnormality model, we applied the same five algorithms to predict whether a patient will show abnormal health conditions. Our findings indicate that we can still achieve a reasonable performance using these machine learning models by focusing on low and high quantile ranges of vital signs. The best accuracy achieved in the readmission model was around 67.53%, with an area under the receiver operating characteristic (AUROC) of 0.7376. The highest accuracy achieved in the abnormality model was around 67.40%, with an AUROC of 0.7379. For the next-day vital sign measurements model, we provide three approaches for selecting model predictors and apply the eXtreme Gradient Boosting (XGB) and Random Forest Regression (RFR) algorithms to each solution. We found that, in general, the use of the most recent vital sign measurements achieves the least prediction error. Considering the large investment from the medical industry in patient monitoring devices, the developed models will be incorporated into an Intelligent ICU Patient Monitoring (IICUPM) module that can potentially facilitate the delivery of high quality care by implementing cost-efficient policies for handling the patients who utilize ICU resources the most.


I. INTRODUCTION
P RECISION health and medicine [1] is a relatively new initiative that aims to improve equitable care and overall individual and population health through targeted observation and assessment, early detection, prevention, and treatment, as well as precision health promotion and engagement. Intensive care units (ICUs) often collect and use a high volume of patients' data to enable physicians and ICU nurses to make timely decisions in delivering high-quality critical care. The use of artificial intelligence (AI) and machine learning (ML) methods to improve the care and treatment of critically ill patients has grown substantially in recent years [2], [3]. ML approaches for analysis of clinical datasets offer great promise for the delivery of personalized medicine and targeted treatment, but these approaches must be customized and optimized [4], [5] for each specific application. FIGURE 1: The intelligent remote patient monitoring (IRPM) framework is comprised of intelligent ICU patient monitoring (IICUPM) and out-of-hospital modules. MIMIC stands for the medical information mart for intensive care.

A. INTELLIGENT REMOTE PATIENT MONITORING (IRPM) FRAMEWORK
We have recently developed a cloud-based intelligent remote patient monitoring (IRPM) framework [6] that consists of several modules ( Figure 1): 1) Intelligent ICU patient monitoring (IICUPM) module: through the IICUPM module's interface, the hospital system can load clinical and demographic characteristics for either an individual patient or a patient population. The module provides five predictive ML models to process the data and return risk scoring results. 2) Out-of-hospital module: unlike the IICUPM module, this module targets individual patients. Through the module's interface, a patient's readings from wearable devices (e.g., heart rate and SPO2) are uploaded into an abnormality ML prediction model, which returns an overall assessment or risk scoring result. 3) IRPM core system and database: the IICUPM and out of hospital interfaces interact with the core IRPM system, which sends the data to the ML model development modules. After processing the data, the ML models send the results back to the IRPM framework to record them in the database.

B. CLINICAL CONSIDERATIONS
The use of prognostic models is a common practice in precision clinical medicine to formally combine predictors from which risks of a specific endpoint can be calculated for individual patients. The clinical goal is to ensure that patients are placed on the appropriate care pathway, including proper ICU type and level of care. The proposed framework provides not only healthcare provider-focused services but also patient-facing digital health services through the out-ofhospital module. This article sheds light on a portion of the ongoing development of the IRPM framework. We developed a prototype for a dashboard that can be used for triaging patients and navigating care to support clinical decisionmaking by using predicted risk to preemptively triage patients across different levels of the healthcare system. The results of applying the feature engineering algorithm and prognostic ML models will help determine the proper level of care based on the predicted risk level. Therefore, high-risk patients can be managed at higher level care facilities while lower-risk patients can be managed at lower levels of care.

C. TECHNICAL CONTRIBUTIONS
The contributions of this article are three-fold: • Building on our recent development of the IRPM framework, we provide three more ML models in the IICUPM module for risk stratification of ICU patients, including readmission, abnormality, and next-day vital sign measurements. • In addition to providing three transparent and reproducible ML models, we also present details of a feature engineering algorithm that can be deployed in different critical care settings. • We provide two different solutions for data with imbalanced classes and three different variations for predicting next day vital sign measures to show that our solutions can predict health outcomes with reasonable performance. We evaluated all proposed solutions and applied them to ICU patient data from a publicly available database to determine the best use on our IICUPM module.

II. RELATED WORK
We discuss research efforts on building ML models to predict ICU-related outcomes. Some researchers have studied patient readmission to the ICU as an outcome. For example, Rajkomar et al. [7] developed a deep learning (DL) model to predict 30-day unplanned hospital readmission. In our study, we predict readmission to the ICU during the same hospitalization rather than readmission after discharge from the hospital. Some research has involved the development and deployment of ML models to predict health status using the Medical Information Mart for Intensive Care (MIMIC) database [8]- [10]. However, most of these investigators have used an exhaustive list of features to achieve higher accuracy in their models. We discuss some of them below. According to the MIMIC website, only a few studies describe the development and deployment of prediction models for ICU readmission using the MIMIC database. To the best of our knowledge, predicting ICU patient readmission and abnormality based on balanced classification using only vital signs and demographic attributes from MIMIC has not been studied previously. Nor are we aware of any research performed on the prediction of ICU patients' next-day vital sign measurements. Lin et al. [11] developed models to predict ICU patient readmission within 30 days of discharge using recurrent neural network (RNN) with long short-term memory (LSTM). They used several features from the MIMIC III database, including 17 chart events, 4 demographic features, and the International Classification of Diseases, 9th Revision (ICD-9) code. In our study, we predict patient readmission without prior knowledge of their medical conditions or diagnoses.
Fialho et al. [12] developed a model for predicting ICU readmission using patient data before ICU discharge. Their dataset included data from 893 non-readmitted patients and 135 readmitted patients. They used several features from MIMIC-II, including 6 monitoring signals, the results of 15 laboratory tests, and urine output. In our research, we used data from only the first day of each patient's ICU stay and only 11 input features. We include imbalanced techniques to produce a balanced classification with the same total number of samples in both the readmitted and non-readmitted patient classes.
Shin et al. [13] built three prediction models using two ML algorithms to predict hospital readmission in pediatric asthma patients from a regional hospital in Memphis, TN. They also used 12 features, including demographic attributes, biomarkers, and socioeconomic factors. The goal was to predict readmission within one year of the initial hospitalization. They compared a model based solely on socioeconomic features derived from the patients' residential neighborhood to a model that is based on clinical features derived from the patient record. They found that the model based on socioeconomic factors achieved accuracy that is as good as that of the biomarker-based model.
Golmaei and Luo [14] examined a novel DL framework based on a deep patient representation. They assessed the framework on 30-day hospital readmission using patient data extracted from the MIMIC-III database. Their results show that the novel framework achieves a better predictive power than do the baseline models.
Pakbin et al. [15] developed risk-of-ICU-readmission models for predicting readmission to the ICU at different target times. They studied patient ICU readmission after a hospital discharge and during single hospital admission and extracted several features from the MIMIC III database.
Momenzadeh et al. [16] applied cluster centroids as the undersampling method aimed to reduce the majority class samples. The k-means clustering technique was used to divide the majority class and generate k clusters. Their strategy is to use different numbers of clusters. The number of clusters is equal to the minority class and two and three times greater than the minority class. Their strategy has an impact on accuracy and avoids model over-fitting.

III. THE QUANTILES APPROACH
In this section, we introduce the quantiles approach that not only focuses on patient's characteristics in the baseline, but also performs feature engineering steps to add richer features to the data.

A. PATIENT BASELINE VITAL SIGN FEATURES
In general, a patient will have a set of vital sign readings in the baseline that are often normally distributed [17]. Vital signs include body temperature (BT), heart rate (HR), respiration rate (RR), arterial systolic blood pressure (SBP), arterial diastolic blood pressure (DBP), peripheral oxygen saturation (SpO2), and blood glucose level (GL). These vital signs are readily obtainable from the electronic health record (EHR) because they are measured frequently.

B. PATIENT DETERIORATING CONDITIONS
Vital signs are recorded in a sequential manner and research studies often process these features using either time series analyses or by aggregating them for each patient using some measure (e.g., mean or median). Thus, when dealing with sequential vital sign readings, some researchers [12] use the mean mean value of the vital sign observations rather than using observations that may deviate far from the median. In our approach, we argue that a patient's deteriorating condition often happens at a high or low level of measurement. Thus, we believe that these observations are essential, as they capture dramatic changes in a patient's health status.

C. FEATURE ENGINEERING ALGORITHM IN THE QUANTILES APPROACH
We have previously proposed the notion of the quantiles approach [6], in which we perform feature engineering by emphasizing high and low quantiles of a patient's vital sign observations. Our previous work showed that the quantiles approach provides a richer dataset by engineering a list of new features. Algorithm 1 shows the steps performed by the feature engineering algorithm. The algorithm has three inputs: a list of patient samples P , and two desired probabilities for the percent point function (PPF), P P F H and P P F L . The algorithm iterates through all patients in P and for each ICU stay S, extracts the baseline vital sign features V baseline from the first day only, normalizes the observations in those features using the probability density function (PDF), and sorts them in ascending order. The algorithm next extracts two discrete values using the PPF function, DiscreteV alue L and DiscreteV alue H and uses these values as thresholds to extract observations that fall in low and high quantile ranges. For each vital sign in V baseline , it calculates the following new features: the list of modified means M odM eans, the list of modified standard deviations M odSDs, and the list of quantile percentages QP ercents of each baseline vital sign feature. Adding the new features to the baseline features V Baseline produces a new list V Quantiles that achieves a better predictive power than would be obtained from only baseline vital sign features. The quantiles percentage, Q percent , is calculated according to equation (1). ObsinQ 1 and ObsinQ 4 represent the vital sign observations that occur in the first and fourth quantiles, respectively.

A. POPULATION SELECTION AND DATA EXTRACTION
We extracted hospital admission data from the publicly available ICU adult patient database, MIMIC-III (v1.4) [9]. The database is structured such that each hospital admission may contain one or more ICU stays and each ICU stay may span several days. On each day, a patient may have several recorded observations. We started with a total of 61,532 ICU stays ( Figure 2) and extracted data recorded on only the first day of a patient's ICU stay, which resulted in a total of 45,254 unique ICU stays with demographic features (age, sex, height, and weight). We combined that with 59,241 ICU stay encounters that contained data pertaining to seven vital sign features (BT, HR, RR, SBP, DBP, SpO2, and GL). The total after merging was 44,626.  Table 1 summarizes the rationale behind each model. The goal of the readmission model is to predict readmission risk, which might be an indicator that the current ICU type is not the best choice for the patient. The hospital system might decide to delay a patient discharge and increase the care level or move a patient from one ICU type to another. The outcome variable for the readmission model is a binary feature indicating whether a patient has been readmitted to the ICU (readmission = 1) or not (readmission = 0) within a single hospital admission. We define readmission as an incident in which a patient had been admitted to the ICU, discharged to the appropriate hospital population, and readmitted again to the ICU during a single hospital admission. If a patient had been admitted to the ICU and discharged once during a single hospital admission, that patient is not considered readmitted. The goal of the abnormality model is to predict the pool of patients who risk showing undesired health conditions. For the abnormality model, the outcome is the abnormality of a patient, which is a binary feature indicating whether a patient is normal (normality = 0) or abnormal (normality = 1). We define abnormality as the presence of one or more of the following undesired health conditions in a patient record: death, readmission to the ICU, or prolonged ICU stay.
Readmission and abnormality are different clinical goals. To measure abnormality, we use 3 metrics, one of which is the readmission flag. To measure readmission, we use only the readmission flag as a metric. The "within a single hospital admission" criterion applies to the readmission flag in both models.
The goal of the next-day vital sign measurements model is to predict ICU patients' physiological variables daily, thus providing early warnings about potential deterioration in a patient's condition based on prior physiological measure-Algorithm 1 Feature engineering algorithm 1: INPUT:P , P P F H , P P F L 2: OUTPUT:V Quantiles 3: for p i ∈ P do 4: M eans ← M eans + M ean k 12: DiscreteV alue H ← P P F (v k , P P F H ) 15: for obs i ∈ v k do 16: if obs i ≤ DiscreteV alue L then 17: QObs k ← QObs k + obs i 18:  We used the six main vital sign measurements (BT, HR, RR, SBP, DBP, and SpO2) along with GL and the five demographic attributes (age, sex, weight, and height) as predictor variables. Table 2 summarizes the cohort characteristics for the different variables used in these models. The vital sign measurements are calculated based on the first day of the patient's ICU stay.

C. READMISSION MODEL
The readmission prediction reduces to a binary classification problem with two classes: non-readmitted (N = 41,597 stays) and readmitted (N = 3,029 stays). This results in data with imbalanced classes [18] with 93.21% of the patients in the non-readmitted class and only 6.79% in the readmitted class.

1) Sampling Approaches
Several re-sampling approaches support the data with imbalanced classes. Two of the most popular ones are the under-sampling of the majority class and the over-sampling of the minority class. The readmitted patients in our case represent the minority class (3,029 samples), while the nonreadmitted patients represent the majority class (41,597 samples). Under-sampling may result in discarding patient data that could cause model under-fitting, while oversampling generates new samples by duplicating samples in the minority class. The later may require more computational power to process the extra samples (83,194 samples compared to only 6,058). Thus, we elected to use the under-sampling approach. To overcome the limitation in this approach, we apply another approach that involves clustering patients in the majority class into equally sized groups before merging them back together. This mitigates the lack of variation limitation introduced by under-sampling (e.g., selecting patients that belong to a single ICU type or to a single hospital).
• Under-sampling of majority class re-sampling: in this approach, we kept the 3,029 patients in the minority class and randomly removed patients from the majority class to retain only 3,029 of the total 41,597 patients in that class. • Clustering of majority class re-sampling: in this approach, we applied the k-means clustering algorithm to cluster the 41,597 patients in the majority class into equally sized groups (in this case, 5 clusters of 606 samples each) before merging them back together to obtain 3,029 patients ( Figure 3). The reason for selecting 5 clusters is that after applying the elbow method, we observed that the WCSS value reduces significantly at 5 clusters.

2) Classification Model Development
We built two variations of the readmission model, one using the baseline approach and one using the quantiles approach. We ran each model variation on both the imbalanced patient population before applying any re-sampling approaches and the balanced population after applying the re-sampling approaches described above.  We used supervised learning techniques because the model outcomes are labeled. In particular, we applied five common ML binary classification algorithms: 1) logistic regression (LR); 2) linear discriminant analysis (LDA); 3) random forest (RF); 4) k-nearest neighbors (KNN); and 5) support vector machine (SVM).

3) Model Evaluation
For the imbalanced dataset, we randomly split the dataset into 75% training (N = 33,469) and 25% test set (N = 11,157). We then trained the readmission models using the training set and 10-fold cross-validation to avoid over-fitting and validated their performance using unseen data as a test set.
For the datasets resulting from the under-sampling and clustering approaches, we randomly split the datasets into training (N = 4,543) and test (N = 1,515) sets. We then trained the readmission models using the training sets and 10-fold cross-validation and measured the performance on an unseen test set from the same population. We used the accuracy along with 95% confidence interval (CI), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) as performance metrics.

4) Hyper-Parameter Tuning
We performed hyper-parameter tuning by running a grid search to select the best parameters in the different algorithms.
• For the under-sampling approach, we set the maximum number of features to consider for finding a good split in RF to four for models in both the baseline and quantiles approaches. We set the estimated number of trees in an RF to 300 for models in both the baseline and quantiles approaches. We used the radial basis function as a kernel type for SVM and set the penalty parameter of error, C, to 0.20 for models in the baseline approach and to 0.60 for models in the quantiles approach. • For the clustering re-sampling approach, we set the estimated number of trees in an RF to 200 for models in both the baseline and the quantiles approaches and the number of features was selected automatically. We used the radial basis function as a kernel type for SVM, and set C to 0.10 in the baseline approach and to 0.50 in the quantiles approach.

D. ABNORMALITY MODEL
The abnormality prediction reduces to a binary classification problem with two classes: normal (N = 19,620 stays) and abnormal (N = 25,006 stays). This results in a data with imbalanced classes, with 43.97% of ICU stays in the normal class compared to 56.04% of patients in the abnormal class. We randomly removed 5,386 samples from the abnormal patients' class to obtain a balanced classification with a total number of 39,240 ICU stays in both classes.

1) Classification Model Development
We built two variations of the abnormality model, one using the baseline approach and the other using the quantiles approach. Since the model outcome is labeled, we ran supervised learning algorithms on each model variation. We applied the same five ML algorithms used in the readmission model: LR, LDA, RF, KNN, and SVM.

2) Model Evaluation
To assess the goodness of fit in our classification models and validate them on an unseen test set from the same population, we compared the accuracy of the test set and the mean accuracy of the trained models along with corresponding 95% CI. We compared the sensitivity, specificity, PPV, and NPV on the test set. We also examined the difference in AUROC scores between the test and training sets.

3) Hyper-Parameter Tuning
For RF, we set the maximum number of features to consider for finding a good split in RF to four for the baseline and quantiles approaches. We set the estimated number of trees in a random forest to 600 for both and the maximum number of features is four. For SVM, we used the radial basis function as a kernel type, and set the penalty parameter of error, C, to 0.20 in the baseline and to 0.60 in the quantiles approach.

E. NEXT-DAY VITAL SIGN MODEL
Since the outcome variable for the next-day vital sign measurements prediction is numeric, it is considered a regression classification problem. We built predictive models to predict the mean of the readings for each vital sign for the next day for each patient using only the baseline approach.

1) Sub-Population Selection
In this model, we considered vital signs for more than just the first day of the ICU stay. Here we extracted every day's vital signs after taking the mean of the readings for each sign. Also, since the goal is to predict the mean of each vital sign readings over several days, we included only patients who stayed in the ICU for more than one week. The total number of ICU stays corresponding to those patients was N=7,438 (of the total N=44,626 in Figure 2).

2) Reference Time (Baseline) Selection Approaches
We explored three variations of the next-day vital sign measurement prediction (the values considered for evaluating the model on the test set): • Day-by-day approach: we considered readings from the most recent day to predict the readings for the following day. More formally, given patient j , who stayed in the ICU for k days, to predict vital sign measurements on day k+1 , we use vital sign measurements from day k as model inputs. • Average measured approach: we considered long-term readings from all days in the most recent week to predict readings on the following day. More formally, given patient j , who stayed in the ICU for k days, to predict the vital sign measurements on day k+1 , we use the average of the measurements from day k to day 1 as model inputs. In our case, we considered measurements for 1 week. • Error adjustment approach: we ignored readings from the distant past and focus on short-term readings from the most recent day along with any error introduced in predicting the readings for that day. More formally, given patient j , who stayed in the ICU for k days, to predict the vital sign measurements of day k+1 , we use the measurements of day k and the error between the actual and predicted values of day k .

3) Classification Model Development
We used supervised ML algorithms since we rely on previously known vital sign measures. In particular, we used the extreme gradient boosting (XGB) and RF regression (RFR) algorithms to develop six next-day vital sign measure prediction models: three models that apply XGB on the three approaches discussed above and three models that apply RFR on those three approaches.

4) Model Evaluation
We split the ICU stay dataset (N=7,438) into training (N=5,578) and test (N=1,860) sets, trained the models using the training set, and measured the performance on the test set. We report the error between the predicted and actual values in the test set using the coefficient of variation of root mean squared error (CV-RMSE). We use the mean of dependent variable value to normalize the RMSE (CV = RMSE/ the mean of the real observations on the test set)

A. READMISSION MODEL
We present the results of the two patient population selection approaches in the readmission model.

1) Under-sampling Of Majority Class
The accuracy of predicting readmission on the test set was 55.45% using RF and the quantiles approach and 55.25% using SVM and the quantiles approach. The highest improvement in accuracy of the readmission model using RF and the quantiles approach on the test set was 2.57%. RF also achieved the highest specificity (0.59), which indicates that the model using the RF algorithm and the quantiles approach can identify which patients will not be readmitted to the ICU better than the other algorithms. SVM achieved the highest sensitivity (0.56), which indicates that the model using the SVM algorithm and the quantiles approach can identify patients at risk of ICU readmission better than the other algorithms. The SVM algorithm using the quantiles approach also produced the highest AUROC (0.59) in predicting the ICU patients' readmission on the test set.

2) Clustering Re-sampling Approach
The highest readmission model accuracy using the clustering re-sampling approach on the test set was 67.53% using the RF algorithm and the quantiles approach, while SVM and the quantiles approach achieved 64.10% accuracy (Table 3). When comparing the baseline and quantiles approaches, the improvement in accuracy on the test set was 6.03% using RF and 2.00% using SVM. This suggests that the clustering re-sampling approach achieved better improvement in model accuracy than did the under-sampling approach.
SVM achieved the highest sensitivity (0.758), which indicates that the model using SVM and the quantiles approach can identify patients at risk of readmission better than the other algorithms. RF, on the other hand, achieved the highest specificity (0.606), which indicates that the model using RF and the quantiles approach can identify patients who will not have ICU readmission risk better than the other algorithms.
We ran Cohen's kappa score function to express the level of agreement between the predicted and real outcome of interest on the data set (observed and predicted for cases in the test set). We found that the highest score was 0.353 using the the quantiles approach and the RF algorithm. Table 4 shows the AUROC results of the readmission model on the training and test sets using the baseline and quantiles approaches and the five ML algorithms. VOLUME 4, 2016 Figure 4 depicts a comparison between the ROC curves for the five ML algorithms using the baseline (left) and quantiles (right) approaches. The figures show that RF had the best AUROC with both approaches, but it improved from 0.67 for the baseline approach to 0.74 using the quantiles approach.
Our two proposed population selection solutions show that the clustering re-sampling approach achieved the highest AUROC score compared to the under-sampling solution, indicating that the model using the clustering re-sampling approach was better at distinguishing between the positive and negative classes.  Table 5 summarizes the AUROC and accuracy results in each of the two solutions and the corresponding improvement in AUROC for the clustering re-sampling approach over the under-sampling approach.  The accuracy of predicting abnormality on the test set with the quantiles approach was 67.40% using the RF algorithm and 66.86% using the SVM algorithm. Table 6 shows the abnormality model performance on both the training and test sets using the baseline and quantiles approaches and the different ML algorithms.
The best improvement in model accuracy on the test set was 3.30% using RF, while it was 2.73% using SVM. The RF algorithm achieved the highest sensitivity (0.657) using the quantiles approach, which indicates that the model using the RF algorithm and the quantiles approach can identify abnormal patients better than the other algorithms. SVM achieved the highest specificity (0.723), which indicates that the model using the SVM algorithm and the quantiles approach can identify normal patients better than the other algorithms.
We found that the highest Cohen's kappa score was 0.348 using the quantiles approach and the RF algorithm.  Table 7 shows the AUROC results of the abnormality model on both the training and test sets using the baseline and quantiles approaches for the different ML algorithms. Figure 5 shows the ROC curves for the algorithms in the baseline approach and the quantiles approach, respectively. The RF algorithm using the quantiles approach produced the highest AUROC (0.74). VOLUME 4, 2016

C. NEXT DAY VITAL SIGN MODEL
We present the detailed results for the RFR algorithm since it out-performed XGB across all three approaches. The bottom rows of Tables 8, 9,and 10 show the mean error for each vital sign using the CV-RMSE metric after applying the RF Regression algorithm using the three approaches for next-day vital sign measurements. Figure 6 depicts a visual representation of the error comparison curves using the three approaches and the RF Regression algorithm for the seven vital signs.
Overall, the RFR algorithm produced less error, on average, in all three approaches compared to XGB. While the average errors for the three approaches are very close, the average measured approach produced the highest error among all three. The error adjustment approach achieved the lowest mean error in diastolic BP, respiration rate, body temperature, SpO2, and glucose level, while the day-by-day approach achieved the lowest mean error in heart rate and systolic blood pressure.
These findings indicate that RFR performed better than XGB, and that the error adjustment approach performed best while the average measured approach performed worst among all three approaches. This may indicate two things: i) the distant past does not help much in predicting values for the next day vital sign measurements as much as the near future does; ii) considering the error of the previous day helps reduce the prediction error for the next day.

VI. DISCUSSION
There are other approaches for sampling, which we could have applied in the readmission model to solve the data with imbalanced classes. For instance, the synthetic minority oversampling technique (SMOTE) [19] is one method for solving data with imbalanced classes. SMOTE will over-sample the minority class by generating synthetic instances. However, SMOTE suffers from the problem of over-generalization because it propagates the minority class region and readmits it to the ICU class without considering the majority class, which is not-readmitted to the ICU class. However, we built our solution (the clustering re-sampling approach) to solve the data with imbalanced classes in the readmission model using k-means to maintain a variety of the population without duplication and without using under-or over-sampling techniques.
In the abnormality model, we relied on three criteria to define abnormality. It would be worthwhile to explore the impact of adding other relevant abnormality indicators such as cardiac problems and organ disorders. The patient abnormality model might improve if we included more models in the IICUPM module.

1) Qualitative Comparison with Other Approaches
We compared the performance of our models to those of other researchers. For the readmission model, we achieved 67.53% accuracy and an AUROC of 0.74 using only seven vital sign features, four demographic attributes, and 21 features engineered from those original features. Other researchers day of patients' ICU stay and we provide two solutions to handle the class imbalance problem. We include imbalanced techniques to produce a balanced classification with the same total number of samples in both the readmitted and non-readmitted patients' classes. Moreover, we achieved an AUROC of 0.73757. Rajkomar et al. [7] developed a DL model to predict 30-day unplanned hospital readmissions and achieved an AUROC between 0.75 and 0.76. In our study, we predict ICU readmission during the same hospitalization rather than readmission after hospital discharge.
Pakbin et al. [15] developed different imbalanced models for predicting ICU readmission at various time points using patient data before ICU discharge. They used all data available from MIMIC-III, including ICD-9 admission diagnosis codes. They achieved an AUROC of 0.76 for risk of ICU readmission after discharge at 72 hours. They achieved an AUROC of 0.84 for risk of ICU readmission within a single hospital admission. In our study, we built a balanced ICU readmission model with fewer features.

2) Limitations
Our approach, like other ML approaches, has a generalizability limitation. In this study, we trained our models based on only the MIMIC database, which represents a single hospital in Boston, MA. Had we applied the models to various patient data from different demographic backgrounds and locations, we could have obtained different results. Also, all prognostic models developed in clinical settings are prone to producing false positives, and our approach is no exception. The method often used to mitigate that in clinical settings is by involving a human-in-the loop by usually having a critical care specialist manually provide a qualitative review of the false positive alerts.

3) Clinical Implications
The focus of this work is to extend the IICUPM module functionalities by incorporating three more prediction models. After exploring the best available approach for achieving our clinical goals (i.e., abnormality, readmission, and next day vital sign risk stratification and prediction), we use our findings as a way to choose the best approaches in our IRPM framework. We incorporated our prediction models into a cloud-based Intelligent Remote Patient Monitoring (IRPM) framework. By integrating the predictive functionalities of the IICUPM module into existing decision support systems already used in clinical workflows we may provide significant practical implications for cost reduction and quality of care improvement.

4) Time Complexity
Our feature engineering algorithm runs in θ(P .V base .S.v k ) time since it requires four nested loops: a loop through all patients (P ); a loop through the ICU Stays S of each patient; a loop through each vital sign in V base ; and a loop through each observation Obs i of each vital sign v k . Thus, the algorithm heavily depends on the population size and the number of observations for each vital sign. However, our algorithm has several assumptions. First, we always focus on only seven vital signs, V base is always constant. Also, the number of ICU stays in our case is always the same, on average. Therefore, the time complexity can be minimized to θ(P .v k ). Finally, we mitigate the overhead caused by the the number of vital sign observations by pre-processing observations within each vital sign feature (Section III-C).

VII. CONCLUSION
Machine learning approaches applied to clinical datasets offer great promise for the delivery of personalized medicine for targeted treatment of human disease. Building on top of our recent development of an intelligent ICU patient monitoring (IICUPM) framework, we provide three reproducible risk stratification ML models. Our findings indicate that we can build balanced prediction models for ICU patient readmission and abnormality with better accuracy using a combination of ML and a quantiles approach that relied on only vital signs. To avoid inaccurate results and poor accuracy in the readmission model, we proposed two solutions for the data with imbalanced classes : one that uses the under-sampling method and one that uses the clustering re-sampling method. We also provide three approaches for selecting predictors of next-day vital sign measurements in reference to a baseline. We applied two different regression classification algorithms to each approach. In general, we found that the error adjustment approach performed best while the average measured approach performed the worst. The result indicates that, generally, using the most recent vital sign measurements achieves the least error especially when we account for previous errors. In addition to providing three transparent and reproducible ML models, this work contributes a feature engineering algorithm that can be deployed in different critical care settings.