AEP-DLA: Adverse Event Prediction in Hospitalized Adult Patients Using Deep Learning Algorithms

Early prediction of clinical deterioration such as adverse events (AEs), improves patient safety. National Early Warning Score (NEWS) is widely used to predict AEs based on the aggregation of 6 physiological parameters. We took the same parameters as the features for AE prediction using deep learning algorithms (AEP-DLA) among hospitalized adult patients. The aim of this study is to get better performance than traditional naïve mathematical calculations by introducing novel vital sign data preprocessing schemes. We retrospectively collected the data from our electronic medical record data warehouse (2007 ~ 2017). AE rate of all 99,861 admissions was 6.2%. The dataset was divided into training and testing datasets from 2007–2015 and 2016–2017 respectively. In real-life clinical care, physiological parameters were not recorded every hour and missed frequently, for example, Glasgow Coma Scale (GCS). The expert domain suggested that missed GCS was rated as 15. We took two strategies (stack series records and align by hour) in the data preprocessing and tripling the values of negative samples for class balancing (CB). We used the last 28 hours’ serial data to predict AEs 3 hours later with Random Forest, XGBoost, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). It is shown that CNN with CB and align by hour got the best results comparing to the other methods. The precision, recall and area under curve were 0.841, 0.928 and 0.995 respectively. The performance of the model is also better than those proposed in the published literatures.


I. INTRODUCTION
The Institute of Medicine's ''To Err Is Human'' has been published since 1999. After two decades, David W. Bates and Hardeep Singh reported the progress of patient's safety The associate editor coordinating the review of this manuscript and approving it for publication was Li He . and pointed out that health information technology (HIT) can help prevent many types of patient safety errors [1]. However, HIT also introduces new problems, including ensuring the safety of the technology itself; the safe use of the technology by clinicians, staff members, and patients; and the effective use of it to improve patient safety. Early warning system (EWS), one of clinical decision support, is used to recognize VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the patients with the risk of clinical deterioration and then to trigger aggressive actions to prevent adverse events (AEs). EWS at first was calculated manually. With the advances in electronic health record system, EWS has been embedded into Health information system clinically [2] and decreased the mortality successfully [3].
Recently, two literatures advised that many studies of early warning scores were found to have methodological weaknesses are by Goldstein et al. [2] and Gerry et al. [4]. Both pointed out the challenge of missing data. Many of the studies did not specified the method of handling missing data and how to impute them. Some studies only used the cases with complete data. In real-life clinical care, physiological parameters are not recorded every hour and are missed frequently. These will reduce the accuracy of HIT-based EWS in predicting clinical deterioration and also challenge the deployment clinically.
In this study, we retrospectively collected the cohort data of electronic health record (EHR) from 2007 to 2017 at Taichung Veterans General Hospital, Taiwan. We collected the parameters from National Early Waring Score (NEWS 2 [5], [6]), which is standardizing the assessment of acute-illness severity in the NHS. We have used the last 28 hours serial data to predict clinical deterioration 3 hours later with machine learning algorithms including Random Forest, XGBoost with Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). We took two strategies (stack series records and align by hour) to handle the missing data by the data preprocessing. The study focused on the effect of different methods of managing missing data under the physician as domain expert, not only for the precise prediction but also for easy deployment clinically. The key to get higher performance using deep learning is to have good data preprocessing strategy for vital sign.

A. BACKGROUND KNOWLEDGE
Several machine learning algorithms [7] are present, each of them are categorized based on supervised, unsupervised with classification, regression, clustering and dimension reduction. So in AEP-DLA, we will be predicting numeric accuracy and will be using supervised learning with classification and regression for continuous outcome. Random forest is one of the suitable algorithm that can be used to take average of many decision trees. Here, all trees can perform better only when combined to get overall better performance. Even though it is requiring expertise to understand results but are known for high quality results and fast to train. XGboost algorithm used is also a type of classifier, similar to gradient boosting but with multiple advantage in penalizing, proportional shrinking, newton boosting and extra randomization parameters of decision trees. Neural networks known for exchanging message with interconnected neurons, where deep learning is its one of the structure that uses several layers connected serially. Specially designed to handle complex tasks. Challenge here is to train multiple layers for operations and understand its predictions.

B. APPLICATIONS FOR AEP-DLA CAN BE APPLIED IN VARIOUS SCENARIOS 1) SMART HOSPITAL
Smart hospitals are basically equipped with several continuous and connected health data monitoring devices. Special section for each disease type, critical care units, group of expert doctors, advance surgery and scanning equipment's helps us to get dedicated care and emergency operating on patients.

2) SMART CLINICS
A smart clinic can be used to connect to specialized hospitals. Also results can be communicated live to super-specialty hospitals to get better treatment and health checkup analysis for the doubted critical patients.

3) OLD AGE HOME
Several senior citizens can be found in old age homes. So they can be categorized as per specific diseases and treated accordingly at local. Continuous and fixed interval based monitoring is possible by using smart watch for blood pressure, heart rate, consciousness, etc. Monitoring video based surveillance and consulting is recently made available on large by private hospitals.

4) SPORTS STADIUM CLINIC
Sports players injured or facing unhealthy situations can be monitored and consulted. Data can be collected while shifting to hospital or until the patient gets critical, which is helpful to analyze condition before emergency.

5) SMART AMBULANCE
While the patients are shifted or transferred by travelling to nearby hospital. Smart ambulance can be used to monitor multiple parameters and get temporary medication by the present nurse or doctor. It can also be used to provide checking and remote analyzing of patient's health thus recorded and re-used later.

6) REMOTE CENTER CLINIC
Remote clinics in rural areas can be used to collect specific interval data of beginner level patients. Emergency units can be used to monitor health from remote center and get consulting to avoid travelling to avoid travelling time for curable situations.

II. LITERATURE SURVEY
The literature survey presents well known models, which are referenced for this work. The table 1 presents detail comparison with some recent research. A systemic review of early warning score for detecting clinical deterioration and focused on the methodology is conducted by Gerry et al. [4]. It concluded that poor methods and inadequate reporting were found in most studies, and all studies were at risk of bias. Methodological problems could result in scoring systems that perform poorly in clinical practice, which might have detrimental effects on patient care. One of their recommendations were multiple imputation is the best practice approach for accounting of missing data in the analysis.
National Early Warning Score (NEWS) by Royal College of Physicians [5], [6] was developed for adult's patients to detect the clinical deterioration with their physiological parameters that are part of routine measurements in NEWS. A final score is calculated based on aggregation of data by parameter weighting of various physiological parameters. The monitoring frequency of patients, the response to clinical urgency based on triggers and escalations of care levels are used for acute illness severity features by scaled response. A standardized chart is developed for recording the routine patient checkup of NEWS parameters with a supported online training implementation for analysis. NEWS has been widely used with or without minor modification in many hospitals. Machine learning used in intensive care unit (ICU) for circulatory failure for early prediction is presented by Hyland SL et al. [6]. The data preprocessing was performed using artifact removal, category-based variable merging with adaptive imputation and state annotation.
High feature importance variables were used to be classified by gradient boosting classifier which is evaluated by specificity, precision, recall and frequency. Artificial intelligence for patient's health deterioration detection in rapid response system is demonstrated by Cho KJ et al. [12]. The dataset consists of vital signs for predicting in-hospital cardiac arrest and ICU admission, which is classified using RNN, long-short term memory (LSTM), rectified linear unit (ReLU), four fully connected layers and softmax for binary (0,1) evaluation. The results are presented as sensitivity, specificity prediction and ROC curve. Explainable artificial intelligence (xAI) for evaluation EHR records used in predicting acute critical illness is presented by Lauritsen SM et al. [13]. AI models showed trade-off between sensitivity and specificity, so to overcome it xAI models was constructed and data from disease category is analyzed using AUROC and AUPRC. The xAI model showed better performance than SOFA, MEWS and Gradient boosting vital signs.
AI for Covid-19 prognostic modelling in UK is demonstrated by Abdulaal A et al. [14]. A mortality risk scoring system from hospital admission using artificial neural network (ANN) is developed with hyper-parameter tuning with importance from Shapley additive explanations (SHAP) values. The evaluation is performed using k-fold crossvalidation, (training and loss) vs. epoch and AUROC. AI in critical care prediction to be used during prehospitalization services is presented by Kang DY et al. [15]. The data is normalized by z-score and given as input to feedforward network with adam optimizer and tensor-flow as backend. The ROC curve outperformed NEWS, MEWS, emergency severity index (ESI) and Korean triage and acuity system (KTAS). Machine learning applied in EWS of Cardiac Arrest is demonstrated by Chang HK et al. [16]. The feature selection is performed of CPR as well as on non-CPR patients using sequential forward selection and applied to decision trees and random forest to higher ROC curve validation.
A review on EWS scores for patient's clinical signs deterioration is presented by Smith et al. [17]. A systematically review of 21 articles is present and concluded that early warning system tools perform well for predicting death and cardiac arrest within 48 hours but the impact on in-hospital health outcomes and utilization of resources remains uncertain, owing to the methodological limitations. The Modified Early Warning Score (MEWS) by Galen et al. [18] is used for recognizing hospitalized patient's clinical deterioration by performing an analysis of real life settings to MEWS protocol adherence, measuring MEWS daily by predictive value determination for Serious Adverse Events (SAEs): ICU admissions and readmissions, cardiac arrests and death. A criteria score is a setup to compare across 6 different wards of hospitalized patients by follow-up of 30-day to be compared presence and absence of critical score. It suggested that MEWS needed modified according to different diseases. Early Deterioration Indicator (EDI) by Ghosh et al. [19] Mortality and ICU level of care can be reduced by keeping patients in general wards by having timely interventions. EDI uses continuous risk scores to be calculated from vital signs log likelihood risk. EDI was validated by comparing it with MEWS and NEWS, which is trained by using data mining knowledge of large datasets. However, the missing data was a problem in constructing EDI model. Churpek et al. [20] created the organ failure assessment related to quick sepsis (qSOFA) score, it is to identify patients with high risk not within ICU. It is used for clinical deterioration detection compared with EWS and systematic inflammatory response syndrome (SIRS). It concluded that EWS is more accurate than qSOFA in non-ICU patients.
Electronic Cardiac Arrest Risk Triage (eCART), NEWS and MEWS scores for ward patient's health deterioration are compared by Green et al. [21]. It can help in providing immediate attention to critical patients by using electronic algorithms. The study considers data from five different hospitals in the US from 2008-2013. Predicting for patient's ICU transfer, cardiac arrests and death up to 24 hours' observation as the composite outcome is found to be better and accurate on eCART than other paper-based observations based on AUROC. Sepsis prediction in the ICU using machine learning by Desautels et al. [22] uses minimal variable set for prediction and is compared with existing scoring system for performance including data sparsity investigation. The data preprocessing is missing values are imputed by carry-forward subsequent values bin and for different data back-fill from first subsequent bin. Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC)-III dataset is used to predicting sepsis based on patient data in ICU using classification in machine learning. This classification is compared with recent scoring systems i.e. qSOFA, MEWS, SIRS, SOFA, SAPS-II for acquiring septic prediction and is found to be performing well even with random missing data evaluated by AUROC. Wellner et al. [23] proposed a machine learning approach for unplanned transfers to ICU prediction. The data is taken from three children hospitals that are used to check for predicting performance for unplanned ICU transfers by using different predictor variables. Different training and testing data were used with the cases from suspects of meeting within single/multiple five criteria of unplanned transfers and for those transferred to the ICU from floor. Neural networks and logistic regression were used for classification models and 1-16 hours of horizon prediction was used for modeling performance evaluation. Accuracy was determined in advance even before 16 hours of patient's deterioration by AUROC. Clinical deterioration prediction using conventional regression and machine learning methods of multicenter comparison by Churpek et al. [24] presents ML dominates conventional regression. The data pre-processing takes missing values from prior blocks, in case of no previous blocks them median values are imputed. The different techniques in ML are used to predict survival analysis using discrete time by using health parameters for predicting the outcome. Different training and testing data were used in which, the random forest was found to be more accurate than others with MEWS AUC, whereas spline prediction logistic regression AUC was more accurate than other regression models. Concluding that improved identification is achieved for critical patients. Machine learning for hemodialysis patients' quality of life (QOL) prediction is presented by Saadat et al. [25] uses algorithms based on naïve bayes and classification trees. The classification tree was found performing better with AUC while considering environmental and psychological domains for QOL.
The limitations of previous papers that are studied in the above survey are as follows: 1. Insufficient data preprocessing techniques are used to handle missing data.  or deep learning probably construct better algorithm to predict clinical deterioration precisely.
A deep learning algorithms based method is designed for our AEP-DLA. Our goal is to include the features of NEWS and increase the accuracy of AEP-DLA. The paper is further organized in the following manner: Section III. The methodology will be focusing on the system model and algorithms used for machine learning and deep learning. Section IV. Experiments will be discussing data preprocessing and the performance of various algorithms with its comparison. In the end, we will have Conclusion followed by Acknowledgement and References.

1) Settings:
The study was done in Taichung Veterans General Hospital (TCVGH), a 1500-bed academic hospital in central Taiwan. The ethical committee/institutional review board (Institutional Review Board (II)107-B-08 Board Meeting) approved the study protocol (protocol no./IRB TCVGH No: CE18209B). Therefore, written informed consent from the participants was waived. Patients information was anonymized and de-identified prior to analysis in the study.
2) Patients AND Data Retrieving: The enrolled criteria were (1) patients hospitalized at general wards (2) age ≥ 20 years old. The exclusion criteria were those patients with one of the following: (1) hospitalization day less than one day (≤ 24 hours), and (2) direct admission to ICU, and (3) the patients who have had artificial airways at admission. We retrospectively collected the data from our electronic medical record (EMR) data warehouse 2007/01/01∼2017/12/31.

3) Definition of Adverse Event:
The deteriorating patients were grouped as ''adverse event (AE)'' if they received cardiopulmonary resuscitation, were transferred to ICU with unexpected deterioration, or died. The other patients were regarded as ''no adverse event (NAE)''. Those patients with scheduled admission to ICU after surgery were regarded as NAE. Only the first episode of AE was studied.

4) Vital Sign Data:
Vital sign data measured in various frequencies according to clinical needs. In a specific timing with no measured data, we get the previous closest one as the representative. The data collected from the patient or provided with facilities of supporting life care can be categorized as listed in Table 2.
The method for our model is shown in Figure 1. As shown in Figure 1. The AEP-DLA System Model, it is basically divided into four parts: Input data as vital signs, data preprocessing, training model and emergency warning system predictor. Once the vital signs are collected, the data would be cleaned first, so that it becomes suitable for data preprocessing. We define data cleaning separately as it is used to identify and correct improper records acquired as input. The data which is considered to be inaccurate is then can be either replaced, modified or deleted as per the criteria. The purpose of such data cleaning operation is to make dataset consistent to be processed by the predictor system and hence valid. The accuracy within the data can be caused due to several issues related to entry errors, data corruption, transmission errors, etc. It can also involve harmonizing data or standardizing them which relating terms to short codes and vice versa respectively. Data cleaning is a basic requirement done before data preprocessing as the later part involves validation instead of accuracy for processing. VS_HR_PreEvent is used to indicate all feature's time limits before adverse events. Whereas, the subspecialty indicated the section for the disease diagnosis concerning the disorders of the specific organ.
In the following section, we are going to present the working model of adverse event prediction using machine and deep learning algorithms i.e. Random Forest, XGBoost, CNN and RNN. The input to the working model is data collected from each patient in the hospital ward, which is then preprocessed and applied to deep learning algorithms for evaluating results.

A. INPUT DATA FOR VITAL SIGNS
The input provided to the model is the series record as shown in the bottom left of the system model. The series data presents x-axis as the data collected from different patients and y-axis as per the time data is recorded in hours. Each data collected or provided by machine settings to the patient is shown from last 28 hours of scale, as we are using an algorithm prediction, we need past records for its input.

B. DATA PREPROCESSING
The input provided to data preprocessing is in the form of cleaned data. Data preprocessing is basically required to provide data validation. As it is an important step in machine learning, it is used to check whether the information gathering was loosely acquired that can collect out-of-range values, missing values and may later on lead to improper data combinations, etc. Redundant data can also be used to overcome using preprocessing, data quality is considered to be an important factor for AEP.
Data preprocessing can be viewed broadly by cleaning of the data first, selecting of an appropriate instance, normalizing the data for smooth processing, transforming the data from original to the required form, extracting features from the acquired data and selecting the most appropriate data values, etc. Henceforth data preprocessing is considered to be a crucial part of final interpretation, as the outcome can be affected due to it. The feature selection done here is as presented in Table 2. Once the features are captured we either check for recovering the missing values or we directly provide the patients data produced 3 hours before of the current time. This paper proposes two strategies in the data preprocessing stage to deal with the problem of missing values, which are stack series records and align by per hour method. To predict the early warning signs after 24 hours, the algorithm requires a total of 28 records from the vital signs data during last 3 to 30 hours. If we have patients vital sign data value of one record from only last 3 hours with 15 features as seen before in Table 2, consisting of 12 features + GCS replaced by total of 1 feature + One Hot encoding replaced as 6 features to predict the early warning sign, which is then can be applied on the machine learning algorithm of random forest or XGBoost. If we have patients vital sign data from last 3 to 30 hours with 15 features of 28 records to predict early warning signs, then we can use deep learning algorithms CNN without class balance, CNN with class balance, alignment based(/hour) CNN (with class balance) and alignment based(/hour) RNN (with class balance).
The missing values in data preprocessing can be recovered by the following methods as stated below:

1) STACK SERIES RECORD
As shown in the Figure 1 and 2, three patients data are available i.e. data 1, data 2 and data 3. In this scenario, the data 1 is stacked from the series record data 1. After stacking, data 1 has exceeded the limit of 28 records, hence we delete up to the threshold of 28 records, as we require only 28 records for processing. Data 2, it levels up to the required 28 records, hence no operation is required to be applied here. Whereas for data 3 when stacked from series record, is having not enough records to reach the threshold of equaling 28 records, so it's an underflow, hence we do padding for this data 3 and is filled by −1 value.

2) ALIGN BY PER HOUR
As shown in Figure 1 and 2, the series records are kept as it is and instead of stacking, only padding is performed in this method. So for the vital sign of patient data 1, we can see that blue bars represents recorded data and the grey bars for padded data. When multiple data is recorded in one hour then it is average to single data per hour. In case of data 2, padding is done by −1 value, when the previous data record does not exist for a long time, which is less than the last 30 hours. In case of data 3, when the data is missing in the middle part then padding is done with recent record. In short whenever any operation is performed for aligning then data is always searched in forward of time. Here, data is only considered from last 3 to 30 hours. In short, the data recorded currently is also not required or can be deleted for the prediction from current time. When previous results are present then we can consider doing padding of data recent records until new records are obtained. Therefore, the record of last 3 to 55678 VOLUME 9, 2021 30 hours can thus be obtained by using such method, in case of missing values.

3) GLASGOW COMA SCALE
The Glasgow coma scale (GCS) is used to represent three features of the patient's health, which can be considered as an initial parameter. Eye opening (1-4), verbal response (1-5) and motor response (1-6) are the range values for evaluating their well-being on the scale. The higher the values, the better is the test response. In case of input to the machine learning/deep learning model, when GCS values are found to be missing then they are rated as a total of 15.0 scale as found to be considered as normal, instead of rating them all separately.

4) CATEGORY VALUE
Missing category values in the dataset are replaced with a binary class matrix by one-hot encoding. The process of converting categorical variables into suitable input for the machine learning algorithm form, resulting in a better prediction by one-hot encoding. So to avoid label encoding problem of considering higher value for category as a superior category, the one-hot encoding thus operates by doing category binarization by having it as a training model feature.

C. ALGORITHMS
In data preprocessing, the data performs both preprocessing on Glasgow Coma Scale and Category Value. However, Stack Series Record and Alignment by Per Hour are different kinds of data preprocessing methods. Both kinds of preprocessing methods had been experimented and evaluated within the experiment section. We also consider continuous data normalization by using min-max. Min-max is basically used to normalize all numeric range values to between 0 and 1 by using feature scaling. It is also known as to be unity-based normalization. When there are two arbitrary points in the dataset a and b, then the value restriction can be achieved by its generalization. The min-max(X ) can be expressed as given in equation 1: The above algorithm 1 pseudo-code presents the data preprocessing performed on the patient's records using stack series method. In step 1, the raw data(DRaw) collected as patient's health record is given as input to the algorithm. In step 2, the stack size (Stack size ) is input which is required while preprocessing. In step 3, processed data (DProcessed id ), a stack is used to store output by the algorithm. In step 4, DProcessed stack is initialized to NULL for storing output. In step 5, the Stack size is declared as the part of input from step 2. In step 6, the first IF condition checks whether the particular patients raw record data DRaw id is greater than the Stack size ? Then in step 7, the extra records greater than the size are deleted from DRaw id and stored in DProcessed id . In step 8, Else-IF is used to check whether DRaw id is less than the Stack size ? Then in step 9, padding by −1 is applied to fill the values upto stack size. In step 10, VOLUME 9, 2021 the Else condition is used to replicate the data to DRaw id from DProcessed id, as they have same stack size in step 11. In step 12, the pre-processed DProcessed id is returned by the algorithm.
The algorithm 2 pseudo-code presents the data preprocessing performed on the patient's records using alignment by per hour record method. In step 1, the input given to the algorithm is raw patient's per hour record data(DRaw PH ). In step 2, the input is top of the stack(Stack Top ) value. In step 3, the input is fixed block size in each stack(Block Size ). In step 4, the output is given as preprocessed patient's record data (DProcessed). In step 5, the processed data stack is initialized to NULL. In step 6, the stack top and block size needs to be declared, as a part of input. In step 7, the IF condition checks that whether the per hour patient's raw record data is recorded multiple times, which is greater than block size(Block Size )? If true then in step 8, the data is averaged and stored as single hour data for that particular patient's ID(DRaw ID,PH ) in processed data (DProcessed ID,PH ). In step 9, Else-If checks whether the per hour raw data of a patient is NULL and that per hour raw data is at top of the stack(Stack Top )? then in step 10, padding by −1 is applied to that raw record and is stored in that respective preprocessed data (DProcessed ID,PH ). In step 11, Else-If checks whether the per hour raw data of a patient is NULL and that per hour raw data is not at top of the stack(Stack Top ) but lower than stack top? then in step 12, the per hour record present above the current record in the stack is replicated to the current record block and is updated in that particular patient's preprocessed record (DProcessed ID,PH ). In step 13, else no above conditions are matched then that raw record(DRaw ID,PH ) is assumed to be valid and in step 14 is added directly to per hour preprocessed record (DProcessed ID,PH ). In step 15, the per hour pre-processed records(DProcessed ID,PH ) is returned by the algorithm.
The algorithm 3 presents the final adverse event prediction (AEP) pseudocode. The AEP algorithm combines the two data preprocessing for stack series records and align by per hour method. In step 1, input required by the algorithm is vital signs of patient, which are recorded as a part of hospitalization process. In step 2, threshold determines the limit or the boundary after which alert needs to be raised by the AEP system. In step 3, the score is the output generated by the AEP system at the end using the best score from either the machine learning or deep learning algorithms. In step 4, the alert is raised as a warning to indicate approaching adverse event of the patient. In step 5, multiple local variables initialized are raw data(DRaw), processed data(DProcessed), cleaned data(DCleaned), candidate score(CandidateScore), machine learning/deep learning Score and Message. In step 6, the data is cleaned for vital signs by checking whether there are any missing values? In step 7, the IF condition checks is data cleaned (DCleaned) true? In step 8, if data cleaned is true then model is trained from machine learning and deep learning to get the best score. In step 9, Else condition is raised for missing values then vital sign data(DCleaned) is selected which checks for more details by feature selection including NULL values, normalizing, extraction, transformation, etc. later stored in data raw(DRaw ID ) with the identity of a particular patient in step 10. In step 11, If the raw data(DRaw ID ) required to be processed is less than or equal to last 3 hours then in step 12, random forest algorithm is applied and its result are stored in CandidateScore1. In step 13, again XGBoost algorithm is applied on the same raw data(DRaw ID ) and results stored in CandidateScore2. In step 14, best score from CandidateScore1 and Candi-dateScore2 is selected to be stored in Score. In step 15, Else-If check whether the recorded data is from last 3 hours to last 30 hours of raw data(DRaw ID ) then in step 16 stack series data preprocessing algorithm is applied and stored as preprocessed data(DProcessed Series ). In the similar way, in step 17, align by per hour data preprocessing is applied on raw data(DRaw ID ) and is stored in processed data(DProcessed Aligned ). In step 18, class imbalance is solved by balancing the proportion of positive and negative sample approximately and then the preprocessed data is stored in preprocessed(DProcessed). In step 19, the convolutional neural network is applied on preprocessed data(DProcessed) and the results are stored in CandidateScore3. Similarly, in step 20, the recurrent neural network is applied on preprocessed data(DProcessed) and the results are stored in CandidateScore4. In step 21, the best score from previous two steps is stored in Score. In step 22, if the obtained Score from previous any of the results is less than or equal to threshold value then it is considered to be Safe/No Warning. In step 24, Else when the Score is greater, the system raises Adverse Event Prediction(AEP) Alert. Finally, the message is returned by the AEP algorithm.

D. MATHEMATICAL ANALYSIS OF THE DEEP LEARNING TRAINING MODELS 1) CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN is known for learning features in hierarchy automatically for the purpose of classification [26], [27]. The feature map constructs, higher layers feature learning by complex, translation and distortion invariant hierarchical approach. A neural network basically consists of perceptron layer (L+1) hidden layer with input units D, output units' C and many hidden units, where units are arranged in layers.
The layer l i th unit computes the output as given in equation 2, weighted connection k th to i th unit in layer l to layer 1 respectively is denoted as w The sigmoid activation function σ as given in equation 3 represents high dimensional network with non-linear properties having slow convergence. It is basically mapping of complex functions between response variables and input z. So the input times weight is added with bias and activation.
Network weights determined by target specific mapping approximations g is done by the supervised training. The training data set gives mapping due to the unknown g practically. The set of training is given in equation 4, where x n gives input value and t n ≈ g (x n ) is the output value, possibly noise.
The weight initialization w is crucial for technique of iterative optimization. The weights in equation 5, are chosen randomly in that range. Each unit input distribution are based using the assumptions by Gaussian distribution and unity order approximation is the actual input ensured. Here, we can have optimal learning by using activation function of logistic sigmoid.
The objective function of minimizing is used here, also known as loss function used for measuring the predicting outcome. The Mean Square Error (MSE) or Quadratic loss belongs to the type of most commonly used regression loss VOLUME 9, 2021  category. MSE is represented as in equation 6, as the sum of squared distances within the given target variable and predicted values.
The Table 3. shows hyperparameter configuration for CNN Figure 3. I. CNN, which provides optimal settings for achieving better output with convolutional layer 1 (Conv 1), convolutional layer 2 (Conv 2) and fully connected layer (FC) having a flatten output of 2304 with sigmoid as the final activation function.

2) RECURRENT NEURAL NETWORK (RNN)
Connectionist models RNNs are used collect sequence dynamics by network node cycles [28]. It is basically use to collect sequence state from a large context window. LSTM allows to train, optimize for achieving large scale learning.
Recurrent edge nodes known for connecting adjacent time steps receive x (t) as current input from data point, h (t−1) value from previous state hidden node andŷ (t) is given as output. In equation 14, the weights within the input and hidden layer is given as W hx matrix, whereas weights within hidden layer and recursively in time steps adjacently to itself is given as W hh . The offset is learned by each node using the bias parameter b h and b y vectors in equation 7 and 8 respectively.
Long Short Term Memory (LSTM) network used here is to replace hidden layer with memory cell c as an intermediator storage, containing a node with weight one of recurrent edge having self-loop. In equation 9, the input node g c at current time step input layer x (t) and h (t−1) is run through weighted input tanh activation function. The input gate i c value uses value of input nodes to be multiplied. The internal state s c node is with each memory cell with linear activation is a unit weight recurrent edge as self-loop also known as constant error carousel. The forget gate f c is used to clear the internal state contents for networks continuous running. Hence, the equation 16 represents forward pass internal state calculation. The internal state s c produced by memory cell value v c is multiplied with value of o c as the output gate, where tanh activation function is used to run by internal state for allotting to each cell similar dynamic range as that of hidden unit of tanh.
The equation 9 and 10, represents LSTM network having forget gates complete algorithm. However simpler LSTM can be obtained by calculating without forget gates as f (t) = 1 for all such t. Here the input node g uses tanh activation function ô. In case of forward pass, when to allow activations is learned by LSTM for input and output gates. Hence, activation trap can occur when there is a closing of both input and out gates, whereas error in and out are learned by the gates. Thus the use of LSTM is preferred over RNNs is due to learning of high range dependency of phenomenal ability.
The Figure 3. II. shows 2-layer stacked RNN with LSTM of 128 hidden layers having input size of 15, timestamp of 28 and output to be predicted is a single scalar with sigmoid function. As the vital signs data is captured from the patients in the sequential form, the structure of Convolutional NN presented in Table 3. I. CNN Hyperparameter Configuration consists of two hyper tuned convolutional layers with different output shape and one fully connected layer. Therefore, the 1D-CNN used here consists of (3 × 3) kernel size, which reads the time series data from medical devices/sensors. The model designed as shown in Figure 3. a. Structure of Convolutional Neural Network is used for sequence data feature extraction and to map its internal features. Hence, for deriving the features from fixed length segments in the dataset, 1D-CNN [32] is found to be effective irrespective of the feature location within the segment.

E. EARLY WARNING SYSTEM (EWS)
All the data preprocessed and after training the machine leaning/deep learning algorithms is then used to predict an outcome by using EWS [2]. The EWS is used to predict the patients' health conditions under risk. These predictions are then used to start a new treatment and improve the health risk that could avoid facing a critical situation. The detail results are discussed in next Section IV. Results and Discussion.

A. DATASET
In this section, we would present all of the experiments that we have conducted successfully as a supporting result for methodology section of our proposed paper. All of the experiments performed, within this paper are done using the following system configuration as shown in table 4:

B. DATA PREPROCESSING
In this section first, we will be discussing the results generated from the data preprocessing techniques used within this paper as defined in the methodology section. These techniques are quite important to be performed before the machine and deep learning operations take place for data classification, regression and ultimately prediction. In the second part, we will see the results produced by learning algorithms for classification by comparing their effects, solving imbalance problems and ROC curve.
As shown in Figure 4.a. missing value for Glasgow Coma Scale (GCS) is used to evaluate the COMASCALE_E, COMASCALE_V and COMASCALE_M. The E, V and M variables corresponds to eye opening, verbal response and motor response respectively. These parameters are basic health evaluations for any patient's current condition. The range of wellness is given to eye, verbal and motor responses from 1-4, 1-5 and 1-6 respectively, totaling to 15, it is considered to be normal when the patient's data for such cases is found to be missing i.e. 15. Therefore, in case of GCS, missing values are treated to be normal, as in case of most patient's and hence is assigned to be 15 for all the missing records as imputation. In Figure 4.b., the category value is replaced with a binary class matrix by one-hot encoding. As we can notice, columns GS, CM, CRS, GI, HEMA and NEPH are an empty matrix, where a binary matrix is allotted i.e. value 1 are in the top left corner diagonal because category HCURSVCL is having a common term HEMA. To represent balance in the category values, binarization of the matrix is done, which is made suitable for input to the machine learning algorithms. This input to machine learning algorithms is considered to be important as the input is strongly known for affecting the output.
Note: HCASE No. is not a feature used for training/testing.

C. EVALUATION METRIC
The precision, recall and f1-score are used to evaluate the performance of different proposed models as shown in equation 11, 12 and 13 respectively, which is calculated as below: where T P (true positives) indicates the outcome of the patient is adverse and the model predicted correctly. True negative for outcome as healthy. False positives(F P ) indicate the model predicted the outcome as healthy while the actual value is not. False negatives(F N ) indicate predicted value is negative while actual outcome is healthy. Accuracy depends on their values present within the respective blocks, which is shown in equation 14: where is calculated by true positive added with true negative(T N ) divided by total values to determine how correct the classifier is evaluated. The loss can be simply calculated using 1 minus accuracy.

D. EVALUATION OF DIFFERENT METHODS
In the next subsection, we present the evaluation of machine and deep learning algorithms with the hospital dataset. The hospital dataset used here is divided into training and testing datasets from 2007-2015 and 2016-2017 respectively. The experiment result of using different algorithms are shown in Table 5. We have used two machine learning models in the Table 5. (a) Random Forest and (b) XGBoost with considering a previous record from last 3 hours. Whereas, deep learning models as CNN takes 28 records due to class imbalance problem.
In Table 5.c. a convolutional neural network model has input data of stack series records and parameters as learning rate of 0.01, batch size of 128, The CNN used here is with two conv2d layers with no pooling and convolution is used to better fit the results as by using heuristics. Successively, the next version was CNN with class balance and input data as stack series records with class balance and parameters of two conv2d layer. As in our case, the CNN with the input class balance and align by per hour, the CNN model used here is again the same two conv2d layers. In Table 5.d., we have RNN with input data of align by per hour with class balance and structure of two stacked LSTM with 128 units each of 15 features and 28 timestamps.
The Figure 5. presents the performance statistics for the adverse event prediction. Here, the x-axis presents the output value as the probability score produced at the last level of CNN with CB and using per hour data by the softmax and y-axis as the number of patients treated in the hospital. In Figure 5.a. selecting the highest cut-off range attempts to save more patients as recall score achieved is the highest (0.96) but the precision suffers (0.71). Therefore, in Figure 5.b. while tuning by optimal cut-off value range, it is observed that better recall score (0.95) is achieved. The cut point chosen is to leverage the accuracy and balance positive range for saving the patients with simultaneously managing the false alarm. The importance of cut point is high, as the shift in direction towards left will save the patient lives up to some considerable limit else there is lack in precision.
Whereas, in Figure 5.c. setting a too high cut-off range value i.e. shift towards the right, will lead to many false alarm within the system and lack in the recall score (0.90). For Figure 5, the values are detailed as in Table 6. experiment results with different cut-off lengths. When we want to deploy the model into the real-life clinical practice, we have to allow some false negative cases appear to decrease the barriers of the physicians because of alarm fatigue. Table 6 provides the adequate information to trade-off the gain and loss of the implement-tation. Also, it was observed from the Figure 5.a. and 5.c. by the doctors and hospital staff that setting a imprecise output range generated many false alarms that lead to chaos and cause misunderstandings with the deep learning system. Therefore, the training provided by the AEP-DLA research team to the group of doctors and nurses was crucial to analyze, learn, tune the system to optimal cut-off range, interpret the results and make further decision for the treatment process.
In Table 7, the experiment results of various algorithms are presented. In Table 7.a. we present the precision, recall and AUC generated by random forest algorithm. The parameters considered for this model are n estimates as 300, max depth of 27, min sample split of 30, max features of 3, oob score set to true and random state of 10. Table 7.b. XGBoost classifier is presented with parameters learning rate as 0.1, n estimators of 240, max depth of 4, min child weight of 5, random state of 10, subsample as 0.9, column sample by tree as 0.6, gamma   as 2, regularized alpha as 0.1 and evaluation metric as AUC. It is shown that for the last 3 hours, CNN with CB (class balance) and per hour pre-processing has low precision, and while RNN with CB and per hour pre-processing has better precision but the recall and AUC is low. Therefore, we chose the method which has the best recall and AUC score for the last 3 hours' adverse event prediction. The precision in RF is highest in the machine learning algorithm. Whereas, in case of recall, CNN + CB + Per Hour is the best. CNN + CB + Per Hour has better recall and AUC score in comparison using the per hour data provides better results comparing to the other methods. VOLUME 9, 2021  In Table 8, explains the experiments results obtained from the AI model for the references benchmark comparison with RNN and CNN. The red color highlighting indicates the highest score achieved in that respective algorithm comparison. The AEP-DLA exceeds in the benchmark comparison for recall with accuracy and proves its worth for the implementation.

E. EVALUATION OF OUR PROPOSED METHOD
To verify the performance of our proposed method, in this section we compared the model with other methods. The EDI [19] uses Naïve bayes to calculate continuous risk scores from their vital signs data. NEWS [5] is a popular scoring system and standard adopted worldwide for patients with severity of acute-illness. Using logistic regression to predict early clinical deterioration after ICU transfer while MEWS [20] is an improved version of NEWS, also using logistic regression as their proposed method to predict the injury severity, ICU resource usage, air transport and mortality. Kwon et al. [29], [30] proposed a 3-layer LSTM for cardiac arrest during hospitalization. The risk stratification tool is one of the analysis method for medical data [31]. They have used 8 hours' data for their model, while the other three methods mentioned above uses 1 hour data to predict the risk.
Our model uses 28 hours' data, which is in time series format collecting the vital sign features of the patient continuously from the time of admittance to the hospital, instead of just preliminary single record comparison from the registration report with adjacent research studies. We experiment our data with the methods mentioned above to predict the risk 3 hours later. In Table 7. The results show the Convolutional Neural Network of AEP-DLA performs better in the last 3 hours for recall and AUC, than the other models. It is shown that our method has the greatest performance on the hospital data. Both precision and recall are better compared to the other methods. Therefore, AEP-DLA uses convolutional neural network as the preferred choice for the system model.
The results available from our model for last 28 hours as the health trend before the adverse event presents the effective use of vital signs data collected continuously from the patient's admittance to provide early diagnosis and treatment.

V. CONCLUSION
Adverse event prediction is considered to be crucial for saving the patient's life and improving health conditions. The physicians usually check the last data and decide what to do next every day. Therefore, we took the last 28 hours vital signs into the AEP-DLA model. Hence, once the patient is admitted to the hospital after the first 28 hours, the algorithm starts processing vital signs and later to predict the results. Thus, the alert raised by the system within last 28 hours will help to seek the doctor's attention for the emergency situation and to carefully handle the critical case by treatment. To provide world-class facilities and overcome the various health risk, the proposed AEP-DLA provides various data pre-processing capabilities, by handling of missing values, process on a single record and multiple records using machine learning and deep learning classifiers respectively to predict better outcomes. The key to achieve better performance using deep learning was to apply good data pre-processing strategy i.e. stack series record method and align by per hour record method, for appropriate availability of the input as vital signs patient records for the prediction. Henceforth, admitted patient's health records are analyzed, disease severity can be determined and the adverse event can be predicted before a substantial amount of time. The method using CNN + CB + Per Hour pre-processing has proved to have best result in benchmark comparison of 92.8% recall and 99.5% AUC score, as the data was first being sorted in hours, some pre-processing and balancing classes were also performed. Various experiments are conducted and proved that not only the method includes most of the features, it also provides better performance prediction on the hospital data. In future, we have planned to apply explainable AI to improve this model and provide detail design insights.   [5], [8], [9], [22], ours (AEP-DLA) and LSTM/RNN.

APPENDIX
The Table 9. shows Analysis of Variance (anova) significance test for the dataset and have found the means spread across the different features/columns. As F > F crit., we reject the null hypothesis. Therefore, the spread across the different features are quite significant. The Figure 6. presents ROC curves for all the models as referenced from the Table 8. for depth empirical analysis indicating AEP-DLA performs as best in comparison. ROC is used to evaluate performance of a binary classifier, whereas AUC curve score is the single value performance summary.