Deep Interpretable Early Warning System for the Detection of Clinical Deterioration

Assessment of physiological instability preceding adverse events on hospital wards has been previously investigated through clinical early warning score systems. Early warning scores are simple to use yet they consider data as independent and identically distributed random variables. Deep learning applications are able to learn from sequential data, however they lack interpretability and are thus difficult to deploy in clinical settings. We propose the ‘Deep Early Warning System’ (DEWS), an interpretable end-to-end deep learning model that interpolates temporal data and predicts the probability of an adverse event, defined as the composite outcome of cardiac arrest, mortality or unplanned ICU admission. The model was developed and validated using routinely collected vital signs of patients admitted to the the Oxford University Hospitals between 21st March 2014 and 31st March 2018. We extracted 45 314 vital-sign measurements as a balanced training set and 359 481 vital-sign measurements as an imbalanced testing set to mimic a real-life setting of emergency admissions. DEWS achieved superior accuracy than the state-of-the-art that is currently implemented in clinical settings, the National Early Warning Score, in terms of the overall area under the receiver operating characteristic curve (AUROC) (0.880 vs. 0.866) and when evaluated independently for each of the three outcomes. Our attention-based architecture was able to recognize ‘historical’ trends in the data that are most correlated with the predicted probability. With high sensitivity, improved clinical utility and increased interpretability, our model can be easily deployed in clinical settings to supplement existing EWS systems.


I. INTRODUCTION
I N RECENT years, increased access to Electronic Health Records (EHR) has motivated the development of datadriven systems that detect physiological derangement to secure timely response. Early Warning Score (EWS) systems assess a patient's degree of illness by assigning scores to routinely collected vital-sign measurements based on pre-determined normality ranges. The National Early Warning Score (NEWS), which is currently used in hospitals and recommended by the Royal College of Physicians in the United Kingdom [1], has shown superior performance in comparison to other EWS systems in detecting the composite outcome of unplanned ICU admission, cardiac arrest, and mortality [2]. EWS systems assign an independent score to each vital-sign variable and assume that vital-sign measurements are independent and identically distributed (I.I.D.) random variables. Given their simplistic nature, traditional EWS systems do not learn any spatio-temporal information from the vital signs. We hypothesized that the use of deep learning may improve the accuracy of predicting clinical outcomes by recognizing complex patterns in the data.
Significant improvements over clinical scores and static machine learning models were achieved using deep learning, such as to predict ICU mortality for pediatrics [3] or to detect sepsis [4]. Long Short Term Memory (LSTM) networks in particular have illustrated superior performance when considering various benchmarks [5]- [7]. Most of the relevant work, however, was primarily based in intensive care settings. We designed an EWS framework that could generalize across a heterogeneous patient population in non-critical care wards, from pre-processing sparse vital signs variables to predicting the probability of an outcome.
Additionally, the decision-making process of the previouslyproposed deep learning models lacked interpretability, and as such they are viewed as 'black box' models by clinical staff since they do not provide any insight on the patterns learned from the data. We defined interpretability as the ability of the clinician to inspect 'trends' of vital signs that most contribute to the model's predicted probability. Inspired by natural language processing, our approach incorporated an 'attention' mechanism with recurrent architectures that highlights those parts of the input time-series that are most relevant to the output. This is useful since interpretability is considered to be a core component of clinical utility [8].
The physiological data recorded in an EHR is often sparse, noisy, and incomplete, especially when collected in non-critical care wards, which is challenging for recurrent deep learning architectures that require regularly-spaced data points. Regularly sampled data can be interpolated using naive methods, such as carrying the most recent value forward (CF) and linear interpolation (LI). Although such approaches are computationally inexpensive, they may impose bias and error [9] and do not account for the uncertainty of the imputed data. In a probabilistic approach, Gaussian Process Regression (GPR) was used to model irregularly-sampled physiological time-series data [10]- [12], to interpolate the posterior mean and variance at unseen time points. In our work, we evaluated the benefit of GPR modeling in comparison to CF and LI.
Unlike currently implemented EWS systems that were originally designed in a heuristic fashion, we developed and validated an interpretable end-to-end Deep Early Warning System (DEWS) that alerts for clinical deterioration, defined as the composite outcome of unplanned ICU admission, mortality, and cardiac arrest.

A. Related Works and Contributions
Attention-based deep learning models improve interpretability; they can model extended long-term-dependencies; and they have been used numerously in computer vision and natural language processing [13], [14]. Within clinical settings, attention models have gained limited recognition and have been used for classifying atrial fibrillation in ECG data [15], [16], predicting a future diagnosis [17], [18], or predicting high risk vascular disease, using both diagnosis codes and medication data [19]. The limitation of using diagnosis codes is that they may not be readily available in a real-time setting, as in retrospective databases. We aimed to use information in routinely collected vital-sign data, as in currently implemented EWS systems, constituting multiple sequential inputs.
In traditional sequence-to-sequence modelling problems, attention enables the model to learn deferentially from more and less important parts of the input sequence; i.e., words in the case of sentence translation, or sentences for document classification [20]. Our goal was to learn different content from different time-series signals, and then fuse the information to predict the probability of an outcome. To the best of our knowledge, no existing attention-based deep learning model focuses on learning from a combination of vital-sign time-series data to indicate a patient's health status in real-time.
The primary contribution of this work is a novel deep learning architecture with high clinical utility to predict clinical outcomes. The model learned from regularly-sampled mean and variance features interpolated by modelling sparse vital-sign data via GPR. We evaluated the framework's ability in detecting deterioration prior to the composite outcome, as in previous studies [21], [22], achieving state-of-the-art results in comparison to the clinical benchmark.
The rest of the paper is organized as follows: Section II describes the methodology pipeline in terms of feature extraction and outcome prediction, Section III describes the datasets used for training and testing, Section IV describes the experimental observations using the proposed models, and Section V discusses findings and presents concluding remarks.

II. PROPOSED METHODS
We framed the problem of detecting clinical deterioration as a binary classification task, such that the model assigns a binary label, based on a computed probability and a pre-defined alerting threshold, to each vital-sign measurement. For each measurement, we would like to predict the probability of the composite outcome within the next N hours. An event window was defined as a vital-sign measurement that was within N hours of a composite outcome and its preceding w hours of observations. A non-event window was defined as a vital-sign measurement that was not within N hours of a composite outcome and its preceding w hours window. We set N = 24 hours in our study, which is a common evaluation window in the development of EWS systems [21], [22], and we evaluated w at 24, 12, and 6 hours. Assume is an unlabelled sample window, where x i is the ith time instance and y i ∈ R m denotes the feature space consisting of m vital signs sequences. The recurrent classification model required the vital signs data to be measured at a fixed set of regularly sampled time instances to compute the output escalation label l ∈ (0, 1). However, each vital-sign sequence j was temporally irregular due to the nature of EHR data. Hence, we deployed a patient-specific GPR for each vital-sign sequence to interpolate the mean and variance at fixed regularly-spaced time instances P = [x * t ] T t=1 . These posterior mean and variance estimates were concatenated for all the vital signs to obtain: Y μ = [y μ,j ] m j=1 and Y σ = [y σ,j ] m j=1 , where Y μ , Y σ ∈ R m×T and y μ,j and y σ,j are the GPR mean and variance for the jth vital sign, such that j = 1, . . . , m. Attention-based encoders learned from each interpolated sequence in Y μ and Y σ to obtain the sequence-level context vectors [c μ,j ] m j=1 and [c σ,j ] m j=1 , respectively. Finally, the summary context vectors c μ and c σ summed up the sequencelevel context vectors and were fed to decoding layers to compute the probability of an outcome. If the probability exceeded the pre-defined alerting threshold, then D W = 1. The proposed framework is described in Algorithm 1. We now describe each step in more detail.

A. Patient-Specific Feature Transformation
Each window of length w hours was modelled using GPR. It was also modelled using CF and LI for benchmarking purposes, and an overview of the modelling techniques is shown in Fig. 1. GPR generalizes multivariate Gaussian distributions to infinite dimensionality and offers a probabilistic and non-parametric approach to model a sparse vital-sign time-series as a function of time from admission. We adopted a radial basis function (RBF) with added white noise as our covariance function to map the similarity between pairs of data points x and x , such that δ(x, x ) is the Kronecker delta function and Θ = {l, σ f , σ n } is the set of hyperparameters, where l is the lengthscale, σ f is the variance of the RBF, and σ n is the variance of the added white noise. Since the work involved patient-specific modelling for each sequence per window across a large-scale population, we adopted a Bayesian approach to increase the modelling efficiency. First, we defined the expected value, or the mean function, of each vital-sign GPR as a constant function equivalent to the population mean of patients with the same age and sex. Second, we used lognormal distributions as priors to constrain each hyperparameter to be clinically meaningful. The models were optimized by minimizing the negative log likelihood with respect to the hyperparameters. After fitting the GPR to the training data [x i , y i ] n i=1 of vital sign j, the GPR kernel is applied to interpolate missing values at equally-spaced time steps [x * t ] T t=1 across the input window, such that the posterior mean is with variance, where y is the training data [y i ] n i=1 , K represents the similarity measure between all training values, K * represents the similarity measure between all training and missing values, and K * * represents the similarity measure between all missing values.
Finally, we obtained a set of equally-spaced measurements as an input to the neural network.
For any point of prediction with historical data spanning less than w hours, we pre-padded the sequence with the population mean and maximum variance of the respective vital sign. Discrete variables were modeled by CF and LI, which only interpolated mean values. The extracted features, through GPR, CF, and LI, were then scaled for the training and testing of the classifiers.

B. Model Architecture
We here describe the architecture of the proposed DEWS method. The interpolated mean and variance features of each vital-sign input were first processed through a Bi-directional LSTM (BiLSTM) network [23], in order to maximize information retrieval in the forward and backward directions. An attention-based BiLSTM model previously performed well for classifying sequential healthcare data [15], and we extended upon it by customizing the mechanism in the attention block and accounting for the uncertainty of the input.
The BiLSTM consisted of two layers which processed each mean and variance input in forward and reverse directions and yielded two hidden layer states h t,f and h t,r . The average of h t,f and h t,r , denoted as h t served as the input of our attention mechanism. While the definition of attention varies across the literature, we adopted the definition in [15], [20], to learn the most important parts of each sequence. For each vital sign j, e t,j measured the importance of the information at each time step t: by computing its similarity with U j , a trainable sequence-level context vector, where a is the rectified linear unit. Next, e t,j was used to derive α t,j , or the normalized weights assigned to the hidden states, as: and α t,j was further employed to derive the final context vector c j : Equations (4)-(6) were applied to the BiLSTM outputs of each mean and variance input of each vital sign j. The sequencelevel context vectors obtained from all vital signs were then aggregated by the trainable weights V μ for the mean features: and V σ for the variance features: The aggregated context vectors were then summed and processed by two dense layers consisting of a rectified linear unit and a sigmoid function, respectively. Finally, the computed probability of an outcome within the next 24 hours was compared to a pre-defined alerting threshold and a binary label was assigned to the window. The model schematic is shown in Fig. 2. During training, the weights U k , W k , b k , V μ and V σ were optimized, where V μ and V σ accounted for the correlations across the vital signs since they combined their respective context vectors. We chose the other hyperparameters, such as number of output nodes e per encoder timestep shown in Fig. 2, and the alerting threshold through experimentation.

C. Model Evaluation
We evaluated our model using training and testing sets. First, we assessed the modelling quality of GPR, CF, and LI. For each sequence in the training windows, we randomly held out 20% of the data as test points and modelled the rest using each interpolation technique. We then calculated the root mean squared error (RMSE) comparing the true values and the interpolated values at the held out test points.
We evaluated the performance of our classifiers using the area under receiver-operating characteristics (AUROC), sensitivity, and specificity on the testing set. All metrics were performed using a bootstrapping technique without replacement [24] with a fixed number of bootstraps (nb). We compared the performance of the models across 16-45 years old patients and >45 years old patients, and across the three outcomes independently.
To assess the clinical utility of DEWS in comparison to the clinical benchmark, we plotted the percentage of generated 'triggers', or windows at or above a given EWS score, on the y-axis against sensitivity on the x-axis [22]. We also assessed the proposed model's decision-making process by visualizing the attention weights computed for a case study. Finally, we compared the average normalized NEWS score and the average DEWS probability for the first 120 hours from admission and the last 24 hours prior to an outcome.

III. DATASET
This section describes the data retrieved from a retrospective large database of routinely collected observations from concluded hospital admissions between 21st March 2014 and 31st March 2018 within the Hospital Alerting Via Electronic Noticeboard (HAVEN) project (REC reference: 16/SC/0264 and Confidential Advisory Group reference 08/02/1394). The database included the vital-sign measurements of adult patients admitted to four Oxford University Hospitals: the John Radcliffe Hospital, Horton General Hospital, Churchill Hospital, and the Nuffield Orthopaedic Hospital, collected by the System for Electronic Notification and Documentation (SEND, Sensyne Health) [25]. We extracted the vital-sign measurements and the occurrences of outcomes to develop and validate a model that is analogous to EWS systems.
Each vital-sign measurement was recorded manually by hospital staff and consisted of 5 continuous variables: heart rate (HR), systolic blood pressure (SBP), respiratory rate (RR), temperature (TEMP), and oxygen saturation (SPO 2 ), and 2 discrete variables: Alert, Voice, Pain and Unconscious (AVPU) score and a binary indicating whether supplemental oxygen was provided. We defined the time of a composite outcome as the time of the first occurring event of unplanned ICU admission, mortality and cardiac arrest. In the case of multiple occurrences of adverse events, we removed observations recorded after the first event.
We split the dataset by time as recommended by TRIPOD guidelines [26]: D 1 (21 March 2014-31 October 2017) for training and validation, and D 2 (1 November 2017-31 March 2018) for testing, roughly corresponding to 85% and 15% of the overall dataset, respectively. We labelled each vital-sign measurement as an event or non-event window. To overcome class imbalance, we performed random under-sampling of the non-event windows in D 1 to match the maximum number of event windows. D 2 remained imbalanced to mimic a real-life testing set, yet it excluded patients who were well enough to be discharged on the day of admission, elective admissions with scheduled visits, and admissions with no vital-sign measurements collected in the last 24 hours prior to an outcome because such patients are likely to be on terminal care pathways. This is the same exclusion criteria adopted in related previous works [21], [22], as EWS systems aim to assess acutely-ill patients.

IV. EXPERIMENTAL OBSERVATIONS
In this section, we summarize the main findings of our study pertaining to experimental and design choices and performance evaluation.

A. Experimental Setup 1) Data Modelling:
The patient admissions had varying lengths of stay, ranging between 0.01 and 1,165 days, and the number of timestamped observations per admission ranged between 1 and 1,901 observations. Across the extracted vital-sign measurements, the missing values for HR, SBP, TEMP, RR, SPO 2 , AVPU and supplemental oxygen were 1.94%, 1.79%, 10.29%, 3.23%, 1.99%, 4.74%, and 3.63%, respectively. The characteristics and demographics of D 1 and D 2 are shown in Table I and the distributions of their vital signs are shown in  Table II. Both datasets have a similar mean age and proportion of females. Since D 1 was balanced while D 2 was imbalanced, we observe differences between the distributions of some variables. Lognormal priors over the hyperparameters were selected to ensure that the modelled vital signs fell within the expected ranges in a clinical setting. The lognormal distributions chosen as priors for l were (μ = 1.0, σ = 0.1) for HR, RR, TEMP, and SPO 2 , and (μ = 1.5, σ = 0.1) for SBP. The lognormal distributions chosen as priors for σ f were (μ = 0.0, σ = 0.1) for HR, SBP, and SPO 2 , (μ = 1.5, σ = 0.1) for RR, and (μ = 3.5, σ = 0.1) for TEMP. The lognormal distributions chosen as priors for σ n were (μ = 0.0, σ = 4.0) for HR, SBP, and SPO 2 , (μ = 0.0, σ = 0.1) for RR, and (μ = 1.5, σ = 0.1) for TEMP. Applying population-based lognormal distributions to the priors of the three hyperparameters enabled us to efficiently fit patient-specific vital signs GPR models in a large-scale dataset. All GPR models were built using GPy (v 1.9.6) [27].
Using the optimized GPR models, we interpolated the posterior mean and variance at every 2 hours across the w hours long window, in keeping with national guidelines of alerting at least at every other hour. We modelled a truncated window of up to w hours to reduce the number of timesteps per input for the recurrent neural network. This reduces the complexity of the architecture, which is essential given our dataset size. Additionally, modelling a large number of timesteps would require the storage of a subsequently large GPR kernel matrix for each window, which would impose computational complexity. The model performed best with windows of length 24 hours i.e. w = 24 hours, after evaluating its performance for lengths of 6, 12, and 24 hours.
After sampling equally-spaced measurements, we experimented with standard scaling, min-max scaling and scaling by the maximum absolute value. The best classification performance was achieved through min-max scaling of mean features into the range [−1,1] and maximum absolute scaling of variance features into the range [0,1]. During training, we used 20% of D 1 as a validation set, and so the scaling and shifting operations were obtained through the other 80% and then applied to the validation and test set D 2 .
The mean and standard deviation of the RMSE of the training windows are summarized in Table III to compare the data interpolation quality of GPR, CF, and LI. In DEWS, AVPU and supplemental oxygen were interpolated using CF with no timelimit. If no previous value was available, then we assumed 'Alert' for AVPU and that supplemental oxygen was not provided. We used CF because it is less computationally expensive than LI, considering that they both resulted with similar mean RMSE.
2) Model Variants: We designed several deep learning architectures to compare to DEWS. The first set of models consisted of simple architectures, namely Logistic Regression (LR), a single-layer LSTM network, and a single-layer BiLSTM network. Y μ was the input to the models' first single layer. Despite its simplicity, the BiLSTM architecture lacked interpretability.
Therefore, the second set of models included attention mechanisms applied to Y μ . BiLSTM-ATT-1 consisted of a simple BiLSTM followed by one attention (ATT) module. This approach was similar to language modelling, since it consisted of a single input feature space, however we were unable to identify the individual contributions of each vital sign. BiLSTM-ATT-2 thus processed each vital sign independently using a dedicated BiLSTM and attention mechanism. The context vectors of the vital signs were then summed and decoded.
The third set of models consisted of 'Uncertainty-Aware' (UA) models since attention was not only applied to the mean features, but also to the variance features Y σ . UA-BiLSTM-ATT-1 consisted of two BiLSTM layers, where one BiLSTM processed Y μ and the other processed Y σ . Each BiLSTM was then followed by one ATT module. Finally, UA-BiLSTM-ATT-2, our proposed model DEWS, had one BiLSTM-ATT per mean and variance features of the vital signs as shown in Fig. 2. We compared all models to NEWS and a simple logistic regression (LR) which used I.I.D. features as inputs; i.e., the last recorded set of vital-sign measurements. 3

) Deep Learning Experiments:
We tried training the model using data from emergency admissions only, as in D 2 , yet the model performed best when the exclusion criteria was not applied. Despite differences in characteristics of vital signs in D 1 and D 2 , the model was able to learn complex patterns since we utilized patient-specific modelling.
The hyperparameters of the models, including the number of hidden layers, units per layers, and activation functions were optimized empirically using the training and validation set D 1 . Within the encoder units, the optimal number of output nodes (e) at each timestep of the BiLSTM that resulted with the best performance was 12. The dense layers that aggregated all context vectors consisted of 5 units, while the final dense layer consisted of 1 unit. For the similarity function of the attention block, we compared the hyperbolic tangent function and ReLU, and the latter performed better for our application.
All deep learning architectures were trained with early stopping by monitoring the loss on the validation set to avoid overfitting, and a batch size of 128, after experimenting with a batch size of 32, 64, and 128. Each batch consisted of sequences of the same length, whereby length of sequence refers to the number of sampled equidistant data points prior to padding. The models were optimized using the Adam optimizer. All deep learning models were implemented using Keras (v 2.2.2) [28] with a TensorFlow backend (v 1.5.0) [29]. Table IV shows the performance results of all models on D 2 . DEWS performed best compared to all models with an 0.880 DEWS performed better than NEWS for both age groups, as shown in Table V, especially for 16-45 years old patients, with 0.820 AUROC [95% CI 0.818-0.822] compared to 0.760 AU-ROC [95% CI 0.757-0.762], respectively. DEWS also performed better than NEWS across all outcomes, both in terms of AUROC and sensitivity. Fig. 3 shows the percentage of triggers, or positive alerts, produced by our best performing model, DEWS, and NEWS at different sensitivity values (x-axis). Across the 16-45 years old patients, Fig. 3(a), NEWS approximately had a 59% trigger rate while DEWS had a 37% trigger rate, at a fixed sensitivity of 80%. This shows that DEWS reduced the trigger rate by approximately 22%, which could directly ease staff burden. Across the >45 years old, DEWS reduced the trigger rate by approximately 3%.

D. Case Studies
The attention weights of two windows are visualized in Fig. 4, where the box with blue borders shows the vitals signs after feature transformation through GPR modelling and scaling. The two windows belonged to the same patient, where the first row was a non-event window, since the time of prediction was not within 24 hours of an outcome, while the second row was an   event window since the time of prediction was within 24 hours of an outcome. In the attention weights of the non-event window, we observe that time steps 6-10 gained more importance than other time steps for SBP. When compared to the raw data, we notice that this trend corresponds to an increase followed by a decrease in SBP across the respective time steps. All other uniform distributions indicate that the model equally values each time step. The probability of an event produced by DEWS for this window was 28%, compared to a score of 6 by NEWS. This window was thus classified as a true negative by DEWS and a false positive by NEWS. As for the second row, the event window, RR, SBP, and TEMP varied similarly, with a decreasing attention from left to right. In the original data, RR and TEMP sharply increased in the earlier time steps. SBP, on the other hand, decreased from a high value. In this scenario, DEWS produced a probability of 98%, while NEWS produces a score of 15, and as such both models produced a true positive.
In Fig. 5, the mean probability produced by DEWS model and the normalized NEWS score appear to be aligned in terms of overall trends. We notice in (a) that in the first 24 hours window from admission, where the thick dashed black line represents 24 hours from admission, the probability produced by DEWS decreases, which suggests that DEWS gains confidence as the patient's length of stay increases. (b) In the last 24 hours window prior to an outcome, DEWS maintained a mean probability greater than its alerting threshold of 0.66, while NEWS maintained a mean normalized score at around 0.3. When comparing the models' AUROC against time to outcome, we observe that DEWS performed better than NEWS across the 24 hours window prior to outcome, as shown in Fig. 6.

V. CONCLUSION AND DISCUSSION
We propose an attention-based neural network that learns from historical trends of vital signs through interpolated mean and variance features to alert for clinical deterioration. Our proposed architecture DEWS achieved state-of-the-art performance, even while considering a limited set of features. DEWS decreased the number of triggers in comparison to NEWS, especially amongst younger patients. This would ease the burden on clinical staff in such a demanding environment. Furthermore, our model performed best across the composite outcome and the three individual outcomes (unplanned ICU admission, cardiac arrest, and mortality). Improving the performance to alert for each outcome independently will be further investigated in future work by training outcome-specific models. However, in this paper, we chose a composite outcome to avoid further class imbalance and as what is being done in the literature [21].
Existing EWS systems only assess the most recently collected vital signs, as I.I.D. data. We demonstrate how historical trends of vital signs can provide potentially beneficial supplementary information. By examining the attention weights assigned for each vital sign, in Fig. 4, we were able to demystify the decisionmaking process of our deep learning model. For example, the clinician could examine why DEWS was alerting by inspecting the time frames where the attention weights were highest in Fig. 4, such as increasing RR or decreasing SBP. Such trend analysis can support designing interventions on hospital wards. The alignment of scores between NEWS and DEWS, which is further illustrated in Fig. 5, emphasizes their supplementary purposes. We envisage the system to provide the NEWS score, DEWS probability, and an overview of the importance of historical trends to the clinicians.
We also accounted for the correlations across vital signs using the trainable weights V μ and V σ , which learned the relationships across the aggregated context vectors of the vital signs. This incurred further computational complexity during training of the deep learning model, but only represented a forward pass during testing. Future work includes experimenting with multi-task GPR (MGP) to account for correlations during feature transformation. However, the computational cost of MGP is O(m 3 n 3 ), which is prohibitive for low-resource settings in comparison to m × O(n 3 ) for the univariate GPR [30].
From a deployment perspective, the score can be easily incorporated into existing hospital devices, such as bedside monitors or hand held devices since it uses the same data streams as EWS systems, i.e. NEWS, but it calculates the score differently. Displaying the attention weights on the screen would require ergonomics specific analysis, which is beyond the scope of this paper.
We focused on utilizing physiological time-series data to establish analogous grounds with EWS systems currently deployed in clinical settings. We considered incorporating diagnosis codes as in previous studies [17], [31], however we decided that the inclusion of diagnosis codes may be impractical in real-life clinical settings because they are usually assigned at discharge for billing purposes. One study introduced a relevant attention-based model to predict in-hospital mortality in ICU [32]. However, their proposed attention-based multivariate LSTM model did not achieve a better AUROC than its variant without attention for predicting mortality.
Other scores have incorporated laboratory tests [33], [34], yet the main objectives of our work were to inspect vital signs trends in real-time as in EWS and to use routinely-collected variables. We hypothesize that incorporating laboratory tests may marginally improve performance, and this is an area of future study.
Our proposed model was developed and validated on a private dataset as there are currently no publicly available datasets for non-ICU settings. Further assessment of the proposed methodology's generalisability is required with larger datasets and other toy classification problems using a multivariate input. We would also like to test our methods on other clinical prediction tasks through transfer learning.