Machine Learning for Clinical Outcome Prediction

Clinical decision-making in healthcare is already being influenced by predictions or recommendations made by data-driven machines. Numerous machine learning applications have appeared in the latest clinical literature, especially for outcome prediction models, with outcomes ranging from mortality and cardiac arrest to acute kidney injury and arrhythmia. In this review article, we summarize the state-of-the-art in related works covering data processing, inference, and model evaluation, in the context of outcome prediction models developed using data extracted from electronic health records. We also discuss limitations of prominent modeling assumptions and highlight opportunities for future research.


I. INTRODUCTION
R ECENT artificial intelligence (AI) developments seek to positively impact medicine and clinical practice [1]. Machine learning (ML), an application of AI, recognizes patterns within large quantities of medical data to make future predictions, with many successful applications in natural language processing, computer vision applications, and automatic speech recognition [2]- [5]. Applications of ML have been successful across several medical domains, such as disease prediction [6] using various data modalities, including speech signals and medical imaging [7]- [10], as well as clinical outcome prediction to detect deterioration, such as cardiac arrest, mortality, or intensive care unit (ICU) admission [11]- [14]. The intention of this paper is to provide a technology survey of recent works on clinical outcome prediction models that illustrate the respective areas of the fields in which they are described.
In general, designing an ML system involves a multidisciplinary effort that extends from data engineering to training and evaluating a predictive model. We consider the general model as a mapping of an input to an output: where f (.) is a function consisting of parameters Θ, X is the input and y is the output. For example, X can consist of vital signs measurements of the patient, such as heart rate, blood pressure, and respiratory rate, and y can represent a binary label indicating the occurrence of ICU transfer or cardiac arrest during the patient's hospital stay [14]. Fig. 1 depicts the typical pipeline of a ML application, starting from the input X, and ending with the corresponding output represented by y. The first task learns to extract intermediary features (Section IV) while the second task learns from patterns in the data to produce the predicted label (Section V). Such models are usually assessed based on clinical utility and interpretability (Section VI).
As we discuss related works throughout this review, we also provide an intuitive explanation of the ML techniques used for feature extraction or predictive inference. In general, 'learning' how to map the input to the output involves approximating the parameters of the model f (.), a loss function L(y,ŷ|Θ), and an optimization algorithm. The loss function L(y,ŷ|Θ), also known as the cost function, measures the dissimilarity between the true labels y and the valuesŷ predicted by the approximated model (e.g., mean square error, binary cross-entropy, etc.). An optimization algorithm, such as gradient descent [15], minimizes L(y,ŷ|Θ) in an iterative manner based on the examples present in the dataset.

II. CLINICAL CONTEXT AND FRAMEWORKS OF OUTCOME PREDICTION MODELS
Care pathways within hospitals vary largely due to the diversity of admitted patients. Thus, an understanding of the clinical context is key for developing machine learning models that can be incorporated within existing medical processes. As shown in Fig. 2, a patient may be hospitalized as an emergency or elective admission, where the latter constitutes a routine  procedure. During hospitalization, different types of data are routinely collected from the patient for monitoring purposes.
Patient monitoring tools, such as early warning systems [16], are widespread across different hospital wards to continuously assess for patient deterioration. The definition of what exactly constitutes clinical deterioration has evolved over time based on the data collection and processing techniques. Early attempts to define clinical deterioration focused on medical neglect and its end result of clinical complications [17]. Subsequent studies focused on more discrete clinical events, such as severe sepsis, unexpected cardiac arrest, ICU admission or mortality [18], [19], and tend to select one or more end-point measures of clinical deterioration. Such events incur high costs of prolonged hospital stays, litigation, staff time, impact on patients and staff, and broader economic consequences [20]. The latter definition is the most popular one, as it enables researchers to group patients into discrete classes, such as deteriorating (i.e., those who experience an outcome) and non-deteriorating (i.e., those who are discharged without experiencing any outcomes), and as such infer the y labels.
The framework of outcome prediction models also varies across the literature. Some studies predict the risk of an outcome only once using the patient's first N hours of data after admission, such as 24 or 48 hours [21]. Others choose to predict the risk of an outcome, such as ICU readmission, using the patient's last N hours of data prior to discharge. Another common methodology is to develop a real-time alerting score, which computes the risk of deterioration every time a set of clinical observations is collected [22], as in clinical early warning systems [23].

III. ELECTRONIC HEALTH RECORDS
Various types of data can be used to develop outcome prediction models, such as imaging, speech, or claims data [24]. Here, we focus on data extracted from electronic health records (EHR), which are being increasingly deployed in hospitals worldwide. EHRs are used in hospitals to store longitudinal information of patients collected in a care delivery setting. Such information includes patient demographics, vital signs, medications, laboratory data, and description of any outcomes that may have occurred to the patient during hospitalization, or shortly after discharge.
Data extracted from an EHR database can be used to develop and evaluate ML models. The dataset is typically split into a training set and a test set, 1 either by a random or a nonrandom split based on location or time. According to the Transparent Reporting of a multi-variable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement, the nonrandom split by time is the strongest evaluation technique as it avoids random variations between the training and testing sets [25]. During model learning, the training set is used to optimize the parameters Θ of the model. The trained model is then evaluated on the held-out test set using various performance metrics. Fig. 3 shows the overall dataset sizes, in terms of number of patient admissions, reported in studies published in the last decade (arranged in chronological order from left to right), extracted from EHRs. There is an increase of six orders of magnitude between 2008 and 2019, which highlights the increased accessibility to EHR data for research purposes. Most datasets are reported to be private, and there have only been a few notable efforts to release open access datasets, such as the MIMIC-III database [26]. Data and resource sharing is important for the advancements of the field.
It is also commonly agreed that data in EHRs may reflect the recording process present in the hospital rather than being a direct reflection of patient physiology [27]. First, EHRs are complex as they may include structured and unstructured data; an example of the latter is textual information which could require natural language processing techniques to process [28]. Additionally, categorical data, such as diagnostic coding, may adopt different coding systems across different institutions.
Another important dimension is data completeness, which may be defined as the proportion of observations that are actually recorded in the system [29]. Incompleteness of EHRs can be a result of health service fragmentation due to inefficient communication following patient transfer among institutions; the recording of data taking place only during healthcare episodes that correspond to illness, or the increased personalisation of attributes per patient [27], [30]. Completeness may also vary across institutions based on adopted protocols.
The third challenge is the accuracy of the data, or "the proportion of recorded observations in the system that are correct" [29]. Errors can occur while clinical staff observe a patient or record data, and their occurrence may be influenced by random and systematic errors such as billing requirements or avoidance of liability [27]. The accuracy of EHRs can be assessed by checking agreement between different elements within the EHR (such as assigned diagnosis and supplied medications), or by verifying whether values are within expected ranges [31].
Finally, it is important to verify whether the data was recorded within a reasonable period of time [31]. For example, the recorded collection time of vital signs may precede the time of admission. Although this aspect of data quality is highly dependent on the efficiency of the clinical staff, it also depends on the work flow protocols adopted at different institutions. Timeliness of data must be assessed to evaluate the chronology of data elements in relation to admission or discharge decisions, for example laboratory results prior to admission may be considered as part of subsequent admission, or death within 24 hours of discharge can be considered as in-hospital mortality.
This imposes challenges on the usability of the data, which usually incurs preliminary data pre-processing as shown in Fig. 4. The first step is to define an inclusion and exclusion criteria to extract the patient cohort of interest. The second step involves setting assumptions to aid the analysis of the heterogeneous data, such as defining a minimum length of stay. Finally, meaningful features as input variables to the ML model can be extracted using a variety of techniques.

IV. FEATURE EXTRACTION
The performance of clinical predictive models relies on the feature representation of the data, as in other domains [32]. As reported in related works, feature extraction generally involves at least one of domain-expertise for hand-crafted features (Section IV-A), data standardization (Section IV-B), or representation learning (Section IV-C).

A. Hand-Crafted Features
Domain expertise is commonly used to provide guidance on the design of the data pre-processing pipeline. This involves (i) preliminary feature selection from the input space, (ii) designing hand-crafted features, and (iii) incorporating prior knowledge of the structure of the data in the model design.
Previous research also computed time series features from waveform data [12], [34], [39], [40]. Those features can be categorized into four types: data adaptive, non-data adaptive, modelbased and data-dictated approaches [41]. Fourier and wavelet transforms, for instance, decompose raw signals into frequency and wavelets respectively. Time domain, Poincaré nonlinear, cross-correlation analysis and geometric measures have also been used to investigate variability of vital signs [12], [34].
Deriving hand-crafted features is a powerful tool in the design of ML models and has been used extensively over the years. However, it is a time-consuming and labor-intensive process, requires expert knowledge, and may not scale well to new problems.

B. Data Standardization
ML algorithms require further data preparation steps to ensure stability of learning. Here, related works reduce the noise, sparsity and irregularity of the clinical data, as well as align the scales of the various predictor variables.

1) Time-Series Modeling:
Time-series modeling is widely used in studies pertaining to early warning models [42], [43]. It is often used either (i) to infer a pattern of the physiological trajectory or (ii) as an interpolation technique to overcome the sparsity and irregularity of physiological data.
Linear dynamic systems have been previously used to model physiological variables for ICU monitoring [44] and detection of sepsis [45]. Hidden Markov Models (HMMs) were also used to model health trajectories of patients [46], [47]. However, such models cannot easily adapt to irregularly sampled vital-sign data. Additionally, each hidden state in an HMM only depends on the previous state [48]. Another approach for modeling similar data is the kernel-based support vector regression [42].
One of the most popular techniques for time series modeling within the clinical domain is Gaussian Process Regression (GPR). GPR is based on a non-parametric stochastic process that offers a probabilistic approach for time-series modeling by providing confidence intervals for estimated values at unobserved time instances. A comprehensive introduction to GPR can be found in [49]. Previous studies illustrate the robustness of the single-task GPR [42], [50], [51] in modeling a single physiological time-series variable. Others focus on multi-task GPR [43], [52], [53], which learns similarities across several time-series data data and models them simultaneously. The use of GPR relies heavily on the choice of the kernel that encodes prior knowledge of any nonlinear time-series dynamics that might be hypothesized to exist in the data.
Most recently, neural processes, a class of neural latent variable models, were also introduced as a probabilistic regression approach [54], which generalizes GPR through the use of generative models from deep learning.
Modeling the physiological trajectory of patients has become increasingly popular for further use in classification [43] or clustering applications [46], [50].
2) Feature Scaling: Empirical studies show that the performance of predictive models relies on the statistical normalisation of the input space [55]. Z-score normalisation with zero mean and unit standard deviation is a widely used tool in feature scaling of numeric clinical variables [13], [56]- [59]. Min-max normalisation performs a scaling of the feature values to lie within a range, such as [0,1] in [11]. A rigorous comparison of the different normalisation techniques in the context of clinical deterioration does not exist. The current practice is to choose the normalisation technique based on its effect on the performance of the respective classifier. This presents an opportunity for future research.

C. Representation Learning
Learning a suitable lower-dimensional embedding or representation of a high-dimensional input space is a fundamental component of ML research [32]. The embedding can represent a medical concept [60] or summarize a patient's hospital visit [61]. It often performs better than the raw input for learning subsequent tasks [62]- [64]. We now provide an overview of the techniques for obtaining embeddings in related medical applications: (i) standard dimensionality reduction techniques,  (ii) distributed representations used in language modelling, (iii) using embedding layers as part of a larger model, or (iv) through the latent space of autoencoders and their variants. Such compact representations are then further used as inputs for classification or clustering purposes (covered in Section V).

1) Standard Dimensionality Reduction Techniques:
One of the most popular statistical dimensionality reduction techniques is principal component analysis (PCA) [127] . PCA transforms a set of possibly correlated variables to a set of linearly uncorrelated components. It has been used to extract features for various clinical applications [40], [65], [66], such as for the detection of hypotensive episodes [34], mortality prediction across stroke patients [67], or prediction of hospital readmission [68]. The main limitation of PCA is that it extracts linear features that may not well represent non-linear relationships present in complex clinical data [32]. Another popular technique is independent component analysis (ICA) [69], [70], which transforms the variables to a set of independent components.
2) Distributed Concept Representations: Patient records may contain discrete categorical codes, such as diagnosis, medication, or treatment codes. Several studies [71]- [73] propose learning from such variables using embedding techniques derived from the distributional hypothesis in semantic modeling. The distributional hypothesis states that words that appear in similar contexts in large samples of language data are semantically similar [74]. The skip-gram algorithm learns the cooccurrence of information inside a context window of a fixed size [128]. It has been used to convert medical codes to dense representations in [60], [71], [75]. Similar to skip-gram, the Global Vectors (GloVe) algorithm was also used to learn the global co-occurrence matrix of medical codes [76].

3) Embedding Layers:
Embedding layers can also be integrated as part of a larger model to transform high-dimensional features into a lower-dimensional space. The embedding can consist of a simple linear transformation [77], [78] or as a fullyconnected (deep) network [11], [73], [77]. One study projected the input into a higher-dimensional space using a convolutional layer [72].

4) Autoencoders and Their Variants:
An autoencoder is a neural network architecture that is often used for dimensionality reduction or feature extraction [79]. It first transforms the input space to a (typically noise-free) lower-dimensional representation using an encoder, and then reconstructs the input from this compact representation. The sparse autoencoder (SAE) enforces a sparsity constraint on the learned representation, and it has been used to learn latent representations of clinical data [61], [80]. The denoising autoencoder (DAE) reconstructs the input from a partially corrupted version of the input. The stacked DAE, which consists of several autoencoders that are initially pre-trained independently then connected in one network, has also been used for clinical applications [56], [70], [81], [82]. Another popular variant of autoencoders is the variational autoencoder [83], which is a generative model that learns a probabilistic latent space, unlike the previously mentioned discriminative autoencoders.
In Table I, we summarize the feature extraction techniques in related outcome prediction studies. In terms of variable selection, we observe that free clinical text is the least-used data type. That may be due to the limited availability of datasets and the complexity of processing free text, such as due to the prevalence of abbreviations. However, recent works have have looked into pre-training well known natural language processing architectures with clinical text for related tasks, such as BERT [84]. Therefore, we expect textual data to become increasingly popular in future clinical applications as research in natural language processing develops. We also note that representation learning has gained popularity from approximately 2013 on wards, and we expect it to continue to be an active area of research in the near future, since it can also support the development of end-to-end models. The consistent use of hand-crafted features over the years indicates its effectiveness in improving the learning of ML models. Unlike hand-crafted features that are easy to compute, time-series modeling is not as widely used since it requires extensive hyperparameter tuning (e.g., choice of kernel). Both hand-crafted features and time-series modeling limit end-to-end training of the pipeline, since they are usually incorporated as stand-alone intermediate data processing steps. Another interesting trend is that more types of predictor variables are being included in prediction models over time, due to the increased availability of EHR data and computational resources.

V. PREDICTIVE INFERENCE
The extracted features can then used to train an outcome prediction model. The task can be posed either as a classification (Section V-A) or clustering (Section V-B) problem. Table II summarizes the different classification models that have been used to predict various clinical outcomes, as presented in recent papers. Most papers compare the performance of their models to those of simple ML techniques, such as regression [58], [78], which have been useful statistical techniques long since before the rise of ML. We also observe a trend in ML model selection over time, where sophisticated deep learning models, such as convolutional neural networks or long short term memory networks, were used most recently. We also note that predictions are often defined within a particular future time-frame, ranging from short-term 48 hours prediction windows [11] to 6 months, in order to frame the problem as a classification task. The varying definitions in the literature of what exactly constitutes an outcome makes it challenging to compare methods directly. Additionally, some studies tend to focus on specific patient subgroups, such as pediatrics [33].

1) Regression Models:
Logistic regression is one of the simplest linear classifiers [86] and is often considered as a standard benchmark for sophisticated clinical models [87]. Previous studies used logistic regression to predict hemodynamic instability [35], imminent mortality [88], or the composite outcome of cardiac arrest, unplanned ICU admission, and mortality [19]. However, logistic regression cannot learn non-linear relationships and assumes independence across the input variables.
Decision tree learning involves the stratification of the feature space based on a criterion defined by information theory, such as entropy. One study developed an early warning score based on decision trees, using seven routinely-collected laboratory tests [89], while another constructed an ensemble model with gradient tree boosting and adaptive boosting to predict the likelihood of transfer to pediatric ICU [33]. Despite the high interpretability of the aforementioned studies, they heavily rely on task-specific hand-engineered features and do not learn complex patterns in the data.

2) Kernel Methods:
Kernel methods rely on a user-defined kernel function that estimates the 'similarity' between pairs of data [90]. The support vector machine is a popular example of kernel methods. It projects data into a higher-dimensional space and finds the optimal discriminatory hyper-planes between classes [91]. The use of support vector machines heavily relies on the choice of the kernel and regularization, and they have shown strong performance in recent clinical applications [36], [39], [92], [93]. Computing the kernel matrix for all pairs of data may be computationally expensive for large clinical datasets especially when a non-linear kernel is used. Further work must investigate approximation techniques for applications involving large-scale medical data.
3) Deep Learning: Deep learning models are also becoming increasingly popular for outcome prediction tasks [12], [14], [37], [43], [96]. The simplest form of neural networks is the multi-layer perceptron (MLP), which consists of fully-connected perceptrons. The main limitation of the MLP is its inability to account for temporal dependencies. Recurrent neural networks and their variants seek to model temporal behaviour through feedback connections. Both Long Short Term Memory (LSTM) networks [43], [97], [98] and Gated Recurrent Units (GRU) [71], [77] were constructed to predict (and alert in advance of) clinical outcomes. There is also a growing interest in developing 'end-to-end' architectures that can jointly extract features and perform classification [78], [85], [99]. Although deep learning techniques are typically characterized by strong performance, their decision-making process lacks interpretability.

B. Clustering for Abnormality Detection
Clustering algorithms are unsupervised learning techniques that group data based on similarity measures. With the increased availability of EHR databases, such techniques have been useful for patient phenotyping and disease subtyping [80], [100]. As for detecting deterioration prior to adverse events, most existing works adopt the novelty detection framework using vital signs, also known as 'one-class classification.' A full review of novelty detection methods has been created in [101].
Such approaches involve creating a 'dictionary' or cluster of healthy patients and computing a similarity metric for new patients. Kernel density estimators are non-parametric methods that can estimate the underlying probability distribution from multi-variate data. Early works demonstrated the use of unconditional probability density function, one-class support vector machine, and Gaussian mixture models to assess the patient's status using routine measurements of vital signs with respect to a 'normal' distribution [102], [103]. Another study used a weighted sum of Gaussian kernels to estimate the distribution of the vital signs of 'normal' patients, and the departure from normality was quantified using a novelty score based on likelihood [104]. Later works focused on assessing the patient based on a time-series representation of the vital-sign data. Some considered clustering of GPR-derived latent representations to model vital-sign data trajectories, and compute the similarity of a new test trajectory based on its local likelihood with respect to the training set [50] or the Kullback-Leibler (KL) divergence [105], [106]. There are other statistical similarity metrics that can be used to compare distributions, such as the Bhattacharyya distance [107]. Most of the aforementioned related clustering works are based on vital-sign data only and involve small-scale datasets.

VI. PERFORMANCE EVALUATION
The performance of supervised outcome prediction models on the testing set is evaluated using various statistical methods. Those statistical methods mainly assess the performance of the model in terms of accuracy metrics. In recent years, model interpretability has also become an area of interest as it directly reflects how we translate technologies into clinical practice [108].

A. Performance Metrics
Model discrimination refers to the model's ability in separating classes of interest. In the context of outcome prediction models, we will here refer to patients who experience an adverse outcome as the positive class, and those who do not as the negative class. Many ML models are trained to compute the probability of the positive class, which is then converted to a binary value by fixing a decision threshold. The predictions are then compared to the true labels and can classified into one of four categories: (1)  Accuracy, which summarizes the proportion of correctly classified samples across all samples, is highly biased when using highly imbalanced datasets. Therefore, other metrics are usually considered. Sensitivity, or the True Positive Rate (TPR), assesses the model's ability to correctly predict the positive class.
Specificity, also known as the True Negative Rate (TNR), assesses the model's ability to correctly predict the negative class.
The receiving operator characteristic (ROC) curve plots the TPR on the vertical axis and (1-TNR), also known as the False Positive Rate (FPR), on the horizontal axis. The integral under the curve is the Area Under the Receiving Operator Characteristic Curve (AUROC) [109]. 2 The AUROC assesses the model's overall diagnostic ability as the decision threshold is varied. An AUROC of 0.5 means that the model is making predictions at random in a two-class setting. One related study mentions that an AUROC higher than 0.8 implies that the model has good diagnostic ability and an AUROC higher than 0.9 means that the model has excellent diagnostic ability [110].
Precision, also known as the Positive Predictive Value (PPV), assesses the proportion of correctly predicted positive class across all of the true positive class.
The Precision-Recall curve, where recall is essentially sensitivity, plots the TPR on the horizontal axis and the Precision on the vertical axis and integrates the area under the curve. The integral under the curve is the Area under Precision-Recall Curve (AUPRC). Unlike the AUROC, the AUPRC and PPV are highly sensitive to class imbalance. Outcome prediction models are generally characterized with low AUPRC and PPV values [111]. Due to low PPV values, such systems should be considered as risk stratifiers rather than predictors [34].
There are other commonly assessed metrics, such as the F1-score [96], [112] and the likelihood ratio [113]. Some studies also report the false positives to true positives ratio [11] and the inverse of the PPV known as the work-up-to-detection ratio [58], [114]. The efficiency curve [89], [115] is a qualitative summary that plots the number of positives generated at different decision thresholds against the sensitivity of the model. This tool is essential to evaluate the trade-off between the total number of positives and the number of false positives.

B. Interpretability
Despite the good performance of recently introduced ML models, interpretability remains to be a challenge for their clinical utility [108]. There are various definitions of interpretability in existing literature and they refer to several distinct ideas [116], [117]. Most of these ideas pertaining to the clinical domain revolve around trustworthiness of the results and transparency of the model. In the context of this review, we summarize the efforts of outcome prediction models that considered interpretability as a key component of model assessment.
Mimic learning assumes that shallow models, such as linear models, are interpretable. It aims to identify the features that are potentially relevant to the prediction. It involves first training a deep learning model for a specific clinical task. It then trains a shallow model, such as gradient boosting trees, to mimic the behaviour of the deep learning model [82], [118]. The local interpretable model-agnostic explanation (LIME) [119] generates a local explanation of the model behaviour using a shallow model. It has been even used to explain ML models for the prediction of in-hospital mortality [120]. However, it has also been argued that linear models, rule-based models, and decision trees are not intrinsically interpretable [116]. Other post-hoc interpretability techniques such as saliency maps rely on qualitative visual interpretations commonly used in computer vision applications.
It is often argued that deep learning models compromise interpretability for high accuracy [121]. Thus, there have been recent breakthroughs in developing inherently interpretable deep learning models instead of performing post-hoc interpretation [122]. For instance, attention mechanisms are incorporated within deep learning models and assign normalised weights to a set of features. The weights indicate the feature importance for the prediction of a future diagnosis [72], [76], [99] or high risk vascular diseases [112]. Other works impose non-negativity [61] or sparsity [80] constraints on the learned embedding space of medical data.

VII. MOVING FORWARD
The prediction of clinical outcomes is essential to detect deterioration in a timely manner and to ease burden off clinical staff. The development of the ML pipelines and their subsequent performance can also be improved by accounting for a few considerations.

A. Noisy Outcome Labels
To train outcome prediction models, outcome labels are currently being defined based on the occurrence of discrete clinical events. However, such labels may be noisy or inaccurate since EHRs only reflect parts of the hospital experience. For example, while a patient may experience cardiac arrest, the patient may be on terminal care pathways with 'do not resuscitate orders,' and such information may not be present in the available dataset.
Additionally, outcome labels are defined based on a specific time-window, where the features are associated with a positive outcome label only if they are within N hours to an outcome. This creates a strict cut-off where data collected prior to this N -hours window is not associated with a future outcome. Realistically speaking, deterioration is likely to develop gradually over time, yet this is the state-of-the-art approach in developing outcome prediction models within clinical practice. Future work should consider both classification and time-to-event analysis, where the latter focuses on predicting the time until the occurrence of an outcome, rather than just predicting a binary label [123].

B. Personalized Predictive Models
Most of the outcome prediction models are developed and evaluated population-wide and recent improvements show marginal improvements. As more data is collected per patient, we hypothesize that the predictive power of such models could improve by developing patient-specific models, that account for individual-, disease-, and organizational-based factors [129]. On an individual-level, factors may include demographics, lifestyle, coexisting medical conditions, or genetic information. Diseaserelated factors may include degree of severity, medications and therapy, rate of progression, interventions, surgeries, and procedures. Organizational-factors may include type of hospital, time of the day, staff ratio, or staff training. This also motivates the advancement of internet of things in healthcare to enhance the collection of integrated data, and would certainly allow us to move forward towards 'precision medicine. ' Additionally, in the development of machine learning and deep learning models, it is assumed that the data samples are independent and identically distributed (i.i.d.) random sets. However, this may not be the case in practice, since some data samples may belong to the same patient and spatio-temporal patterns may be indicative of deterioration prior to an outcome.

C. General Learning Models
Deep neural networks are powerful processing techniques. However, most of the state-of-the-art models seek to learn how to predict a specific outcome or a particular task, which can generally be referred to as 'narrow AI.' While some of the motivation behind using representation learning has been to learn general patient representations as inputs for downstream predictive tasks, more work needs to be done into developing generalized models that can automatically learn from heterogeneous EHR data to perform diverse tasks simultaneously, such as disease diagnosis and patient prognosis.
Additionally, end-to-end models have shown recent success in applications such as speech recognition and natural language processing [124]- [126], since they can bypass intermediate data processing steps that are typically present in traditional ML pipelines. In the context of clinical outcome prediction models, this requires major improvements in the collection and curation of EHR data across several dimensions, especially completeness, complexity, and accuracy. To overcome the challenge of manually designing ML pipelines, some works have suggested frameworks to automatically optimize the configuration of the pipeline, such as AutoPrognosis [123]. Future works should further investigate the extension of end-to-end training for EHR data to improve efficiency and minimize biases.
While recently developed ML models perform well within retrospective studies, validating their success in practice requires prospective analysis. The progress of the field relies on increased multidisciplinary collaborations between ML research scientists and clinicians. While it will take time for both parties to speak the same language, we hope that this review would demystify the overall ML pipeline and summarize the assumptions and techniques of the state-of-the-art.