Browse

• Abstract

SECTION I

## INTRODUCTION

THE rapid pace of development in “e-health” technologies within integrated healthcare systems (such as electronic patient records) has far outpaced their uptake in clinical practice. There is a perceived “plague of pilots” [1], in which prototype systems do not penetrate into clinical use, and there is a consequence lack of evidence required for adoption at scale [2].

We address this problem by describing a large clinical trial in which the care of 10 000 patients in the Emergency Department of a major research hospital switches over to the use of integrated healthcare systems that we have designed around best-practice principles in machine learning.

Adopting new healthcare systems at scale in a clinical environment is a time-consuming and resource-intensive process, particularly in building the large bodies of evidence required to support adoption. This paper describes the trial needed to provide this evidence, in which algorithms for detecting physiological deterioration are embedded within an integrated healthcare system and are compared with the existing standard of hospital care (where the latter is introduced in Section II). The infrastructure of the system is described in Section III; Section IV investigates the shortcomings of existing methods of using that infrastructure for patient care, and proposes techniques to overcome these shortcomings. We describe how methodologies for evaluating the success of electronic systems in clinical studies are inadequate: Section V describes the dangers involved with using traditional evaluation methods, and proposes a methodology that avoids these problems. Results, obtained from comparing existing and proposed methods, are presented and discussed in Section VI. Conclusions are drawn, and future work is considered, in Section VII.

SECTION II

## EXISTING STANDARD OF CARE

Before the introduction of novel “e-health” systems at scale becomes possible, considerable evidence must be acquired concerning the existing standards of care, which may be obtained, for example, from clinical trials. For systems that monitor patient physiology, we must determine the efficacy of existing methods of patient observation.

Adverse events occur when the physiological condition of patients is not recognized or acted upon [3]. This has resulted in clinical guidance being provided for the U.K. in which the monitoring of certain vital signs was recommended, followed by the suggestion that manual “early warning score” (EWS) systems are used [4]. The latter involve the clinician making a manual observation of a patient’s vital signs, applying univariate scoring criteria to each vital sign in turn (e.g., “score 3 if heart rate exceeds 140 beats per minute”), and then escalating care to a higher level if any of the scores assigned to individual vital signs, or the sum of all such scores, exceeds some threshold. There are several disadvantages of this existing standard of care, against which novel methods must be evaluated:

1. The scores assigned to each vital sign, and the thresholds against which the scores are compared, are mostly determined heuristically, according to clinical opinion. This leads to significant differences between EWS systems used in different hospitals, or even between wards in the same hospital [5]. An attempt has recently been made to define these quantities using a large evidence base of vital-sign data [6]. We argue that standardization of the approach, using an evidence-based procedure, can overcome the disadvantages of heuristic methods.
2. EWS systems are used only after routine observation of patient vital signs, which may be performed as infrequently as once every 4 h in some wards. Patients may deteriorate significantly between observations. It may be that automated systems can operate continuously; however, existing bed-side monitoring systems suffer from a false-alarm rate of up to 86% [7]. This motivates an integrated approach, based on machine learning methods, which can drive down the false-positive alarm rate to clinically useful levels, as we will describe later in this paper.
3. Many hospitals use EWS systems manually, where the clinician assigns scores using a lookup table, adds the scores, and then compares the resulting totals to predefined thresholds. There is a significant error-rate associated with this manual arithmetic [8], [9], [10], suggesting that automated methods can avoid such errors.
4. Scores in EWS systems are univariate (so that they may be used easily by the ward staff), and, therefore, do not take into account the covariance between vital signs. When comparing the sum of scores from individual vital signs to a threshold, an assumption is implicitly made that the vital signs are independent, which is unlikely to be true in practice. Automated methods, such as those proposed in Section III, can learn the dependence between vital signs, and thereby better assess patient physiology than basic EWS systems.
5. The evidence base for evaluation of EWS systems is poor, with most hospitals using scores that have been derived heuristically, rather than implementation being based on clinical data [5]. The clinical trial described in this paper aims to address this need for evidence.
6. EWS systems are population dependent, by necessity. Some systems exist for pediatric patients or other specific populations. Automated systems offer the possibility of patient-specific alerting, which is discussed later.

Two EWS systems are considered by the work described in this paper: an older system in use in the ED of the John Radcliffe Hospital at the time of initial data collection, and the evidence-based method described in [6] that has since been implemented in the ED. These two EWS systems are shown in Tables I and II, respectively.

TABLE I EWS SYSTEM USED IN THE ED DURING INITIAL DATA ACQUISITION
TABLE II EVIDENCE-BASED EWS SYSTEM SUBSEQUENTLY ADOPTED IN THE ED
SECTION III

## INFRASTRUCTURE

This paper considers an integrated system that interfaces to a peer-to-peer network of bed-side monitors, hand-held PDAs, and wall-mounted touch screens, being introduced into the ED of the John Radcliffe Hospital as part of a 10 000-patient clinical trial, approved by the local medical ethics committee. Bed-side monitors acquire vital-sign data from patients in real time using ECG electrodes, pulse oximeters, and sphygmomanometers, at sampling intervals of approximately $T_s = 20$ s (and where BP is measured approximately every 30–60 min). A secure wireless network communicates vital-sign data to a central server, where algorithms process the data, identify periods of communication and sensor failure, and then analyze the data with respect to models of patient physiology (as described in Section IV).

As with the manual EWS systems described in Section II, the goal is to identify periods of patient deterioration, and so each patient is assigned a patient status index (PSI), which takes low values when patient physiology is stable, and which takes higher values when patient physiology is deemed to be indicative of patient deterioration. The PSI and other results of analysis are then displayed at the patient bedside, and can be further summarized on the wall-mounted touch screens (for providing an overview of the health status of all patients in a ward), and to clinicians’ hand-held PDAs.

It is important to note that fully automated patient monitoring systems cannot entirely replace the process of manual patient observation, because the ward staff are required to perform patient reviews; in many wards, nurses are encouraged to make physical contact with the patient to estimate pulse rate and to determine respiration rate. However, rather than relying on the traditional paper-based methods of recording and scoring patient physiology, our trial includes the facility for clinicians to input vital-sign data from manual patient reviews into their hand-held PDAs. These data are then automatically scored using an EWS (as shown in Table II), and transmitted to the central server for more detailed multivariate analysis combined with the continuous data acquired from bed-side monitors, as described in Section IV. This effectively provides benefit through, as described in Section II, performing continuous, automated assessment of patient condition, and taking a multivariate approach that does not treat the vital signs as if they were independent random variables.

During the initial phase of the trial, over 105 GB of patient data were acquired, totaling approximately 2170 h of patient data. The data completeness for each vital sign is shown in Table III. Such data loss is typically caused by disconnection of sensors from the patient, either by accident (such as a pulse oximeter falling from its place on the earlobe or finger) or by intention of the clinician.

TABLE III DATA QUANTITY AND DATA LOSS DURING INITIAL DATA ACQUISITION

We note that the inclusion of continuous temperature measurement into the electronic patient record, and its use in both manual and automated analyses, is problematic. Continuous temperature measurement using sensors is difficult; our trial initially attempted to collect data using a skin-mounted thermistor, to collect skin temperature as a proxy for the core temperature typically used by clinicians in the diagnosis of patients. However, the drop-out rate for this channel exceeded 75%, the standard deviation of the resulting temperature signal exceeded 2 °C, and the mean varied significantly between patients (sometimes exceeding a mean shift of 5 °C between patients). These poor signal statistics made identification of patient deterioration using continuous temperature data impossible, where changes of 1–2 °C can be significant indicators of physiological distress.

We therefore used manual measurements of temperature. While manual methods of tympanic measurement (measured using an in-ear probe) exhibit low variance, are close to core temperature, and are the “gold standard” for the existing standard of care, the resultant data are often inappropriate for use due to device configurations that can cause significant mean shifts that could make “healthy” patients seem “abnormal” in their temperature, and “unhealthy” patients seem “normal.” For example, a time series of tympanic measurements from devices at the Oxford University Hospitals NHS Trust is shown in Fig. 1(a). The measurement devices were known to have had settings switched, causing a small offset to be applied to the $N$ temperature data ${\bf X} = \{x_m\}$, $m = 1\ldots N$ at some point during the period shown in the figure. A Bayesian change-point detector [11] can be used to find a posterior distribution $p(m\vert {\bf X})$ over the indices $m$ of the data TeX Source \eqalignno{p(m\vert {\bf X}) &= \int_0^\infty \!\!\int_{-\infty }^\infty\int_{-\infty }^\infty p(m, \mu_1, \mu_2,\sigma \vert {\bf X}) \hbox{d}\mu_1 \hbox{d}\mu_2 \hbox{d}\sigma \quad &\hbox{(1)}\cr &\propto {1\over \sqrt{B}} \left[\sum_{i=1}^Nx_i^2 - {1\over m} S_l^2 - {1\over N-m} S_r^2 \right]^{(1-{N\over 2})} &\hbox{(2)}} where $S_l = \sum_{i=1}^mx_i$ and $S_r= \sum_{i=m+1}^Nx_i$ are the sums of the data up to and after change-point $m$, respectively, and where constant $B = m(N-m)$. Here, we have assumed that the data up to and after the change-point have distributions $N(\mu_1,\sigma^2)$ and $N(\mu_2,\sigma^2)$, respectively, and we have used (1) to integrate out the nuisance parameters $\mu_1, \mu_2, \sigma$, leaving us with the posterior change-point distribution over $m$. The result shown in Fig. 1(b) shows that a step-change occurring after 15 months has been detected. We note that the formulation has constrained $p(m\vert {\bf X})$ to a single change-point, and hence short-term deviations, such as a temporary decrease at around 12–13 months, do not affect detection of the change-point. Accurate detection of regime changes, such as the simple reconfiguration of temperature probes, allows us to cope with such step-changes in input in a principled manner. We note here that the time series of temperatures was taken over a long period, and over many patients, such that the variability between individual patients can be assumed to be a constant effect over the whole time series, and thus not affect change-point detection of a mean shift.

Fig. 1. (a) Time series of tympanic temperature measurements, ${\bf X} = \{x_m\}$, $m=1\ldots N$. (b) Corresponding change-point distribution $p(m\vert {\bf X})$.
SECTION IV

## NOVELTY DETECTION

The aim of clinical practice, in its use of integrated EWS systems, can be framed in machine learning terms as a novelty detection task, in which deviations away from normality are identified. The EWS performs this task by assigning scores to vital signs as they depart from normality. However, novelty detection may also be performed using principled machine learning techniques, appropriate for use in the automatic analysis of large quantities of physiological data.

This paper takes a novelty detection approach, in which a model of “normal” patient physiology is constructed. Test data are then compared to this model, and deemed “abnormal” if they differ significantly from it, where abnormality is defined differently according to the various novelty detection approaches described later. Novelty detection is typically performed in preference to a multiclass approach to classification when there is an insufficient quantity of data to model abnormal states accurately. This may occur when the abnormal states are so numerous that each cannot be fully specified (as in the case of high-dimensional datasets acquired from complex systems, such as human patients), or when examples of failure are rare (as in the case of some physiological conditions).

This section describes several approaches to novelty detection, which will be compared for use with the data obtained at the central server described in Section III.

### A. Inference With Kernel Estimates

Previous work [12] has modeled the joint distribution $p({\bf x})$ of vital signs ${\bf x} \in {\BBR}^5$, for the vital signs shown previously. Each vital sign was standardized with respect to its own mean and variance, $x^{\prime } = (x-\mu)/\sigma$. The joint distribution of the (normalized) training data was estimated using a mixture of Gaussian distributions: a kernel estimate [13] with 500 components. The procedure used to estimate this distribution involved first summarizing (using the $k$-means clustering algorithm [13]) a set of approximately $\hbox{2.3}\times \hbox{10}^6$ data, corresponding to over 3000 h of vital-sign data acquired from acute patients. This summarization process is necessary due to the large size of the dataset.

The likelihood $p({\bf x} \vert \bm \theta)$ of previously unseen test data ${\bf x}$ is then evaluated with respect to the kernel estimate (parameterized by $\bm \theta$) and used to generate a corresponding novelty score, $z({\bf x}) = -\log p({\bf x}\vert \bm \theta)$. This novelty score takes high values when the test data are “abnormal” with respect to the model of normality. Thus, the novelty score may be seen as a probabilistic version of the manual EWS, with the advantages that the vital signs are not treated independently (because the joint density of all vital signs is estimated), and that it is not heuristic.

A threshold $\kappa$ is defined on $z$ such that test data ${\bf x}$ are deemed “abnormal” with respect to the joint pdf if $z({\bf x}) > \kappa$. Communication failures, network and sensor noise, and other transients can cause temporary, artifactual spikes in continuous novelty scores. To avoid false-positive alerts caused by these transients, our method only communicates a novelty alert to the clinician (thus calling them back to the patient’s bedside) when $z({\bf x}) > \kappa$ for 4 minutes out of any 5-minute window of data. The value of $\kappa$ was similarly selected using cross-validation with an independent validation set, selected from over 18 000 h of vital-sign data acquired from acute patients [12].

### B. Modified Kernels

The summarization process described previously, in which an integrated dataset of physiological data is reduced to a smaller set of distribution means $\{{\bf y}_i\}$, $i=1\cdots \ K$, can adversely affect the quality of the resulting estimate of the joint distribution, $p({\bf x}\vert \bm \theta)$. Fig. 2 illustrates this disadvantage with an exemplar set of univariate data (shown as crosses on the $x$-axes of the plots in the figure). The required distribution $p(x)$ is shown in the upper plot; however, summarizing the dataset using a smaller number of distribution centers $\{y_i\}$, and then forming a kernel estimate of the distribution of those centers $p(y)$, is a poor approximation of $p(x)$. This is shown in the lower plot in the figure, where the $k$-means algorithm has placed $k=2$ distribution centers that describe each of the two clusters of data, even though one has a significantly smaller mode than the other in the desired distribution $p(x)$. This problem is caused by the $k$-means algorithm, which minimizes squared distances between the dataset $\{x_i\}$, $i=1\cdots \ N$ and the centers $\{y_j\}$, $j=1\cdots K$, and by the kernel estimate from Section IV-A which assigns equal prior probabilities $\pi_j = K^{-1}$ to each of the kernels.

Fig. 2. Dataset $\{x\}$ and its distribution $y = p(x)$ are shown in the upper plot. $K=2$ kernel means $\{y\}$ and the resulting kernel estimate $p(y$) are shown in the lower plot.

The appropriate weighting of the kernels in the density estimate may be obtained using a modified kernel estimate [14] TeX Source $$p({\bf x}\vert {\bf y}_i)=\sum_{i=1}^{K}{\pi_i\over \sigma } \Phi_i\left({\Vert {\bf x}-{\bf y}_{i}\Vert \over \sigma } \right) \eqno{\hbox{(3)}}$$ which is the kernel estimate from Section IV-A to which has been added priors $\pi_i$, and where $\sum_{i=1}^K\pi_i = 1$. In the above, the $i$th kernel $\Phi_i$ is a Gaussian distribution, characterized by mean ${\bf y}_i$ and isotropic covariance $\Sigma$ with diagonal elements $\sigma$. The priors are determined from the proportion of data ${\bf x}_i$ that fall within the $i$th cluster, $\pi_i = N^{-1} \sum_{j\in {\bf y}_i}{\BB1}$, where ${\BB1}$ is the indicator function.

Fig. 3 shows 2-D visualizations of the results of applying the modified kernel estimate to the 4-D training set of 3000 h of patient data described previously, where the visualization has been performed using the SASS [15] projection, which attempts to preserve distances between data before and after projection. Noting that the training set comprises both “normal” and “abnormal” patients, we follow [12] in discarding those 20% of the resulting kernels (i.e., 100 of the $K=500$ kernels) deemed to be “outlying.” We will consider two definitions of “outlying”: one where the 100 kernels with the lowest associated priors $\pi_i$ have been discarded (and so which have the lowest population), and one where the 100 kernels furthest from the centroid of the data have been discarded. These are shown in the left and right plots, respectively, in Fig. 3. We will refer to these modified kernel methods as $K_{w1}$ and $K_{w2}$, respectively, and to the original (unmodified) kernel estimate from Section IV-A as $K_0$.

Fig. 3. Two-dimensional visualizations of 4-D kernel means $\{{\bf y}_i\}$. Those kernel means with the 100 smallest populations and 100 greatest distances to the centroid of the data are shown by × in the left and right plots, respectively.

### C. One-Class Support Vector Machines

We also consider the use of a one-class support vector machine (SVM), trained using the same data as that from which the kernel estimate described previously was obtained. We used the method proposed by Schölkopf et al. [16], in which the objective function is defined by separating the training data from the origin in the feature space defined by the SVM kernel, where we have used the squared-exponential kernel.

The degree to which the SVM objective function is penalized by misclassifications (and thus the curvature of the decision boundary) is controlled by the “C” parameter in the conventional nomenclature, the value of which, along with the width parameter $\sigma$ shared by all of the isotropic kernels in the model, was selected using tenfold cross validation.

A quantity $N$ of $d$-dimensional data $\{{\bf x}_1, \dots, {\bf x}_N\} \in {\BBR}^d$ are mapped into a (potentially infinite-dimensional) feature space ${\BBF}$ by some nonlinear transformation $\phi$: ${\BBR}^d \rightarrow {\BBF}$. A kernel function $\Phi$ provides the dot product between pairs of transformed data in ${\BBF}$, such that $\Phi ({\bf x}_i,{\bf x}_j) = \phi ({\bf x}_i) \cdot \phi ({\bf x}_j)$. A Gaussian kernel allows a point to be separated from the origin in ${\BBF}$ [16], hence is chosen for us in the work described by this paper: $k({\bf x}_i,{\bf x}_j) = \exp {(-\Vert {\bf x}_i-{\bf x}_j \Vert^2 / 2\sigma^2)}$.

The decision boundary between “normal” and “abnormal” subspaces in ${\BBF}$ is $z({\bf x})$ = $w_o \cdot \phi ({\bf x}) - \rho_0$, with parameters TeX Source \eqalignno{w_o &= \sum_{i=1}^{N_s} \alpha_i \phi ({\bf s}_i) &\hbox{(4)}\cr \rho_o &= {1\over N_s} \sum_{j=1}^{N_s} \sum_{i=1}^{N_s} \alpha_i \Phi ({\bf s}_i, {\bf s}_j) &\hbox{(5)}} where ${\bf s}_i$ are the support vectors, of which there are $N_s$, and where $k$ is the Gaussian kernel. Here, $w_o \in {\BBF}$, $\rho_o \in {\BBR}$, and the $\alpha_i$ are Lagrangian multipliers used to solve the dual formulation, more details of which may be found in [16] and which are not reproduced here. Test data ${\bf x}$ arriving in the electronic patient record are classified as being “abnormal” if $z({\bf x}) > 0$, and “normal” otherwise. In order to allow a fair comparison with the probabilistic methods $K_0$, $K_{w1}$, and $K_{w2}$ described in Sections IV-A and IV-B, an alert was generated if test data were classified “abnormal” for 4 minutes in any 5-minute window of test data, in order to avoid false-positive alerts due to transients arising from sensor artifact.

SECTION V

## EVALUATING MONITORING SYSTEMS

The evaluation of automated patient monitoring systems is not straightforward and remains an area of contention. This section describes the challenges involved, and proposes an evaluation strategy appropriate for the complex evaluation of an automated system running in real time.

### A. Clinical Labels

In order to evaluate system performance, we would ideally have accurate labels of “normal” and “abnormal” episodes of data. The “gold standard” in classification problems is often a set of labels provided by domain experts—in the case described by this paper, such experts are clinical specialists. However, exhaustive labeling is typically not possible in practice due to the size of the datasets and the difficulty in determining patient abnormality from inspection of the vital signs alone. Furthermore, intra- and interexpert variability is such that the subjective nature of the labeling process becomes significant. This is a fundamental obstacle to the evaluation of automated systems, which are based on the communication and analysis of very large quantities of data.

Therefore, a one-sided approach is sometimes taken to evaluating large-scale systems, in which clinical experts are asked to review data in the electronic patient record corresponding to periods of suspected patient abnormality. These could be taken from, for example, “hard outcomes” such as death, unforeseen admission to intensive care, etc. However, the number of patients with these outcomes is typically small unless data are acquired from large numbers of patients, which may take many years. Therefore, this study takes the approach in which clinical escalations were taken as being possible indications of patient abnormality. These escalations are events that took place during the patient’s stay, where the patient’s clinical notes indicate that care was escalated to a high level due to some perceived abnormality.

There are many reasons for which a patient’s care may be escalated in practice, only some of which are likely to correspond to derangement of the vital signs, and which could, therefore, be expected to be identified by an automatic method. Two experts in emergency medicine independently reviewed patient records (but not the continuous vital-sign data acquired from sensors and bed-side monitors) and identified those periods that corresponded to escalations that should be expected to have a corresponding change in the vital signs. Any differences in opinion between the two experts were resolved by the independent assessment of a third clinical expert. This labeling of data in the electronic patient record by experts is extremely time consuming, and is a primary reason why such large clinical trials are not undertaken for the principled evaluation of automated monitoring systems.

### B. Evaluation Methodology

Communications theory originated the received-operating characteristic (ROC) curve, which has since become the primary means of evaluating the decision outputs of medical devices. However, the evaluation of performance with respect to events occurring within time-series data is not straightforward.

#### 1) Independence

The classical method for evaluating classifier performance is to construct a confusion matrix, which quantifies the number of true and false, positive and negative classifications (TP, FP, TN, and FN) made by the classifier, with respect to some “ideal” classification. The sensitivity and specificity of the classifier may then be plotted as a function of some variable of its operation (typically a parameter that controls its decision threshold), to give the ROC curve. Indeed, some EWS systems [18] were evaluated by maximizing the area under the ROC curve (AUROC), which corresponds to maximizing the accuracy of the classifier.

Such approaches can be appropriate when evaluating the classification of independent entities, such as mammograms or blood samples taken from different patients, but are problematic when used to analyze time-series data, such as that considered in this study within the electronic patient record, which are not independent. The results could be biased, for example, by a small number of “abnormal” patients with long hospitals stays (which often occurs, because length-of-stay is correlated with physiological abnormality); these patients contribute a large proportion of data to the set of “abnormal events,”, but where those events are largely dependent. Thus, the performance of the classifier would be skewed toward how well it performed for this small subset of patients.

We suggest that there is no “right” answer to the problem of how to evaluate a time-series classifier, and that, ultimately, it is probably inappropriate to reduce the performance of a system down to a single metric (e.g., accuracy/AUROC).

#### 2) Patient-Based Analysis

If we wish to use ROC-based performance metrics in an evaluation (without reducing them to a single statistic), we must select a basic unit of analysis other than individual samples of vital-sign data within the electronic patient record. The assumption of independence between basic units can avoid being broken by performing the analysis on a per-patient basis. In this study, we adopt the following convention:

“Event” patients: This group comprises all patients containing one or more “events,” defined to be those with escalations in their vital signs that occurred during their stay in the ED. Section VI will provide results from phase I of our study, which comprised 476 patients, 52% of which were male. The mean age was 61 years (range 18–108, IQR 43–79). There were 34 escalations in this population, indicating the scarcity of labeled event data.

“Normal” patients: This group comprises all patients who had no clinical escalation of any kind, and corresponds to 217 (46%) of the 476 patients in phase I of our study.

We note that there is a set of patients that belongs to neither of the two sets described previously, being those who had no escalations due to their vital signs, but who were also not “definitely normal”; for example, they may have had escalations for reasons other than those related to vital signs, and which an automated monitoring system could, therefore, not be expected to detect. Similarly, some patients had no vital-sign data transmitted by our system, and could, therefore, not be used in either of the sets described previously.

We define a TP classification to be an “event patient” for which the first event was successfully detected; conversely, we define an FN classification to be an “event patient” for which the first event was not successfully detected.

A confounding factor in the evaluation of monitoring systems is determining which machine alerts count as “early warning” of a forthcoming event, and which alerts are merely false positive. This definition must depend on the dynamics of the system being monitored; in the assessment of patient vital signs, for example, if an alert was generated by a machine within 1 h of a clinical escalation, we might deem that the alert was predictive of the escalation; however, if the alert occurred 10 h before the escalation, we might deem it to be a false alert. Clinicians have no standard definition of “early warning,” and so we define an event to have been successfully detected if an alert within some time $\tau$ prior to that event. We will consider the performance of each system as $\tau$ is varied in the interval $\tau = [0\ 60]$ min, representing the range of times that could constitute “early warning” of an event in the context of patient vital-signs monitoring. We argue that this approach allows the comparison of various integrated systems without dependence on this window length, $\tau$.

We define a TN classification to be a “normal patient” for which there were no alerts generated; conversely, we define an FP classification to be a “normal patient” for which one or more alerts were generated; i.e., the “normal” set of patients should have had no alerts generated for them, because they were deemed to be physiologically stable throughout their connection to the monitoring system.

SECTION VI

## RESULTS

Fig. 4 shows the TP and FP results for the various analysis methods described previously, when evaluated using the large-scale methodology described in Section V. We compare the kernel estimate $K_0$ from Section IV-A, the two modified kernel estimates $K_{w1}$ and $K_{w2}$ from Section IV-B, the SVM from Section IV-C, the heuristic EWS system that was used in the hospital at the time of the study (termed $\hbox{EWS}_a$), and the “evidenced-based” EWS system proposed in [6] since adopted throughout the Oxford University Hospitals NHS Trust (termed $\hbox{EWS}_b$).

Fig. 4. TP numbers for manual and automatic patient systems, with a maximum TP = 29 event patients: (a) automatic methods $\hbox{SVM}$, $K_{w2}$, $K_{w1}$, and $K_0$ are shown from uppermost to lowermost, respectively (i.e., in green, blue, red, and black, respectively); and (b) $\hbox{SVM}$, and manual methods $\hbox{EWS}_{a2}$, $\hbox{EWS}_{b2}$, $\hbox{EWS}_{a1}$, and $\hbox{EWS}_{b1}$ are shown from uppermost to lowermost, respectively (i.e., by green, blue solid, red solid, blue dashed, and red dashed lines, respectively).

$\hbox{EWS}_a$ and $\hbox{EWS}_b$, which would be applied manually, in practice, were evaluated when applied to continuous data at frequencies of 30 min and 2 h. We refer to the 2-h EWS systems as $\hbox{EWS}_{a1}$ and $\hbox{EWS}_{b1}$ for the old and new EWS systems, respectively. Similarly, we refer to the 30-min EWS systems as $\hbox{EWS}_{a2}$ and $\hbox{EWS}_{b2}$ for the old and new EWS systems, respectively.

The SVM and kernel-based methods were both trained using data obtained from a previous clinical study, as described in Section IV. These training data were acquired from patients deemed representative of those encountered in the ED, and where studies indicate that “stable” patient physiology across high-dependence departments in the U.S. and U.K. is generally similar [6].

Examination of the results shown in Fig. 4 demonstrates that the SVM is the superior classifier in terms of identifying patient deterioration. In order of decreasing sensitivity to patient deterioration, the automatic methods $K_{w2}$, $K_{w1}$, and $K_0$ suggest that the discriminative power of the SVM is closely matched by the modified kernel estimate $K_{w2}$ when those kernels furthest from the centroid of the training data are discarded. Of the automated methods, the basic kernel estimate is the least sensitive to patient deterioration in the results from the phase I data obtained from the clinical trial [19], [20].

The figure also shows a comparison of the manual methods, where it may be seen that the EWS used during the period of data acquisition applied at 30-min intervals ($\hbox{EWS}_{a2}$) is more sensitive to patient deterioration than the new EWS proposed in [6] applied at 30-min intervals ($\hbox{EWS}_{b2}$). Unsurprisingly, when these two methods are applied less frequently, at 2-h intervals ($\hbox{EWS}_{a1}$ and $\hbox{EWS}_{b1}$), the sensitivity for both EWS systems decreases significantly. While these EWS systems can reach the sensitivity of automated methods, they must be performed at unrealistically frequent intervals, which could not be supported by the typical resources available in emergency care.

Table IV shows the number of FPs for each method, obtained from examining their performance with the set of entirely “normal” patients, as described in Section V, for whom we would not expect physiological deterioration to be detected. The table shows that, while manual EWS methods can approach the sensitivity of integrated automatic methods (if performed sufficiently frequently), they generate large numbers of FP alerts. The results also demonstrate that the evidence-based EWS system $\hbox{EWS}_b$ of [6] has a lower FP rate than all methods considered; this is a particular advantage for an EWS, which probably has to err on the side of being insensitive, because every alert generated by the system greatly increases the workload of the ward staff. We note that these values of FP are significantly lower than the false-positive rates from conventional bed-side monitors [7], making them attractive for use in a busy clinical environment, such as an ED.

TABLE IV SUMMARY OF FALSE POSITIVES FOR ALL 217 “NORMAL” PATIENTS

A case study of a patient that deteriorated during phase I of the clinical trial is shown in Fig. 5, along with corresponding outputs from the various automatic methods under consideration. We note that this patient had no temperature data available. An escalation of care occurred at the end of the period shown, following periods of erratic SpO2 (including many desaturations to low oxygen levels), a shift in baseline HR by approximately 20 bpm, and transient hypertension (where the systolic-diastolic average BP exceeds 130 mmHg in this example). Note that the RR signal drops out in the first half of the interval shown, due to poor signal quality obtained from the measurement probes. The outputs of the automated methods demonstrate that this deterioration was detected sufficiently early to be useful to clinicians.

Fig. 5. Vital signs (SpO 2, HR, BP, RR from upper to lower, first plot, in blue, green, red, and black, respectively) acquired via e-health system for an example ED patient, who deteriorated and for whom care was escalated at the time shown by the vertical red dashed line. Underneath, the outputs of the automatic methods, where shaded regions indicate intervals during which an alert was raised.
SECTION VII

## CONCLUSION

This paper has described a large clinical trial, undertaken in order to 1) evaluate automatic methods for assessment of patients, based on an electronic patient record augmented with machine learning algorithms and 2) address the perceived lack of evidence in e-health research, that has been suggested as one of the primary reasons that such methods have not yet been adopted at scale. The trial that we have described is the first of its kind, and brings with it many challenges that we have had to address, and which have been described in this paper.

Integrated, automatic systems require some sort of supervised training in order to perform useful analyses, and so that resulting systems can be evaluated. While semisupervised techniques have been demonstrated to be effective in some areas [21], this is still an active research area within e-health, and remains future work. Therefore, there is a dependence on the labeling of patient data such that periods of instability can be learned. This requires teams of expert clinicians to review data from the electronic patient record, which is extremely time consuming and costly. We have described methods of reducing this workload, by requiring clinical labeling of those events that are deemed to have been “escalations” in practice; while this is not an exhaustive labeling of the data (which is impractical, for a 105-GB dataset, exceeding 2170 h of data), it has allowed us to train and evaluate machine learning techniques within the integrated healthcare system deployed in the ED.

We have framed the existing standard of care (EWS systems) as a novelty detection task, in which periods of “abnormal” physiology are scored; this allows a direct comparison with our integrated, automatic methods. Such approaches are, by nature, population-based; patient-specific algorithms that learn from individual test patients in real-time remain an area of on-going research. However, the population-based approach is already accepted as being clinically acceptable, because the EWS systems in widespread use in hospitals adopt such methods. Hence, our proposed system may be seen as a continuous, automatic version of the existing clinical standard of care, with the advantage that vital signs can be treated in a multivariate sense, using models that take into account dependence between vital signs; this is a considerable advantage over the existing standard of care, in which vital signs are treated as being independent.

Future work will involve improving the algorithms used to perform data communication and analysis, and in further building the body of evidence required before integrated patient monitoring systems can be adopted at scale. Related clinical trials in the physiological monitoring of other cohorts of patients are underway between the Institute of Biomedical Engineering, Oxford and the Oxford University Hospitals NHS Trust; these trials will adopt the methodologies developed within this paper for assessment and evaluation of evidence.

## Footnotes

The work of D. A. Clifton was supported by the Centre of Excellence in Personalized Healthcare funded by the Wellcome Trust and EPSRC under Grant WT 088877/Z/09/Z. The work of D. Wong and L. Clifton was supported by the NIHR (National Institute for Health Research) Biomedical Research Centre Programme, Oxford.

D. A. Clifton, D. Wong, L. Clifton, and L. Tarassenko are with the Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford, OX3 7DQ, U.K. (e-mail: david.clifton@eng.ox.ac.uk; wong@robots.ox.ac.uk; lei.clifton@eng.ox.ac.uk; lionel@robots.ox.ac.uk).

S. Wilson, R. Way, and R. Pullinger are with the Oxford University Hospitals NHS Trust, Oxford, OX3 9DU, U.K. (e-mail: sarah.wilson@ouh.nhs.uk; rob.way@ouh.nhs.uk; rick.pullinger@ouh.nhs.uk).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

1John Radcliffe Hospital, Oxford University Hospitals NHS Trust, Oxford, U.K.

2Heart rate (HR) measured in beats per minute, respiration rate (RR) measured in breaths per minute, peripheral blood-oxygen saturation (SpO 2) measured as a percentage, systolic blood pressure (BP) measured in mmHg, and body temperature.

3These are vital signs where the systolic blood pressure was replaced by the mean of systolic and diastolic blood pressure (the systolic–diastolic average, or SDA) in order to take into account changes in both blood pressure measurements that are typically acquired, rather than just the systolic.

4We note in passing that this is statistically equivalent to generating a novelty alert if $r_{20} > \kappa$, where $r_n$ is the $n$th order statistic, taken from the order statistics of a window of scores $z({\bf x})$ of test data ${\bf x}$.

5This method typically performs similarly to the other popular one-class SVM formulation, the support vector data description, as proposed by Tax and Duin [17].

6Accuracy is defined to be (TP + TN)/(TP + TN + FP + FN).

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available