PatientFlowNet: A Deep Learning Approach to Patient Flow Prediction in Emergency Departments

Emergency Department (ED) crowding is a major public health challenge since it can seriously impact patient outcomes; and accurate prediction of patient flow in EDs is essential for improving operational efficiency and quality of care. We present a deep learning framework to predict patient flow rates in EDs, namely the rates of arrival, treatment, and discharge for patients across all triage levels. Our model detects short-term and long-term temporal dependencies within the time-series data of a given patient-flow variable, as well as dependencies between time-series data of different patient-flow variables. We implement this framework as a convolutional neural network, which we call PatientFlowNet. Our proposed model learns simultaneously from multiple flow variables over a long temporal window and predicts future values of arrival, treatment, and discharge rates in the ED. We benchmark our model against state-of-the-art methods on data from EDs in three different hospitals. Results show that PatientFlowNet achieves superior prediction accuracy, compared to the baseline methods, and yields a mean absolute error that is 4.8% lower than the leading baseline. Furthermore, we provide a visual and interpretable representation of the learned dependencies by our model, between patient-flow variables in EDs.


I. INTRODUCTION
Emergency Department (ED) crowding is a public health challenge [1], [2]. It is caused by external factors, such as fluctuations in patient arrivals, and by internal factors, such as lack of available beds or unexpected human delays [3]. This problem is also exacerbated by unexpected external shocks such as a global pandemic [4], [5] due to unusually increased patient flow [6], [7]. In such circumstances, the need to rapidly modify admission procedures [8], [9] demands accurate forecast of the patients' arrival rate and their movements through the system until their discharge.
A large body of research has been dedicated to prediction of ED crowding [10]- [12] and to mitigate its effects [3], [13]. Sources of ED crowding include unexpected large volume of patients arriving at the ED, delays in triaging and starting their treatment, and impediments in discharging patients whose treatment has been completed. Accurate prediction of arrival, treatment, and discharge rates helps understand when and The associate editor coordinating the review of this manuscript and approving it for publication was Hualong Yu .
where such crowding occurs, leading to efficient allocation of resources or development of cost-effective interventions to streamline the processes.
Accurate estimation of flow rates can also help predicting other ED workflow variables, such as time-to-treatment and length-of-stay [14] when the ED is modeled as a queueing system. In [15], flow variables that mimic arrival and treatment rates are combined with machine learning to estimate time-to-treatment for each patient, and in [16], [17], a queueing theoretic version of that approach is used to estimate the length-of-stay for each patient. Approaches that model the ED as a queueing system are hindered by unexplained queue pre-emption and delays (see Section III). Thus, a more aggregated approach where per-interval patient-flow rates are considered instead of per-patient timelines is more desirable.
Most studies on prediction of patient-flow rates use either time-series methods such as variations of autoregressive integrated moving average (ARIMA) [18]- [21], or use parameters that encode some seasonality information (such as day of the week or moving window average) and perform a regression [22], [23]. Such models either have a rigid structure or require manual feature engineering. ARIMA's use of a small number of difference terms leads to only short-term dependencies being considered, while the use of manually encoded parameters leads to loss of information prior to the application of regression. Such deficiencies suggest that a more general model that does not require feature engineering (like deep learning models) may be more suitable.
Deep learning models are more generalized than classical machine learning models and can better capture the dependencies when data is not directly amenable to feature engineering is. In fact, recurrent neural networks (RNNs) have been used to predict patient-flow rates with mixed degrees of success [24]- [26]. The models used in these papers are vanilla version of long short-term memory (LSTM) networks and they outperformed baseline time-series methods, as they used sequence input data and can learn from a large temporal window. Nevertheless, they suffer from two main shortcomings: (1) slow training and (2) lack of cross-learning from multiple time-series data. The latter is significant, as a patient flow variable in the ED depends on other factors, besides its own history (see Section III).
Convolutional models that use dilations, such as WaveNet [27], offer a viable solution to the problem (1) compared to RNNs, as they can be trained faster. In addition, exponentially large receptive fields of WaveNets enables them to learn long-term dependencies. The size of such models increases in the log order of dilation size, i.e., they can capture long term dependencies without being excessively large. In [28], convolutional models are outperform RNNs on a variety of sequence modeling tasks. Variations of WaveNet are used in other time-series tasks, e.g., in [29], parametrized skip connections are added to WaveNet to extract dependencies between financial tickers; in [30], pre-and post-processing tools in addition to WaveNet are used to estimate blood glucose levels; and in [31], interleaving dilated temporal and spatial convolutions are proposed to classify multiple time-series variables.
However, WaveNet is very effective in single variable sequence-prediction tasks such as audio synthesis, it only uses historical values of a single variable to predict its future values, not addressing the aforementioned shortcoming (2) of RNNs. In this paper, we address this by proposing a deep learning model that can simultaneously learn temporally (from the history of a flow variable) and spatially (by extracting dependencies between different flow variables). We implement this model as a convolutional neural network, which we call PatientFlowNet. This model relies on stacks of flow convolutions, i.e., 2-dimensional convolutional filters that are exponentially dilated in time. This enables our model to learn from multiple patient-flow variables in a large temporal window. Furthermore, we will show that the convolutional filters in our design allow for interpretability by visualizing the dependencies between patient-flow variables in EDs.
Using ED data from electronic medical records of three different hospitals, we will show that PatientFlowNet outperforms the state-of-the-art supervised learning methods in one-step-ahead prediction of patient-flow rates. We also use our model to extract dependencies between different patient-flow variables in EDs.
The remainder of this paper is organized as follows. We discuss the existing methods for patient-flow prediction in Section II. Section III contains problem statement and describes ED workflow as well as variables-of-interest in the dataset. In Section IV, we develop a mathematical model for multivariate time-series forecasting and describe how it manifests into PatientFlowNet as a deep learning model. We describe our dataset in Section V, and in Section VI, we benchmark PatientFlowNet against existing methods by using data from 3 different EDs. We conclude with remarks in Section VII.

II. OTHER RELATED WORKS
Classical Machine Learning Models: Classical machinelearning approaches to patient-flow prediction have been mostly centered around variations of ARIMA, exponential smoothing, and regression [22], [23], [32], [33]. In [18], flow data from one ED in Brazil was used to show that simple seasonal exponential smoothing was the most accurate at jointly predicting arrival rates for all triage levels, while seasonal ARIMA was more accurate at predicting arrival rates for a specific triage level. In [34], a hybrid model of ARIMA and linear regression was developed that outperforms variations of either model in predicting arrival rates at two EDs in China. In [19], flow data from one ED in China was used to show that a combination of single exponential smoothing and seasonal ARIMA is more accurate than either model alone. In [21], a host of machine learning methods were used on data from one ED in Portugal to show that seasonal ARIMA outperforms moving window average and exponential smoothing in predicting daily arrival counts. The main drawback of the models in the above cited works is that their input size is small, meaning that either predictions were based on a short temporal window or manual feature engineering was needed. As we will show next, more flexible models that can use a larger input size can provide more accurate predictions.
Deep Learning Models: Feed forward and recurrent neural networks have also been used to predict patient flow in hospitals. In [35], a hybrid model of ARIMA and feed forward neural network was developed that outperforms ARIMA, linear regression, or a hybrid of both in predicting arrival rates in one ED in Turkey. In [36], a feed forward network whose feature selector is based on a genetic algorithm was developed that outperforms a host of models including ARIMA and linear regression in predicting flow rates in one ED in Hong Kong. In [25], it was noted that random forest is more accurate than long short-term memory (LSTM) in predicting discharge rates, as it can combine data from multiple flow variables. An extensive study of neural network models was done in [24], where it was shown that a sequence-tosequence LSTM which uses a long history of flow variables outperforms a host of baselines including seasonal ARIMA, linear regression, and feed forward network in predicting VOLUME 9, 2021 arrival rates in one pediatric clinic. The two drawbacks of this approach is that LSTM does not combine information from multiple variables and is slow to train. In contrast, our model can provide more accurate predictions by learning from multiple variables, while having faster training due to its convolutional design.
While convolutional models have been used for other time-series prediction and classification tasks [27]- [29], [31], they have not been used for patient flow prediction. Our convolutional model considers dependencies between multiple patient-flow variables and utilizes their long-term history to provide accurate predictions.

III. BACKGROUND AND PROBLEM STATEMENT
A. ED WORKFLOW Patient flow in EDs is shown in Figure 1. Once a patient arrives at the ED, she is first registered (arrival time) and then triaged (triage time), where she is assigned a triage level from 1 to 5 denoting the acuity of her condition. Level 1 indicates the highest and level 5 denotes the least acuity. The patient then waits in a first-come-first-serve (FCFS) queue to start her treatment (treatment time). High acuity patients (triage levels 1 and 2) are given preemptive priority over low acuity patients (triage levels 3 to 5) and usually jump to the front of the queue. Hospitals in this study implement a separate fast-track queue during certain periods of the day, whereby they process patients who require minimal treatment (usually triage levels 4 and 5) separately to reduce the overall wait time. Fast-track is in effect from 8am to 11pm daily. Once treatment of a patient is completed, she is discharged and leaves the ED. From the above, EDs may appear to be simple queueing systems, but our data in Section V indicates that this paradigm is not closely followed. Table 1 shows the probability of a patient jumping ahead of the line, grouped by expected scenarios. Red cells correspond to high acuity patients being admitted, where the probability of jumping ahead is expected to be close to 1. Yellow cells correspond to patients admitted via fast-track. Since fast-track is in operation only during 8am to 11pm of weekdays, we expect these values to be somewhere between 0 and 1. Green cells correspond to patients not expected to jump ahead, and values are expected to be near zero.
Red cells have generally larger values than the rest, but there are numerous unexplained violations of the aforementioned expectations. Factors such as unavailability of facilities or deteriorating condition of a waiting patient may be TABLE 1. The value in row i , column j denotes the probability that in Hospital 1 an arriving patient in triage level i will be served ahead of a patient in triage level j who has arrived earlier and is in the queue.
conducive to such violations. The above examples indicate that a model-based (e.g., queueing model) approach in which per-patient flow metrics are considered may not work well because violations of the model (e.g., preemptions) are unexplained. Thus, we focus on a more data-driven approach for prediction of patients' arrival, treatment, and discharge rates in fixed-length consecutive intervals.

B. PROBLEM STATEMENT
We wish to accurately forecast patient flow in EDs. Specifically, our focus is on predicting future values of arrival (λ), treatment (µ), and discharge (δ) rates by utilizing historical values of these flow variables. The term rate is defined as the raw count of patients in a specific triage level for a fixed interval of 1 hour. For instance, µ i,t is the number of patients of triage level i who are treated in the 1 hour interval of t − 1 to t. Our aim is to predict the aforementioned rates for all triage levels at time t + 1, using the history of all rates at all triage levels in a fixed window of length k. Suppose we have a flow variable α that we wish to predict, where α ∈ {λ, µ, δ}. Given a set S = {(x t , y t+1 )| t 1 ≤ t ≤ t 2 }, where x t = λ 1,t , . . . , λ 5,t , µ 1,t , . . . , µ 5,t , δ 1,t , . . . , δ 5,t and y t+1 = α 1,t+1 , . . . , α 5,t+1 , we want to find the mapping f * : R 15 → R 5 that minimizes the loss. That is, where g is a loss function. This enables us to predict the number of patients in different triage levels who arrive, are treated, and are discharged from the ED in the next hour. The one-hour-ahead prediction window is sufficient for predicting short-term ED workflow variables such as wait times. For longer-term predictions, one can continue feeding the predicted values back into the model to obtain farther estimates as is done in sequence-to-sequence models.

IV. METHOD
The ED has no control over the arrival rates and once a patient arrives, it is not possible to turn her away. Thus, the arrival rate is exogenous to the system, but the treatment and discharge rates depend on the system, and are therefore endogenous. In what follows, we first develop a model to predict future values of an exogenous variable (e.g., λ) by using its own history. We then expand this model to predict future values of an endogenous variable (e.g., µ and δ) from the history of multiple time-series variables. Finally, we explain how this model is implemented as a neural network, namely PatientFlowNet.

A. PREDICTING EXOGENOUS VARIABLES
Let a be an integer valued time-series variable. We wish to predict its value at time t + 1, denoted by a t+1 , from its k previous observations. One can use conditional probability p(a t+1 |a t−k+1 , . . . , a t ) to predict a t+1 . By sliding a window of length k over the time-series data, one can estimate the conditional probability distribution (CPD). Let N = max i (a i ) denote the upper bound of a. There are N k possible combinations of values a t−k+1 , . . . , a t . At the same time, k must be large to observe a long enough history. Hence, a fully non-parametric approach to CPD estimation requires exponentially large data (scaling with N k for a large k), which is infeasible.
To mitigate the above mentioned infeasibility, we use a parametric paradigm in which parameters are inter-related via a tree structure. We first model a t+1 as a linear combination of past observations of a, i.e., where the values of θ i are to be learned. This approach is computationally more efficient in the sense that it only needs k parameters instead of N k parameters to predict a t+1 , but may be less accurate when the future value is not best modeled by a linear combination of past observations. Since the size of θ scales with k, if one aims to look at a large enough (e.g., 2 11 ) window of past variables, it soon becomes infeasible. Thus, in what follows, we propose a second assumption to reduce the number of parameters representing θ . A linear combination can be viewed as applying a 1-dimensional convolution filter in a neural network with no padding, so we write (1) as θ a, where θ, a ∈ R k . Next, we enrich this convolution by a series of nested convolutions that are explained below. Let k = 2 for some . Normally, the value of θ a is obtained for some θ , which needs 2 parameter values. In what follows, we show how to stack convolutions together and use 2 parameter values instead. Specifically, for γ (1) , . . . , γ ( ) ∈ R 2 , we obtain convolutions in a serial manner. Let a (0) = a, and obtain a (i) Since the interval between indices of a increases exponentially with i, they are exponentially dilated convolutions. This is graphically shown in Figure 2. It follows that and f i,k is the kth digit from the right in the binary representation of i. Note that a ( ) j is the convolution ofθ To put things in perspective, we made two assumptions in (1) and (2) that reduce the required data from N 2 to 2 and then from 2 to 2 , respectively. The set up in (2) ensures that convolutions are causal, i.e., they only look at past values. This allows us to slide the stacked convolutions over the input stream to obtain a causal output stream, as shown by dotted arrows in Figure 2. This scheme is also used in WaveNet [27], where residual and skip connections are added to a stack of causal and exponentially dilated convolutions, where convolutions are interleaved by an activation function mimicking the switching mechanism in gated recurrent units [37] to improve accuracy.

B. PREDICTING ENDOGENOUS VARIABLES
While the above scheme works for an exogenous variable such as the arrival rate, it lacks cross-learning that is needed for endogenous variables. For instance, the treatment rate depends not only on its own history, but also on that of the arrival rate. If there is an uptick or a slowdown in arrivals, a similar pattern will be observed in treatment rates with a delay. Thus, we need a model that can learn from multiple time-series variables to predict future values of an endogenous variable.
We adapt our scheme to accommodate for endogenous variables that depend not only on their own history, but on other time-series variables as well. In doing so, we predict the value of endogenous variable b at time t + 1, denoted by b t+1 , from the previous k observations of m input variables, where one of these variables is the past observations of b itself. Collectively, we denote these inputs as a, where a i,j for i = 1, . . . , m and j = t − k + 1, . . . , t. Note that the input size is m × k. One can estimate the conditional probability p(b t+1 |a 1,t−k+1 , . . . , a m,t ) by sliding a window of size k over the m time-series variables. Let N = max i,j (a i,j ) be an upper limit to a. We need to collect probabilities for m × k combinations of N variables, resulting in N mk probability values. As this is infeasible, we model b t+1 as a linear combination of past values, i.e., where θ ∈ R mk is to be learned. This can be represented as applying a 2-dimensional convolution filter in a neural network with no padding on the input data, meaning that (3) can be represented by θ a. We aim to learn long-term dependencies of time-series variables, which requires k to be large. Since the size of θ scales with k, learning θ soon becomes infeasible. To mitigate this, we propose a systemic representation of θ that requires the number of parameters to be in the order of log k. To do so, we use a stack of 2-dimensional convolutions, spanning across multiple time-series variables (spatial dimension) over a fixed time window (temporal dimension). These convolutions are causal and exponentially dilated in the temporal dimension to create a receptive window large enough to utlize long-term history and maintain causality between the input and the output. We call such convolutions flow convolutions where a i is the ith column of a. We obtain a (i) for i = 1, . . . , by Note that γ The indexing of elements in (4) ensures that the elements of a (i) are derived from past values in the lower layer, hence causality is maintained. The exponentially increasing dilation ensures that the temporal receptive window increases as we go higher in the stack, while the spatial length is m i . This stacking of flow convolutions is shown in Figure 3. Thus, the output of the stack (a ( ) 2 ) is the result of convolutions of the past 2 observations of a (0) . Note that by setting a ( ) 2 = b t+1 , we have replicated (3) by using a stack of γ (i) s. Since the size of γ (i) is 2m i−1 m i , the number of parameters needed for this scheme is 2 i=1 m i−1 m i , which is in order of = log k. Thus, by expressing future values as a linear combination of past values and utilizing a stack of flow convolutions, we reduce the number of required parameters from an order of N k , to an order of log k. This structure ensures that the output is learned from multiple time-series variables on an exponentially large temporal window. Besides, this structure depends on γ (i) that represent a set of flow convolutions. This is illustrated in Figure 4. Learning in the temporal dimension is as it was in Section IV-A for a single time-series variable, but now we have concurrent learning in the spatial dimension across multiple time-series variables. By using a stack of such convolutions, we temporally use an exponentially large receptive field. We can change the size of spatial dimension as we go higher in the stack by changing the value of m i .

C. PatientFlowNet ARCHITECTURE AND IMPLEMENTATION
We use the framework that we have developed so far, and inspire from [27] to build our convolutional neural network, namely PatientFlowNet. The core idea behind Patient-FlowNet is using a stack of flow convolutions. The input is passed to the first layer of the stack, where in each layer, we first apply the flow convolution filter, followed by independent applications of the tanh and sigmoid activation functions and subsequent multiplication of the results, which mimics the activation function in [27]. We discovered that an additional 1-dimensional convolution across time-series variables (i.e., in the spatial dimension) improves the performance. The output of this spatial convolution is passed to the next layer, and with each layer the size of the receptive field is doubled. We continue adding layers until the size of the receptive field matches the temporal input size that we desire. We also add residual and skip connections between and within layers to enable training deeper networks. The output of the topmost layer is the result of a causal convolution of all elements in the receptive field as noted in Section IV-B. We subject this output to a series of additional spatial convolutions and ReLU activation. Finally, we apply a dropout layer to avoid overfitting, followed by a 1-dimensional spatial convolution and a linear activation function to provide a continuous output. This architecture is shown in Figure 5.

V. DATASET
The data used in this paper comes from EDs in three teaching hospitals in New York City that did not provide permission to 45556 VOLUME 9, 2021 be named, and so are called Hospitals 1, 2, and 3. The datasets have per-patient information, namely triage level, arrival time, treatment time, and discharge time over a roughly 2 year period from 2011 to 2013. We discard invalid patient data, which include those whose discharge time is prior to their treatment time or whose treatment time is prior to their arrival time. We also discard the data of any patient whose time-totreatment is longer than 24 hours. The above exclusions leads to discarding of 1.23% of data.
For each hospital, we extract the number of arrivals (λ), treatments (µ), and discharges (δ) for different triage levels over fixed consecutive intervals of τ = 60 minutes. Let λ i,k be the number of patients in triage level i who arrive during the kth interval. Similarly, let µ i,k and δ i,k be the number of patients in triage level i who are treated and discharged during the kth interval, respectively. Thus, when there are T intervals, we have λ, µ, δ ∈ R 5×T . The data has a long tail distribution as shown in Figure 6a. Hence, we apply a ''log 1p'' transformation (defined by log 1p(x) ≡ log(1 + x)) to shrink the long tail.

VI. RESULTS
In this section, we present experiments on the three datasets described in Section V. For each dataset, we report the prediction error on arrival (λ), treatment (µ), and discharge (δ) rates. In each experiment, we predict the one-step-ahead values of each of the above rates for all triage levels, given the values of the last k observations of all rates and triage levels for a given value of k. That is, the input data consists of the past k observations of λ, µ, and δ for triage levels 1, . . . , 5, and the output is the next observation of the variable of interest for triage levels 1, . . . , 5. Since some of the baseline methods cannot effectively utilize a long input size, we conduct two sets of experiments: 1) Short-term experiments, whose input size k is short.
2) Long-term experiments, whose input size k is long.
Since the data exhibits a strong 24-hour cyclic behavior as in Figure 6b, we set the input window size to k = 24 for shortterm experiments. For long-term experiments, we consider window size of k = 2 11 , which roughly includes 3 months of observations.

A. EXPERIMENT SETUP
For each hospital, we segment the corresponding dataset (consisting of time-series values of λ, µ, and δ as described in Section V) into 4 non-overlapping partitions, where each partition consists of a train/validation/test split as shown in Figure 7. There are 720 labels (equal to 30 days) in each of the training, validation, and test sets and these labels do not overlap. We use a walk-forward approach [38], where the value of each label is predicted by using k previous observations. FIGURE 7. The 4-fold training/validation/test walk-forward split used in benchmarking. The labels (in blue) are predicted using previous observations (in red). Each blue tile corresponds to 1 month of labels and each red tile corresponds to 3 months of observations.
Since patient-flow patterns are dependent on external and internal factors in the ED, we expect short-term correlation between patient-flow patterns. Factors such as ambulance routing or patient registration processes may change over time, which in turn may significantly change patient-flow patterns over longer periods, which makes patient-flow rates non-stationary. When the training set and the test set are far apart, their distribution becomes inherently different and a model trained on the training set would not be a good predictor on the test set. To avoid such instances, the train/test/validation splits need to be temporally close VOLUME 9, 2021 to each other. Hence, the train/validation/test boundaries are chosen in such a way that their corresponding labels (marked in blue in Figure 7) are adjacent to each other and do not overlap. The input data in different sets (marked in red) may overlap, but the blue labels of the test set, in which the prediction errors are reported, does not overlap with the other data. This is in fact standard evaluation practice [38].
For each segment, we train the models on the training set, monitor the errors on the validation set, and report the error over the test set for the model that has the lowest validation error. We use several error metrics to compare the models in our experiments, namely the mean absolute error (MAE), the mean absolute percentage error (MAPE), root mean square error (RMSE) and the coefficient of determination (R 2 ). We use the Adam optimizer [39], and train each model for 4000 epochs using early stopping with tolerance of 100 epochs over loss on the validation set. We repeat each experiment 10 times with random initializations and report the mean error over the test set which offsets the influence of random events.

B. EXPERIMENTS
We compare the performance of our model against the below-mentioned state-of-the-art baselines in patient-flow prediction: • Gaussian Process Regression (GPR) consisting of expsin-squared and white kernels. The exp-sin-squared kernel has a length scale size of 1 and periodicity of 24 and the white kernel has a noise level of 1.
• Seasonal ARIMA [20] with autoregressive order of 1, differencing order of 1, moving average of 2, and seasonality period of 24.
• Feed forward network (FF) [24] with one hidden layer of 64 units and ReLU activation function.
• WaveNet-Long (WN-L) [27] with 11 layers and filter size of 2 in each layer to match input size of k = 2 11 .
• PatientFlowNet-Long (PFN-L) as in Section IV-C with input size k = 2 11 , 11 layers, and 16 filters with temporal length of 2. Parameter values for GPR, PFN-S, PFN-L were chosen based on cross-validation on the portion of datasets from all three hospitals that were not used for the rest of the experiments. Parameter values for the rest of the baselines are exactly the same as those in their respective cited papers. We used the TensorFlow and Scikit-learn Python packages for our experiments.

C. DISCUSSION
In what follows, we present observations based on MAE values in Table 2 for the test sets. Experiments are repeated for other loss metrics such as the mean absolute percentage error (MAPE) and the root mean squared error (RMSE), and the results are included in Table 2. RMSE values are of the same order of magnitude as MAE values, indicating that our log 1p transformation of the values was effective in reshaping the long-tail distribution. The MAPE values are sensitive to small values of labels, and we observe inflated MAPE values due to patient-flow rates being very small for high-acuity triage levels. We observe similar performance gaps between models regardless of the choice of the error metric, Hence, we focus on MAE values.
In short-term experiments, GPR, Lasso-LR, WN-S, and PFN-S are the most accurate in predicting the arrival rate (λ). Models in long-term experiments such as LSTM, WN-L, and PFN-L outperform those in short-term experiments, as they take as input a longer history to predict future values. Note that with respect to λ, there is not a large gap in prediction accuracy between PFN-L and WN-L despite the fact that PFN-L learns from multiple time-series inputs. This is due to the fact that λ is exogenous and its dependence on other timeseries variables is minimal. Thus, cross-learning from other time-series variables is not significant while temporal learning from a longer history provides better predictions for λ.
We observe the reverse when it comes to treatment (µ) and discharge (δ) rates, which are endogenous. As endogenous variables are dependent on other variables, we observe that as expected, cross-learning is significant. Models in short-term experiments that learn from multiple time-series variables, even with a shorter receptive field, generally outperform those in long-term experiments that learn from a single variable, such as LSTM and WN-L. This indicates that dependencies between endogenous variables are mostly short term, and a 24 hour observation can capture most of cross-learnings. Note that PFN-L outperforms the rest of the models in short-term experiments by learning from multiple time-series variables using a long temporal receptive field.
We observe that PatientFlowNet produces more accurate predictions than the rest of the baselines when using a short input window, and it can further improve its accuracy when the length of the input window is increased. PFN-S has a higher prediction accuracy than the rest of the models in short-term experiments for all rates, indicating that it is able to better learn dependencies between different flow variables when using a short input window. Furthermore, WN-L and PFN-L outperform their short-term versions respectively. Therefore, while there is a strong 24-hour cyclic behavior in the data, these models use their large receptive field and learn patterns beyond the 24-hour cycle to make more accurate predictions. This serves as an ablation study of our model, Comparison of models used in short-term experiments (Gaussian process regression (GPR), Lasso linear regression (Lasso-LR), random forest (RF), feed forward networks (FF), WaveNet-Short (WN-S), and PatientFlowNet-Short (PFN-S)) and long-term experiments (ARIMA (AR), LSTM, WaveNet-Long (WN-L) and PatientFlowNet-Long (PFN-L)) in terms of MAE, MAPE, RMSE, and R 2 (corresponding to models that minimize MAE) for the arrival (λ), treatment (µ), and discharge (δ) rates in hospitals H1, H2, and H3.
indicating that its superior prediction accuracy derives from both its cross-learning as well as its large temporal window. When averaged over all rates in all hospitals, PFN-L has an MAE that is 4.8% lower than the leading baseline.
The MAE values along with standard errors are shown in Figure 8. Note that deep learning models generally have higher variations than classical machine learning methods such as regression and random forest as their initialization is more stochastic. For the exogenous variable, i.e., λ, while deep learning models generally outperform classical machine learning models, the performance is within the margin of error. For the endogenous variables, i.e., µ and δ, while the error-bars for deep learning models is still large, the performance gap between PFN-L and the remaining models is significant. Thus, by efficiently learning from multiple flow variables over a long temporal window, PFN-L has a slightly better prediction accuracy for exogenous variables, while having a significantly better accuracy in predicting endogenous variables in the ED.
Since PatientFlowNet can simultaneously learn from multiple time-series variables, we can extract dependencies between patient-flow variables in retrospect. To do so, we examine flow convolution filters in PatientFlowNet. Figure 9 shows normalized values of flow convolution filters in the first layer of PatientFlowNet-Long for each hospital. The values are normalized in such a way that the sum of values in each filter is 1. Thus, darker shades denote stronger dependency. Note that λ is best estimated using its own history, µ is best estimated using past values of λ and µ, and δ is highly dependent on past values of µ and δ. This validates our earlier assumption on the exogenous nature of λ. Lack of dependency of µ on δ indicates that slowdowns and speed ups in discharging patients do not significantly affect treatment rates. Besides, lack of dependency of δ on λ indicates that patients are not necessarily rushed out when arrival rates spike. This analysis is based on historical data in our data sets that were taken during non-distress times; and can change during distress times. Average test error and its confidence interval for predicted rates for different models in 3 hospitals. The length of the error bar for each variable is 2 × the standard error of test error across 4 folds. Note that for endogenous variables (µ and δ), PatientFlowNet-Long outperforms other models by a large margin. Its average MAE for exogenous variables (λ) is also less than other models, but within the margin of error.  We also perform an ablation study on the set of input parameters to our model and to assess their importance in predicting future values. Table 3 shows MAE values of PFN-L for predicting patient-flow parameters using different input configurations in the three hospitals in our study. We observe the same dependency pattern discussed above, where the exogenous variable (λ) is best predicted using its own history while the endogenous variables (µ and δ) are best predicted by using the history of all variables. While the use of additional input parameters is expected to add uncertainty to the model and reduce its prediction accuracy, we note that for the exogenous variable such reduction is minimal as PFN-L learns and adapts to such dependencies, as observed in Figure 9.

VII. CONCLUSION
In this paper, we presented a convolutional neural network model, called PatientFlowNet, to forecast patient flow in emergency departments. The design of PatientFlowNet enables it to learn simultaneously from multiple flow variables over an exponentially large input window, while keeping the model size manageable. We have shown that our PatientFlowNet achieves better prediction accuracy than the current state-of-the-art models used for patient-flow forecasting. While PatientFlowNet has a slightly better prediction accuracy for exogenous variables such as patient arrival rates, it produces substantially more accurate predictions for the endogenous flow variables such as treatment and discharge rates. The short-term predictions by Patient-FlowNet can be used to estimate workflow variables such as wait times. We also described how dependencies between flow variables in the emergency department can be deduced, in a data-driven fashion, by inspecting the learned parameters in the first layer filters of PatientFlowNet.