Combining Deep Neural Networks and Classical Time Series Regression Models for Forecasting Patient Flows in Hong Kong

Deep Neural Networks (DNNs) has been dominating recent data mining related tasks with better performances. This article proposes a hybrid approach that exploits the unique predictive power of DNN and classical time series regression models, including Generalized Linear Model (GLM), Seasonal AutoRegressive Integrated Moving Average model (SARIMA) and AutoRegressive Integrated Moving Average with eXplanatory variable (ARIMAX) method, in forecasting time series in reality. For each selected time series regression model, three different hybrid strategies are designed in order to merge its results with DNNs, namely, Zhang’s method, Khashei’s method, and moving average filter-based method. The real seasonal time series data of patient arrival volume in a Hong Kong A&ED center was collected for the period July 1, 2009, through June 30, 2011 and is used for comparing the forecast accuracy of proposed hybrid strategies. The mean absolute percentage error (MAPE) is set as the metric and the result indicates that all hybrid models achieved higher accuracy than original single models. Among 3 hybrid strategies, generally, Khashei’s method and moving average filter-based method achieve lower MAPE than Zhang’s method. Furthermore, the predicted value is an important prerequisite of conducting the rostering and scheduling in A&ED center, either in the simulation-based approach or in the mathematical programming approach.


I. INTRODUCTION
Nowadays, Accident and Emergency Departments (AEDs) play a significant role in the urgent assessment and treatment of illness and injury. Especially in Hong Kong, the AED can The associate editor coordinating the review of this article and approving it for publication was Simone Bianco. be regarded as a special healthcare department, which provides high-quality specialized emergency care, with the help of available resources, to patients by a team of well-trained and qualified staff. According to the Hong Kong Hospital Authority, AEDs deal with all kinds of emergencies and adopt a 5-point Triage system to assess and prioritize patients in need of urgent treatment. Under the challenges of VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ ageing population, AEDs are facing more and more pressure from the unbalance between supply and demand. A recent figure from NHS England reveals the pressure on AEDs in the UK. Only 92.6% patients were seen within four hours, which is far below the performance target. An analogous situation is common in other countries and territories, such as Hong Kong. In this circumstance, it is essential to ensure a high quality and reliable AED operation in order to utilize the limited resources to the most. A number of studies have been carried out to determine optimal staffing levels and schedules in AEDs, to maximize resource utilization and patient satisfaction. Therefore, knowing the patient arrival pattern in advance is important for decision-making across the whole AED system and is a prerequisite for rostering and scheduling, whether using a simulation-based approach [1] or a mathematical programming approach [2]. In order to solve this problem, a number of approaches were proposed to fit the daily patient flow by widely used time series regression model, such as Generalized Linear Method (GLM), Seasonal AutoRegressive Integrated Moving Average (SARIMA) method, AutoRegressive Integrated Moving Average with eXplanatory variable (ARI-MAX) method, etc. However, since the patient arrival pattern is influenced by human behavior, the relationship between the patient flow pattern and input features (i.e. contributing variables or co-variables) contains high-complexity and cannot be easily fitted by a single classical time series model. In the meantime, combining forecast models is gradually considered as a successful alternative to using just an individual forecast method. For instance, in the machine learning field, a growing number of advanced hybrid strategies, such as bootstrapping, bagging, stacking and boosting, show their better predictive performance than a single method. Hence, in this study, we propose a series of hybrid approaches to solve this real problem, in which those widely used time series regression models were combined respectively with two types of deep neural networks (DNN). Additionally, we introduced three types of hybrid strategies, which are based on Zhang's hybrid model [3], Khashei and Bijari's hybrid model [4] and a moving average filter-based hybrid method [5], respectively, in order to reduce the risk of unreasonable combination and find the best hybrid strategy with minimum validation error.
The paper is organized as follows. Section 2 shows a retrospective study of forecast models in this topic and proposed hybrid approaches. Section 3 presents the detail of our DNN architectures and three different hybrid strategies. Section 4 presents the result of a case study and meanwhile discusses some new findings. The conclusion and the future work are shown in Section 5.

A. PATIENT FLOW PREDICTION
After conducting a systematic literature review of articles concerning patient flow prediction, we find that almost all of the forecast models abstracted from reviewed articles can be labeled as the single classical prediction model. According to incomplete statistics, the ARIMA model [6], [7] is the most widely used forecast model. Among those regression-based models, most approaches choose the Multiple Linear Regression (MLR) model [8], [9] and get acceptable outcomes though the model seems quite simple. Some approaches [10], [11] choose an improved version of Poisson regression with the idea that the patient arrival volume is count number. It is notable that a number of Artificial Neural Network (ANN)-based methods [12]- [14] (Aladag & Aladag, 2012; Menke et al., 2014; Xu, Wong, & Chin, 2013) are applied in this problem because of its flexibility of approximating any linear and nonlinear function to any desired degree of accuracy. However, all these approaches choose traditional shallow ANNs which have only one hidden layer and cannot learn the deep structure of real data sets. Until now, we have not found a proposed approach using hybrid model.

B. FORECAST MODEL COMBINATION
Based on the result of Hibon's study [15], not only do the best possible combinations perform better than the best possible single model, but also combined forecasts reduces the risk of selecting an individual forecast model in practice. After a retrospective study of proposed hybrid models via literature review, the basic idea of combining forecast models is to utilize each model's unique feature to capture different patterns in the data. Wedding and Cios [16] combined the result of Radial Basis Function (RBF) neural networks [17] with Box-Jenkins (UBJ) models [18] to predict future values of data. Terui and Van Dijk [19] investigated combinations of forecasts generated by linear and some nonlinear models and indicate that combined forecasts perform well. It is worth to mention that the method with time varying coefficients dominates marginal forecasts for inside sample performance. In the meantime, a huge number of hybrid strategies are proposed to improve the performance of model combination. Zhang [3] proposes a hybrid strategy that combines both ARIMA and ANN models to take advantage of the unique strength of ARIMA and ANN models in linear and nonlinear modeling. Then, a novel hybridization of artificial neural networks and ARIMA model is proposed by Khashei and Bijari [4], which identify and magnify the existing linear structure in data. Babu and Reddy [5] propose a moving-average filter based method, in which the nature of volatility was explored using a moving-average filter, and then an ARIMA and an ANN model were suitably applied. Therefore, in our approach, all these three kinds of strategies are applied to our models and compared by real data sets.

A. FRAMEWORK OF DNN
A DNN is defined as an ANN with multiple hidden layers of units between the input and output layers. Similar to shallow 118966 VOLUME 7, 2019 ANNs that have only one hidden layer, DNNs can still deal with complex linear and non-linear relationship. Although the universal approximation theorem states that a feedforward network with at least one hidden layer with any activation function can approximate any function, the theorem does not restrict the size of this network, and exponential number of hidden units might be required with the increasing complexity of objective function. Hence, the shallow ANN may fail to fit and generalize correctly with an infeasibly large number of hidden units whereas the DNN can reduce the number of hidden units and also the amount of generalization error. Moreover, when each sequential hidden layer is connected with the one below it, each hidden layer becomes a projection of the layer below it as well. The information from the layer below is nicely transferred by a non-linear, optimally weighted, projection in each subsequent layer of the DNN. Therefore, the extra hidden layers ensure the potential capacity of modeling complex data structure with fewer units than a traditional shallow network.
In our approach, we adopt two types of DNNs. One is a Multi-Layer Perception (MLP) with a rescaled and shifted logistic activation function: This activation function is symmetric around 0 and allows the training algorithm to converge faster. We implement the DNNs by using an R package called ''H2o'' in R x64 3.1.2.
The optimizer for weights update is a parallelized version of Stochastic Gradient Descent (SGD) [20]. This algorithm is faster and more efficient than the standard SGD as it is implemented in a multi-threaded computer system. The other one is a fully connected recurrent neural network (RNN) where the output is to be fed back to input. Each node in the fully connected RNN is connected with a one-way connection to every other node in the next layer. Similar with MLP, each node has a time-varying real-valued activation and each connection has a modifiable real-valued weight. The feedback structure of RNN allows it to exhibit temporal dynamic behavior for a time sequence. Unlike MLP, RNN can use its internal state to process sequences of inputs. The activation function is as same as the one of MLP and we introduced a state-of-the-art optimizer Kingma and Ba [21] as the learning algorithm of RNN. Both MLP and RNN contains three hidden layers, which is determined based on the scale of training datasets and some initial comparative tests.

B. TECHNIQUES TO AVOID OVERFITTING
With more hidden layers, it is obvious that DNNs are often much harder to train than shallow neural networks. DNNs are prone to get over-fitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data. In general, a model containing high-complexity does not necessarily include the target function or the true data generating process. In the modeling process, we almost never have access to the true data generating process, which means that it is usually difficult to adjust the complexity of the model with deep architecture. Not simply to control the size or the number of parameters, in this approach, we aim to fit a large DNN which has been regularized appropriately. In this way, over-fitting can be avoided while the flexibility of the DNN is maintained.
Based on the motivation mentioned above, we apply two kinds of techniques to avoid over-fitting. The first one is L 1 and L 2 regularization methods, in which we modify the loss function as B|j). (2) In this way, those weights which are used to fit the sampling error are cut off and then a smoother model is formulated. L 2 regularization represents the sum of squares of all the weights and biases in the network. This parameter norm penalty is commonly known as weight decay and the mechanism is summarized as follows. The input X is perceived to have higher variance by adding the L 2 regularization. This makes the learning algorithm shrink the weights on features whose covariance with the output target is low. L 1 regularization represents the sum of absolute values of the weights and biases in the DNNs. Compared with L 2 regularization, L 1 regularization results in a solution that is sparser (i.e. more parameters have an optimal value of zero). Similar to LASSO, the sparsity property induced by L 1 regularization has been considered as a feature selection mechanism. The constant λ 1 and λ 2 control the degree of penalties.
The second one is a more recent technique named as dropout method [22]. In dropout, some number of units are randomly omitted from the hidden layers during training. In detail, with different training examples, each neuron in the input layer or the hidden layers can suppress its activation with a given probability. Then, an exponentially large number of models are produced because each training example trains a different model. Finally, the weights and biases in these models are averaged as an ensemble. Therefore, the rare dependencies which is possible to occur in the training data are prevented.

C. THREE COMBINATION STRATEGIES 1) ZHANG'S METHOD
First, we design a naive hybrid strategy which is based on G.P. Zhang's idea. He assumes that it may be reasonable to consider a given time series to be composed of a linear component and a nonlinear component, which can be described as In equation (3), L t denotes the linear component and N t denotes the nonlinear component. These two components can be estimated from the data as followed. We let 3 linear models, GLM, SARIMA and ARIMAX, as mentioned above, to model the linear component respectively, and then the residuals from the linear model will contain only the nonlinear component with the assumption that the linear model can  (4), whereL t is the forecast value for time t from the fitted linear model.
After modeling e t using DNNs, nonlinear relationships can be discovered. Denoting the forecast value from DNNs asN t , the combined forecast will bê

2) KHASHEI'S METHOD
This combination strategy can be regarded as a modified version of Zhang's method, in which the existing linear structure in data is magnified and DNNs is used to determine a model to capture the potential data generating process. Unlike the Zhang's assumption shown in equation (3) Similar as Zhang's method, the main aim of the first stage is linear modeling. Therefore, 3 linear models are used to model the linear component and the residual series e t is acquired. In second stage, the main aim is nonlinear modeling and the data generating process modeling. The DNNs are utilized to conduct this task and the combined forecast is shown in equation (7), where e t is the residual at time t from the three linear models, andL t is the forecast value for time t; z t = (1−B) d (y t −µ) can be regarded as a linear structure abstracted from time series; m, l and n are integers that are determined in design process of final DNNs.

3) MOVING AVERAGE FILTER-BASED METHOD
We design a moving-average filter based hybrid strategy, which is based on C. Narendra Babu's study [5]. With the opinion that whether a given sequence is normally distributed or not must be examined via Jarque-Bera normality test, a time series is firstly divided into a low-volatility component and a high-volatility component by moving-average filter.
In this process, the order of moving-average filter is set by checking the kurtosis of the sequence. Then, the lowvolatility component is fitted by three linear models, respectively, and the high-volatility component is fitted by the DNNs. Finally, the results of two components are combined and the detail is shown in Figure 1.

IV. CASE STUDY
Administrative data was obtained from a certain A&ED center in Hong Kong from 1 st July 2009 to 31 th March 2011. According to the ''Triage System in Hong Kong A&EDs'', patients are divided into five categories according to their injury severity, and in this study, we concerned patients in Category 3 and 4, and also the total volume. The reason is that the number of patients in Category 1 and 2 is far less than the total volume, and the corresponding patient arrival pattern is usually fixed. In the meantime, the patients in Category 5 do not need the service at once thus the related patient flow does not influence the schedule plan of A&ED staff.    From Figure 2, we can capture obvious distinct monthly pattern and weekly pattern on daily patient flow. Therefore, we add both ''month of a year'' and ''day of a week'' in the initial feature set, as the calendar-based factors. As is mentioned in literature part, we also include the holiday related factors (i.e. whether it is holiday and whether it is before or after holiday). All the calendar-based co-variables are converted into binary dummy variables. For the meteorological factors, we use air temperature, mean dew point, mean relative humidity, mean amount of cloud, total rainfall, mean wind speed as relevant features. We add the change of above index into the initial feature set as the lag effect of these meteorological factors.
To validate the performance of proposed hybrid model, the whole data set is firstly divided into training set and testing set, with the ratio around 7:3. The prediction horizon is fixed to 28 days ahead, which satisfies the real demand from the decision makers in the healthcare system. Considering   the application environment and the sequential instances, cross-validation is not a good choice in this approach because it is impractical to shuffle the training set and testing set after the model are implemented in hospital. Based on this constraint, to maintain the reliability of testing error, the cumulative training set is implemented. The whole testing set is firstly divided into several sequential segments and then each segment is added to the current training set after the last prediction is finished. In this way, each proposed hybrid model is evaluated five times. A series of hybrid models are established by combining two types of DNNs and Poisson Regression, SARIMA, ARIMAX, respectively. For each pair, 3 hybrid strategies are used and compared with the benchmark algorithm, i.e., the original single model. The metric is set as the Mean Absolute Percentage Error (MAPE) of five testing sets and prediction horizon is fixed as 28 days ahead, which is based on the real demand. The result is shown in Figures 3 to 5. Corresponding values are listed in Tables II to IV, in which relative increment with respect to the single time series regression model is also calculated in order to show the improvement of hybrid models. For the similarity between results achieved by MLP and RNN in all three datasets (see Figures 3 to 5), only values related with RNN are recorded in this paper.
The result demonstrate that all hybrid models perform better than original single models, except for a few abnormal data points. Based on different comparisons between hybrid models and the original single model, it can be concluded that introducing DNNs to linear models benefits the performance with lower MAPE, which validates the predictive power and VOLUME 7, 2019  universal property of DNNs indirectly. By comparing the performance obtained by MLP with different models and RNN with these models, it indicates that RNN helps the hybrid model to achieve slightly lower MAPE in some cases (4 th testing segment in the case that DNNs combined with Poisson Regression). By comparing three hybrid strategies, generally, Khashei's method and MA filter-based method achieve lower MAPE than Zhang's method. As for MA filter-based method, it may occasionally bring some unexpected forecast values on the 5th testing segment in the 118972 VOLUME 7, 2019 case that DNNs are combined with ARIMAX or SARIMA model. By using the Khashei's method, hybrid models obtain comparatively low MAPEs with the highest robustness. Therefore, the combination between DNN and classical forecast models via Khashei's method possess strong predictive power and universal property, which can be deployed as a universal tool to conduct patient flow forecasting in Hong Kong A&ED.

V. CONCLUSION
In this study, a series of hybrid models are designed and compared to forecast daily patient flows in a certain A&ED in Hong Kong. The result indicates that combining DNNs and other linear models really improves the forecast accuracy of original single linear models. Theoretically, our study not only validates Hibon's findings [15], but also proposes new methodologies to utilize DNNs to model time series containing both linear and nonlinear component. Practically, forecasting varying daily patient flow patterns is a significant premise for other decision makings in A&ED, for example, nurse scheduling, doctor rostering, etc. The present forecast models and corresponding combination strategies can be generalized to other kinds of patient flow forecast problem, or other analogous time series forecast problem.