High-Performance Time Series Prediction With Predictive Error Compensated Wavelet Neural Networks

Machine learning (ML) algorithms have gained prominence in time series prediction problems. Depending on the nature of the time series data, it can be difficult to build an accurate ML model with the proper structure and hyperparameters. In this study, we propose a predictive error compensation wavelet neural network model (PEC-WNN) for improving the prediction accuracy of chaotic and stochastic time series data. In the proposed model, an additional network is used for the prediction of the main network error to compensate the overall prediction error. The main network takes as inputs the time series data through moving frames in multiple-scales. The same structure and hyperparameter sets are applied for quite distinct four types of problems for verification of the robustness and accuracy of the proposed model. Specifically, the Mackey-Glass, Box-Jenkins, and Lorenz Attractor benchmark problems, as well as drought forecasting are used to characterize the performance of the model for chaotic and stochastic data cases. The results show that the PEC-WNN provides significantly more accurate predictions for all compared benchmark problems with respect to conventional machine learning and time series prediction methods without changing any hyperparameter or the structure. In addition, the time and space complexity of the PEC-WNN model is less than all other compared ML methods, including long short-term memory (LSTM) and convolutional neural networks (CNNs).


I. INTRODUCTION
Time series prediction is an important area that has attracted the attention of researchers from different fields, such as business, economics, finance, science, and engineering [1], [2].
In this study, we propose an efficient ML structure for time series prediction problems that provide considerably higher accuracy and low time complexity with respect to conventional algorithms such as long short-term memory (LSTM) networks, and convolutional neural networks (CNNs). Besides, the proposed algorithm has capability to find accurate solutions for different types of problems without changing the hyperparameter set or the network structure.
In recent decades, ML methods, including artificial neural network (ANN) models have attracted more attention in the domain of time series forecasting. These models have been widely used compared to various traditional time series models. The ANNs models are intended to resolve non-linear functional dependencies between the time series data in the past and its future [6]. ANN models can be classified based on network structure into feed-forward and recurrent neural networks [7]. The most used feedforward neural networks in time series predictions are the multilayer perceptrons (MLPs) [5]. The MLP structure demands a large number of parameters to solve complex non-linear problems. This results in a low learning rate and poor generalization [5]. The prediction of time series data to achieve better accuracy requires the NN models to be adaptive to changes that occur over time in the data. Several neural networks (NNs) and their modified models have been applied for non-linear time series prediction to overcome these drawbacks [8]- [15].
Convolutional neural networks (CNNs) [16] are widely used for learning nonlinear mapping functions from complex data. They can be applied to a variety of problems, from image data, time series to the outputs. The CNNs can learn and extract the most important features due to special convolutional operations.
Recurrent neural networks (RNNs) possess an internal memory and makes them capable of incorporating changes due to internal recurrence [17]. RNNs are computationally more powerful than feedforward networks. Despite the efficiency of NN, CNN, and RNN models in the prediction of time series, two main problems can be addressed. The performance of networks highly depends on the architecture and hyperparameters of networks. The appropriate design of CNN, RNN, and NN models becomes more difficult regarding the nature of time series data. Therefore, the prediction performance is affected by appropriate network parameters.
In this study, we propose a predictive error compensated wavelet neural network (PEC-WNN) model consisting of two NNs. The motivation for using two separated NNs comes from the following perspectives. Firstly, the forecasting models are facing with expending uncertainties such as the lack of information for making more accurate predictions and the accumulation of errors. A well-known drawback in the recursive methods is sensitivity to the estimation errors since their predicted values are used in the model instead of the target values [18]. In contrary, in the proposed approach the models are trained independently and hence not prone to accumulate errors. We show that compensating the predicted error through a second NN enhances the overall prediction performance. The PEC-WNN uses time series input data in multiple-time windows. Sampled time series data in the moving time window are first transformed into a set of wavelet coefficients using a discrete wavelet transform (DWT) and then fed into the NNs. DWT is applied separately to each window by analyzing signal in time as well as in frequency domain. The results show that using a multi-dimensional time window improves the prediction performance without increasing the algorithm complexity. The PEC-WNN improves accuracy while at the same time prevents overfitting by taking the advantage of multi-resolution DWT and NN.
The main contributions of the proposed method can be summarized as: 1. Improvement of the prediction accuracy for chaotic and stochastic time series data using multiple neural networks where the secondary network is trained by shifted time series prediction error of the primary network so that overfitting can be avoided due to increase of recurrence related feedback input. 2. The same structure and hyperparameter sets can be applied for broad range of time series prediction problems with moving frames in multiple-scales. 3. The DWT yields better accuracy improvement than directly applying the time series data to the neural network in predictive error compensation.
In the next section, we explained the proposed PEC-WNN model for time series prediction. The time series problems: the Mackey-Glass chaotic time series, the gas furnace data (series J) of the Box-Jenkins benchmark problem, the Lorenz Attractor time series data, and drought forecasting problems are provided in Section III along the corresponding results. The time and space complexity of the proposed model with respect to the models found in the literature have been discussed in section IV Section V presents the concluding remarks.

II. PREDICTIVE ERROR COMPENSATED WAVELET NEURAL NETWORK MODEL
The predictive error compensated wavelet neural network model (PEC-WNN) utilized in this study comprises of two separate wavelet preprocessed neural networks, as demonstrated in Fig. 1. The current input is shifted to the previous value using the unit delay operator z −1 (see Fig. 1.a)). Along with the four consecutive values, we compute the average values of the different time intervals obtained by applying the unit delay operator ( Fig. 1.b)), in the same manner as in ( Fig. 1.a)). The input data consist of two different time windows that are preprocessed in accordance with the time frame using the discrete wavelet transform (DWT). The rationale for applying DWT is due to its ability to analyze a signal both in time and frequency domains. Unlike FT that provides insight just into frequency content, the wavelet analysis can automatically adapt itself to a suitable resolution and overcome the limitations found in the FT [19], [20].
The DWT is a linear signal processing technique that transforms a signal from the time domain to the "wavelet" domain. The wavelets characterize a family of functions generated from one single function ψ(t) by the operation of dilation and translation. The mother wavelet function localized both in time and frequency domain is represented by ψ(t), a scaling function is defined by ϕ(t) and the parameters j i and k i can be defined as the scale and translation parameters to generate VOLUME 8, 2020 the families of functions, respectively, as given in (1) and (2).
Formerly, we sample the data into a window W i as in (3): where x(i) is the i-th value of the input, l x is the length of the signal part using the current window. Based on the selected scaling functions and multi-resolution analysis, we obtain the double-scaling equations as given in (4) and (5): Applying the DWT to the current window, we compute the scaling coefficients c i,j,k , and the wavelet coefficients d i,j,k using the following equations: where the coefficients h n−2k and g n−2k can be performed by using (4) and (5): Equations (6) and (7) are mathematical expressions of filtering a signal through a high-pass (h[n]) and low-pass (g[n]) filters, which corresponds to convolution with an impulse response of k-tap filters. Subsequently, the signal reconstruction can be computed by: In this work, as a common wavelet basis function, the symmetric Haar wavelet function is used. It beneficially diminishes the distortion rate during the signal decomposition and the signal reconstruction. The Haar wavelet function also reduces the processing and computational time significantly [21]. The Mallat's pyramidal algorithm that provides high-(h n ) and low-(g n ) frequencies from a given signal that are used for the decomposition of the input signal. The low-and high-frequency components are used together as input to the forecasting model to capture valuable information during the training process. A block diagram of multilevel wavelet decomposition is presented in Fig. 2, together with the coefficients used as input to the first NN. The prediction of n-step-ahead time series data is obtained by using the main network characterized by three layers: input, hidden, and output layer. Mathematically, a hidden layer with activation function g(.) and k hidden neurons can be represented as given in (11): where w ji = [w j1 , w j2 , . . . , w jk ] T characterizes the weight vector that connects the jth hidden neurons with the inputs, and b j is the bias value of the jth hidden neuron. The result of the jth output neuron mathematically can be computed as represented in (12): where β ji = [β j1 , β j2 , . . . , β jm ] T denotes the weight vector connecting the jth hidden and output neurons, and bj is the bias value of the jth output neuron. The total number of output neurons is given by N . The activation function g(.) approximate the relationship between the input x i and the output, target t i . Consequently, there are β i , w i , and b i such that: The employed network uses Rectified Linear Unit (ReLU) activation function. The ReLU activation function, compared to the widely used activation functions (sigmoid and hyperbolic tanged), significantly improves the performance of the feed-forward networks [22]. The ReLU is a linear function that returns the value provided as input if the value is higher than zero as given in (14).
The stochastic gradient descent (SGD) is used for optimizing, where the learning rate and momentum are 0.05 and 0.75, respectively. The SGD maintains a single learning rate for all weight updates without varying during the training. The learning rate is maintained for each network weight, whereas it is distinctly adopted as learning folds. Secondly, the PEC-WNN is used to improve forecasting performances obtained in the first NN. The input data of the second NN is constructed using the DWT preprocessed prediction errors.
The prediction error at a time (t + 1) is shifted applying the unit delay operator z − 1 ( Fig. 1.c)). The output of the second NN is the prediction of the error at a time (t + 1) using the prediction errors at time (t), (t − 1), (t − 2) and (t − 3). Finally, the predictive value from the first NN at a time (t + 1) and predictive error (t + 1) from the second NN, are used together to acquire the compensated predictive value at a time (t + 1). The main equations of the proposed model can be expressed as given below: where x p (t + 1) characterizes predicted value at time (t + 1).
Four consecutive values, (x(t), ) computed using (15), represent the input of the first NN. The average values of an interval [i, j] are computed using (16). The input data of the second NN contains four errors obtained using the predicted values from the first NN using (17). The compensated predicted value is computed by subtracting the predictive value at a time (t + 1) from the predictive error at (t + 1), which is given in the (19).

III. MATERIALS AND RESULTS
The performances of the proposed model are verified using the Mackey-Glass, the Box-Jenkins gas furnace (series J), the Lorenz Attractor time series data, and for the drought forecasting problem the global SPEI index. The data sets are applied to the different models such as simple neural network model (hereafter NN), predictive error compensated neural network model (PEC-NN), wavelet neural network (WNN), and predictive error compensated wavelet neural network (PEC-WNN).
The data sets are scaled by using the minimum/maximum normalization method given in (20): where x(t) represents the real value, min 1   where X obs is observed value and X model is modeled value in time i. The number of data samples is given by n. The d i is given by: A. THE MACKEY-GLASS CHAOTIC TIME SERIES DATA The chaotic Mackey-Glass time series data (Fig. 3) has been typically used as a benchmark problem before considering the suitability of a specific approach to real-world forecasting problems [23]. The time series data have been generated from the following differential equation (22): where x (unitless) is the series in time t, and τ is the time delay. The parameters α, β and τ are set as α = 0.2, β = 0.1, τ = 17. Note that, for τ > 17, the time series show chaotic behavior [24]. The initial condition x(0) = 1.2 is used to generate the data points by using the fourth-order Runge-Kutta method with time step 0.1. The work from [24], uses non-consecutive values with the constant time interval, T = 6 for prediction of the short-term outputs. They performed experiments by considering inputs as to predict x(t + 6). Out of 1000 samples, the authors used 500 for training the model and 500 for testing performance. Similarly, [23] considered sequential four input variables, to estimate single output variable at time x(t + 5). Out of 300 samples, half of the samples served for training and the remaining half for testing.
Different than the previous studies, in this work, we construct the data sets which contains the averages of different window sizes together with sequential values. The first data set contains only four successive values obtained by (23). In order to observe the effects of the average values on the forecasting performance in the following data sets, we include the average values of window size 5 and size 10, obtained using equations (24) and (25), respectively, which can be considered as daily data. Hence by using the average values of size 5, we gain the business week resolution of data. Moreover, extending it to the four shifted average values, we obtain the monthly resolution.
In our study the forecasting intervals differ from the next value (t +1), the fifth (t +5), the sixth (t +6), the forty-second (t +42) until the eighty-fourth (t +84) value. Equally divided into training and test sets, 1000 samples are used as in [24]. [5,9] , x [10,14] , x [15,19] ) 9] , x [10,19] , x [20,29] , x [30,39] ) The RMSE results are presented concerning the constructed data sets (presented in the above equations) and its forecasting time interval in Tab. 2. The results show that the use of averages along with consecutive values significantly reduces the error. In addition, the average values used in conjunction with the successive values improve the forecasting performances of the proposed model (PEC-WNN). The selection of window size has a huge impact, considering the predicting time interval. Small window size averages show better results for short-term predictions, while window size expansion shows better forecast performance for long-term forecasts. Similarly, in comparison to the MAPE (Tab. 3) we confirmed that appending the averages to the consecutive values improve the results and reduce the forecasting errors. The results for directional accuracy, DA in Tab. 4, shows descent results for applied models where the PEC-WNN model precedes.  The Mackey-Glass time series problem results showed that at different forecasting intervals, the PEC-WNN achieves the lowest RMSE error. The second-best result found in the literature [25], for short-term forecasting (t+1) is 0.0327. The same time-interval forecasting done with PEC-WNN obtain the RMSE of 0.0013, which reduces the RMSE by 95%. Similar results are noticed for (x + 6) forecasting term, where the best result found in the literature is 0.0055 [26]. This result is obtained by using the dynamic cell structures, and local linear models (DSC-LMM) proposed in [26]. The PEC-WNN RMSE for time interval (t + 6) was 0.0027, which is 49% better in comparison to the results found in the literature. The long-term forecasting interval (t + 84) using the proposed model achieves the RMSE of 0.028. On the other hand, for the same forecasting interval, Cudy et al. [26], reached the RMSE of 0.03 using the DSC-LMM model.

B. THE BOX-JENKINS TIME SERIES DATA
The Box-Jenkins time series data set is another frequently used benchmark example in the prediction algorithms [24]. The method refers to the iterative application of a three-stage modeling approach: 1) model-identification and selection, 2) estimation, and 3) statistical model checking [29]. The first stage determines the stationarity of the data. The plots of the dependent time series data are used to decide which autoregressive or moving average components should be applied. In the second stage, the estimation of parameters of the selected model is obtained by using the maximum likelihood or non-linear least square estimation. In the last stage, the statistical model checking, we examine whether the model follows the conditions of a stationary univariate process. The data used in this study are well known as gas furnace data (series J) prediction problem. The output of the Box-Jenkins gas furnace time series data set is given in Fig. 4.
The inputs proposed in the literature, given as in 26 are applied first. Subsequently, we used successive values of the methane gas flow (27) to forecast the successive value. We expanded our investigation by checking how the increased amount of input data with its average values, affect the forecasting performances. For that purpose, we applied the averages of five and ten window sizes with four successive values of methane gas flow with CO 2 concentration in the gas. Note that, the forecasting value is always the next (t + 1) value of CO 2 concentration in the gas, while the input data set differs.

C. THE LORENZ ATTRACTOR
The Lorenz Attractor represents a classical time series multivariate prediction problem consisting of three ordinary differential equations given in (30)- (32).
The equations are obtained from the Navier-Stokes equations and used in fluid mechanics. The parameter settings to exhibit the chaotic behavior are σ = 10, β = 8/3 and ρ = 28, with initial conditions [x (0) , y (0) , z(0)] = [0, 1, 1.05] as studied in Lorenz [35]. Different Lorenz maps with the same general dynamics can be obtained by using distinctive initial conditions and parameter values. The Lorenz map is given in the Fig. 5.
The data set contains 10,000 multivariate data samples. From the plots of each trajectory interdependencies between the time series can be seen (Fig. 6). Xiu et al. [36] applied a multivariate data set as inputs to predict the single variable as the output. We applied similar single and multivariate inputs to our model. The input equations are given below in (33)- (35). The output represents the next (t + 1) value of a single variable.   9] ,ȳ [10,19] ,ȳ [20,29] , y [30,39] ,z [0,9] ,z [10,19] ,z [20,29] ,z [30,39] ) The constructed data sets are divided into training and test sets, where 80% of the data is used for training and 20% for testing performances. The significantly low RMSE (Tab. 10) is achieved using only the one previous value of each trajectory for forecasting the next value. Similar to previous experiments, herein we try to observe the impact of adding average values of different time-window to the consecutive values. The growth in the number of variables, from a single to multivariate, increases the RMSE. On the other hand, the usage of multivariate average values together with consecutive multivariate values reduces the RMSE error in comparison to the successive multivariate input. The lowest value of MAPE (Tab. 11) is found in the PEC-WNN     The best result for the Lorenz Attractor data set using the multivariate input data is 0.0013 for the PEC-WNN model. The PEC-WNN result obtained 64% less RMSE compared to a similar experiment found in the literature, Xiu et. al. [36]. The Lorenz multivariate time series data with the natural structure used as the input overperform the predicting results where the single variable sequence is used.

D. THE DROUGHT FORECASTING
In this section, we also demonstrate the performance of the proposed model when it is applied to the stochastic time series data. The benchmark problems previously explained, the Mackey-Glass, the Box-Jenkins gas furnace (series J), and the Lorenz Attractor represent chaotic problems with deterministic models. Their output can be determined based on representing mathematical models when the initial conditions and the model parameters are known. The proposed PEC-WNN model for performance comparison is also applied to the stochastic problem. The standardized precipitation-evapotranspiration index (SPEI) drought index developed by Begueria et. al [38] is selected for that purpose. Drought identification and forecasting are very important in limiting their effects. However, accurate drought prediction remains a scientific issue due to the nature of data. The SPEI represents an index that quantifies the drought condition over a given area. The index can be calculated in several time scales to adopt the characteristic drought response time of the target natural and economic systems, by determining their drought resistance [38]. The data set evaluates accumulated precipitation minus potential evapotranspiration (PET) over multiple time scales between 1 and 48 months, with global coverage at a 0.5-degree resolution [39]. The advantages of a used data set are that (a) it improves the spatial resolution of the unique global drought data set at a global scale; (b) it is spatially and temporally comparable to other data sets, given the probabilistic nature of the SPEI; and, (c) it enables the identification of various drought types, given the multiscalar character of the SPEI [39]. The analyzed period is from January 1901 until December 2015. The 1-month, 4-months, and 6-months data were used (Fig. 7). The input-output functions for the prediction model are given in the (36) and (37). The input data sets consist of eight inputs with different window sizes. The first data set contains one-month and four-month data (36); the second data set one-month and six-month data (37). In both cases, we tried to forecast the next six months' drought period. In equations, t represents the SPEI values of one-month data, 4t the SPEI values of four-month data, and 6t the SPEI values of six-month data. The used PEC-WNN model contains the same hyperparameters and the number of inputs that are applied to the previous chaotic time series problems. As an alternative to the previously mentioned models, for drought forecasting problems we additionally used the LSTM model proposed in [13] and multivariate linear regression (LR) for performance comparison. The proposed PEC-WNN accomplished significantly low RMSE (Tab. 14) with monthly and four-month data used as inputs given in (36). The PEC-WNN model also provides the lowest MAPE as seen in table 15.
The results of forecasting the SPEI index have shown reasonable prediction accuracy for a six-month time scale VOLUME 8, 2020    considering the uncertainty level of stochasticity. The accuracy of the proposed method with increasing the scale of SPEI input data from four-months to six-months average data does not show better performances. Evaluated SPEI prediction at different time scale simultaneously used increases the performance of the proposed method.

IV. DISCUSSION
The time series prediction model where a separate NN model is used for predictive error correction of the main NN, PEC-WNN has been applied to different kinds of deterministic, chaotic, and stochastic time series problems. The introduced method, PEC-WNN, has been compared to twenty time series prediction methods found in the literature, to demonstrate the prediction performance where the PEC-WNN model demonstrates the lowest RMSE error. The predictive error compensation model overall reduces the RMSE but when applied together with wavelet transform as a preprocessing mechanism surpasses the other methods applied and found in the literature. The PEC-WNN has been applied to different problems without changing the network structure and hyperparameters. The PEC-WNN, although in its structure, uses two NNs is less computationally expensive and time-consuming with respect to other ML methods found in the literature. The PEC-WNN complexity concerning the number of parameters is relatively low (Tab. 18). The results are consistent with one of the conclusions found in [40], which states that simple models tend to outperform complex models. The proper arrangement of input data sets can significantly improve the forecasting performance of the proposed model. The results show that different sizes of input data frames used together with consecutive values improve the forecasting performances.

V. CONCLUSION
In this work, a predictive error compensated wavelet preprocessed NN model for time series prediction problems is proposed. The model is consisting of at least two separate NNs, where the input data are preprocessed using DWT in both of them. It has been demonstrated that the second predictive error compensating network significantly improved the overall accuracy of the proposed model at all benchmark problems. The Mackey-Glass, Box-Jenkins, and Lorenz Attractor problems are used to evaluate the prediction performance for chaotic time series case and global drought forecasting problem for a stochastic case. The results show that the PEC-WNN model provides 64% less RMSE for the Lorenz Attractor, 78% less RMSE for the Box-Jenkins, and 95% less RMSE for the Mackey-Glass benchmark problems. The proposed method achieves reasonable results also in forecasting the global drought SPEI index. An additional advantage of the proposed model is less sensitivity to its hyperparameters and structural settings for a broad range of time series prediction problems. The same network structure of PEC-WNN has been used in all given benchmark evaluations. Both the time and space complexity of the proposed model was less than the compared other machine learning methods in all cases. Though the proposed PEC-WNN method demonstrated promising results, more improvements can also be achieved through fusion with additional cascaded predictive error compensating networks for multidimensional data sets.
BURAK BERK USTUNDAG (Member, IEEE) received the B.Sc. degree in electrical engineering and the M.Sc. and Ph.D. degrees in control systems and computer engineering from Istanbul Technical University (ITU). He is currently a Professor with the Computer Engineering Department, Faculty of Computer and Informatics, Istanbul Technical University (ITU). He has served as a Science and Technology Advisor to Ministers and Governmental Institutions for more than 15 years. He has more than 100 scientific publications and patents. His research interests include data fusion, artificial intelligence, signal processing, global optimization, cognitive communication, and agricultural information systems. He is a member of the IEEE Communication Society and Computational Intelligence. She is also a Research Assistant with the Computer Engineering Department, Graduate School of Science Engineering and Technology, Istanbul Technical University. She has published papers at international conferences and journals interested in data fusion. The published works are mainly about data fusion models for time series, remotely sensed, and ground-based measurement data of agricultural sensor networks using preprocessing techniques, such as wavelet transformation together with neural networks for estimation performances. She is interested in doing further research in the development of deep wavelet networks for estimation performances. Her current primary fields of investigation are forecast of time series data, natural events, data fusion, and machine learning techniques. VOLUME 8, 2020