Deep Hybrid Neural Network and Improved Differential Neuroevolution for Chaotic Time Series Prediction

Chaos is widespread in non-linear systems such as finance, energy, and weather. In the chaos system, a variable changing with time generates a chaotic time series, which contains a wealth of information about the non-linear system, and it is helpful for us to analyze and understand chaos systems. Traditional hybrid models for chaotic time series prediction based on neural networks have problems such as low prediction accuracy and difficulty in determining the network topologies. In recent years, the chaotic time series prediction has attached the attention of researchers in the area of deep learning. In this paper, we use a deep hybrid neural network (DHNN) based on convolutional neural network (CNN), gated recurrent unit (GRU) network, and attention mechanism to predict chaotic time series. Besides, we use the idea of neuroevolution to optimize the topologies of the DHNN. In the DHNN, we use CNN to capture spatial features from phase space reconstruction of chaotic time series. Then, we combine spatial features with the original chaotic time series. GRU extracts the spatio-temporal features from the combined sequence, and an attention mechanism with a non-linear activation function is designed to capture critical spatio-temporal features. Besides, we propose an improved differential evolution (IDE) algorithm to optimize the topologies of the DHNN, including the filter sizes of CNN and the number of hidden neurons of GRU. We develop the IDE with an adaptive mutation operator and dynamic chaos crossover operator, which can improve convergence speed and reduce optimization time. In this paper, we use the theoretical Lorenz datasets, monthly mean total sunspot datasets, and the actual coal-mine gas concentration datasets to verify the prediction accuracy of the proposed prediction model. Experimental results have shown that the proposed prediction model performs well in chaotic time series forecasting.


I. INTRODUCTION
Chaotic time series prediction (CTSP) is involved in various domains of social and natural sciences, such as copper metal price, oilfield water injection, wind power, and rainfall [1]- [4]. Over the last decade, CTSP has been applied to the study of blood glucose, disease, and gait in humans [5]- [7]. Besides, CTSP also has been applied to cyber-information tasks such as retweeting [8], information diffusion [9], and DoS and DDoS attack detection [10].
The associate editor coordinating the review of this manuscript and approving it for publication was Hisao Ishibuchi .
The application of CTSP in the real world is becoming more significant and more widespread.
Theoretical and empirical studies reported in the literature suggest that the hybrid model is one of the best ways to improve the accuracy of time series forecasting [11], [12]. A hybrid network model combined support vector machine (SVM) and echo state mechanism (ESM) was proposed to CTSP [13]. Ardalani et al. [14] proposed a hybrid Elman-NARX neural network to forecast the chaotic time series. Said Jadid et al. developed an unscented Kalman filter and NARX neural network to analyze and predict the Lorenz time series [11]. Combined with the smoothing approach considering the entropic information, a noisy forecast method was applied to chaotic rainfall time series [4]. Nhabangue et al. proposed a functional link extreme learning machine to CT SP [15], and Xu et al. applied a hybrid regularized echo state network to forecast multivariate CTSP [16]. Yan et al. developed a hybrid empirical mode decomposition and neural network for Maritime Time Series Prediction [17]. These hybrid models have performed well in CTSP. More recently, deep learning algorithms such as long short-term memory neural network (LSTM) [18], convolutional neural network (CNN) [19], and hybrid CNN-LSTM neural network have been applied to CTSP [20]- [22]. YanLi et al. applied hybrid empirical mode decomposition, adaptive regrouped, and LSTM to forecast port cargo thro-ughput time series [23]. Compared to the hybrid machine learning model, the hybrid deep learning model has a better performance [22].
In the last few years, a particular hybrid model named neuro-evolution has once again caught the attention of researchers. Unlike other hybrid models, neuroevolution can be used to design neural networks [24], [25]. Genetic algorithms (GA) and evolutionary strategies (ES) have yielded excellent performance in optimizing the topology and weights of neural networks [26], [27]. Through neuroevolution, we can determine the appropriate network structure for a specific problem and achieve excellent predictive performance [28]. At present, the research of evolutionary algorithms for neuroevolution is continuously developing.
In previous studies, we observed that attention mechanism had made more exceptional performance on the tasks of sequence models, such as machine translation and textual entailment [29], [30]. The attention mechanism can extract key spatio-temporal features from spatial and temporal features [29]. Besides, it also can solve some long-term memory problems [31]. Thus, it is taken into account while using neural networks to extract temporal features from sequence models. More recently, attention mechanism has been applied to predict time series. Youru et al. introduced an evolutionary attention learning approach to transfer shared parameters of LSTM [32], and a multistage attention network is designed to capture the influence information and the variation law of data over time [33]. Yao et al. proposed a dualstage attention-based recurrent neural network (DA-RNN) to address long-term temporal dependencies and select the relevant driving series to make predictions [34]. Yeqi et al. developed a dual-stage two-phase attention-based recurrent neural network (DSTP-RNN) for long-term and multivariate time series prediction, which can capture spatio-temporal correlations and long-term temporal dependencies [35]. In deep learning, an attention mechanism with function mapping is designed to capture mutation information on the target time series, which can process the fusion of historical hidden state and cell state information for LSTM [36].
The hybrid models mentioned in the previous literature were only considered the spatial or temporal characteristics of chaotic time series. In this paper, a deep hybrid neural network based on deep learning is proposed to CTSP, which considers both spatial and temporal. In the proposed model, spatial characteristics are acquired by CNN, and gated recurrent unit (GRU) [37] is used to extract temporal characteristics. We apply the differential evolution (DE) [38] algorithm to design the topologies of hybrid neural network and search appropriate time-steps for forecasting. However, we observed that simple DE and adaptive DE [39] have a slow convergence speed and long running times. Thus, we improve DE by changing the mutation operator and crossover operator. In section II, we describe the specific improvements in detail. Of course, the attention mechanism is used to extract spatial-temporal features from hybrid deep learning neural networks.
The paper is organized as follows. We introduce the hybrid neural network, CNN, GRU, attention mechanism, and improved DE in section II. In section III, we describe the specific details of the experiments. In section IV, we analyze and discuss the experimental results. The conclusion is summarized in section V.

II. HYBRID MODEL
As shown in Fig.1, various kinds of neural networks play different roles in the proposed hybrid model. The reconstructed phase space of the chaotic time series contains spatial features of the chaos system, while the original sequence also contains rich temporal features. Therefore, the CNN model is used to capture the spatial features of the reconstructed phase space, and GRU extracts the spatio-temporal features under the spatial features. The attention model is used to capture the critical spatio-temporal features. Meanwhile, we use the improved DE to design topology of the hybrid network, including the kernel sizes of the CNN and the number of hidden neurons for the GRU neural network. In this paper, we make a one-step prediction, and time-steps (the number of data used to forecast) often affects the prediction accuracy. Therefore, we use improved DE to search fitting lookback for forecasting. The details are described as follows.

A. CONVOLUTIONAL NEURAL NETWORK
Convolutional neural network(CNN) is a specialized kind of deep learning neural networks which can process data with known grid-like topologies [19], [40]. It is widely to use CNN in the fields of time series analysis, computer vision, and natural language processing [40]. CNN can categorize into 1-D (dimension), 2-D, and 3-D convolution by processing the different data streams. In this paper, we use a 1-D convolutional neural network, which is widely using in the fields of time series analysis and natural language processing [41]. As shown in Fig.2(a), the main parts of the simple 1-D CNN include the essential input and output layers, the convolutional and pooling layers are the most critical layer, and the fully connected layer is necessary. In 1-D CNN, we can understand the function of convolution as extracting the translation features of the data in a particular direction, where the essence of the operation of the convolution is the circular multiplication and summing, which is expressed by  the following formula: where y, h, u are series, as shown in Fig.2(b), h and u are a row of a multivariate time series, they are convoluted from top to bottom. k represents the times of convolution, the length of u is N.

B. GATED RECURRENT UNIT
As the extension of the feed-forward neural network, the re-current neural network(RNN) can handle variable-length sequence data via hidden state units [42]. However, the vanishing and the exploding gradient problems put a limit on training RNN [43]. Long short-term memory and gated recurrent unit were proposed to solve the vanishing and the exploding gradient problems. LSTM and GRU are gated recurrent neural networks, which use various gates to capture long-term dependencies of a sequence data. LSTM has three gates, including an input gate, an output gate, and a forget gate. Unlike LSTM, GRU has two gates. As shown in Fig.3, GRU uses an update gate u to control the forgetting factor and the decision to update the state unit simultaneously. Besides, a reset gate r can control how much historical information to forget, and the update equations are the following: 159554 VOLUME 8, 2020 where W and b stand for weights and biases, σ is a sigmoid function, tanh represents activation function, h <t> is the output of the GRU cell, t represents the current time state.

C. ATTENTION MECHANISM
It tends to focus on certain parts of the things when the human brain observes something, and these focused parts are the key to acquire information form things, which are   very useful for us to recognize similar things. The attention mechanism is a unique method that mimics this cognitive process [44]. Attention mechanism has been applied to the computer version and natural language processing [29], [30], and we apply attention mechanisms to the analysis of chaotic time series. In CTSP, we use CNN to extract spatial features from the reconstructed phase space of the chaotic time series, and then use GRU to extract spatio-temporal features based on spatial features. However, the prediction accuracy is affected by too many or non-critical features. Thus, we apply the attention mechanism to extract the key features from the hybrid CNN-GRU model. As shown in Fig.4, the attention mechanism is a crucial feature extractor, and it performs a weighted sum operation. It will give high weight to important features and weaken useless features, the vector c is the extracted key features, and its formula is as follows: where m is the sum of input time-steps of the GRU, v is the feature vector output by the GRU, and β represents the weight of the vector v.
In order to obtain β, we add a small neural network a(v) with softmax activation function to the attention model, the formula is as follows: where e i = a(v i ).

D. DIFFERENTIAL EVOLUTION AND ITS IMPROVEMENT
Differential evolution algorithm is a stochastic heuristic algorithm that is simple to use and has strong robustness and global excellence seeking ability [45]. Rainer Storn and Kenneth Price proposed the original and a few variants of the differential evolution [31], [38], [39], [46], [47], defining notations DE/x/y/z, where x specifies the mutation method, y represents the number of difference vectors, and z is the cross method.

1) STANDARD DE/BEST/1/BIN
In this paper, we use standard DE/best/1/bin [46] as the underlying algorithm template, in which the mutation method uses the best population individual to generate vectors, and the bin represents DE obtains the experimental population using the binomial distribution crossover method. In DE, we set population size, the number of generation, mutation and crossover operator as NP, G, F and CR, respectively. For each D-dimensional target vector, the calculate equations are the following: where x i,G is ith population individual of generation G, a mutant vector v i,G+1 is generated according to (9), x best,G is the best individual of generation G, r1 and r2 are random indexes in range {1, 2, · · · , NP}, F is an invariant operator ∈ [0, 2], which determines the magnification ratio of the differential vector. As shown in (10) and (11), the trial vector u i,G+1 is selected form the mutation vector v ji,G+1 and original vector x ji,G . CR is a constant operator ∈ [0, 1], randb(j) represents the jth estimated value of random number generator with the outcome [0, 1], rnbr(i) is a casually selected index ∈ 1, 2, · · · , D.

2) IMPROVED DIFFERENTIAL EVOLUTION ALGORITHM
In DE/best/1/bin, F and CR are a real and constant factor, which are difficult to choose during the search process. Adaptive DE (ADE) [39], [48], [49] provides a way to solve this problem, which uses adaptive strategies with generation to choose F, the equation is expressed as [49]:  where F 0 is original mutation operator, G m is the max generation, F trends and eventually equals F 0 , G represents the current generation. and 1 ≤ G ≤ G m .  It is efficient to choose a proper mutation operator in the implementation. However, we found that F tend to lager factor while F 0 is large, and this affected the efficiency of VOLUME 8, 2020 the search process. Thus, we improved (12) as follows: where F 0 is original mutation operator, G m is the max generation, F trends and eventually equals F 0 , G represents the current generation, and 1 ≤ G ≤ G m . As expressed in Fig.5, AF is the operator described by Eq. (12) and IF represents the mutation operator computed via Eq. (13). It is apparent that AF varies over a wide range, while IF varies over a smaller range, which can not only maintain the population diversity in the initial stage but also ensures the search efficiency.
In this paper, we use the Logistic chaotic mapping equation to compute CR. Chaotic disturbance not only allows CR to control the crossover probability and diversify the population but also accelerates the convergence. Chaotic CR calculated from the following formula: where µ is a parameter, Eq. (14) is chaos when 3 ≤ µ ≤ 4, in the literature, µ = 4. The change curve of CR is shown in Fig.5.

3) OPTIMIZATION OF HYBRID NEURAL NETWORK USING IMPROVED DIFFERENTIAL EVOLUTION
In this paper, we use improved differential evolution algorithm to optimize the topologies and time-steps on the hybrid neural network. In the optimization, the mean square error (MSE) is used as the evaluation criterion to select the best individual. It means that MSE is the fitness function, which computed as: where y i andŷ i stand for raw and predictive values, n is the number of predicted points.
Algorithm 1 describes the process of IDE optimizing the hybrid neural network.

Algorithm 1 Improved Differential Evolution Optimizes Hybrid Neural Network
Step 1: Set control parameters: mutation factor F 0 , crossover operator CR 0 , population size NP and max generation MAX _G Step 2: Randomly initialize a population of NP individuals c2, g1, g2, g3, l), where c1 and c2 is the size of the CNN filter, g1, g2, and g3 is the number of neurons of the GRU layers, and l is the time-steps for forecasting. Set the generation number G = 1  (15), which has smaller value will be selected end for G = G + 1 end while

III. EXPERIMENTS
In this section, we introduce the detail of data access, data preprocessing, and evaluation criteria.

A. DATA ACCESS
In this paper, we use two data sets to verify the predictive performance of the proposed model, including theoretical Lorenz datasets and a coal-mine gas concentration datasets.

1) LORENZ CHAOTIC TIME SERIES
The equation of Lorenz chaotic mapping is: The initial value of equation selected as x = y = z =1, the parameters a =10, b =8/3, c =28. to ensure chaos, discarding the first 10,000 samples and select the last 3,000 samples as experiment data. Fig.6 is an example of the Lorenz time series, and we use the X variable of Lorenz to train and test the proposed deep hybrid model.

2) MONTHLY MEAN TOTAL SUNSPOT NUMBER
Sunspots are common phenomena on the sun's photosphere that appear as spots darker than the surrounding areas. We collected monthly mean total sunspot numbers form1749 to 2019, and 3240. records are valid and used in this paper. Fig.7 expresses the curve of monthly mean total sunspot number.

3) CHAOTIC COAL-MINE GAS CONCENTRATION TIME SERIES
In this paper, we also use a chaotic coal-mine gas concentration series to the proposed model. We captured the actual data of a mining face in Xingtai Mine, and 1464 records are valid and used in this paper. The Fig.7 shows the curve of gas concentration datasets.

B. DATA PREPROCESSING 1) PHASE SPACE RECONSTRUCTION
The emergence of the theory of phase space reconstruction provides a theoretical basis for forecasting chaotic time series. In the basic idea of phase space reconstruction, any variable in the system is determined by other variables interacting with each other. Therefore, any variable's development and change contain information on the development and change of other variables [50]. Packer et al. proposed that the phase space can be reconstructed by using the delayed coordinates of a variable in the dynamical system [50]. Takens Floris demonstrated that the dimensions of the original dynamical system could be recovered with the appropriate embedding dimension [51]. In this paper, we use the mutual information method [52] and Cao method [53] to determine the delay time τ and embedding dimension. Time-series lists as {x 1 , x 2 , · · · , x N }, the delay time is τ and embedding dimension is m. Phase space reconstruction The matrix is represented as follows: It is necessary to normalize datasets in deep learning, which not only eliminates the magnitude and unify the data to the same scale but also enhance the convergence speed and prediction accuracy of the model. In this paper, we use normalization to unify phase space reconstruction datasets and original chaotic time series to a range between (0,1), and the normalization function can be expressed as: The Fig.8 shows the process of data preprocessing.

C. EVALUATION CRITERIA
In the literature, we use two kinds of criteria to evaluate the performance of the prediction model, and there are root mean square error (RMSE) and mean absolute percentage error (MAPE), the calculation equations are the following: where y i andŷ i stand for raw and predictive values, n is the number of predicted points.

D. TRAINING OF PREDICTION MODEL
In this paper, we choose the first 80% of datasets as the tra-ining datasets and the rest of 20% as the testing datasets.
In the training phase of DHNN experiments, we use the impr-oved differential evolution algorithm to infer optimal topo-logies and time-steps for the proposed model. We use Keras to code experimental programs and implement the ES, GA, and DE using The genetic and evolutionary algorithm tool-box with high performance in Python python(geatpy) [54] in Python library. We also implement improved DE quickly by using geatpy. As the loss function, the mean square error is applied to compute the quantity that a model should seek to minimize during training. We also use Adam [55] to optimize the gradient of the stochastic objective function.

IV. ANALYSIS AND DISCUSSION
In this section, we analyze and discuss the prediction accuracy of the proposed hybrid model through two experiments. One of them is to optimize the hybrid neural network through different evolutionary algorithms, and the other is to compare the proposed model with other variant models.

A. VARIOUS EVOLUTION ALGORITHMS FOR HYBRID NEURAL NETWORK
In this part, we analyze and discuss the predictive performance of the hybrid neural network, which optimized by different evolution algorithms. We use the differential evolution (IDE) algorithm, adaptive differential evolution (ADE) algorithm, standard differential evolution (DE) algorithm, evolution strategy (ES) [56], and genetic algorithm (GA) [57] to infer optimal topologies and time-steps for the hybrid neural network. As shown in Fig.9, the IDE not only has a faster converg-ence speed but also achieves the lowest target value, which compares with other algorithms. From Fig.10 (a), it can be seen that the convergence value and convergence rate of IDE and ADE are similar, but the convergence rate of IDE is faster than ADE, and both of them are better than DE, ES, and GA. From Fig.10 (b) and (c), it is clear that the IDE has the smallest convergence value. From Table 1, it is obvious that IDE runs faster than DE, ADE, ES, and GA. Thus, it is proved that the IDE proposed in the literature can improve the convergence speed and reduce the optimization time.
In Table 1, we can notice that the RMSE, MAPE, and average prediction error of IDE-DHNN model are the lowest. The RMSE of IDE-DHNN model is obviously lower than other models, and the MAPE values of IDE-DHNN models are slightly higher than others. From Fig.12, it is clear that the values predicted by IDE-DHNN are close to actual values. And Fig.11 shows that the max percent-age error of IDE-DHNN is 1.5%, which is lower than ADE, DE, ES, and GA-DHNN. On the other hand, IDE-HNN also has higher forecasting accuracy on the sunspot datasets. Table1 shows that the RMSE of IDE-DNHH is the lowest with a value of 3.3834. The MAPE value of IDE-DHNN is 0.0419, which is the minimum. Fig.14 and Fig.15 show that the accuracy of IDE-DHNN is higher than other models. Fig.13 and Fig.16 respectively show the curve of actual-predicted values and prediction error on gas concentration datasets. We notice that IDE-DHNN performance well, and the max error is 5%, which is far lower than other forecasting models. Varieties of evaluation criteria all verify that IDE-HNN has excellent forecasting performance, and it is the right choice for chaotic time series prediction.

B. COMPARISON OF VARIANT PREDICTION MODELS
In the literature, the hybrid neural network includes three parts, which are CNN, GRU, and attention model. We use three various forecasting variant models to verify the fore-casting accuracy of the proposed hybrid neural network, which are hybrid CNN-GRU model, single CNN model, and single GRU model. Table 2 shows the RMSE, MAPE, and average error of various prediction models.
As described in Table 2, the RMSE, MAPE, and the average error of the CNN-GRU-Att prediction model are far lower than CNN-GRU, CNN, and GRU model. The CNN-GRU-Att has the lowest RMSE, MAPE, and the average error on three different datasets.
On the Lorenz datasets, the RMSE of the various prediction models is quite diverse. The RMSE of CNN-GRU-Att is the lowest with a value of 0.0756. Besides, the MAPE of CNN-GRU-Att is obviously lower than that of other models. It is clear that CNN-GRU performs well too, which is better than CNN and GRU. From Fig.17 and Fig.18, it is evident that CNN-GRU-Att has higher prediction accuracy, and the forecasting error is controlled within 1.5%. Fig.19 and Fig.20 respectively show the curve of perdic-tion error and true-predicted values of four models. As shown in Fig.20 (a), the values forecasted by CNN-GRU-Att are close to actual values. The prediction error curve represents that the proposed CNN-GRU-Att model has higher prediction accuracy. Table 1 shows that the RMSE and MAPE of CNN-GRU-Att are the lowest with a value of 3.3834 and 0.0419, respectively.
From Table 2, Fig.21, and Fig.22, it is evident that the hybrid model with CNN, GRU, and attention model performs very well on the gas concentration datasets. As expressed in Fig.22, the prediction accuracy of CNN-GRU-Att is higher than the others, and the prediction error is controlled within 5%. It is also clear that the prediction error of CNN-GRU, CNN, and GRU models are higher than the CNN-GRU-Att model. It is worth noting that the gas concentration dataset is significantly less than the other two datasets, but the proposed hybrid neural network still has high predictive accuracy.

C. DISCUSSION
Based on previous trial results, we can summarize the following findings: (1). Neuro-evolution is an excellent choice for designing topologies and searching for some hyperparameters for neural networks. The improved differential evolution proposed in this paper performs well in optimizing hybrid neural network. We can see that the IDE can improve convergence speed and reduce running times simultaneously. The most important thing is that the optimization result of IDE for hybrid neural network achieves better prediction performance and higher prediction accuracy. (2). The hybrid neural network plays an essential role in the prediction model, and the prediction accuracy is low while using the single CNN and GRU model, and hybrid neural network without attention model. It is because that chaotic time series expands with high dimensions in the phase space reconstruction, and in this condition, the chaotic system contains rich spatial information. At the same time, the original chaotic time series also contains rich temporal characteristics. It is difficult to fully extract the temporal or spatial characteristics from the chaotic system with a single CNN or GRU model. Besides, even if the hybrid CNN-GRU model can extract temporal and spatial features, but the prediction accuracy is affected by lots of non-key features. At this point, the attention model plays a crucial role in extract temporalspatial features. It gives high weights to critical features while weakening non-critical features. From the previous trial results, it is evident that the hybrid neural network optimized by IDE and added attention mechanism has high prediction accuracy.

V. CONCLUSION
In this paper, we propose a hybrid model to forecast the chaotic time series, which includes convolutional neural network, gated recurrent unit, attention mechanism, and improved differential evolution algorithm. The proposed hybrid model can be summarized as two parts, one is the deep hybrid neural network, and the other is neuroevolution based on IDE. In the deep hybrid neural network, we use CNN and GRU to extract spatial and temporal features from phase space reconstruction and time series, respectively. The attention model can extract critical spatio-temporal features, which can improve prediction accuracy. In the neuroevolution, we first develop the IDE with an adaptive mutation operator and dynamic chaos crossover operator, which can improve convergence speed and reduce optimization time. Then, we use IDE to infer appropriate topologies and time-steps for the deep neural network. Simulation experiment results show that IDE can improve convergence speed and reduce optimization time. Furthermore, it also can acquire a lower prediction error. Thus, the deep hybrid neural network is an excellent choice for chaotic time series prediction.