Effect of Multi-Scale Decomposition on Performance of Neural Networks in Short-Term Traffic Flow Prediction

Numerous studies employ multi-scale decomposition to improve the prediction performance of neural networks, but the grounds for selecting the decomposition algorithm are not explained, and the effects of decomposition algorithms on other performance of neural networks are also lacking further study. This paper studies the influence of commonly used multi-scale decomposition algorithms including EMD (Empirical Mode Decomposition), EEMD(Ensemble Empirical Mode Decomposition), CEEMDAN (Complete Ensemble Empirical Mode Decomposition with Adaptive Noise), VMD (Variational Mode Decomposition), WD (Wavelet Decomposition), and WPD (Wavelet Packet Decomposition) on the performance of Neural Networks. Decomposition algorithms are adopted to decompose traffic flow data into component signals, and then K-means is used to cluster component signals into volatility components, periodic components, and residual components. A Bi-directional LSTM (BiLSTM) neural network is adopted as the standard model for training and forecasting. Finally, three metrics, including prediction performance, robustness, and generalization performance are proposed to evaluate the influence of the multi-scale decomposition algorithm for neural networks comprehensively. By comparing the evaluation results of different hybrid models, this study provides some useful suggestions on proper multi-scale decomposition algorithm selection in short-time traffic flow prediction.


I. INTRODUCTION
With the acceleration of urbanization, the car parc is increasing year by year. The resulting traffic congestion and unequal distribution of the right of way seriously affect the safety and efficiency of travel. Accurate short-term traffic prediction is the core problem of ITS (intelligent transportation system), which can provide technical support for the surveillance and the forewarning of traffic flow [1], so as to alleviate congestion. In order to achieve the high-precision prediction of short-term traffic flow, a large number of prediction models have been proposed, which can be mainly divided into three categories: parametric model (statistical theoretical model), non-parametric model (machine learning theoretical model), and hybrid model.
The associate editor coordinating the review of this manuscript and approving it for publication was Qichun Zhang . Parametric models such as Kalman filtering [2], [3] is particularly good at dealing with noisy traffic flow prediction, Grey forecasting model [4], [5] can achieve high prediction accuracy even when the number of samples is relatively small, autoregressive integrated moving average (ARIMA) model [6] realize traffic flow prediction by stabilizing data through difference. Furthermore, its improved model seasonal autoregressive integrated moving average (SARIMA) [7] improves the ability to extract traffic flow periodic features by adding seasonal factors and further improves the prediction performance. However, parametric models are often applicable to linear data, and the effect of predicting the complex nonlinear traffic flow is often not satisfactory. Especially when dealing with large-scale data at the same time, [8] the efficiency of parametric models is very low, which cannot meet the timeliness requirements of shortterm traffic flow prediction.
Non-parametric models have become the mainstream short-term traffic flow forecasting models due to their strong nonlinear fitting ability and self-learning characteristics. Decision tree models, including Random Forest (RF) [9], [10] and Gradient Boosting Decision Tree (XGBT) [11], are typical non-parametric models that are not prone to over-fit and have a high anti-interference capacity. Support vector machine (SVM) [12]- [14] is widely used in short-time traffic flow prediction because of its strong nonlinear convergence performance and strict mathematical interpretation. K-Nearest Neighbors (KNN) [15], [16] with a short training period is also a commonly used prediction model. The Artificial Neural Network (ANN) [17], [18] is known for its adjustable structure and powerful generalization performance. As the deep learning theory is proposed, RNN [19] and its improved models GRU [20] and LSTM are widely adopted in short-term traffic flow prediction [1], [21], [22].
Tselentis et al. [23] uncovered that each prediction model has a specific application range. In predicting short-term traffic flow, the combined model is more accurate and more applicable than a stand-alone model. Tang et al. [24] combined double exponential smoothing and SVR to verify its advantages over a stand-alone model. Saiqun et al. [25] combined ARIMA and LSTM to train the linear and nonlinear characteristics of traffic flow respectively. Linjiang et al. [26] proposed a dynamic spatial-temporal feature optimization method for big data based on GBDT.
Since the traffic flow is easily disturbed by various factors such as weather and traffic detectors during the acquisition process [27], the collected data often contains a lot of noise. At present, numerous studies have shown that the decomposition algorithm can reduce the impact of noise on the prediction model and improve prediction accuracy. One decomposition method is to assume that the traffic flow has a periodic part and can be modeled by trigonometric regression. Then the residual part is obtained by subtracting the periodic part from the original traffic flow to achieve the traffic flow decomposition. Xiaoxue et al. [28] adopted the combination of sine and cosine to model the periodic part, thus decomposing the traffic flow into the periodic part and the residual part. Indicating the hybrid prediction approach is more effective. Reference [22] uses Fourier transform to separate traffic flow into periodic part and volatility part. It was proved that the prediction accuracy of the hybrid model after decomposing is better than that of the traditional model. Zhang et al. [29] introduced variance and assumed that the traffic flow can be decomposed into three parts. Another way is multi-scale decomposition which migrated from the field of signal processing, the decomposition does not require assuming the period T of the function, and is completely determined by the characteristics of the data itself. Ghosh et al. [30] adopted WD to reduce the sample size required for training. Yan et al. [31] introduced Maximal Overlap Discrete Wavelet Transform to decompose time series, and verified that it is superior to the traditional deep learning mechanism. Mehmet et al. [32] concluded that WD has higher accuracy than EMD with correct wavelet type selection. Jiang and Adeli [33] adopted WPD to reduce the noise of traffic data effectively. Cheng et al. [34] proved that WPD can effectively separate historical data. Wang et al. [35] demonstrated hybrid EMD-ARIMA model outperforms the traditional forecasting models in different scenarios. Qiu et al. [36] proposed a hybrid model combining EMD and deep learning method, and proved the superiority by comparing different prediction methods. Tang et al. [37] compared five denoising schemes and proposed that EEMD is superior to other algorithms. Reference [27] also indicated that EEMD can reduce the prediction error. Yanan et al. [38] proposed hybrid models based on deep learning methods and CEMMDAN have great potential in the field of traffic flow prediction. VMD is the state-of-the-art decomposition algorithm, which is seldom employed in the field of short-time traffic flow prediction. However, it is widely utilized in the fields of wind speed [39], runoff [40], chemistry [41], power load [42], and so on.
Currently, a large number of studies show that the multiscale decomposition algorithm can reduce the influence of noise on the prediction model and improve the accuracy of the prediction. But neither of these studies explain why this decomposition algorithm is chosen over the others, nor does it say when it should be utilized. Reference [37] compares five algorithms, but their purpose is to test the denoising performance and the latest CEEMDAN and VMD are not involved. At the same time, plenty of research concentrates on multi-scale decomposition to improve the prediction accuracy of neural networks, but lacks further research on the other important properties of neural networks, such as robustness and generalization performance.
The rest of the paper is organized as follows. The second part introduces the multi-scale decomposition algorithms briefly. The third part takes the data as an example of empirical analysis. Finally, the fourth part gives the conclusions and suggestions.

A. WAVELET DECOMPOSITION AND WAVELET PACKET DECOMPOSITION
Wavelet analysis can decompose signals at multiple scales by translating and scaling the wavelet basis function to obtain signal components at different scales, which overcomes the shortcoming of short-time Fourier transform in single resolution. Wavelet packet decomposition is a more refined signal decomposition method based on orthogonal wavelet analysis. In essence, the wavelet packet decomposition is the further decompose of high-frequency signal obtained by wavelet decomposition. There is neither redundancy nor omission in wavelet packet decomposition, which improves the capability of time-frequency analysis of the signal spectrum. Although wavelet packet decomposition can fully decompose the high-frequency and low-frequency parts of the data, it is generally considered that the high-frequency part of the data contains a lot of noise. Therefore, the selection of the multi-scale decomposition algorithm differs in different studies [32], [34].
In the multi-scale analysis, if the scale function and the wavelet basis function satisfy the two-scale function equation, it can be set as: where H (k) and G(k) are Gaussian filter coefficient, j is decomposition scale, k is the position index and t is time Then Mallat decomposition algorithm can be expressed as: where cA j is the low-frequency approximate component coefficient of the signal in the J-th layer and cD j is the highfrequency detail component coefficient of the signal in the J-th layer. On this basis, the obtained low-frequency component coefficient cDj is put into Eq. (2), which is the multi-layer wavelet decomposition. If the high-frequency component coefficient cAj is decomposed into Eq. (2) at the same time, it is the multi-layer wavelet packet decomposition. The decomposition structure is shown in FIGURE 1.
Its reconstruction can be expressed as: where s(t) is the original signal, h and g are wavelet reconstruction coefficients, and their relationship with Gaussian filter coefficients can be expressed as:

B. EMPIRICAL MODE DECOMPOSITION AND IMPROVED ALGORITHM
As an adaptive signal time-frequency processing method, EMD [43] stabilizes the signal to generate a series of data sequences with local characteristics of different time scales. Each sequence is an intrinsic mode function (IMF). Although EMD has an excellent processing effect on nonlinear and nonstationary signals, the inherent defects of mode mixing limit its wide application. However, it is not important whether the modes are mixing or not in short-term traffic flow prediction. EEMD [44] solves the mode mixing by additional white noise. On this basis, CEEMDAN [45] adds adaptive white noise to each decomposition to improve the integrity of EEMD and reduce the reconstruction error. The basic principles of the three algorithms are the same but differ in noise, then the latest CEEMDAN is introduced as an example.
(1) Determine all local maximum and minimum points of the original signal s (t), Cubic spline interpolation is employed to fit all the local maximum and minimum points., The upper and lower envelops formed by cubic spline difference are denoted as u 0 (t) and l 0 (t) respectively.
(2) The mean value of the upper and lower envelope m 0 (t) = (u 0 (t) + l 0 (t)) /2, The first IMF component can be expressed as ( 3) Steps (1) and (2) are the rationale for EMD. The main difference between the three algorithms is that EMD has no additional white noise for the s (t), while EEMD has additional Gaussian white noise w i (t) with zero mean value and homogeneity of variance, and CEEMDAN adds noise coefficient ε 0 on this basis. The original signal can be expressed as s i (t) = s(t) + ε 0 w i (t) Decomposition s i (t) by EMD to obtain the first component which was defined as Intrinsic Mode Function(IMF).
(4) Calculate the first residual signal: (5) The operator E(·) is defined as the i-th modal component generated by the EMD, and the IMF i is i-th modal component genera-ted by CEEMDAN likewise. Decomposition r 1 (t) + ε 1 E 1 (w i (t)) and obtained the second IMF 2 (t).
(6) Repeat steps (4) and (5) until the acquired residual signal can no longer be decomposed. The judgment criterion is that the number of extremal points of the residual signal should not exceed two, and the final residual signal is: where m is the total modulus of CEEMDAN, and the reconstruction of the original signal s (t) can be expressed as: is a state-of-the-art time-frequency signal decomposition algorithm with the character of adaptive and completely non-recursive. By constructing a variational model, the modal function u k (t) and the corresponding center frequency ω k are alternately and iteratively updated to find the optimal solution of the constrained variational model. Finally, all the components are obtained in this process. VMD overcame the endpoint effect and model mixing of EMD and has a solid mathematics theory foundation. the algorithm can be represented as: (1) Perform Hilbert transform for each u k (t) to obtain unilateral spectrum: where δ (t) is the Dirac distribution function and * is convolution operation.
(2) The signal of each mode is multiplied by the estimated center frequency e −jω k t , then the spectrum is shifted to the baseband: (3) Gaussian smoothing estimation is carried out for L2 regularization of demodulation signal gradient to obtain the bandwidth of each mode function, then the constrained variational model can be expressed as: where ∂ t is solving for the partial derivative and k is the number of components.
(4) Considering the influence of Gaussian noise second penalty factor α and Lagrangian multiplier λ are introduced to ensure the reconstruction accuracy of the original signal s (t), and the constrained variational formula is transformed into the unconstrained variational formula: (5) The component u k (t), corresponding center frequency ω k , and Lagrangian multiplier λ are updated by the alternating direction multiplier method and search the minimum value point of the expression of the extended Lagrangian function. The updating formula is as follows: whereŝ(ω),û n k (ω),λ(ω) are the results of Fourier transform of s(ω), u n k (ω), λ(ω), ω is frequency, n is the number of iterations, and τ is the update parameter of Lagrange multiplier.
(6) Repeat step 5 and stop the iteration until the convergence accuracy ε is satisfied, so as to obtain k narrowband components. The iteration termination condition can be expressed as:

D. BI-DIRECTIONAL LONG SHORT-TERM MEMORY
LSTM is representative of the recurrent neural network. The gating mechanism is introduced to realize the function of spatiotemporal memory and alleviate the problem of gradient disappearance and explosion effectively. The introduced gating mechanism like input gate, forgetting gate, and output gate can control when to forget historical information or update unit state with new information, which makes LSTM have an excellent effect in solving nonlinear time series problem and is widely applied in short-term traffic flow prediction. Its process can be expressed as: where i t , f t , o t , c t , are the output of input gates i, forget gates f , input gates o and cell state c at time t respectively,

2) DECOMPOSITION PARAMETERS
Reference [37] indicated that the method with Daubechies type achieves higher accuracy on short-term traffic flow with periodicity. In this case, we adopted the Db4 wavelet for decomposition by testing, while WD carried out a 4-layer decomposition to obtain four high-frequency components and a low-frequency component. WPD carried out a 3-layer decomposition to obtain four high-frequency components same as WD. At the same time, the high-frequency components containing noise are further decomposed by WPD to obtain four low-frequency components.
EMD adaptively decomposes traffic flow data to obtain 14 components without preset parameters; The three parameters of EEMD including noise standard deviation, number of realizations, and the maximum number of shifting iterations are set as [0.2 100 1000] respectively, and we obtained 15 components after decomposition. CEEMDAN adopts the same decomposition parameters as EEMD and obtained 15 components ultimately.
Different from wavelet decomposition which has numerous researches to determine the super parameters. Super parameters of VMD are lack prior knowledge to determine in the short-time traffic flow field. The decomposition layer is set as 15, which is the same as EMD and EEMD. Considering the reconstruction error and center frequency distribution, the secondary penalty factor is set as 200.
TABLE 1 had listed the reconstruction errors of each decomposition algorithm and expresses them by mean absolute error(MAE). EEMD has serious noise residue and great reconstruction error. Although the latest VMD overcomes mode mixing, it cannot maintain the integrity of the traffic flow. Relatively, EMD has almost no reconstruction error, and CEEMDAN overcomes the defect of EEMD, and the reconstruction error is almost 0. The decomposition algorithms based on wavelet function (WD, WPD) maintain good integrity after reconstruction.
After decomposition, we need to establish multiple additional prediction models, and the number of models for each decomposition algorithm is different, so it is impossible to reasonably compare each decomposition algorithm. Chen and Yu [47] found that each component may represent a certain pattern of traffic flow. In this article, we supposed that the traffic flow is composed of the volatility component, periodic component, and residual component.
where V (t) is the volatility component representing the data noise during the collection process, such as weather, sensor failure, etc. P(t) is the periodic component representing the seasonal part of the traffic flow. R(t) is the residual component representing the overall change trend of traffic flow.
To unify the number of models and improve the efficiency of prediction models, K-means is carried to divide the component obtained by the decomposition algorithm into 3 clusters with similar pattern in FIGURE 3.

3) TRAINING PARAMETERS
BiLSTM neural network is adopted to test. The number of iterations is set as 100, and the initial learning rate is set as 0.005. To prevent the learning rate from being too large, the model oscillates back and forth. The learning rate is dynamically adjusted by the attenuation method, and the learning rate decreases by 50% for every 25 iterations. The dropout is set as 0.2 to avoid the overfitting problem. The loss function is the root mean square error (RMSE). The time-step is set as 3 and output layer neurons is 1, namely, we adopted the former 5, 10, 15 minutes of traffic flow to predict the present traffic flow. Considering the efficiency of the models, the number of hidden layer neurons is 64.

B. PREDICTION PERFORMANCE EVALUATING
The prediction performance is the most important metric to evaluate the short-term traffic flow prediction model, which directly determines the quality of the model. The predic-  tion performance is usually evaluated by the error between the predicted value and the measured value. Three metrics are adopted to evaluate the prediction performance comprehensively. Average absolute percentage error (MAPE) reflects the relative deviation between the predicted value and the actual value, which can directly measure the prediction results. It is widely utilized to evaluate the quality of prediction models, but it cannot directly reflect the difference between the predicted value and measured value. Root mean square error (RMSE) can directly reflect the absolute difference between the predicted value and measured value and be sensitive to the extremely large or extremely small error, it can effectively make up for the deficiency of MAPE. The determination coefficient (R 2 ) reflects the degree of similarity between the predicted value and measured value, These metrics are calculated by the following formula: (y t − y t ) 2 (26) where y t is the predicted value, while y t is the measured value.   TABLE 2, and the combination model with the best effect is marked in bold. Firstly, in the part of three components, for the volatility component, WP has the lowest error and the best fitting effect; EMD and CEEMDAN show similar performance; EEMD is less effective; VMD and WPD have the largest error. For the periodic component and residual component, the decomposition algorithm based on wavelet function is obviously not competent, WD and WPD cannot effectively separate the two components from the traffic flow, resulting in poor K-means clustering results, and much greater error than EMD, EEMD, etc. Relatively speaking, VMD also performs poorly, slightly better than WD and WPD, but not as well as EMD. Especially, the effect of WPD is pretty poor in trend component prediction. Secondly, after the reconstruction of the three components, the prediction error of the volatility component almost determines the total error. Finally, from TABLE 2 we notice that decomposition algorithms can significantly improve the prediction of neural networks range from 0.6232 to 8.2114 in RMSE. WD has the best performance and outperforms EMD and EEMD which is the same as the conclusion in reference [32] and [37] respectively. Surprisingly, the newly proposed VMD shows excellent performance, close to that of WPD and second only to WD, CEEMDAN has a mediocre performance, EMD and EEMD had the worst effects. We ranked decomposition algorithms according to their predictive errors, from best to worst: Multi-step prediction of traffic flow on several time horizons allows for a wider range of applications to apply the predictions. To test the influence of multi-scale decomposition on the multi-step prediction of the model, We tested the variation of prediction accuracy for four horizons of 5 minutes, 10 minutes, 15 minutes, and 20 minutes respectively. The RMSE and MAPE are given in FIGURE 5. It can be found that the longer horizon did not deteriorate the prediction accuracy significantly by introducing multi-scale decomposition. Besides, VMD performed well in multi-step prediction, Even when the prediction horizon reaches 20 min (4-step ahead), the error is not much bigger than that of 1-step horizon ahead prediction. It implies that VMD is a relatively stable method for multi-step prediction.
Baseline Model: To verify the improvement of neural network performance by multi-scale decomposition, commonly used prediction models are introduced for comparison, including Long short-term memory(LSTM), Random Forest(RF), and Particle swarm optimization least squares support vector machine(PSO-LSSVM). FIGURE 6 had been processed the same as FIGURE 4, and we can demonstrate the following conclusions from FIG-URE 6. Firstly, the prediction performance of the general LSTM neural network is not different(no more than 1% in   MAPE)from that of traditional machine learning like RF and LSSVM. Secondly, LSTM with bi-directional training can significantly improve the prediction accuracy of short-term traffic flow, RMSE reduced by 7.2494, and MAPE reduced by 4.67%. R 2 is up 5.63%. Finally, the combination of multiscale decomposition and neural networks can further reduce the prediction error. Taking the WD-BiLSTM hybrid model as an example, compared with LSTM, its RMSE and MAPE have been reduced by 15.4617 and 9.91% respectively. R 2 increased by 9.33%. The effect is remarkable.

C. ROBUSTNESS EVALUATING
The robustness of the neural network refers to the model's ability to maintain good performance under certain disturbances, while in the field of short-term traffic flow, the robustness is shown as the ability to resist data noise. In order to evaluate the influence of multi-scale decomposition on the robustness of neural networks, two evaluation methods are designed. One is to simulate the random disturbance of the detector. For this purpose, Gaussian white noise with different signal-noise ratio (SNR) is added. The other is to simulate the random failure of the detector. Therefore, different proportions of traffic flow data are set to 0 randomly. FIGURE 7 (a) and (b) show similar results that when white Gaussian noise with SNR from −5 dB to −25dB is added, the overall error of all decomposition algorithms almost does not increase. Since the small amount of added noise is equivalent to data enhancement, the regularization effect is played. When white noise with SNR of −5dB is added to all prediction models, the errors are reduced to a certain extent. EEMD had the most significant response effect to regularization, followed by CEEMDAN and WPD. The regularization had no obvious effect on WD, which had the best prediction performance originally. At the same time, the prediction performance of the model without the decomposition algorithm is not improved by adding noise, which shows that multi-scale decomposition can improve the sensitivity of neural networks to regularization. When the SNR reaches −35 dB and −45 dB, the prediction accuracy of the models fall off a cliff and lose the prediction ability. Noise disturbance has the most serious effect on EMD, CEEMDAN, and WD algorithms. FIGURE 7 (c) and (d) show that for data missing, the early error of all decomposition algorithms increases linearly and then flattens out. It can be noted that the multi-scale decomposition is quite sensitive to data missing. When the missing rate is 5%, the prediction error of half of the combined models is higher than that of the BiLSTM. With the further increase of miss rate, the prediction error of all combination models is gradually higher than that of BiLSTM. When the miss rate reaches 25%, the BiLSTM neural network is the prediction model with the least prediction error. In particular, WD with the best prediction effect is the most sensitive to data missing. When the missing rate reaches 5%, the error doubles by 3 times, and when the missing rate finally reaches 25%, it is the model with the largest error. However, EEMD performs well in the noise resistance test but has a poor performance in data missing resistance, and its error growth rate is second only to WD. The data missing resistance performance of EMD is still not good, and the error is close to that of WPD.VMD and CEEMDAN show good performance in this test, which is superior to other decomposition algorithms.

D. GENERALIZATION PERFORMANCE EVALUATING
At present, in the study of short-term traffic flow prediction, the data of the former part of the same section is taken as the training set and the latter part as the test set., so as to evaluate the generalization performance of the neural network, and the generalization performance is equivalent to the prediction accuracy. However, under the circumstance of big data, the short-term traffic flows in the same section will show a similar pattern. The trained neural network may performance extremely high accuracy in this section, but it achieves a poor effect in other sections, that is, overfitting occurs. Therefore, the generalization performance of neural networks is not comprehensive enough. Our goal is to obtain a prediction model for the road network. Therefore, 3 of the remaining 23 sections are selected randomly for prediction by using the trained neural network model to comprehensively evaluate the generalization performance of the neural network and the influence of the decomposition algorithm.  show that when the models are adopted to predict different sections, the prediction error is significantly increased, and the degree of augmenting is different, which indicates that it is not comprehensive to evaluate the generalization performance of the neural network only by training and simulation of traffic flow data in the same section. At the same time, the multi-scale decomposition greatly improved the generalization performance of the neural networks. When the traditional BiLSTM is applied to other sections, the prediction error increased greatly and different sections fluctuated greatly. No matter which multiscale decomposition algorithm is utilized to optimize the neural network, the error fluctuation decreased obviously. WD maintained a good generalization performance in general, and the average MAPE is the minimum, but the prediction error of different sections fluctuated greatly. On the other hand, VMD showed excellent generalization performance and stability. Average MAPE is second only to WD, but the stability is far better than WD. Besides, the stability of CEEMNAN and EEMD is not much different, and the prediction error of CEEMDAN is slightly smaller than that of EEMD. The generalization performance of EMD and WPD is also unsatisfactory, and they are not suitable for road network prediction. VMD and WD are the optimal decomposition algorithms for road network traffic flow prediction.

E. COMPREHENSIVE COMPARATIVE EVALUATION
For the convenience of comparison, the performance of each model under different tests is ranked and scored. Models that performed well will achieve higher scores. The prediction performance and anti-missing performance are comprehensively evaluated by RMSE and MAPE. In the part of antinoise performance, the regularization effect is taken as the main standard, and the error from −35dB to −45dB is used for auxiliary evaluation. In the part of generalization performance, the error fluctuation of the different sections is the main factor, supplemented by the average MAPE.
From the observation of results in FIGURE 9, several interesting findings can be summarized as follows: (1) In most cases, the multi-scale decomposition algorithm can effectively improve the performance of the neural network. Although these algorithms are quite sensitive to missing data. However, when forecasting traffic flow in real situations, interpolation or machine learning is often used to fill in the missing data, so the impact of missing data is relatively light.
(2) VMD and WD are excellent multi-scale decomposition with the best comprehensive performance. VMD has shown stable performance in all tests and can deal with various complex traffic flow conditions with wide applicability. Although WD is extremely sensitive to data loss, if the original data is properly preprocessed, it will be the most superior decomposition algorithm.
(3) The performance of CEEMDAN and WPD is relatively close in general, and they can be substituted for each other. If the prediction accuracy and anti-noise performance have higher requirements, WPD can be employed; If we focus on the generalization performance of the model, CEEMDAN will be a better choice.
(4) The overall performance of EMD and EEMD is mediocre. EMD, in particular, has relatively poor performance in almost all tests, and we can always find a superior alternative algorithm for it. So we do not recommend it to be employed in priority under any circumstances. While performing poorly in other tests, EEMD has unmatched noise resistance and can be considered when raw traffic flow data contains a lot of noise and cannot be effectively separated.

IV. CONCLUSION
This study assessed the impact of the multi-scale decomposition algorithm for the neural network model based on the combination of BiLSTM and the decomposition algorithm, using the traffic flow data collected from PEMS. We utilize K-means clustering to unify input parameters and achieve dimensionality reduction. Three metrics of prediction performance, robustness, and generalization performance are proposed to comprehensively evaluate the influence of the multi-scale decomposition algorithm. From the observation of the results from this study, several interesting conclusions can be drawn. Firstly, the multi-scale decomposition algorithm is quite effective in improving the performance of the neural network in all aspects. Secondly, VMD is the most widely applicable algorithm, WD is excellent after data preprocessing, and EEMD has unmatched noise resistance. Finally, CEEMDAN can be used to pursue prediction accuracy and anti-noise performance, and WPD can be used to pursue generalization performance, but in general, they are inferior to VMD.EMD is not recommended in any cases.
Comparing the performance of different combination models, this study provides some useful suggestions on proper multi-scale decomposition algorithm section in short-time traffic flow prediction. However, in the current study, we only consider historical traffic flow but do not involve the response of decomposition algorithms to spatial factors such as weather and holidays, which will be the focus of our further research.