A Period-Specific Combined Traffic Flow Prediction Based on Travel Speed Clustering

Short-term trafﬁc ﬂow forecasting has always been an interesting research at the ﬁelds of Intelligent Transportation Systems. This paper presents a time-based combined trafﬁc ﬂow prediction model based on ﬁeld data collected by loop detectors at signalized intersections, which are used to signal optimization, route choice, trafﬁc monitoring, etc. Firstly, the trafﬁc ﬂow and corresponding travel speed by hour is processed for error elimination and correlation analysis. Secondly, time of day is divided into three groups (peak, ﬂat-peak and low-peak period) in terms of hourly travel speed clustering such as to separately develop prediction formula for each period with avoiding the overﬁtting of a single 24-hour model. And then, a combined prediction model based on time partition is proposed for 24-hour trafﬁc ﬂow forecasting, which adopts grey theory model for ﬂat-peak and low-peak periods and back-propagation artiﬁcial neural network for peak hours, respectively. Finally, in tests that used ﬁeld data from Xingzhong Rd, Zhongshan, China, the developed combined method based on speed clustering shows promise in reducing mean absolute error, mean absolute percentage error and mean squared error. Further exploration with excessive experiments for comparison analysis exhibits that the period-speciﬁc combined model conducts a more accurate and reliable prediction than the individual model and existing combined ones with the same structure for 24-hour.


I. INTRODUCTION
Urbanization development has been causing serious traffic congestion in numerous metropolitan and large cities around the world. Thus, it's necessary and inevitable to conduct urban road infrastructure construction and advanced traffic management for meeting travel demand [1]. Most effective strategies for traffic congestion mitigation always depend on the accurate and timely traffic prediction, such as traffic flow for traffic organization and signal timing optimization, travel time or speed for vehicle routing guidance.
Since early 1980s, short-term traffic prediction technique has become one of the most important components of Intelligent Transportation System (ITS), and the prediction time window ranges from a few minutes to a few hours into the future based on road geometry, traffic information, and control strategies, etc. [2]. In review of literatures over the past The associate editor coordinating the review of this manuscript and approving it for publication was Edith C.-H. Ngai . few decades, the forecasting model can be roughly classified into two categories: single models and combined ones.
The former is usually dedicated on one certain kind of formula by considering current and past traffic information. The existing literature can be divided into two subgroups. One category is parametric models, which can be described by using a finite number of parameters, such as exponential smoothing model [3], historical average algorithm [4], Autoregressive Integrated Moving Average (ARIMA) [5], Kalman Filtering (KF) [6], [7], and Grey theory model (GM) [8]. Among, GM is suitable to predict the system of having poor information and uncertainty, such as traffic flow [9]. Okutani and Stephanedes [10] began to employ Kalman filter to forecast traffic flow on the road network in Nagoya, Japan. Moreover, Sun et al. [11] developed a linear regression model to forecast flow on US-290 freeway in Houston, USA, who found that it outperformed the k-nearest neighbor method and kernel smoothing method. The other category, non-parametric models, assumes the structure of traffic parameters is not fixed and mainly follows statistical regularity depending on the abundant field data, such as Support Vector Machine (SVM) [1], artificial neural networks [12]- [14], non-parametric regression [15], Gaussian maximum likelihood [16]. Among, Artificial Neural Network (ANN) is one of the most widely used methods because it can capture traffic fluctuation [17]. Liang and Wei [18] modeled traffic flow on freeways based on simple recurrent networks, also namely Elman Network. Tan et al. [19] proved the k-Nearest Neighbor (k-NN) model can outperform ARIMA and the exponential smoothing model based on the field data of Guangzhou, China. Later, Lv et al. [20] first applied Stacked Auto-Encoder (SAE) to short-term traffic flow prediction and trained the model by Greedy Layer-Wise algorithm. Recently, Yu et al. [21] analyzed the temporal-spatial characteristics of traffic flow, and then employed Convolutional Neural Network (CNN) to predict short-term flow based on location partition. Subsequently, Xu et al. [22] developed an artificial fish swarm algorithm to optimize support vector machine regression for flow forecasting. With the development of computer techniques, machine learning has also been used for flow forecasting on the basis of huge historical dataset. For example, Dai et al. [23] proposed a many-to-many deep learning for traffic prediction, namely DeepTrend 2.0, which regards multi-sensor information as input and simultaneously generates predicted results for all sensors. Meanwhile, Li et al. [24] proposed a deep feature leaning approach in the following multiple steps by using supervised learning techniques. Zhao et al. [25] predicted traffic flow on four road segments in Beijing by using LSTM, and found it outperformed ARIMA and RNN (Recurrent Neural Network). Polson and Sokolov [26] pointed out that deep learning architectures are able to capture the nonlinear spatialtemporal effects resulting from the transitions between free flow, breakdown, recovery, and congestion in traffic flow.
Different from the previous single forecasting techniques, most combined models could yield much more benefits than the same kind individual ones due to utilizing two or more methods' advantages. For example, some researchers preferred to choosing ANN as the underlying model of integrating other methods, such as GM [27], [28], ARIMA [29], k-NN [30], Support Vector Regression (SVR) [31], clustering algorithm [32], and simple statistical approach [33]. Also, Feng et al. [34] combined wavelet function and Extreme Learning Machine (ELM) to propose a short-term prediction which outperformed ANN based on the field data of Canadian highway. Recently, Wu et al. [35] proposed a combined deep learning of CNN-RNN by considering the weekly/daily periodicity and spatial-temporal characteristics of traffic flow. According to one decomposed periodic sequence and two-part random ones for traffic flow time series, Zheng et al. [36] proposed a hybrid prediction with Back Propagation (BP)-based ANN, ε-SVR and LSTM models. Chang and Tsai [37] reported a composite method where incorporating a generalized auto-regressive conditional heteroscedasticity into GM tuned by adaptive support vector regression.
Although many worldwide researchers have reported a large variety of methods on traffic flow prediction, the shortterm traffic flow forecasting along urban signalized corridors is still challenging tasks because traffic flows have many uncertainty (e.g. time-varying, highly oscillated, nonlinear and non-stationary). In particularly, most studies are dedicated on one single model for 24-hour forecasting by quantifying the relationship between the predictor and dependent input variables, which might result in model overfitting due to the high fluctuations of traffic flow over hours. Therefore, this study contributes to proposing a new period-based combined scheme of GM and BP based on the field historical datasets. The main contents of this paper can be divided into two parts: (i) The k-means clustering method is employed to divide 24 hours of one day into multiple time periods based on the travel speed time series in hour; and (ii) a combined method of GM and BP, namely GM-BP, is developed to forecast the hourly traffic flow for each time period, which can capture the fluctuation and overfitting prevention.
The remaining of the article is organized as follows: Section 2 describes the field dataset and data processing. In Section 3, the detailed prediction method is developed discussed. The case study demonstrates model performance with the field data from the city of Zhongshan in Section 4. The last section presents the conclusions.

II. DATA COLLECTION AND ANALYSIS
As known, it's significant to investigate traffic flow prediction based on actual traffic data. However, it's time-consuming and expensive for local governments or traffic engineers to collect the large-scale traffic data in practice. In the past decades, local governments in many Chinese cities have installed many infrastructures and developed application systems based on the idea of ITS.

A. DATA SOURCE
As one of the earliest pilot cities of ITS in China, the city of Zhongshan in Guangdong Province has ability to automatically collect the city-level traffic flow at signalized intersections. Therefore, this study collected hourly traffic flow and link travel speed belonging to ITS with Internet Plus from the department of Zhongshan Traffic Police Detachment.
In details, the tested site is located on Xingzhong Rd with two-way six motorized lanes, which is the busiest and most congested south-north corridors in Zhongshan downtown area. There are many government agencies, commercial buildings, and activity centers along Xingzhong Rd. The dataset with time interval of one hour was recorded from February 27 to March 26, 2017, and the total sample size is 672. Among, it included southbound traffic flow collected by loop detectors installed several meters before the southbound stop-line at the signalized intersection between Xingzhong Rd and Songyuan Rd, and link travel speed from Sunwen East Road to Songyuan Road along Xinzhong Road in Figure 1.

B. ABNORMAL DATA IDENTIFICATION
Based on the basic data analysis, one can find out that there are some abnormal data of raw traffic flow and average travel speed as depicted in Figure 2, which could be caused by the detector failure due to power off, communication interrupt, etc. In details, traffic volumes suddenly dropped from about 1200 vehs/h at 11:00 to 100 vehs/h at 12:00, and dramatically increased up to 1750 vehs/h at 13:00 on March 2. Similarly, average travel speed also reached to 82.8 km/h at 8:00 on February 11, and then dropped to 0 km/h in the next four hours. Thus, it's necessary to identify and eliminate these abnormal data before prediction.
Generally, the Wright criterion (i.e. 3σ criterion) [38] is a very effective method for discriminating outliers in the case of a normal distribution. This study proposed a data processing procedure based on this criterion. Firstly, let's define the residual between the hourly traffic volume and average one detected by loop detectors by: where, q(i) represents the detected traffic volume at the ith hour; q is the mean of total sample data. If the absolute residual for the ith sample is greater than the triple standard deviation of the absolute residual, it will be marked as abnormal data which need be calibrated by other methods. This method is also applied for link travel speed in this paper.

C. DATA CORRELATION ANALYSIS
As known, there are many relevant variables of traffic flow prediction in the literature, such as historical flow [39], [40], travel speed [41], [42], traffic state [43], congestion levels [44], and occupancy [45]. Moreover, it's greatly expensive and difficult to collect traffic signal timing plan because sometimes it is adaptive or actuated based on control logic in practice. Therefore, this article contributes to developing a feasible prediction method based on the available data in Zhongshan. In order to determine the prediction formula and inputs, we firstly analyzed the triple-week dataset with traffic flow and travel speed in hour, and illustrated some key findings via one-week data from March 20 to March 26 as shown in Figure 3. And, Figure 4 shows link travel speed estimated by floating car data, which is from Sunwen East Road to Songyuan Road along Xinzhong Road in Figure 1. During the analyzed time windows, traffic system is quite stable without special incidents along the targeted corridor, such as holiday, major events, and school opening or closing. Compared with Figures 3 and 4, traffic speed time series was quite stable and high from 22:00 to the next 6:00, namely latenight off-peak hours. And then, it was dropped and traffic congestion happened from 7:00 to 8:00, and from 17:00 to 18:00, namely morning and evening peak hours, respectively;  and the remaining are regarded as transition process with the high fluctuation of volume and speed. In addition, there is a slight decline in speed at noon on Saturday and Sunday.
In order to further explore the characteristics of hourly traffic flow, a correlation analysis is conducted via Pearson coefficient by using the derived data from February 27 to March 26, 2017. Overall, on the same day, the correlation coefficients are getting more and more smaller with the increase of time difference in the last column of Table 1. The coefficients between adjacent two hours are greater than 0.8 over different time of day, and the coefficients of the last three intervals also exceed 0.5. The results showed that the current interval volume has a significant correlation with the past three ones, which should be considered into model development.
Meanwhile, one can also observe that the coefficients during the past one-week is much higher than 0.85 with regardless of the day of week in Table 2. Different from other roads in Zhongshan, the correlation between weekdays and weekends is still higher because many activity centers located on the either side of the arterial attract many local residents and Xingzhong Rd is also the most important south-north arterials for interzone travelers. Therefore, it's possible to use historical time series to estimate missing or abnormal data.
Thus, this study presented that the abnormal or missing travel speed and flow would be set to the average value of the same time interval on the same day of the past three weeks.

III. MODEL DEVELOPMENT A. MODEL STRUCTURE
As known, traffic flow cannot directly reflect traffic condition unless it combines with other parameters, such as the number of lanes, saturation flow rate, and signal timing. However, travel speed is a popular variable to effectively represent traffic congestion. Thus, this study presents a speed-cluster method to identify traffic congestion and decompose 24 hours of one day into multiple periods, and developed the specific flow prediction algorithm for each period, namely periodspecific prediction. The scheme of the entire prediction logic is developed as follows: The collected dataset for traffic flow and speed is filtered according to the Wright criterion, and the abnormal or missing data are estimated with the average of the same time interval on the same day of the past three weeks.

2) TIME DECOMPOSITION BASED ON SPEED-CLUSTERING
This study employs k-means cluster for travel speed time series to divide 24 hours of one day into multiple time periods in order to identify the peak and off-peak hours.

3) PERIOD-SPECIFIC PREDICTION FORMULATION
Based on the previous clustering results, the GM and BP model are combined for flow prediction at all divided time periods.

B. TIME PERIOD DECOMPOSITION BASED ON SPEED-CLUSTERING
As a partition-based clustering analysis, k-means algorithm has the advantage of efficiently processing huge dataset and discovering patterns. Based on the calculation of Euclidean distance, the objective function of clustering method can be expressed as follows: where, SSE means the summation of the squared error, which is regarded as objective function for clustering quality measuring; K is the total number of data clusters; C k denotes the dataset of the kth cluster; v(i) represents link travel speed at the ith hour in one day; and c k is the centroid of cluster k.
Herein, this paper performed k-means clustering on travel speed to search for each clustering center. The entire procedure is decomposed into the following steps:   • Step 2: Randomly choose K data samples of v(j) as the initial cluster centers.
• Step 3: Calculate the distance from each sample in the dataset to all cluster centers, and then allocate this sample to the nearest center based on distance.
• Step 4: Update all cluster centers based on sample reallocation.
• Step 5: Repeat Step 3 and 4 until Equation (4) reaches the minimum, and obtain the final K clusters.
For the total 672 data samples, one can obtain the converged clusters after 10 iterations, and travel speed dataset are divided into three categories. If those samples for the same time interval on the different days belongs to more than one cluster, this study employs the majority voting method to tackle it. Finally, the clustering speed centers are 27.9km/h for peak period, 36.5km/h for flat-peak period, and 45.8km/h for low-peak period, respectively. Correspondently, the lowpeak period ranges from 23:00 to 6:00, flat-peak period from 9:00 to 16:00 and 19:00 to 22:00, peak periods from 7:00 to 8:00 and 17:00 to 18:00 in Figure 5.

C. PERIOD-SPECIFIC PREDICTION MODEL
Based on the decomposed three periods in Figure 5, the period-specific combined predicted model (CPM) for 24-hour traffic flow is developed as follows:

1) GM-BASED PREDICTION FOR LOW-PEAK PERIOD
From after 23 to before 6 in one day, the average vehicle speed is close to the free-flow speed, and the traffic volume is very low. In particularly, the volume during this period is almost decreasing, so it's proper for GM model to capture this kind of downward trend with uncertainty. Actually, the grey theory model, the core component of the grey system, has been proved that it's been widely used in the field of transportation, especially for small-sample time series prediction or estimation [9]. Among, GM (1,1) is the typical format of grey theory, and can be formulated by the following Equations (3-7) [8]. Firstly, let's define the original time series as follows: And then, the one-time accumulated new series of traffic flow can be described by: where, q (1) The superscript 1 means traffic flow is processed with accumulated generating operation from the original series. Subsequently, let's define the following expression: where, Z (1) is a mean sequence of Q (1) calculated by formula z (1) (m) = 0.5(q (1) (m) + q (1) (m − 1)).And then, the basic GM(1,1) can be formulated by the following expressions: where, a and b mean gray coefficients, which might be calibrated by the conventional statistical least-square method. Therefore, the predicted traffic flow can be expressed as follows: Base on the previous correlation analysis in Section 2, this study took traffic flow time series in the past three hours as model inputs, and the output is the current hour one.

2) GM-BASED PREDICTION FOR FLAT-PEAK PERIOD
During the flat-peak period between 9:00 and 16:00, the average vehicle speed is medium compared with other two periods, but accompanied by a rapid rising or falling trend. Therefore, the GM is also suitable for it, and the inputs are the same as the low-peak period because the correlation coefficients of traffic flow between the current hour and the past three hours exceed 0.5.

3) BP-BASED PREDICTION FOR PEAK PERIOD
Artificial Neural Network can capture the fluctuation of traffic flow affected by the uncertain and nonlinear noises due to the capability of handling complex non-linear mapping, flexible network structure, and learning ability [46]. As a common neural network style, BP has the characteristics of signal forward transmission and error back propagation, and can capture the uncertainty and nonlinearity in traffic flow.
The remaining four hours belong to the peak hours (j = 7, 8, 17, and 18), and traffic volumes during peak hours have a greater fluctuation than others while they are much larger than others. Thus, BP neural network might be suitable to capture the fluctuation of traffic flow during peak hours. However, the data-driven BP model, a black box one, need much more sampling data to calibrate the parameters. In Table 2, this paper illustrated correlation analysis between the current hour volume and the same hour of the historical day of week is over 0.85, and thus employed historical data at the same hour belonging to the past serval weeks to train the BP model. The developed BP model has the popular structure of one input layer, one hidden layer, and one output layer, respectively.
As known, it's difficult to decide the number of neurons in the hidden layer for BP. According to the characteristics of the input and output data, the number of neurons in the hidden layer is initially determined by: where, B and C is the number of neurons in the input layer and output layer, respectively; and D denotes a constant integer from 0 to 10. After testing A values from 5 to 15, this study obtained the final value of 12 when the fitting error is the smallest. So, the structure of the network is BP(3,12,1), namely, 3 input neurons, 12 hidden ones and 1 output one, respectively. In this study, 504 of total 672 samples is selected as training ones, and the remaining is used for prediction. Finally, the period-specific prediction model can be formulated in the following expression: where, q (i + 1) is the predicted flow. If i-1 or i-2 is less than zero, it means the time series of the previous day will be regarded as inputs. For example, if i = 0, the q(i-1) and q(i-2) represent the volume at 23:00 and at 22:00 on the previous day, respectively.

IV. EXPERIMENTAL ILLUSTRATION
To evaluate the effectiveness of the proposed model, this study used the field data from Zhongshan for testing as shown in Section 2, and also compared with the popular existing models in terms of Measurement of Effectiveness (MOE) indexes as follows: The Mean Absolute Error between the actual volume and predicted one the Mean Absolute Percentage Error and the Mean Squared Error where, n is the total number of tested data samples.  (e.g. ARIMA-BP or A-B). Notably, in order to fairly evaluate these models, the related GM and BP has the same underlying structure of GM(1,1) and BP(3,12,1), and the parameters of all models will be retrained according to the sampled dataset. The comparison of the prediction accuracy is shown in Figure 6. A conclusion can be reached that the proposed combination in this paper is capable to obtain much better results than others with regardless of MOEs. Among, the MAPEs of BP and BP-ARIMA exceed 20%, and the MAPE of other models except CPM ranges from 13.2% to 19.5%. Moreover, the performance of GM model and GM-ARIMA and GM-BP with the similar fundamental inputs from GM prediction is better than that of BP or ARIM models because the latter methods might not capture the upward or downward trend of traffic flow over time and more easily converge to a local optimum. Notably, LSTM performs better than those existing ARIMA-like BP-like and GM-like models, but a litter worse than the period-specific CPM developed in this study. Overall, the GM model provides a good prediction accuracy, and thus this study presented GM and BP combined methods to separately predict traffic flow for different time of day. The results show that the MAPE of CPM reduces to the lowest value of 8.5% than other ten models. Figure 7, the GM model has a strong ability to track the fast descending or ascending trends of traffic volume, and show a much better accuracy than others. For example, from 10:00 to 12:00 on March 24, the actual flow by hour is about 1217 vehs/h, 1218 vehs/h, and 1043 vehs/h, respectively; and the corresponding prediction from GM model is 1219 vehs/h, 1128 vehs/h, and 1044 vehs/h, respectively. However, for the peak-hour period at 7:00 or 17:00, the low-peak one between 3:00 and 4:00 and the stationary period at 14:00 and 15:00, the overall prediction precision is not acceptable due to the overfitting problem of GM. On the contrary, the proposed period-specific combination model can suppress the overfitting for the dramatic flow fluctuation by importing BP.

As shown in
The ARIMA-GM and BP-GM model have a little tracking ability and high accuracy during the periods when traffic flow increases or decreases rapidly because the underlying model is GM, which is similar to that of the single GM model. After integrating the ARIMA into GM, the overall forecasting accuracy shows a steady trend, and the overfitting problems for the maximal and minimal traffic flow prediction are improved compared to the single GM one. After integrating the BP model into GM, the overall prediction accuracy is generally improved, but the problem of time-lag occurs. Further, the LSTM has an under-fitting problem during the wave and valley peaks.
In details, the ARIMA-GM model shows a low prediction accuracy with the largest error of 40.7% during morning peak hours from Tuesday to Saturday, especially for 7:00 on March 24 (Friday). What's more, the proposed CPM with GM  can yield much better accuracy than the other two models (ARIMA-GM and BP-GM), especially the predication error drops to 0.3% at 7:00 on Friday.
C. CPM VS BP-LIKE MODELS OVER TIME As shown in Figure 8, the performance of individual BP model is better than GM and ARIMA models during morning peak hours between 7:00 and 9:00 and evening peak hours between 17:00 and 19:00. However, during the periods of flow increasing or decreasing process, it shows lower accuracy and has time-lag characteristic of about one hour. For example, the hourly traffic flow from 4:00 to 8:00 on March 22 are 41 vehs/h, 75 vehs/h, 196 vehs/h, 1152 vehs/h, respectively, and the corresponding predicted one by BP is 50 vehs/h, 53 vehs/h, 891 vehs/h, and 1205 vehs/h, respectively. However, by importing other methods into BP, the ARIMA-BP and GM-BP models have the low performance with a prediction error larger than 30% during lowpeak hours from 0:00 to 5:00. Compared with Figure 7, the same finding of LSTM in can be reached in Figure 8.

D. CPM VS ARIMA-LIKE MODELS OVER TIME
The ARIMA model provides a quite stable prediction accuracy over time as shown in Figure 9. However, the characteristic of time-lag by ARIMA is very obvious between 3:00 and 6:00 every day. Meanwhile, there is a weak tracking ability at serval specific periods (after 22:00 before 7:00, and 19:00-21:00) when the volume has the significant upward or downward trend, where the forecasting value is much smaller than observed one. The main reason is that the ARIMA is only capable to help understand the linear and stationary relationship of data and cannot capture the real-time fluctuation of traffic flow. The MAPE of LSTM (13.2%) is much better than that of GM-ARIMA (14.4%) and BP-ARIMA (20.5%). However, during peak-hour periods when traffic flow reaches to the maximum, the error of GM-ARIMA is greatly higher than that of BP-ARIMA because of the overfitting of the GM.

E. RELIABILITY ANALYSIS OF 10 MODELS
From the cumulative probability density function (CDF) curve of the prediction error under the proposed CPM and VOLUME 8, 2020 other nine models as shown in Figure 10, the period-specific CPM model shows promising in prediction accuracy and reliability with regardless of time of day. Particularly, the probability with the MAPE of less than 10% in a week is up to 71.4% by CPM, while the probability of having larger than 25% MAPE is no greater than 5.8% on weekdays and 9.7% at weekends, respectively. Notably, the LSTM have a potential to achieve a much better prediction performance than other traditional methods except CPM.

V. DISCUSSIONS AND CONCLUSION
The objective of this study is to optimize a traditional 24-hour prediction logic and thereby develop a novel timedecomposition prediction method according to the fluctuation of traffic flow over time. In a test case, the field traffic volume and link travel speed with the interval of 1 hour were collected in Zhongshan. Firstly, according to the Wright criterion, the abnormal and missing data were processed. And then, temporal correlation analysis was performed to prove that the current traffic parameters have a significant correlation with those of the last three hours and the same time of day in the past seven days. After that, cluster analysis was conducted based on link travel speed, and the 24-hour time was divided into three periods, namely peak flow, flat peak flow and off-peak flow. Finally, period-specific prediction method with gray theory method and BP artificial neural network was formulated based on the characteristics of each divided period. A comprehensive experiment was conducted to validate the developed model, and it is found out that the mean absolute percentage error is about 8.46%. Most importantly, the probability of MAPE no more than 10% is close to 71.66% on weekdays, and 71.30% at weekends, respectively. What's more, the other nine models (namely, ARIMA-like, BP-like and GM-like and LSTM) are evaluated and compared in terms of MOEs and reliability, which can offer valuable insight at the field of both academia and industry. The proposed combined method can provide reliable information for traffic police departments on signal timing optimization and traffic guidance. The future work will focus on how to avoid the over-fitting and under-fitting of predicted models for the specific period. PENGHAO LI is currently a Deputy Detachment Leader with the Traffic Police Detachment, Zhongshan. She has been engaged in urban road traffic management and control for many years. She is currently responsible for collecting intelligent traffic collection data, business data and internet data, using statistical analysis, data mining, artificial intelligence, and other technologies to achieve the goal of traffic management from experience governance to data governance. VOLUME 8, 2020