Travel Time Prediction Using Hybridized Deep Feature Space and Machine Learning Based Heterogeneous Ensemble

Travel Time Prediction (TTP) has become an essential service that people use in daily commutes. With the precise TTP, individuals, logistic companies, and transport authorities can better manage their activities and operations. This paper presents a novel Hybridized Deep Feature Space (HDFS) based TTP ensemble model (HDFS-TTP) for accurate travel time prediction. In the first step, extensive endogenous and exogenous data sources are augmented with traffic data obtained using sensors. Next, we used Principal Component Analysis (PCA) and Deep Stacked Auto-Encoder (DSAE) for feature reduction. We generated feature spaces of deep learning models, namely Convolutional Neural Network (CNN), Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU), and fed them to a model based on Support Vector Regressor (SVR) for predicting travel times. Two best-performing models are selected, and their feature spaces are hybridized to boost feature space. On this boosted feature space, we employed SVR for final prediction. Our proposed HDFS-TTP ensemble can learn complex non-linearities in traffic data with the varying architectural design. The performance of our proposed HDFS-TTP ensemble using hybridized and boosted feature spaces showed significant improvement in test data in terms of Root Mean Square Error (62.27± 1.58), Mean Absolute Error (13.38± 1.09), Maximum Absolute Error (104.66± 2.77), Mean Absolute Percentage Error (2.50± 0.03), and Coefficient of determination (0.99714± 0.00044).

• We have augmented exogenous features with Floating- 92 Cars Data (FCD) to enhance the overall performance of 93 our proposed ensemble.

94
• We have also extracted Principal Component Analysis 95 (PCA) features and encode GPS trajectories using Deep 96 Stacked Auto-Encoder (DSAE) to boost feature space. 97 The comparative analysis with baseline architectures 98 shows considerable improvement in metrics like RMSE, 99 MAE, Max. AE, MAPE and R 2 on the FCD dataset for 100 our feature-based LSTM-GRU ensemble.

101
The remainder of this paper is organized as follows: 102 Section II provides the historical background of the study. 103 Section III presents the proposed methodology. Section IV 104 elaborates on the results of our research. Section V contains 105 the concluding remarks and the future direction.

107
Most of the earlier work on TTP employed segment-based 108 approaches focusing on predicting TT on a selected set of 109 routes or a specific freeway segment/region. Loop detector 110 data has been extensively used to predict segment/link TT. 111 Various approaches, including pattern matching [11],  Square minimization [12], Hidden Markov Model [13], Gra-113 dient Boosting Decision Tree [14], and XGB [15] have been 114 proposed to model segment-based TT. Data fusion has also 115 been studied to improve the prediction accuracy in [16]. 116 However, the major drawback associated with segment-based 117 approaches is that link delays at intersections or transition 118 time from one link to another are not considered in the 119 prediction process. This limitation makes the applicability of 120 these approaches only limited to freeway scenarios. 121 Path-based approaches address the limitation of segment-122 based approaches to some extent by splitting the entire path 123 into sub-paths and computing TT for each sub-path using 124 historical trajectories to get the final prediction [2], [5], [17], 125 [18]. Rahmani et al. in [2] presented an idea to concatenate 126 sub-paths to estimate the entire path. The authors in [17] 127 decomposed the entire trajectory path into a pathlet dictionary 128 and then reconstructed the complete path with fewer path-129 lets, and estimation of TT is carried out from these pathlets. 130 Li et al. in [18] extended the work towards personalized pre-131 diction of TT using pathlet dictionary and learned congestion 132 patterns. However, the performance of these path-based stud-133 ies could be impacted by the data sparsity problem. 134 In the last decade, data-driven approaches have been 135 widely used in traffic forecasting with the surge in data 136 collection technologies like hand-held devices, and vehicle 137 navigation systems. These approaches solve the problem by 138 learning the hidden spatiotemporal features of traffic data in 139 an end-to-end fashion. For [41] followed by a fully-connected 209 layer to predict TT. Shen et al. [42] have employed LSTM 210 as a prediction layer on features learned using CNN-RNN 211 models. The authors in [43] have hybridized DBN with 212 quantile regression for highway TT prediction. In addition 213 to hybridized models, ensemble-based approaches have also 214 been developed for TTP. The output of GRU and XGB 215 is combined in [8]. In another study, Zou et al. [10] have 216 combined the output of Light Gradient Boosting Machine 217 (LightGBM) and MLP using a decision tree model for TTP. 218 Likewise, the authors in [9] have reported better results of an 219 ensemble involving LightGBM and XGB as base regressors 220 for the urban road network. In [4], Wide-Deep-Recurrent 221 (WDR) models have been proposed that combine three mod-222 els, namely, linear, MLP, and LSTM models to predict TT. All 223 the above ensemble approaches have analyzed the impact of 224 decision scores of machine learning and deep learning models 225 for TTP. However, the impact of feature spaces of deep 226 learning models and prediction of an ML model i.e., SVR 227 for TTP have not been studied in prior literature. Moreover, 228 exogenous features are not so extensively examined for TTP 229 on a network scale. In our current work, we have augmented 230 exogenous features including weather conditions, calendar 231 data, peak hours data and fastest route data to our map-232 matched trajectories. Moreover, PCA features are extracted 233 from pickup and drop-off location features. Finally, DSAE 234 is employed to learn and encode GPS trajectories in a lower 235 dimension. On the final feature set obtained after augmen-236 tation of exogenous features, PCA features, and encoded tra-237 jectories, we trained our meta-model in which SVR is used as 238 a meta regressor and the feature spaces generated by LSTM, 239 and GRU are fed as input to SVR for final prediction. The 240 results demonstrate the superior performance of our proposed 241 meta-learning based approach.

244
Travel time prediction is a challenging task as it is affected 245 by several exogenous and endogenous factors like the choice 246 of route, time of the day (peak/non-peak hour), day of the 247 week (week/weekend), weather condition (usually more time 248 is needed to reach a destination in a bad weather situation). 249 Ensembles are now widely considered the most advanced 250 solution to many machine learning problems and address the 251 limitations of a single model by adding diversity using mul-252 tiple base learners (either homogeneous or heterogeneous), 253 ultimately improving overall predictive performance. This 254 diverse learning leads to a more robust model that sufficiently 255 captures data's variance (distribution). Different approaches 256 like voting, ensemble selection, and stacking have been used 257 to combine base learners to form an ensemble model [44].   Likewise, some studies consider weather information but not 283 take into account other exogenous features like peak hours, 284 calendar information, fastest route data etc. [49], [50]. More-285 over, some studies have incorporated weather information, 286 calendar information for a freeway [51] or corridor [52] 287 or it is an OD-based prediction [19], [32]. Travel time is 288 affected by weather conditions, time of day, day of the week, 289 route choice, peak or non-peak hour, etc. We extracted and 290 aggregated various spatio-temporal, and weather-related fea-291 tures in our integrated dataset. For instance, trip geospatial 292 area and vehicle route during a trip strongly impacts TT. 293 We extracted geospatial features such as total distance, seg-294 ments, and intersections traversed by a vehicle during a trip 295 using map-matching. Similarly, another important type of 296 feature that affects TT is temporal features. For example, 297 the TT during non-peak hours is extremely different and 298 longer than during peak/rush hours. For temporal informa-299 tion, we extracted the time of the day, day of the week, 300 day of the month, and month of the year features. Weather 301 conditions also affect TT [53], so we included 18 weather 302 conditions 1 in our final features set. These features are listed 303 in Table 1. Other useful features contributing to accurate 304 TTP are is_peak_hour, is_holiday, fastest_route_distance and 305 fastest_route_time. The fastest route features as described 306 in [54] are extracted using OSRM fastest route Application 307 Programming Interface (API). 2 With the help of the Direc-308 torate of Traffic Engineering and Transportation Planning 309 Islamabad, the is_peak_hour is calculated. Our data is then 310 used to validate the feature. feature representation. The trajectory data was encoded into 320 eight features and appended these features to the final fea-321 ture set. After data aggregation and feature representation, 322 we removed anomalous trips with duration less than a minute 323 (extremely short) and greater than two hours before final 324 experimentation. Our data comprises trips between 0.5 and 325 60 kilometers. 326 In our dataset, the longest trip contains 99 GPS locations 327 (latitude longitude pairs) which corresponds to 198 latitude 328 and longitude points. After feature augmentation as discussed 329 in Section III, B, we have increased the dimensionality of our 330 VOLUME 10, 2022      We trained four deep learning models (CNN, MLP, LSTM, 367 and GRU) and used SVR [58] as a meta regressor. 368 We extracted the feature spaces of individual models and 369 fed them to SVR for the final prediction as it is based on 370 structural risk reduction theory. SVR seeks to reduce test error 371 and enhances the model's ability to generalise, in contrast 372 to models based on empirical risk minimization theory [59]. 373 To create a hybrid learning-based boosted feature space, 374 we chose the two best-performing models i.e., LSTM and 375 GRU, as our base feature extractors.
420 where ⊗ represents point-wise multiplication, and µ and σ 421 are tanh and sigmoid activation functions. is increased by the model's simplified architecture, which 427 results in fewer parameters to train. The update gate in 428 GRU replaces the input and forget gates of the LSTM. 429 We employed a two-layer GRU model in this experiment. 430 The structure of the GRU cell is depicted in Fig. 6, and the 431 mathematics for the two gates of GRU to control the flow of 432 information within the cell can be seen in Equations 8-11. The 433 equations of GRU are taken from [68].
where u t denotes update gate, r t denotes reset gate, h t denotes 439 memory content (current) and h t denotes memory content 440 (final) at time t, σ and µ are sigmoid and tanh activation 441 functions. The symbol denotes element-wise multiplica-442 tion whereas W u , and U u are the respective weight matrices 443 of the two gates.

445
This section begins with a description of the data, followed 446 by an explanation of the models that were used to analyse it 447 and their results.   Chipset(U-Blox EVA-M8M). Our study uses data spanning 457 the peak and off-peak hours, from 6:00 am to 11:00 pm.

458
Details of the dataset are given in Table 2.

459
The data distribution scheme of our proposed HDFS-TTP 460 approach is demonstrated in Fig. 7. .
The R 2 indicates how much of the variation is learned by the 489 model and is shown in Equation 16.
Here TT_m refers to the mean travel time value. The ideal 492 value for RMSE and MAE will be zero (or close to zero) and 493 close to one for R 2 for the best prediction. . 503 In addition to that, three related ensemble approaches [8], 504 [10], and [9] were also implemented to compare with our 505 proposed HDFS-TTP approach.

506
E. HYPER-PARAMETER SETTING 507 Table 3 lists the parameters we specified for our baseline 508 NNs. Trial-and-error method is used to get these values. 509 These best values for each parameter of the models presented 510 in Table 3 were obtained after multiple experimental runs. 511 For our base regressors, we've tweaked the learning rate, 512 the number of hidden layers, the number of neurons in each 513 hidden layer, and the batch size. The optimizer and activation 514 function have been set to 'relu' and 'adam', respectively. 515 Our suggested approach's results are validated using hold-516 out cross-validation (See Figure 7). Our proposed approach, 517 in contrast to baselines, includes a machine learning-based 518 meta model (SVR) with a pseudo-random behaviour (like 519 other machine learning models). In order to demonstrate the 520 robustness of our approach, we ran the experiment 10 times 521 using the best parameters and the reported the results with 522 confidence interval in Table 6, 7, 8, and 9.

525
The results of various deep learning models used as feature 526 extractors for SVR on the entire dataset are presented in this 527 section. Table 5 shows the outcome on overall data. As can be 528     LSTM and GRU perform better as feature extractors than 543 CNN and MLP, as was discussed in Section IV-F. The per-544 formance of these recurrent learning models will be further 545 enhanced by concatenating their feature spaces [61]. from 6:00 am to 11:00 pm, are shown in Fig. 8.

556
In order to demonstrate the generalizability of our pro-557 posed HDFS-TTP, we performed two experiments. In the 558 first experiment, we analyzed the impact of weather features.

559
In the second experiment, we tested our model on weekdays 560 data only. In both scenarios, only a slight degradation in 561 model performance is reported. In the next sections, we have 562 discussed the details.   indicates that weather features have a considerable effect on 576 the overall prediction of TT. Fig. 9 shows the RMSE of our 577 proposed feature-based LSTM-GRU ensemble and baselines. 578 The impact of weather features on our proposed models and 579 baselines is readily apparent. This experiment is performed on weekdays data. The results 582 are reported in Table 8. There is a slight degradation in overall 583 performance, which could be caused by the reduction in data 584 size. An RMSE of 64.02 ± 1.14 is reported for our proposed 585 HDFS-TTP ensemble. The RMSE of our proposed feature-586 based ensemble and baselines is demonstrated in Fig. 10. 587 Even with weekend data removed, our model performs better 588 than its counterparts.

589
The performance of [8], [10] and [9] deteriorates slightly 590 on weekdays data. An ensemble proposed [8] in has RMSE 591 and MAE of 74.11 and 31.94, respectively. The ensemble 592 proposed in [10] has an RMSE and MAE of 78.87 and 30.26, 593 respectively. Similarly, the RMSE and MAE of the ensemble 594 proposed in [9] are 65.24 and 23.78, respectively. It is evident from the results shown in Table 9 that our pro-599 posed boosted feature space-based ensemble (HDFS-TTP) 600    Our proposed approach is different as compared to ensemble 642 approaches presented in the literature that rely on the decision 643 score of the base regressors. Investigating other ML models as 644 meta-regressors can further enhance the results. In addition to 645 that, variants of DSAE such as variational AE, and denoising 646 AE can be used to enhance the feature spaces prior to model 647 training. In the future, we plan to incorporate decision scores 648 with feature spaces of recurrent learning models. We also plan 649 to evaluate the performance of Graph-based NNs on the same 650 dataset.

652
The authors acknowledge the support of the Khalifa Univer-653 sity of Science and Technology.