Transportation Mode Recognition With Deep Forest Based on GPS Data

Transportation mode recognition (TMR) is a common but critical task in the human behavior research field, which provides decision support for urban traffic planning, public facility arrangement, travel route recommendations, etc. The rapid development of urban information technology, mobile sensors and artificial intelligence has generated solutions for TMR; however, they rely on extra sensors and Geographic Information System (GIS) information, which are not always available. Recognition is usually simplified by disregarding the trajectories among transportation mode change points. In this paper, we proposed an ensemble learning-based approach to automatically recognize transportation modes (including a hybrid mode) using only Global Positioning System (GPS) data. A total of 72 features were extracted to better distinguish different transportation modes. Furthermore, we exploited a deep forest to combine various types of classification models, which facilitates robust learning with different trajectory samples and modes. The experimental results for the Geolife dataset show the efficiency of our approach, and the improved deep forest model achieved the best performance among all experiments that we conducted with 88.6% accuracy.


I. INTRODUCTION
The motion behaviors of residents has a certain regularity according to a certain time cycle, and the hidden patterns and trends are crucial for urban development and governance. Transportation mode recognition (TMR) can help reveal these patterns by showing how individuals migrate among points of interests (POIs). At the urban level, TMR aids in decisionmaking for traffic system management, regional function divisions, public facilities layouts etc. For individuals, by inferring their life patterns and preferences from their past trajectories, TMR can provide support for many applications, such as travel route recommendations.
For the past few years, the methods employed for acquiring transportation mode data information were questionnaires or telephone interviews. However, the information The associate editor coordinating the review of this manuscript and approving it for publication was Keli Xiao . collected in these traditional ways was not universal for all urban residents. The data quality is strongly affected by respondents' memories, which are commonly inaccurate and incomplete, especially when the transportation mode changes frequently in a trajectory [1]. The rapid development of urban informatization, mobile sensors and artificial intelligence has generated solutions for TMR; however, they rely on extra sensors and Geographic Information System (GIS) information, which are not always available. In recent years, vast amounts of Global Positioning System (GPS) data have been produced due to the popularity and refinement of GPS devices [2]. GPS data have a high sampling rate and accuracy and can describe individual travel behaviors more completely with greater detail.
Two primary issues in the research of machine learningbased TMR have been addressed: feature extraction from raw GPS data and selection of a classification model. Various kinematics or statistical features have been extensively VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ applied in the representation of TMR, which can be divided into two categories: point features, including velocity, acceleration, turning angle, and sinuosity; and segment features, such as mean velocity or maximum acceleration. For the different combinations of these features, decision trees [3]- [8], k-nearest neighbors [5], SVMs [3], [5], [9], [10], fuzzy systems [11], [12], ensemble learning [5], [9], [13], [14], and deep learning methods [15]- [17] have been applied to enhance the identification capability of the TMR task. However, the following problems still exist: recognition of the trajectory segmentation at transportation mode change points is usually disregarded in the training process, which simplifies the recognition task but causes errors when the transportation model changes. Additionally, in previous studies, Zheng [4] employed the maximum velocity and acceleration and the mean, variance, and expectation of the velocity as input features. Based on these features, researchers further selected more statistical measures of the velocity and acceleration, which have been applied to better explain the differences between different modes but introduce interference from human experience [18]. The increase in the number of features may also increase the complexity of the model and the computational costs. The performance of the classification model remains unsatisfactory in some real situations due to the existence of noise and outliers. Therefore, improvement in the generalization ability and robustness of the models is needed.
More recently, a decision tree-based ensemble approach, which is referred to as the deep forest [19], has been proposed. The approach contains fewer hyperparameters and lower computational costs than deep neural networks (DNNs) and provides competitive performance. Inspired by its promising capacity in classification tasks, we propose an improved deep forest method to automatically recognize transportation modes, including a hybrid mode, with only GPS data. In this framework, we employ 72 global trajectory features extracted by using statistical methods to distinguish transportation modes. Our research contributes to the field in the following ways: 1) In addition to a few general transportation modes, including walking, riding a bike, taking a bus, driving a car, and taking a subway or train, our approach is able to identify whether a segment belongs to a hybrid mode, which makes it better able to determine when and where an individual is likely to change his/her transportation mode. 2) We develop a deep forest method that combines various types of classification models, including the random forest (RF), completely RF (CRF), SVM and XGBoost, to facilitate robust learning with different trajectory samples and modes that does not require substantial effort to tune the hyperparameters compared to the effort needed for DNNs. However, the deep forest method still performs similarly or better than these models.
The remainder of our paper is organized as follows: Section 2 reviews relevant references. Section 3 presents our model, and the experimental results and discussion are shown in section 4. The conclusions are presented in section 5.

II. LITERATURE REVIEW
Over the last decade, in the field of TMR, the GeoLife dataset has become well known for its quality and quantity [3], [4], [20]. Some researchers have either employed data released by governments or independently collected data. Biljecki et al [11] use a dataset with 17 million GPS points from the Netherlands and elsewhere in Europe. Bantis et al [21] collected GPS data via a customized smartphone application (app).
Based on the collected GPS dataset, TMR can be considered a multicategory classification using a machine learning scheme; thus, two types of research efforts have been commonly employed: extraction of trajectory features and classification of TMR using these features.

A. EXTRACTING FEATURES BASED ON GPS DATA
Many researchers have extracted the statistical values of a trajectory segment as global features. Zheng et al [3] employed common statistical features, including the mean velocity, expected velocity, top three velocities and top three accelerations, to identify four different transportation modes (bike, bus, car, and walking). Based on their previous results, Zheng [4] then selected three features that are more advanced: the Head Change Rate (HCR), the Stop Rate (SR), and the Velocity Change Rate (VCR). The results show the capability of these features for improving the robustness of TMR models. Dodge [22] extracted a total of 58 features, including global features and local features, via a statistical method and profile decomposition. Xiao et al [14] further increased the number of features to 111.

B. SELECTION OF CLASSIFICATION MODELS FOR SPECIFIC DATASETS
In addition to traditional machine learning methods, the deep learning method that has developed rapidly in recent years is considered a new solution for TMR. Endo et al [23] first employed a DNN in the TMR task with time information only; that is, no kinematic characteristics were considered. Song et al [24] employed heterogeneous data in a city to build a long short-term memory (LSTM) network to simulate and predict the movement of personnel in the whole city. To address the problem of human bias when creating efficient features in traditional machine learning methods, Dabiri et al [25] proposed a convolutional neural network (CNN) architecture with 84.8% accuracy. Wang et al [15] further presented the CNN-BiGRU, which combines a CNN with a bidirectional gated recurrent unit (Bi-GRU), to better mine the timing characteristics of a trajectory. Additionally, Toan H. Vu et al [17] proposed an improved recurrent neural network (RNN) model that is referred to as the Control Gatebased Recurrent Neural Network (CGRNN), which is an end-to-end model that works directly with raw signals from an embedded accelerometer.

III. METHODOLOGY A. FORMULATION
The GPS point can be denoted by three parameters with p i = (l i , g i , t i ), where l i , g i and t i denote the latitude, longitude, and timestamp, respectively, of point p i . The trajectory segment S i of length n is a set of pairs of points and the corresponding transportation mode.

B. THE FRAMEWOR OF OUR METHOD
In this paper, we propose a deep forest-based method to automatically recognize seven transportation modes: walking, bicycle, bus, car, subway, train, and hybrid. The flowchart of our method (as shown in Figure 1) includes three modules: data processing, feature extraction and classification & evaluation. In the data processing module, the cleaned GPS data are segmented into a number of trajectory segments. For a single segment, we first extract the kinematic parameters, including the velocity, acceleration, turning angle and sinuosity, which are then applied to calculate a variety of statistical measures to serve as trajectory features. For the 72 extracted features, a deep forest model is adopted to classify different transportation modes in the classification & evaluation module.

C. DATA PROCESSING
In our study, only surface transportation modes are included; thus, the trajectories with the ''airplane'' and ''boat'' labels are pruned, and the ''taxi'' mode is merged into ''car'' due to their similarity.
For each trajectory, three preprocessing steps were conducted. First, if a GPS point has a time interval that exceeds 15 min compared to the previous point, then it is treated as the start of a new trajectory. These trajectories are divided into segments with a fixed length of m = 300. Segments with fewer than 15 GPS points are removed. The segment set is utilized as the input for the following blocks.

D. FEATURE EXTRACTION
Based on a previous study [22], the features are categorized into global features and local features. The global features represent the descriptive statistics of the entire trajectory, while the local features reveal more details about the movement behavior. However, global features have been determined to be more important for trajectory segments after theoretical analysis, which was also shown in [14] via feature importance ranking; therefore, our study selects 68 global features in terms of 17 statistical measures of the velocity v t , the acceleration a t , the turning angle θ t (difference between the azimuth angles of two consecutive points) and the sinuosity s t (winding path divided by distance). To obtain the previously mentioned 68 features, first, v t , a t , θ t and s t were calculated for each GPS point. The following 17 statistical measures were then respectively estimated for the four previously mentioned kinematic parameters for all GPS points included in a segment.
1) Mean: The mean value reflects the general value of the data in a trajectory segment.
2) Standard deviation: The standard deviation reflects the degree of dispersion in a data set.
3) Mode: The mode is the most frequently occurring value in the statistical distribution. 4) -9) Three max and three min values: These parameters aim to reduce the impacts of abnormal points with positional errors.
10) Range: The maximum value minus the minimum value. 11) -12) Percentiles: Measures of the positions of the data, which provide information about how the dataset is distributed between the minimum value and the maximum value. In this paper, we selected the 25th percentile (lower quartile) and the 75th percentile (upper quartile).
13) Interquartile range: The difference between the upper quartile and the lower quartile. 14) Skewness: Skewness is the digital characteristic of the degree of asymmetry of a statistical data distribution, which measures the direction and degree of the data distribution deviation. Skewness is defined as follows: where µ represents the mean value of the data, and σ represents the standard deviation of the data. 15) Kurtosis: The kurtosis measures the flatness of a data distribution. If the kurtosis is steeper than a normal distribution, the value is greater than 0; otherwise, it is less than 0. It is defined as follows: where µ denotes the mean value of the data and σ denotes the standard deviation of the data. 16) Coefficient of variation: The coefficient of variation reflects the degree of the data dispersion, which is similar to the standard deviation, and eliminates the influence of the measurement scale and dimension, whereas the standard deviation does not eliminate this influence. The coefficient is VOLUME 8, 2020 defined as: where µ represents the mean value of the data, and σ represents the standard deviation of the data. 17) Autocorrelation coefficient: Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version over successive time intervals.
The coefficient of autocovariance is formulated as follows: The autocorrelation coefficient is defined as: where µ represents the mean value of the data. We further employed three advanced features proposed by Zheng [4], the HCR, SR, and VCR, to further assess the robustness of the classification model. The thresholds for the HCR, SR, and VCR were defined as H t , V s , and V c , whose values were obtained according to the accuracy changes when the HCR, SR and VCR were selected for classification. H t , V s , and V c were set as 19 • , 3.4 m/s, and 0.26 m/s, respectively. 1) HCR: The HCR can be regarded as the frequency with which individuals change their direction, which exceeded H t . The HCR can better distinguish motorized and nonmotorized transportation modes.
where p c = {p i |p i ∈ P, p i .H > H t }.
2) SR: By setting the threshold V s , the SR represents how often individuals stop in their trajectories, which effectively distinguishes walking and other modes, as well as the VCR. where 3) VCR: Similar to the SR, the VCR calculates the frequency with which individuals change their velocity over a certain threshold V c . where In addition, the length of each segment is involved in the feature set. Thus, a feature set including 72 global features is constructed and available for the downstream classification task.

E. DEEP FOREST-BASED CLASSIFICATION MODEL
On top of the feature layer, we apply a deep forest as the classifier due to its strong and robust prediction performance. The deep forest [19], which is also known as the Multi-Grained Cascade Forest (gcForest), is a kind of ensemble learning method that is based on a decision tree. Inspired by deep learning, the gcForest employs sliding windows of different sizes to achieve representation learning and inputs the obtained transformed features into cascade forests. The deep forest does not require high computational costs such as DNNs but achieves comparable prediction performance with other deep learning methods. In addition, it shows promising performance, even for small scale datasets.
Apart from the training strategies, the performance of the deep forest is strongly affected by the selection of its component learner. However, the original deep forest proposed by Zhou [19] uses only the RF and the CRF as component learners. Although the CRF can increase the generalization ability of the model, it cannot effectively avoid the interference of noise and outliers in the dataset. Considering the scale and dimension of the selected feature vector, we employ the RF, CRF, SVM and XGBoost as the component learners. Among these component learners, the RF, CRF and XGBoost are ensemble learning models based on decision trees. Their difference is that XGBoost focuses on reducing the deviation while the RF concentrates more on reducing the variance. Compared with the RF, XGBoost is suitable for sparse data and is less likely to overfit since it adds a regularization term to control the complexity of the model. Compared to the regular RF, the CRF helps by improving the generalization ability of the deep forest since it randomly selects a single feature from the full feature space when nodes are split. The SVM is a traditional classifier, that is similar to the RF, and can work well in high dimensions; however, it usually does not perform well for large-scale datasets. Therefore, the joint utilization of these four learning models can facilitate better generalization and robustness of the proposed deep forest in various cases of available data.
The architecture of the proposed deep forest is shown in Figure 2. The 72 global features serve as the input of the cascade forest structure, which consists of n levels of component learners. Each level is composed of an SVM, an XGBoost, an RF and a completely RF to ensure the diversity of the component learners. The output vectors of the component learners in the same level are concatenated with the raw feature vector and then put on the next level as the feature representation of each learner. As a result, the layerby-layer processing of the features is executed until there is no significant performance gain, and an average strategy is adopted to accomplish the classification at the last level.

1) RF/CRF
The RF was first proposed in 2001 [26]. As a representative bagging method, which is shown in Algorithm 1, the RF uses a decision tree as the component learner and considers the majority votes as the final result. The training procedure of the RF can be briefly described as follows: 1) Sample m samples each time by using bootstrap sampling to construct n sample subsets. 2) n decision trees are constructed with n subsets of samples, and each tree grows without pruning. 3) Compared with a regular decision tree, instead of using all features, each node utilizes a feature subset when splitting. The nodes of the trees in the CRF use a single randomly selected feature from all features when splitting.

2) SVM
A SVM [27] aims to find a hyperplane that can linearly divide the samples in the original sample space or a high-dimensional feature space. The model can be described as where α i is the Lagrange multiplier, b is the displacement term that determines the distance between the hyperplane and the origin, and κ(x, x i ) is the kernel function. The commonly employed kernel functions are listed in Table 1.

3) XGBOOST
The Classification And Regression Tree (CART) is a type of decision tree that uses the Gini index to select partition attributes. By combining a certain number of CARTs, which give a prediction score for each leaf, XGBoost [28] calculates the final score by summing each individual tree's prediction score as where K is the number of trees f is a function in the functional space F, and F is the set of all possible CARTs. The objective function to be optimized is given as follows:

A. DATASET
The GPS trajectory dataset employed in this paper was collected by the GeoLife project (Microsoft Research Asia) by 182 users in a period of more than three years (from April 2007 to August 2012) [3], [4], [29], [30]. The GPS trajectories in this dataset are represented by sequences of time-stamped points, each of which contains the latitude, longitude and altitude. According to the statistical results, the whole dataset contains 17,621 trajectories with a total distance of approximately 1.2 million kilometers and a total duration of more than 48,000 hours. These trajectories were recorded by GPS devices or smart phones and have a variety of sampling rates, most of which (91%) are densely logged, e.g., every 1-5 seconds or every 5-10 meters per point.  In general, raw GPS data, to some extent, contains noise, outliers and gaps. Thus, we cleaned the raw dataset by removing the duplicate data that have the same timestamp information. Second, we empirically set the velocity and acceleration thresholds for each transportation mode (as shown in Table 2).
With the previous conditions, Table 3 shows the number of segments for each transportation mode.

B. MODEL EVALUATION
To better compare the performance of different classification models, we use the precision, recall, F-score, confusion matrix, receiver operating characteristic (ROC) Curve and area under the curve (AUC) as the evaluation metrics. For the multiple classification tasks, by matching the labels with the prediction results, all samples in the test dataset fall into four categories: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Among these criteria, precision indicates how many samples are actually positive compared with those predicted to be positive by the model. Recall represents how many samples are successfully identified among all the positive samples. The precision and recall are defined as follows: The precision and recall evaluate the model from two aspects. However, in general, they are contradictory. The F-score is extensively employed due to its comprehensive consideration. The closer the F-score is to 1, the better is the model's performance.
Especially, when α =1, P and R have the same weight: The graph for the ROC curve uses the False Positive Rate (FPR) and True Positive Rate (TPR) as the horizontal axis and vertical axis, respectively, and draws the curve by traversing all thresholds. The graph works well when addressing class-imbalance problems. We calculate the AUC to quantify the performance, which is referred to as the AUC value.
To assess the performance of the proposed method, we compare our method with a set of machine learning algorithms according to the previously mentioned evaluation metrics. Additionally, in our experiment, all classification models were implemented in Python.

C. EXPERIMENTAL RESULTS AND DISCUSSION
To better evaluate the effectiveness of our study, we constructed a set of comparative experiments, including RF, XGBoost, CNN and regular deep forest. The RF was employed because it was commonly employed in the TMR field and performed well in most experimental conditions. The XGBoost and CNN were conducted based on the studies carried out by [14] and [25]. These two methods were chosen because they achieved the best results as representatives of ensemble learning methods and deep learning methods. In addition, the regular deep forest is employed to verify our improvements in the ensemble structure. Based on these reasons, to ensure an unbiased and consistent test, we trained 5 models on the same preprocessed dataset, and tuned the parameters, respectively.
The dataset is randomly divided into a training set that contains 80% of the segments and a testing set with the remaining 20% of the segments. A 5-fold cross-validation method was employed. The details of different models can be obtained in Table 4, Table 5, and Table 6.
The performance of the proposed models and baselines are evaluated in terms of the previously mentioned metrics. From the confusion matrixes shown in Table 7, Table 8, Table 9, and Table 10, we can visually determine that our deep forest achieves the highest accuracy at 88.6%, which is 4.6%, 0.8%, and 14.6% higher than the RF, XGBoost and CNN. For the precision rate, our deep forest ranks first for ''Bicycle'',   ''Car'', ''Train'' and ''Hybrid'' and ranks second for ''Walk'' and ''Bus''. In terms of the recall rate, our model performs better than the precision rate. The improved deep forest ranks first in six categories-''Walk'', ''Bicycle'', ''Bus'', ''Car'', ''Subway'' and ''Train''-and ranks second for the ''Hybrid'' mode with a slight numerical difference. Although the recall rate of our model is slightly lower than that of the XGBoost, it achieves a higher precision rate, which yields a higher value of AUC. These results demonstrate a better performance of our method in recognizing ''Hybrid'' trajectory segments.
As a representative method of deep learning, the CNN had been shown to be able to perform well in the TMR task [25]. However, it can be obviously seen from the confusion matrix in Table 9 and the ROC curve in Figure 6 that the performance of the CNN is roundly lagging behind that of the RF, XGBoost and our deep forest. The overall accuracy of CNN was only 74.2%. The poor performance of the CNN may be attributed to the finding that the performance of the CNN is positively correlated with the scale of the training set. However, the segment length in our method (300) is longer than that of Dabiri [25](200), which produces a 40% reduction in the scale of the dataset.
Considering the condition of class imbalance in the dataset, the unweighted average precision (UAP) and unweighted   average recall (UAR) were calculated, as shown in Figure 7.
The results indicate that the improved deep forest is more robust to the class-imbalance data. For example, compared with the RF, for the smallest category (subway, ∼ 4%), although the precision rate of the deep forest is not better than that of the RF, it has a higher recall rate, which yields a larger AUC. This finding illustrates that the deep forest tends to be more sensitive to the positive samples in the class-imbalance cases.
Based on the experimental results, we intuitively discover that: 1) Most misclassification occurs between ''walking'' and other modes, which may be caused by the similarity between ''walking'' and ''driving'' in the case of heavy traffic or transfer at a bus station.
2) Misclassification commonly occurs between the ''hybrid'' mode and other single transportation modes because the differences in the kinematic characteristics   between different transportation modes will be diluted or amplified according to their weight of the distance.
3) There is a high correlation between the classification performance and the number of samples of a specific mode. For example, the large number of samples and unique characteristics of the ''walking'' mode make its recall rate stable at approximately 99% with the RF, XGBoost and improved deep forest. However, the recall rate of the ''subway'' mode with the least number of samples fluctuated greatly and the value was low: only 73% for the improved deep forest, although it is more than the RF of 50%.
In terms of running time, CNN has the longest running time while RF has the shortest, which is consistent with the complexity of the model. As for XGBoost and deep forest, deep forest spends longer time than XGBoost, since XGBoost is one of the component learners of deep forest.
In the case of the same travel distance, a higher sampling rate means larger dataset size and more refined data description, both of which generally have a positive impact on the feature representation and model performance, leading to the better classification. However, the influence will be limited when the sampling rate reaches a certain level.
The overall accuracies are shown in Table 11, and the improved deep forest that we proposed are 14.6%, 4.6%, 4%, 0.8% higher than the CNN, RF, regular deep forest, and XGBoost, which demonstrates the effectiveness of our method.

V. CONCLUSION
In this paper, we presented a deep forest and trajectory global feature-based TMR model using only raw GPS data to recognize the transportation modes. In this ensemble learning framework, the SVM and XGBoost are employed as the component classifiers in addition to the RF and CRF models to enhance the diversity of component classifiers. A total of 7 transportation modes of individuals can be determined by the proposed model, including walking, bicycle, bus, car, train, and subway and hybrid modes. In particular, our method can effectively identify hybrid pattern recognition, which helps to infer an individual's transfer time and location and can improve the accuracy of travel behavior recognition. In the evaluation on the GeoLife dataset, we compared our model to the state-of-the-art baseline of conventional and deep learning methods: XGBoost and CNN as well as the commonly employed method, the RF. The evaluation results showed that our model achieves the highest accuracy of 88.6%. The combination of ensemble learning and deep learning shows the potential to be regarded as a new solution for prediction tasks in other applications.