CNN and KPCA-Based Automated Feature Extraction for Real Time Driving Pattern Recognition

Driving conditions greatly affect the energy control and the fuel economy of a hybrid electric vehicle (HEV). In this paper, an automated feature extraction scheme based on convolution neural networks (CNNs) and Kernel PCA (KPCA) for real time driving pattern recognition (RTDPR) is proposed in order to achieve consistent performance of the energy management. Firstly, a dimension expanding strategy is performed to transform one-dimensional speed sequences to generate a two-dimensional dataset. Then, the transformed data is sent to the CNN and KPCA based feature extractor. Finally, the feature extractor automatically selects the most representative features for classification. To improve the generalization of CNN to a small sample dataset, the structure of the typical CNN is adjusted by adding the KPCA layer in order to reduce model parameters. The model is well trained and evaluated in simulation, and it is tested for RTDPR in the real world. Simulation and experimental results show that the proposed automated feature extraction strategy outperforms the conventional driving pattern recognition algorithms based on manually feature extraction, which has achieved the state-of-the-art recognition accuracy.


I. INTRODUCTION
A driving pattern is typically defined as the driving cycle of a vehicle in a particular environment [1], [2].Since the current driving pattern has a great impact on the energy management strategy of a hybrid electric vehicle (HEV) [3], [4], it is efficient to use the prior knowledge of the driving cycle to achieve the real time driving pattern recognition (RTDPR) and enhance the control performance of the HEV [5], [6].There are many researches on the RTDPR [2], [7]- [10].The conventional way is to manually extract features from the historical speed data to characterize the driving patterns [2].Then the classical machine learning models like k-means [7], hidden Markov models [8], fuzzy c-means [9], and their variants [10] are fully utilized to classify the extracted features The associate editor coordinating the review of this article and approving it for publication was Alireza Sadeghian.into different categories.Therefore, the quality of the feature extraction algorithm plays a great impact on the classification accuracy.However, those manually extracted features usually include average speeds, average accelerations and other features which are directly calculated using physical models [11], while other complex and high level features are hard to represent.In practice, those low level features are unable to effectively characterize the complex driving patterns.Additionally, to reduce time cost of RTDPR, a limited amount e.g., 16 features are selected to characterize the driving patterns [12].Based on the above analysis, it can be concluded that the recognition accuracy of the conventional methods is significantly affected by selected features.Recently with the development of deep learning and its strong classification ability [13]- [15], the convolution neuron network (CNN) has been wildly used in the pattern recognition fields [16]- [18], and achieved good performances.The CNN can achieve an end-to-end recognition without feature extraction but still has not been widely applied in RTDPR, partially due to the lack of magnanimous training samples.Motivated by the CNN, we do not manually generate the feature vectors from the historical speed data to build the model.Instead, the model learns to extract the features itself from the datasets [19].During the training process, the model can learn to select the most representative features and their amount automatically.The simulation results indicate that the features selected automatically by the models are more representative than those that are manually designed.The standard CNN is a nonlinear model with typically thousands of parameters, which may easily get overfitting when the training samples are not sufficient [20].The most parameters concentrate on the fullyconnected layers which hold much redundancy.To solve the problem, we design an automated feature extractor that retains the former part of the CNN and removes the fullyconnected layer.Then the kernel PCA (KPCA) layer is added to further supply features, thus the redundancy is removed and classification is simplified.Additionally, we have performed linear shift on the speed data to expand the dataset, which also proves to be very effective to avoid overfitting.
In this work, we firstly collect the training samples from the historical speed data by a sliding window.The size and step of the window are adjusted in the training process.Secondly, we transform the training samples to the two-dimension dataset so that the CNN based model can effectively deal with the speed information.Thirdly, the two-dimension dataset are divided into batches to fit the feature extractor.Finally, the extracted features are utilized for RTDPR.The specific contributions of this paper are as follows: (1) We have improved the generalization of the standard CNN for small dataset by adding the KPCA layer.(2) We have achieved an end-to-end strategy for RTDPR instead of manually designing features.(3) The historical speed sequence is transformed to two-dimension to extract spatial features.(4) We have achieved the state-of-the-art accuracy for RTDPR.
The structure of this paper is as follows: the details of the CNN + KPCA architecture are described in section II.Then our model based on CNN + KPCA is reported in section III.Section IV presents the applications on four typical patterns in the congested urban, flowing urban, subway and high way and in real environment.The results are compared with that of other typical classifiers.Finally, section V gives the conclusions of this paper.

II. THE CNN + KPCA ARCHITECTURE A. THE STANDARD CNN CLASSIFIER
The CNN model is a complex nonlinear function that maps the input samples into the corresponding driving patterns.The overall structure of CNN is described in Fig. 1, which includes one input layer, the complex middle layers and one output layer.The input layer of the CNN deals with the two-dimension samples.The middle layers include the convolution layers and a fully-connected layer.Within the convolution layer, the convolution operation is performed, followed by the max-pooling operation immediately.The outputs of the last convolution layer are then flattened to one-dimension as the inputs of the fully connected layer for further nonlinearization.In the output layer, there contain four neurons that delegate different driving patterns.The details of the calculation process are described as follows.
Provided that we have an n × n input sample represented as a two-dimension array: where x i,j is the pixel at the position (i, j).A two-dimension kernel is defined to connect the input layer to the convolution layer, which is represented by a m×m array of shared weights: A two-dimension feature map is then obtained as described in (3).
Here, C 1 i,j,k represents the i, j-th value of the feature map to the k-th kernel in the first convolution operation.f is the activation function of the neuron, which is described as: b k is the shared bias of the k-th kernel.Finally, we use x i+a 1 ,j+a 2 to denote the pixel at the position (i + a 1 , j + a 2 ).The shared weights and bias of the kernel means that the same features will be detected, just at different locations of the input sample.Different kernels generate different feature maps, which comprises the former part of the convolution layer.
The max-pooling operation prepares the condensed feature maps from the former part of the convolution layer.Then those feature maps stack up and comprise the latter part of the convolution layer.For instance, each neuron in the pooling operation summarizes a maximum activation in a region of e×e neurons in the feature maps.For the i, j, k-th (zero-base) neuron in the first max-pooling operation, the output is: The second convolution layer is obtained by performing the convolution operation C 2 and the max-pooling operation M 2 .The details of those operations are described in (3) and ( 5), respectively.Then M 2 is flattened from threedimension to one dimension, i.e., neurons in this layer are arranged in line.The i-th activation in the flattened layer is represented as M 2 i .Finally, the fully connected layer connects every neuron from the flattened max-pooling layer to every one of 4 output neurons [21].Assume that there are h neurons in the fully connected layer, then the i-th activations of the fully connected layer (F i ) and the output layer (Y i ) are described in ( 6) and (7), respectively.
where ϕ 1 and θ 1 are the biases and weights between the flattened max-pooling layer and the fully connected layer, respectively.ϕ 2 and θ 2 are the biases and weights between the fully connected layer and the output layer, respectively.s is the softmax function described in (8).

B. THE CNN + KPCA FEATURE EXTRACTOR
As shown in Fig. 2, the number of the parameters in the first convolution layer is where k 1 , k 2 are the number of the kernels in the first and second convolution layer respectively.In the flatten layer, the parameter number is ( n 4 ) 2 × k 2 , which is thousand magnitude due to the width of the input n.The default number of the hidden neurons in the post-fullyconnected layer is always set to be 1024 or 2048 [22].Thus the parameter number in the fully-connected layer can reach a million level, which can easily result in overfitting in the task of small samples classification.We propose two methods to reduce the parameters while keeping the CNN performance.Firstly, we remove the fully-connected layer which contains large redundancy.Secondly, we retain the convolution layer, the output of which are projected by the KPCA operation, i.e. the KPCA layer.The dimension of the flattened convolution layer can be reduced considerably from thousands to ten by KPCA projection.While the major feature information is retained.The combined convolution and KPCA layers are treated as the feature extractor as described in Fig. 2. The output of the flattened convolution layer is calculated by Eq.( 1)-( 4), represented as M 2 .The KPCA projection is performed to reduce the feature dimension calculated as follows.
where k is the kernel function, K ij represents the i, j-th element of the kernel matrix.Then the kernel matrix is centralized by (10). where , N is the number of the training samples.Calculating the eigen value λ i and the corresponding eigen vector a i of K , and perform normalization to the eigen vector as follows.
Then select the corresponding normalized eigen vectors according to the former maximum λ.The number of the selected eigen vectors l is decided by the accumulated contribution rate calculated as follows.
At last the feature dimension is reduced by (13).
where α = [a 1 , a 2 . . .a l ], M 2 kpca are the extracted features after KPCA layer.The extracted features are then utilized for the post classification.Provided that we have a speed sequence of a driving vehicle with an unknown pattern, our goal is to recognize the driving pattern.The CNN + KPCA model is applied as a classifier to provide the probability distribution of the driving patterns for each input sequence, which is illustrated in Fig. 3.
Each neuron of the output layer represents a driving pattern.The driving pattern with the maximum probability is the final recognized result.

III. MODEL BUILDING
Since the KPCA fitting requires to maintain the projection space unchanged, the model building will include two separate processes.In the first stage, we train the standard CNN model with stochastic gradient descent methods.After the parameters of the CNN are optimized, we add the KPCA layer between the convolution layer and the fully-connected layer of the CNN.And in the second stage, we fine-tune the fully-connected layer with the convolution layer frozen.The parameters of the fully-connected layer are updated iteratively by gradient descent with the whole training batch.The computation is described in ( 14)-( 20) in detail.

A. TYPICAL CNN MODEL BUILDING
We use the categorical cross-entropy in the loss function for the multi-class classification problem [23], which is detailed in (14).
where G is the ground truth of the sample.z represents the parameters to be optimized, i.e., z = [w, b, All the operations on the vectors are element-wise.An algorithm for the first-order gradient based optimization of the stochastic objective functions is used to minimize these parameters.The adaptive estimation of the lower-order moments [24] is as follows: where g z,t is the gradients of z at iteration t.Denoting m z,t as the exponential moving averages of the gradient g z,t , v z,t as the exponential moving averages of the squared gradient g 2 z,t , then a momentum term is introduced to update the values of m z,t and v z,t as described in (17) and (17), where β 1 , β 2 ∈ [0, 1) are two hyper parameters that control the exponential decay rate of the gradient and the squared gradient, respectively.When the moment estimations are initialized as zero or the βs are initialized close to zero, those estimations will be biased toward zero.To avoid this, m z,t and v z,t are bias-corrected as follows: where m z,t and v z,t are the bias-corrected first moment estimates and bias-corrected second moment estimates, respectively.Finally, the bias-corrected estimates are used to update the parameters (20), where α and represents the learning rate and the target error, respectively.
It is worthy to mention that the learning process in the two different training phases is very similar.There only exist slight differences of the objective parameters.

B. CNN + KPCA MODEL BUILDING
After we have built the standard CNN, we extract the convolution layer and combine it with the KPCA layer as the feature extractor.Afterwards, we obtain all the features from the dataset which are used to develop the classifier.The training process is the same as the standard CNN described in ( 14)-( 20), leaving a different input and the objective parameters to optimize.The feature extractor remains the same in this developing phrase, and the parameters in (21) are optimized following the way described in ( 14)- (20).

B. THE DATASET PROCESS
To expand the dataset, a sliding window is defined as shown in Fig. 5, where the width represents the width of the window and the step represents the sliding step of the window.The width controls the size of the samples.The bigger the width is, the more information is contained in the samples but it takes more time to acquire the samples.The step controls the size of  the dataset.The smaller the step is, the more samples are collected from the dataset but the samples become more similar to each other.Therefore, the width and step are two variables that need to be selected carefully.In this work, width and step are experimentally set as 40 and 5 seconds respectively.To collect samples, the window slides through the speedtime sequence.At each step it slides through, a sequence that represents the corresponding driving pattern is obtained, e.g., the first step of the window is [0,40], the second is [5,45], and the third is [10,50], etc.Then the corresponding ground truth is labeled with the sequence.We use a four-bit binary code to represent the ground truth (e.g., 0001 for label 1 and 0010 for label 2).
In the one-dimension sequence, only the speed information is taken into consideration.To exploit the spatial structure of the speed distribution, we take a close look at the sequences on the pixel-wise level.A two-dimension array is used to reconstruct the sequence as shown in Fig. 5, where the 1s represent the corresponding pixels that contain the speed information in the 1-dimension sequence, and the 0s represent the corresponding pixels without the speed information in the 1-dimension sequence.The resolution of the 2-dimension data is defined by n × n, which controls the information amount of the transformed samples.The transformed samples with their labels are used to optimize the parameters of the model.

C. HYPER PARAMETERS
Table 2 gives the hyper parameters of the typical benchmark models and the proposed CNN + KPCA model, respectively, where k 1 and k 2 represents the number of the kernels in the first and second convolution layers, respectively.The parameters listed in Table 2 are hyper parameters (i.e.choices for the algorithm that we set rather than learn).We use the hyperopt [25], a Python library, for optimizing the hyper parameters, to select the best model set automatically.

D. RESULTS ANALYSIS
We equally divide the dataset into training and test datasets to train and test the CNN model separately.Generally, the larger the dataset, the stronger the generalization of the model will be.Although our dataset is relatively small, we have successfully avoided overfitting by adding the KPCA layer to the architecture and achieved state-of-the-art results.The accuracy reaches 100% on the training set and 97.40% on the test set, which outperforms the other models based on different machine learning methods.The simulation was implemented at TensorFlow, running on a laptop with Intel Core i5 @ 2.3GHz and 8GB RAM.The training and testing classification results are illustrated in FIGURE6, where only a small number of samples in class 2 (UDDS) and class 3 (WVUSUB) are misidentified on the testing set.Knowing how the classifier performs on individual classes is important as it helps to refine the system design.A receiver operating characteristic (ROC), or simply ROC curve is plotted from the confusion matrix [26] to assess the performance of the classifier on the individual classes.By computing the area under the ROC curve denoted by AUC, the quality of the classifier is comprehensively evaluated.Fig. 7 shows the ROC curve of the CNN + KPCA model on the testing set, where the AUC of classes 1 and 4 are both 1, which means the model has outstanding performance on those classes.The micro-average and macro-average ROC curve are calculated to evaluate the generality on the four classes.
To visualize the effectiveness of the KPCA layer, the first two components of the CNN feature extractor are selected to form the scatter plot on both training and testing sets as shown in Fig. 8.In (a) and (c), the space is formed by the first two dimensions of the originally extracted features by CNN on the training and testing sets, respectively.In (b) and (d), the space is formed by the first two components of the projected features by KPCA, where a projection of the data makes features linearly separable and this help simplify the postfully-connected layer and improve classification accuracy.

E. COMPARISON WITH OTHER CLASSIFIERS
To evaluate the necessity and effectiveness of the KPCA strategy on the CNN feature extractor, two sets of control strategies, the standard CNN strategy and the CNN combined with PCA strategy are performed on the same dataset, respectively.

1) TYPICAL CNN
The simplified typical LeNet-5 with only one fully-connected layer (128 hidden neurons) and dropout strategy (0.5) was selected as the classifier for real time driving pattern recognition.The structure of the standard LeNet-5 was slightly adjusted to fit in the dataset.The training and testing results on the regular dataset are illustrated in Fig. 9, where the accuracy rate are 100% and 79.78%, respectively.Although the classifier on the training dataset achieved 100% correct rate, the testing accuracy fell far behind, which means the classifier lacked the generalization ability to handle the data  The ROC curve on the test dataset is shown in Fig. 10, where the standard CNN classifier easily got confused between classes 2 and 3, in addition, the average accuracy rate on the test dataset is much lower than that of the proposed strategy.

2) CNN + PCA
The CNN and PCA based automated feature extractor was also evaluated for real time driving pattern recognition.The structure of the CNN part in the extractor was the same as the proposed extractor.In the reduction part, we apply PCA to replacing the KPCA to avoid overfitting.The classification results are illustrated in Fig. 11 and Fig. 12.The accuracy are 100% and 94.34%, respectively, where overfitting is overcame by applying the PCA strategy.The correct rate on the  testing dataset is a little lower compared with the proposed strategy as KPCA is better in extracting nonlinear features.
The two control strategies have proved that the standard CNN will easily suffer from overfitting when the training dataset is not large enough.By slightly adjusting the structure of CNN and extracting effective neurons of the fullyconnected layer, overfitting would be effectively reduced.Table 3 and Fig. 13 have summarized the metrics of the three models, where there is a big gap between the training and testing accuracy in the standard CNN strategy due to the numerous parameters.In addition, the AUC of the standard CNN model is relatively lower compared with the proposed strategies, which further prove the necessity and effectiveness of the KPCA on the CNN based extractor.To prove the superiority of the proposed classifier based on automated feature extraction over the classical classifiers based on manually designed features, a variety of the traditional classifiers were trained and tested on the same dataset.Wang et al. [28] and Wei et al. [29] have analyzed the 12 motion features that can best distinguish the driving patterns, as shown in Table 4.The following three classical classifiers were evaluated with the 12 manually designed features.
3) K-NEAREST NEIGHBOR K-Nearest Neighbor (KNN) is one of the popularly used classifiers.This analysis generates speed feature vectors with width = 60s and step = 60s.The features are then associated with a specific driving pattern by a KNN classifier.The classifier has the advantage that no training is required.However, the memory requirement and recognition time are demanding.On the regular test dataset, the correct rate was 81.51% with k = 20, while on the training dataset the correct rate was 88.02%.The parameter k of KNN was a hyper parameter which was chosen to make the model perform best.The training and testing KNN classification results are illustrated in Fig. 14.The ROC curve of the KNN model on the testing set are given in Fig. 15, where we can find the classifier has a bad precision on classes 2 and 3.

4) MULTILAYER NN
Another classifier that we tested was a fully connected multilayer NN (MNN) with 3 layers.The weights of the hidden layer were obtained by training with back-propagation [29].12 standard features were extracted from the speed distribution with width = 80s and step = 80s.Then these feature vectors were used to train the MNN.The accuracy rate on the regular test dataset was 85.16% with 24 hidden neurons and  84.91% with 20 hidden neurons, while on the training dataset, the accuracy rate was 91.23%.The number of the feature vectors and hidden neurons were chosen to make the model perform best.The training and testing MNN classification results are illustrated in Fig. 16.Fig. 17 shows the classifier has a bad precision on class 2. MNN based on feature extraction can achieve a relatively better result than KNN's.But the extracted features do not seem to positively contribute to the recognition task and in simplifying the classifier's structure.

5) KERNEL PCA BASED MULTILAYER NN
In [30], a preprocessing stage was constructed which computed the projection of the input pattern on the principal components as the extracted feature vectors.To compute the principal components, the mean of the input components was first computed and subtracted from the training vectors.A kernel function was then chosen to put the resulting vector into the high-dimension space.The covariance matrix of the high dimensional vectors was then computed and diagonalized by using singular value decomposition.The selected principal components represented by 5 dimensional feature vectors were used as the inputs of a multilayer classifier with 9 hidden neurons.The selected 5 principal components contained 99% information of the original data, and the number of the hidden neurons was chosen to enable the model to perform best.The accuracy on the test dataset was 91.93%, while on the training dataset the accuracy rate was 96.88%.The training and testing classification results based on kernel PCA (KPCA) based MNN (KPCAMNN) are illustrated in Fig. 18.The KPCAMNN achieves the overall better performance compared with the former two classifiers, which is shown in Fig. 19.
Compared with the traditional driving pattern recognition methods, the CNN + KPCA model has achieved the stateof-the-art correct rate, reaching 100% recognition accuracy on the training set and 97.40% on the testing set, whereas the best testing results from the methods based on feature extraction algorithms is 91.93%, which is a substantial leap.The model metrics including accuracy and AUC of the classifiers mentioned above are described in Table 5, from which we can conclude that the proposed framework outperforms the other methods on both accuracy and AUC.crossing almost the main area of the city, was selected as the UDDS cycle to be recognized.The route map is shown in Fig. 20.The speed data were sampled from 6:00AM to 2:30PM in every 10s except the bus idle time, which is shown in Fig. 21.The classification results shown in Fig. 22 show  that 95.81% samples were recognized as class 2 (i.e.UDDS cycle) with only a small part misclassified to be class 3 (WVUSUB) or class 1 (MBDC).

V. CONCLUSION
In this study, we have proposed a novel end-to-end approach to establish a strategy for real time driving pattern recognition applied on the speed data.Our results show that the proposed CNN + KPCA model effectively overcame the bottleneck of other traditional driving pattern classifiers based on manually extracted features.In addition, the proposed model have successfully avoid overfitting by adding the KPCA layer to the network architecture and the expanding dataset, which is potential when the number of the examples in the training set is not large enough.Our future goal is to apply this method to the energy management strategy of HEV to achieve efficient system control so as to save energy on vehicles.

FIGURE 1 .
FIGURE 1.The architecture of typical CNN classifier.

FIGURE 2 .
FIGURE 2. The structure of the CNN + KPCA feature extractor.

FIGURE 4 .
FIGURE 4. The speed-time sequences of four typical driving conditions.

FIGURE 8 .
FIGURE 8.The first two components of the extracted features by CNN with or without KPCA projection.Training set: (a) Original space (b) Projection by KPCA; Testing set: (c) Original space (d) Projection by KPCA.

FIGURE 10 .
FIGURE 10.ROC curve of the typical CNN model.

FIGURE 12 .
FIGURE 12. ROC curve of the CNN + PCA model.

FIGURE 15 .
FIGURE 15.ROC curve of the KNN model.

FIGURE 17 .
FIGURE 17. ROC curve of the MNN model.

F
. REAL DRIVING PATTERN RECOGNITION The bus line 335 from Cicheng to Panhuo in Ningbo, China, with 41 bus stops, 11 traffic lights and 27.2Km journey

FIGURE 20 .
FIGURE 20.The running route of bus 335 in Ningbo.

FIGURE 21 .
FIGURE 21.The speed samples of bus 335.

FIGURE 22 .
FIGURE 22.The results of real driving pattern recognition.

TABLE 1 .
Four typical driving conditions.

TABLE 2 .
The hyper parameters of typical benchmarks.

TABLE 3 .
Metrics of the three models.

FIGURE 13 .
Metrics of the three models.

TABLE 4 .
Manually designed motion features.