Driving Style-based Conditional Variational Autoencoder for Prediction of Ego Vehicle Trajectory

Trajectory prediction of the ego vehicle is essential for advanced driver assistance systems to function properly. By recognizing various driving styles and predicting trajectories reflecting them, the prediction performance is enhanced, and a personalized trajectory can be generated. Therefore, we propose to combine driving style recognition and trajectory prediction tasks using only in-vehicle CAN-bus sensor data for possible application to normal vehicles. The DeepConvLstm network was utilized for driving style recognition, and a generative-based model was used for trajectory prediction. The classified driving style was added as a condition to the network. In addition, the past trajectory of the ego vehicle is estimated and utilized as an additional input for performance improvement. The performance of the proposed method is analyzed in terms of the root mean squared error (RMSE) and mean absolute error (MAE) compared with the case wherein the driving style and the past trajectory are not conditioned or given, respectively. The results demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
Trajectory prediction of the ego vehicle is a key task for many advanced driver assistance systems (ADAS) that implement evasive maneuvers to avoid or mitigate oncoming hazards. However, as a wide variety of drivers exist, they have their own driving styles, which makes accurate prediction of a personalized trajectory challenging. Therefore, to predict the trajectory accurately, it is necessary to grasp the driving style and make a prediction that reflects it. Driving style recognition and trajectory prediction of ego vehicle have been widely investigated.
Several studies have been conducted to determine the driving styles of drivers. Recognition of the driving style has been formulated as a classification problem that can be solved by training neural networks. Streiffer et al. proposed Darnet to recognize whether the driver is distracted using the driver's facial camera and IMU sensor [1]. Lee et al. used thermal and infrared images to recognize the driving styles [2]. Dong et al. proposed ARNet to encode driving styles using a GPS sensor [3]. Constantinescu et al. investigated the modeling of the personal driving styles of various vehicle drivers using GPS data [4]. However many vehicles do not contain cameras capable of providing driver's facial information or GPS that can provide an accurate vehicle pose. Nevertheless, Shahverdy et al. expressed controller area network (CAN)-bus signals as an image and used a convolutional neural network (CNN) to classify the driver's behavior [5]. In addition, the DeepConvLSTM network was developed to recognize human activity [6]. This model is capable of learning features from multi-sensor data from the CAN-bus with their temporal information.
Several researchers have made considerable effort to predict vehicle trajectories [7]- [10]. Baumann et al. predicted the trajectory of the ego vehicle using a deep neural network, but lidar is used as an input to the network [7]. Feng et al. [8] proposed a trajectory prediction network using a conditional variational autoencoder (CVAE) approach. They improved the performance with a simple intention recognizer network as an additional model to the previous network. However, they used Next Generation Simulation (NGSIM) data [11], which can be obtained using GPS, radar, or lidar. Malla et al. [9] proposed a trajectory prediction network that considers the surrounding obstacles, including the motion of pedestrians, and the feature of this study uses video information. Sadeghian et al. [10] proposed path prediction using an attentive generative adversarial network (GAN) using the information of images and surrounding vehicles.
Conversely, attempts have been made to estimate the driving style and predict the trajectory simultaneously. Xing et al. proposed a driving style-based trajectory prediction of a leading vehicle [12]. They used a Gaussian mixture model (GMM) to distinguish the driving styles in an unsupervised manner and used different long short-term memory (LSTM) networks to predict the trajectory based on the driving styles. However, only longitudinal motion was predicted, and NGSIM data were utilized. Liu et al. estimated the driving style using a dynamic Bayesian network and predicted the trajectory using a Gaussian process model [13], [14]. The limitation is that they used a naturalistic vehicle trajectory data set called highD [15] which is similar to NGSIM.
In this paper, we propose an integrated method of driving style recognition and trajectory prediction to produce a personalized trajectory using only in-vehicle CAN-bus data. First, a total of three driving styles were classified, including normal, aggressive, and distracted. Subsequently, the recognized driving style-based trajectory prediction was proceeds. The contributions of this study can be summarized as follows: • Predict the driving style-based multi-modal trajectory of the ego vehicle using a deep generative model called CVAE. • Utilize only the in-vehicle CAN-bus sensor data for inference. • Estimate the past trajectory to feed the network for improvement of the prediction performance. • Use Hardware-in-the-Loop simulation (HILS) for data acquisition of various driving styles.

II. APPROACH
In this section, the main algorithms regarding to driving style recognition and trajectory prediction are explained. We also describe the estimation part of the past trajectory for improvement of prediction performance.

A. DRIVING STYLE RECOGNITION
The classifier network used in our work for driving style recognition is DeepConvLSTM network with a similar architecture in [6]. To increase the performance of the network, the depth-wise separable network was used in the convolutional layer and attention layer was added [16], [17]. Additionally, the CAN data was pre-processed by a sliding window method used in [16]. The CAN data used consists of [brake, accel pedal, steering angle, steering angle rate, v x , a x , a y ,ψ] with a sampling time of 0.1s for 5s where each element indicates brake pedal, accel pedal, steering wheel angle, steering wheel angle rate, longitudinal velocity, longitudinal acceleration, lateral acceleration and yawrate, respectively.

1) CAN data processing
Before applying the sliding window method, the data is normalized along each data scale.
where mean and std indicate the mean value and standard deviation, respectively. D n denotes the number of data points used, and T denotes the time step of the data. The normalized data (X ) is then cropped into the window size, and the window is slid along the data by the step size. The total number of windows is stacked channel-wise to transform the data into an image form. The details of this process are presented in Fig.1. The processed data has the following structure: W x × D n × W n (W x : window size, D n : number of data, and W n : number of windows).

2) Network Architecture
Our network is mainly composed of a convolutional layer, recurrent layer and attention layer. The convolutional layer extracts the features in the time series, and the recurrent layer learns the temporal information. The convolutional layer is modified using a depth-wise separable layer and max-pooling layer [16]. Unlike in [16], we used batch normalization instead of dropout to accelerate network training speed. The modified layer makes the network lighter and makes feature learning more efficient. We used LSTM for the recurrent layer to avoid the long-term dependency problem. The attention layer captures the relationship between each feature and is capable of learning the importance of the feature.
The overall network structure is shown in Fig. 2. First, the feature map is extracted from the convolutional layers. The extracted features are then passed through the LSTM layer, the attention layer, and the classifier layer, which is the fullyconnected layer.

B. TRAJECTORY PREDICTION
The trajectory prediction method in our study is based on the CVAE structure which includes an encoder-decoder network with conditional inputs. Two types of conditional inputs are used in the proposed method. One is the driving style explained in Section II-A. The other is an embedded vector that is made in combination with CAN-bus data from Car-Maker HILS and the estimated past trajectory. The overall architecture is shown in Fig. 3. The true trajectory ξ, depicted with a dotted line, is used only during the training phase.
The CVAE consists of a generative model p ρ (ξ|η, c, z) and an inference model q φ (z|η, c, ξ), and the latent variable z is expressed as follows using the reparameterization trick [18]: where φ and ρ are the parameters of the encoder and decoder network, respectively. is sampled from a normal distribution N (0, I). To minimize the error between the predicted trajectoryξ and ground truth ξ, the reconstruction loss is defined as L2 loss. The entire CVAE network is trained by minimizing the loss function, defined as follows: where W is a hyperparameter that balances the two losses. The former part represents the reconstruction loss, and the latter part represents the KL divergence loss between the multivariate normal distribution and the output distribution from the encoder. In the test phase, the encoding of the true future trajectory is not available; thus, we directly sample from N (0, I) as z, and only the decoder is used to obtain the predicted trajectory.
Different trajectories of driving styles can be generated by feeding different conditions. The driving style recognition network in Fig. 3 provides a probability output of the driving style prediction. Then, the maximum probability output, in one-hot vector form, is utilized in the encoder and decoder of the proposed CVAE architecture. For example, if [1, 0, 0] is fed, the driving style named normal is conditioned. Likewise, [0, 1, 0] for aggressive and [0, 0, 1] for distracted. Therefore, in the following section, we analyze the effects of this factor to determine whether different conditions result in different types of trajectories.

1) Past Trajectory Estimation
The constant turn rate and acceleration model (CTRA) [19] was used to estimate the past trajectory. The CTRA model assumes that the turn rate and acceleration are constant. The state vector is expressed as follows: where x, y indicates the position of the vehicle, θ is the heading angle of the vehicle, v is the velocity, a is the acceleration, andψ is the yawrate. The state of the next time step is expressed by where the subscript k is the time step, and ∆t is the prediction time interval. The state-space equation considering the process noise is expressed as follows: where w k and r k respectively indicate system noise and observation noise defined as the Gaussian distribution.  Unscented Kalman filter (UKF) was utilized in our study [20] to solve nonlinear problems. To consider the uncertainty in the nonlinear dynamic model, the unscented transform (UT) was used. A fixed number of sigma points were selected from the original distribution to estimate the transformed distribution. The UT process was used with the motion model to perform the trajectory estimation task by considering the uncertainty in (11).
The past trajectory initialized with (x 1 = 0, y 1 = 0, θ 1 = 0) is estimated during the history horizon as shown in Fig. 4a. The in-vehicle sensor data composed of the longitudinal velocity, v x , the longitudinal acceleration, a x , and yawrate,ψ was used as measurement in UKF. Then the estimated trajectory was transformed to make the last estimate, which is at the current time step to (x T obs = 0, y T obs = 0, θ T obs = 0) as shown in Fig. 4b. This is because the prediction is performed in the local coordinate system of the ego vehicle.

III. EXPERIMENTS
In this section, we collected a dataset using CarMaker HILS by IPG Automotive shown in Fig. 5. The implementation details of the driving style recognition network and the trajectory prediction network are described.

A. DATASET ACQUISITION
In order to examine the proposed method, we carried out multiple sets of driving through the CarMaker HILS for data acquisition. Highway scenario in the 3-lane road is used and the surrounding vehicles were randomly spawned. Straight roads and curved roads with different curvatures were designed for data collection. The data collected consisted of [brake, accel pedal, steering angle, steering angle rate, v x , a x , a y ,ψ] with a sampling time of 0.1s, and it is depicted as η − in Fig. 3.
Twelve drivers participated in the experiments. They were asked to drive in all types of driving styles, which consisted of normal, aggressive, and distracted.
A Normal driving style is defined as driving safely at a speed similar to the traffic flow, changing lanes smoothly, accelerating and decelerating slowly, and not overtaking the preceding vehicles often. An Aggressive driving style is defined as driving at a faster speed than the traffic flow, changing lanes abruptly and hazardously, accelerating and decelerating rapidly, and overtaking the preceding vehicles. Finally, to create a distracted driving situation, the drivers were asked to text messages on their mobile phones or watch videos while driving.
In Figs. 6 and 7, three different driving styles are visualized as examples based on arbitrarily selected CAN-bus data. The blue, yellow, and red dots represent the distributions of normal, distracted, and aggressive driving styles, respectively. As can be observed in the two figures, different distributions are shown based on the three driving styles. The distribution of the distracted driving style is similar to that of the normal driving style, but wider because unstable driving was performed by the drivers.

1) Driving Style Recognition Network
We pre-processed the CAN data using a window size of 30, step size of 5, and window number of 4. After the preprocessing step, the network input size was 30 × 9 × 4. The network is composed of two depth-wise separable convolutional layers with one max pooling layer. We used the rectified linear unit (ReLU) activation function after each   convolutional layer and batch normalization for regularization. In addition, two LSTM layers were used. The details of the network are listed in Table. 1. The numbers 1) through 8) indicate the process represented in Fig. 2. In the training process, cross-entropy was used for the loss function. The optimization was performed using a standard Adam optimizer with a learning rate of 0.001.

2) Trajectory Prediction Network
As shown in Fig. 3, the proposed architecture consists of a driving style recognition network, encoder, and decoder. In the encoder, the concatenated vector of η, c, and ξ is fed into an LSTM layer with a 64-dimensional cell state. Then, the result passes through a fully-connected layer with 64 hidden units followed by an activation function and two fully-connected layers with eight hidden units, which is the dimension of the latent variable. In the decoder, the concatenated vector of η, c, and z first passes through an LSTM layer with 64-dimensional cell state followed by a fullyconnected layer with 64 hidden units. Then the output passes through an activation function and a fully-connected layer with 20 hidden units which is the dimension of the predicted trajectory. ReLU is used for the activation function. The sampling time is 0.1s, and we predicted the next 1s using the last 1s information. W in Eq. 3 was set to 0.001 in our study. The optimization was performed using an Adam optimizer with a learning rate of 0.001.

IV. RESULTS
In this section, first, the results of driving style recognition are explained, and then the results of trajectory prediction are described. Subsequently, the results are compared to other baselines, and the performance analysis is conducted.

A. DRIVING STYLE RECOGNITION
The performance of the network was tested using the testdata set obtained from HILS. The accuracy of the network is calculated at all time steps using the same window size and window number as the training. The maximum probability output is the predicted driving style of the network. If the prediction style matches the true driving style of the data, it is evaluated as correct. Table. 2 lists the network accuracy for predicting each driving style.

B. TRAJECTORY PREDICTION 1) Metric
In our study, we used the root mean square error (RMSE) and mean absolute error (MAE) as the evaluation metrics which are calculated as follows: where N is the total number of prediction steps,x i and y i are the longitudinal and lateral values of the predicted trajectory, respectively, and x i and y i are the longitudinal and lateral values of the true trajectory, respectively.

2) Compared Models
In the experiment, we compared eight different models in our work as follows. The vanilla GAN was used to predict the future trajectory. The true future trajectory was fed to an encoder in the training phase. The model structure is composed of encoder, generator, and discriminator. • Conditional Variational Autoencoder (CVAE without past trajectory) : The CVAE model without the associated past trajectory is also used for comparison. The model structure is identical to Fig. 3 except for the past trajectory information.

3) Performance Evaluation
In this section, tables and figures are presented and explained to analyze the performance. Tables 3 and 4 show the longitudinal and lateral position errors in terms of the RMSE and MAE of 1s, respectively. The Fig. 8 shows the RMSE and MAE as graphs for clear comparison. The first non-deep learning baseline that uses the CV model with the UKF has significant errors, especially in the lateral position. The other non-deep learning model that uses CTRA has good estimation results in the longitudinal position. However, despite the improvement compared with the CV model, it shows a large error in the lateral position. V-VAE showed improvement in the lateral position compared with the baseline models, but it performed poorly in the longitudinal position estimation. CVAE, which has no past trajectory as an additional input, has a larger error in the longitudinal direction compared with the baselines. However, the estimation performance in terms of the lateral direction increased. In addition, it shows a dramatic improvement compared with the V-VAE in terms of the lateral position. CVAE without driving style condition but with past trajectory generally has better performance than previous methods. Our proposed model makes progress over the above methods and achieves the best results. This result shows that when the estimated past trajectory is conditioned, the trajectory estimation performance is significantly improved. Furthermore, using the driving style condition improves the estimation performance and allows multi-modal prediction. Fig. 9 shows the prediction results for the test dataset. The blue line represents the past trajectory, the green line  represents the true future trajectory, and the red dotted line represents the prediction. Fig. 9a shows the result when the true driving style is normal, and each driving style is fed into the model as a condition. The results show that the prediction performance improves when the correct driving style is conditioned. In addition, when aggressive is conditioned, the trajectory tends to go further. In Fig. 9b and 9c, the trajectory is predicted accurately when the true driving style is conditioned. It also indicates that the model tends to produce a shorter trajectory under normal conditions. In Fig. 10, all the trajectory distributions for the three driving styles are depicted. The results show that our proposed method can generate multi-modal predictions by assigning the probability of each driving style obtained from the recognition network. The trajectory distribution from each modal was drawn along with an ellipse corresponding to 3σ. The prediction distribution with the true condition has a better prediction performance than the others. Additionally, the distribution of normal condition (black) tends to lie in a shorter area than the others, and the distribution of aggressive conditions (red) tends to lie further. To quantitatively evaluate the performance of each algorithm in this specific test case, the prediction results in terms of total MAE are shown as follows: only the in-vehicle CAN-bus data. The proposed model uses a DeepConvLSTM network with a sliding window approach for driving style recognition. Then, the classified driving style is assigned to the CVAE structure with the estimated past trajectory. The proposed method is trained based on the experimental data collected using HILS, and the evaluation results show that the proposed method outperforms the baseline methods. In addition, the trained model can produce multi-modal predictions corresponding to driving styles with interpretability. Experiments in more extensive driving scenarios that consider a wider variety of driving styles and validation on the real-car dataset remain for future study.