Vehicles trajectory prediction using Recurrent VAE network

This paper presents an analysis of the implementation and performance of a deep learning model based on Recurrent layers and Variational Auto Encoder model (VAE) architecture for prediction of future local trajectory and maneuver. The proposed method uses the encoder part of the VAE to represent the vehicle’s surroundings agents behaviour in time, taking advantage on the fact that VAE encodes similar situations or states close in the latent space and the generative properties of the VAE decoder, that is used to generate naturalistic driving trajectories. Furthermore, the variance of the predicted trajectory is estimated using the statistical properties of VAE model, increasing it if the input data is noisy or unrealistic and decreasing it if the model is certain about the prediction. The model is trained and evaluated with a public dataset. The results show that the proposed architecture outperforms state of the art methods in trajectory prediction error and provides a variance estimation that depends on input quality


I. INTRODUCTION
An accurate and reliable trajectory prediction module, capable of predicting the future paths of the surrounding vehicles, is a key aspect for modern autonomous driving vehicles navigation algorithms. Those algorithms, such as path planning or decision making, need as input the future behaviour of the surrounding vehicles to work properly. However, an accurate trajectory prediction is still a challenging problem due to the uncertainty on the behavior of vehicles and the dynamic nature of the environment. This issue is aggravated in the last seconds of a long term trajectory where it tends to be highly non-linear and unpredictable as it is much stronger influenced by the intentions of the drivers which may have independent goals or different driving styles [1]- [4].
Latest works on trajectory prediction benefit from the use of deep learning models which are more robust and generalize better. In particular, models such as Generative Adversarial Networks (GANs) [5] or Variational Auto-Encoder (VAE) [6], can be very useful for this purpose thanks to their generative properties. Above all, VAE models have become increasingly popular and have been used in a variety of works with excellent results typically for noise reduction [7], dimensionality reduction [8], [9] and realistic data generation [10]- [12]. Thus, trajectory forecasting can benefit from all those VAE capacities, as the input data for the prediction can be noisy, it is intended to extract the relevant features of the surroundings in a lower dimensional space and it is necessary to generate a realistic trajectory.
This work proposes a deep learning model based on VAE and Long-Short Term Memory (LSTM) layers that predicts the future maneuver of the vehicles in the surrounding, but also generates the future path that the vehicle will follow. This approach makes a deep study of the latent space generated by the VAE to prove that, on the one hand, close ndimensional points correspond to similar situations, and on the other hand, far away ones to very different ones. It is also proved that all vehicles interactions and states over time are represented on this space. Furthermore, the generator of the VAE is used to generate naturalistic trajectories, which are similar to the ones of the VOLUME 4, 2016 training set, generating not only the lane change maneuver detection, but also how it will be performed (acceleration, speed or how fast to leave and merge the lane).
This work is focused on highway scenarios as the dataset used provides highway data, but its flexibility makes it applicable to many different highway scenarios with different number of lanes or road geometry.
The proposed model is evaluated with state of the art methods in the same dataset, outperforming them in trajectory error.
The contributions of this paper are: • VAE encoder is proven to generate a better latent representation that improves trajectory error. • VAE decoder with LSTM is utilized to produce naturalistic paths. • Uncertainty of the trajectory is estimated. • Theoretical analysis of the performance of VAE applied to trajectory prediction is performed. The remainder of this paper is organized as follows. In section II, relevant works are discussed. Section III motivates the dataset chosen, and describes the data preprocessing. Section IV describes the proposed architecture. Finally section V presents and discuss the results and section VI concludes the paper.

II. RELATED WORK
According to [13], trajectory prediction algorithms can be classified into three main types: based on a physics model, on a maneuver model or on an interaction model. Classic models rely on the definition of a physic model of the vehicle. This model can include either the geometry of the vehicle (kinematic model), or it can also consider the forces that affect its motion (dynamic model). This kind of model is commonly used for collision detection, as the time horizon is not very large. Physics-based models main drawback is the high dependence on the initial conditions, usually estimated with noisy sensors, and the limited information of the environment (road geometry and surrounding vehicles). Some examples of this kind of models can be found on [14], [15].
In the maneuver based models, the vehicle is considered independent from the others on the road, and the model has to predict, from a collection of different possible trajectories, which one the vehicle is more likely to perform. Although this kind of model can consider the surrounding obstacles or road geometry in order to discard a part of the maneuver set, not considering the interactions with other vehicles can lead to an undesired behaviour. A representative example on this is [16], where the predicted path is sampled from a set of real trajectories according to the current situation.
The interaction aware models considers that vehicles motion is influenced by other vehicles in the scene. This kind of models rely on detecting and identifying the interaction precisely, but correctly modeling or considering those interactions can be challenging. Dynamic Bayesian Networks have been used in different works [17], [18] to solve this problem, but in the recent years, the high flexibility and performance of machine learning, has lead on an extensive use of these techniques for prediction or forecasting. Latest relevant works in the topic, use deep neural network to predict the future path, where different layers or input representation are the main difference between works. In [19], authors use a 3D tensor that represents the potential field of the traffic scene in different time steps, whereas in [20] and [21] the interactions between vehicles are modeled using an attention layer or an interaction layer respectively. In [3], a convolutional social pooling layer is used to encode the past motion of the neighboring vehicles and a LSTM decoder generates the future trajectory (CS-LSTM).
There can be observed two main characteristics on the deep learning models reviewed: The use of LSTM layers in the models which improves their capacity of handling data time series [3], [22]- [26], and an encoder-decoder architecture, also named as sequence to sequence for LSTM models, which is extensively used in many different deep learning models like [27].
VAE follows the encoder-decoder architecture which is used in most of the reviewed works, but it is based on bayesian statistics. This type of model stands out for its generative capacities and the ability to produce a regular latent space with good properties for generation. VAE has been used in different Autonomous Vehicles algorithms to generate a database of realistic trajectories: Tra-jVAE work [11] uses a VAE based model to generate realistic large-scale trajectories (from dozens of seconds to minutes) to create a dataset. However, the trajectories generated by this method are different to the ones studied in this paper, which are dependent on the surrounding vehicles, with up to five seconds prediction, demanding higher precision.
Multi-Vehicle Trajectory Generation (MTG) method [12], uses a VAE to generate trajectories of pairs of vehicles that drive through a road or an intersection. This method succeeds creating realistic trajectories, but it can not be used to predict trajectories of the surrounding vehicles, as this method is oriented to generate reliable testing scenarios, and does not include information of the surrounding vehicles.
Models based on GAN architecture are also known for generating realistic data. This kind of models have been used for predicting the future trajectory of a vehicle proving their utility as can be shown in [28]- [30]. However, a comparison between GANs and VAE models for trajectory generation performed in [11] and [12] shows how GANs can produce a higher quality reconstruction, but tend to lack full support over the data unlike VAE, which is essential in an unpredictable environment like a road [31].
The work presented in [1] generates prediction of vehicles trajectories using a conditional VAE (cVAE) that has as an additional input which is the intention predicted by an additional model that works as a classifier. This framework predicts future trajectories of the vehicles and outperforms the compared methods. However, it highly relies on a good performance of the intention recognizer classifier that makes the cVAE generate one of three types of maneuver (keep in lane, merge left/right lane). In addition, the dataset used in this work, does not provide classes for the maneuvers, so they need to be added manually and the criteria for deciding if a trajectory belongs to a class is not covered. In contrast, the proposed method does not depend on a classifier and the number of classes is not limited to three, instead the model can learn any trajectory type.
The aim of this work is to apply a simple VAE model based on LSTM recurrent layers to generate a realistic future local trajectory. The proposed method considers vehicles interactions, as it includes time-distributed data of the surrounding vehicles within the date used to generate the path. These interactions are implicit in the model i.e. it does not analyze interactions independently.

III. DATASET AND DATA PREPROCESSING
To train the model, a driving dataset that includes vehicles trajectories in an highway is required. One of the most used dataset on this topic is the one provided by the Next Generation Simulation (NGSIM) project [32], which includes several hours of driving data. However, as studied in [33], this dataset has some problems in the data which can not be corrected throw filtering, such as wrong trajectories that collide with near ones or unrealistically large magnitudes.
Recently released highway drone (highD) dataset [34], has overcome most of the problems mentioned of NGSIM dataset, giving not only more accurate data, but also a larger collection of samples (highD dataset provides a total driven distance 9 times higher than NGSIM).
In addition, most of the latest state of the art works in the topic of trajectory prediction use highD dataset for both, training and evaluating the results [4], [19]- [21].
Thus, for this work, highD dataset is used.

A. PREPROCESSING AND NORMALIZATION
HighD Dataset includes dynamic information like position, speed and acceleration for every vehicle in the road at a timestamp. Combining and low-pass filtering that information, surrounding vehicles can be extracted, defining eight zones for every vehicle where possibly other vehicle can be driving as shown in Fig. 1. For every zone, a metric m i is generated to represent the distance to the nearest vehicle in that zone. The expression of the metric for all zones except for zones 2 and 7 is given by where d max is the value of the maximum distance to consider (set experimentally to a value of 200m) and real distance d i is transformed to a metric m i (that is input to the network) to better represent the absence of a vehicle in that zone and to evenly represent different distance values. For zones 2 and 7, m i is set to 1 if a vehicle or part of a vehicle is inside the zone and 0 otherwise. The rest of the values are normalized to fit a normal distribution, but position for coordinates is done in the following way. For the longitudinal coordinate, the first value is subtracted for all sequence to make them start on 0 and all the values are min-max normalized. Lateral coordinate in every road is transformed to fit a common one where lanes centers are in 0.25, 0.75 and 1.25 meters.

B. MANEUVER LABELING
A maneuver label is generated for the dataset, where every trajectory is classified into three categories: keeping lane and left/right lane change if at any moment of the trajectory at least one time step is close in time to a lane change maneuver. This label will not be utilized in the training as it provides future data leakage, but will be used to perform a deeper analysis on the proposed model in section IV-D.

C. DATA SPLIT
The dataset consists on 60 recording sequences. In order to properly evaluate the performance of the proposed model, the dataset is split into 45 sequences of train data and 15 of test data (75-25%) to make sure that the performance is evaluated in never seen sequences. In addition, 30% of train data is used for validation and hyperparameters tuning.

IV. PROPOSED MODEL ARCHITECTURE
This section describes the proposed model architecture: First, the VAE models used are described, followed by the model used to predict the future trajectories and a description of how the full model is trained. Finally, an analysis of the latent space generated by the encoder and the performance of the VAE itself is made.

A. VAE
A VAE model [6], is very similar in shape to a standard Auto-Encoder (AE), which uses an encoder-decoder architecture to reconstruct the input data but with some differences. The VAE encoder, instead of mapping the input data into a single multi-dimensional point, it generates a Gaussian distribution over that point defined by its mean and variance. The loss VOLUME 4, 2016  function includes an additional term to force the latent space be similar to a normal distribution [31]. The temporal nature of the problem makes the use of recurrent neural networks convenient for temporal feature extraction and time-series generation. The proposed model architecture for the VAE model uses LSTM layers, widely used for complex sequence learning problems [35], combined with fully-connected and time-distributed fully-connected layers.
The proposed model uses two VAE models, which are trained separately. The first model is trained with M features, corresponding to all the surrounding vehicles data (distances and speeds) for the T previous time stamps. For this model (V AE surr ), only the encoder part is used in the final model to generate the latent variables, but the full model is analyzed in section IV-D to prove that all relevant features are represented in the latent space. The second model (V AE xy ) is trained only with longitudinal and lateral positions for T time stamps. The decoder part of this model is used to generate naturalistic trajectories in the full model. In addition, the encoder part generates latent variables of the previous trajectory of the vehicle, which are combined with the V AE surr encoder ones to have a better latent description of the environment.
The loss function utilized, combines the Kullback-Leibler (KL) divergence loss D KL , that measures how similar two probabilistic distributions are where µ and σ 2 are the predicted mean and variance, and the mean square error (M SE) between the reconstructed sequenceŷ i and the ground truth y i for T time steps Both losses are combined in a weighted sum where β is a weight that can be treated as an hyperparameter tuned experimentally that increases the importance of a close reconstruction (low value) or a regular latent representation (high value) as shown in [36].

B. PREDICTION MODEL
The prediction model takes as an input the encoded situation of the driving scenario for the last T time steps (L n,µ and L n,σ ) generated by the two encoders of the model, where n is the input latent feature, and predicts a latent representation of the future trajectory of the vehicle M x,m and M y,m , where m is the output latent feature.
The first step of this model is to sample the input latent data. VAE encoders generate two values per dimension corresponding to a Gaussian distribution: mean (L n,µ ) and variance (L n,σ ). The latent input variable is randomly sampled following that distribution both in the training and in the prediction stages. This sampling procedure allows to improve the robustness of the model compared to using only the mean value and discarding the variance and makes possible to estimate the trajectory mean and variance in the prediction stage using the Monte Carlo sampling method.
The sampled data are fed to a relatively simple model based only on Fully Connected Layers. Considering that the input data have been processed by the encoder and the output data will be processed by the decoder, the complexity of the model and layers does not need to be high.

C. FULL MODEL TRAINING
The full model is trained in two stages: First, the VAE models are trained. V AE surr model (in charge of encoding the information of the driving scene) is trained with surrounding vehicles data, whereas V AE xy is trained with trajectories from the dataset. The aim of those models is to encode the input data into a lower dimensionality variable and to decode the predicted value into real trajectories. Thus, the input and the output training for each model are the same, using MSE error as a training loss of the models.
After that, the full model is ensembled to train the prediction model. To do this, the previously trained encoders from V AE alr and V AE xy , and the decoder from V AE xy (whose weights are set as non trainable) are put together with the prediction model as shown on Fig. 2. The input data to the ensembled model are the current and past states X m,t , where m can be either the feature index or x/y to denote longitudinal or lateral trajectory respectively and t is the time step. The loss is computed using the MSE between the reference trajectories and the ones generated by the decoder (Y m,t ). This loss is backpropagated through the decoder to the prediction model in order to update its weights.

D. LATENT ANALYSIS
Number of latent variables is a key parameter that needs to be carefully chosen. That value needs to be sufficiently large to correctly represent all the input data, but also as low as possible so relevant characteristics are extracted from the data. To choose that parameter, a principal component analysis (PCA) is performed, using the LAPACK implementation of the singular value decomposition as shown in [37]. First, a sufficiently large number of dimensions is chosen to train the VAE. After that, the results of the PCA are analyzed, and using the cumulative sum of the explained variance ratio, the number of chosen dimensions is the lower that explains at least 95% of the data which, for the V AE surr model, is 7, and for V AE xy models is 2.
Furthermore, a deep study on the latent variables generated by VAE encoder is performed. This study aims to prove two assumptions: the latent space represents correctly all the input data, and similar driving situations are close together. For the first assumption mean absolute error (MAE) of the reconstructed data by the decoder is analyzed, giving a quantitative result of 0.7m for lateral trajectory and 0.04m for longitudinal trajectory. Furthermore, a visual inspection of several samples of the data, gives a good qualitative result as shown in Fig. 3, where surrounding reconstructed data are similar to original data and captures the evolution of dynamic data ( Fig. 3(a) and 3(d)) and sharp distances changes due to other vehicles lane changes ( Fig.s 3(b) and 3(c)), meaning that the VAE model gathers all the relevant data in the latent variables.
For the second assumption, the latent variables of only the surrounding data are grouped according to the maneuver that is taking place or will take place soon in the input data. Fig.  4 compares the most representative 2D slices of the latent hyperspace for VAE and normal AE. Encoded maneuvers in latent space for VAE are clearly distinguishable and form different groups (Fig. 4(a)), whereas for normal AE, the different maneuvers are not distinguishable as their corresponding latent variables are placed in the same place ( Fig.  4(b)). This analysis evidence that only using information of the surrounding vehicles, it is possible to predict the next maneuver, and VAE is able to distinguish and separate those situations.
In addition, Fig. 5 shows VAE decoder lateral and longitudinal trajectory reconstruction for different combinations of two latent space variables. VAE has learnt a smooth representation of all possible trajectories in the latent space. In this specific case, latent dimension 0 mostly represents   the lateral road position, whereas latent dimension 1 denotes the maneuver or lateral displacement. This smooth representation of all the trajectories in the latent space is meant to improve the full model precision, as an small error in the future trajectory in the latent space, would result in a very similar trajectory. Furthermore, the low dimensional representation of the state, where different maneuvers are grouped, makes easier for the following layers to forecast correct predictions, and reduces the complexity required for the prediction model.

V. RESULTS
The proposed model has been evaluated using sequences from the database to assess its accuracy. This section describes the process of obtaining the necessary metrics to determine the accuracy of the model, compares the results with state-of-the-art models and evaluates the results obtained. In addition, a second experiment is performed where the input data quality is degraded by adding noise. The results shows how a low quality input affects both the accuracy and covariance of the model

A. EVALUATION METRICS
Similar works on the topic of trajectory prediction evaluate the performance of their approach using either the root mean square error (RMSE) or the MAE separately for longitudinal and lateral prediction which gives a good intuition of the model performance. For this reason, these two metrics will be calculated to evaluate the proposed model and to compare it with similar works.

C. RESULTS EVALUATION
The data results have been obtained by evaluating the predictions of the proposed model, which are compared with the real trajectories collected in the database.
The input of the model are different sequences that include all the pre-processed data that the model needs. This model have been evaluated and re-configured five times to generate predictions for different time horizons (from 1 second to 5 seconds). The obtained results are compared with the ground truth trajectories that the vehicle have followed in each sequence, computing different metrics that evaluates the model.
The proposed method captures better the environment situation and generates realistic trajectories, meaning that the main improvements are for long-in-time trajectories, whereas for short-in-time trajectories, results show non improvement as it is shown in Table 1. Fig. 6(a) gives an example of trajectory forecasting for every vehicle in the scene. Predicted trajectory is close to the ground truth both in lateral and longitudinal axes. Some of the sampled trajectories for a vehicle can be far away from  ground truth, but the average of all the samples, results in a close prediction. One of the main advantages of the method is the possibility of estimating statistics of the variance of the generated trajectory. The second experiment performed evaluates qualitatively how the MSE and the variance of the trajectory evolves for inputs with different amount of noise in the input data. The results show that, for known situations (where little amount or no noise is added), the variance is low (Fig. 6(a)), but for very different ones (where the amount of noise added is higher) , not only the prediction error is greater, but the variance increases as it is shown in Fig. 6(b).
In addition, the proposed method is able to run fast enough for real-time applications. Predictions and sampling can be performed in parallel, reducing computational time in contrast to many interaction models. Data of all vehicles can be concatenated and repeated multiple times to sample with Montecarlo method in parallel using a GPU.

VI. CONCLUSIONS
In this work, a novel trajectory and probabilistic prediction system, which makes use of the LSTM VAE network, has been evaluated.
The test performed using dataset proves the viability of the approach, and the results provided outperforms state of the art approaches of trajectory prediction. Furthermore, the proposal is able to provide trajectory uncertainty by means of the use of the LSTM VAE network.
Future works will focus in adapting the proposed architecture to different scenarios like city roads or intersections as the proposed method has been trained and tested only in highway roads and to include vehicle sensors raw data (images or LiDAR pointclouds) to have a better description of the environment.
The source code used in this work is publicly available at https://github.com/midemig/traj_pred_vae. .