DeepTrip: A Deep Learning Model for the Individual Next Trip Prediction With Arbitrary Prediction Times

The increasing availability of travel trajectory data allows for a better understanding of travel behavior. In the individual mobility analysis, the problem of next trip prediction assumes a central role and is beneficial for applications such as personalized services and mobility management. This paper addresses the next trip prediction problem with arbitrary prediction times (the time when the prediction is made). This problem has not been studied adequately in the literature and it is important for applications driven by system events, such as proactive travel recommendations under disruptions or crowding in transport systems. It predicts an individual’s next trips given their historical trip sequences and the prediction time. We formulate the next trip prediction problem as on-board and off-board predictions depending on an individual’s travel status (i.e. on-board/off-board). Using historical/real-time travel trajectories, a DeepTrip model is proposed based on a trip sequence-to-sequence deep learning structure coupled with an attention mechanism. A novel overlapped embedding method is proposed to represent continuous travel attributes capturing simultaneously the categorical and numerical feature information. We also develop a random-sampling training algorithm to learn the impact of the prediction time. The model is validated using trip data in urban rails. The results show that DeepTrip outperforms statistical-based models by more than 10% in terms of accuracy and other deep learning models by 2%-3%. The impact analysis shows that different representations are appropriate for the two prediction cases (on-board/off-board), and the prediction performance does not monotonically improve as the prediction time approaches the next trip.


DeepTrip: A Deep Learning Model for the Individual Next Trip Prediction With Arbitrary
Prediction Times Pengfei Zhang , Haris N. Koutsopoulos , and Zhenliang Ma Abstract-The increasing availability of travel trajectory data allows for a better understanding of travel behavior. In the individual mobility analysis, the problem of next trip prediction assumes a central role and is beneficial for applications such as personalized services and mobility management. This paper addresses the next trip prediction problem with arbitrary prediction times (the time when the prediction is made). This problem has not been studied adequately in the literature and it is important for applications driven by system events, such as proactive travel recommendations under disruptions or crowding in transport systems. It predicts an individual's next trips given their historical trip sequences and the prediction time. We formulate the next trip prediction problem as on-board and off-board predictions depending on an individual's travel status (i.e. on-board/off-board). Using historical/real-time travel trajectories, a DeepTrip model is proposed based on a trip sequence-to-sequence deep learning structure coupled with an attention mechanism. A novel overlapped embedding method is proposed to represent continuous travel attributes capturing simultaneously the categorical and numerical feature information. We also develop a random-sampling training algorithm to learn the impact of the prediction time. The model is validated using trip data in urban rails. The results show that DeepTrip outperforms statistical-based models by more than 10% in terms of accuracy and other deep learning models by 2%-3%. The impact analysis shows that different representations are appropriate for the two prediction cases (on-board/off-board), and the prediction performance does not monotonically improve as the prediction time approaches the next trip.
Index Terms-Next trip prediction, individual mobility, deep learning, metro systems.

I. INTRODUCTION
I NDIVIDUAL mobility studies how humans move within a network or system [1]. Understanding  individual mobility is essential and beneficial for many applications in areas such as urban planning [2], [3], personalized recommendations [4], and Intelligent Transportation System (ITS) [5]. Individual mobility prediction can be generally defined as follows: given a series of spatiotemporal records S = [r 1 , r 2 , . . . , r n ], referred to as individual mobility records hereafter, predict the next mobility records r n+1 . Depending on different prediction tasks, r i can be a timestamped location in GPS trajectory data (next location prediction), or a trip record with origin/destination times/locations (next trip prediction). Depending on the application context, the individual mobility prediction can be categorized into two classes: customer event triggered and system event triggered. In the customer event-triggered mobility prediction, r n+1 is predicted when mobility record r n is collected. For example, in personalized recommendations, once the latest mobility record (r n ) of an individual is observed, the next possible visiting location and time (r n+1 = (l n+1 , t n+1 )) are predicted so that recommendations about the predicted location could be timely pushed [6]. The individual mobility prediction triggered by a system event may take place at any time serving a certain system purpose. In this case, the time when the prediction is made (prediction time) is important. Personalized information provision is increasingly becoming an important service of smart transportation systems. With the ability to predict users ′ mobility at arbitrary times, transportation authorities could target the affected users in certain situations and provide them with relevant information. For example, the model may predict the potentially affected passengers after an incident takes place in a metro system. Disruption information could then be pushed only to these users. Also, operators could identify potential travelers during the morning peak beforehand and provide them with incentive schemes [7] to reduce system congestion during peak hours. Compared to the traditional information system (e.g., distributing the information on the website or pushing the information to all users), the prediction-based personalized information provision can improve the information relevance and accessibility, as well as reduce the communication cost.
The customer event-triggered mobility prediction is defined as: given a sequence of an individual's historical mobility records [r 1 , r 2 , . . . , r n ], predict their next travel record r n+1 immediately after r n is observed. Different from that, the system event triggered mobility prediction is defined as: given a sequence of an individual's historical mobility records [r 1 , r 2 , . . . , r n ], predict the next travel record r n+1 at time t ≥ t n . The customer event-triggered mobility prediction problem has been widely studied in areas such as recommendations for points of interest (e.g. dining/tour) [8] and next trips in public transport [9]. However, to the best of our knowledge, the system event triggered individual mobility prediction problem in which the prediction time is arbitrary has not been addressed in the literature. The customer event-triggered mobility prediction problem is a special case of the system event-triggered prediction problem when the prediction time is set as t n .
The system event-triggered individual mobility prediction problem is challenging in the following aspects: • It is challenging to model the prediction time capturing these impacts. Figure 1a shows the definition of the individual mobility prediction problem. Existing studies predict the mobility attributes given the individual's travel sequence. As discussed, including arbitrary prediction time in the prediction is essential for online applications. We model the problem as predicting the next trip given both the travel sequence and the prediction time information. Figure 1b is an example of such case. Assume this individual has two travel patterns for work-to-home trips starting at 16:00 and 19:00, respectively. Given that the last trip (trip n) happens at 8 o'clock in the morning, predicting at t 1 ≤ 16:00 has to consider that the next trip may take place at 16:00 or 19:00. However, predicting at t 2 > 16 : 00 could utilize the additional information that the trip at 16:00 did not happen. Thus the next trip will probably be the 19:00 trip.
• The mobility sequence has varied types of data associated with it, including discrete (e.g. labeled metro stations) and continuous data (e.g. trip time). The continuous time data in existing studies is discretized into intervals [9], [10]. However, this approach suffers from the interval length and boundary issues. For example, 8:59 AM and 9:01 AM are represented by two different classes if time is discretized into hourly intervals starting on one hour.
• Given information on the travel sequences of an individual, capturing the complex spatiotemporal and historical dependencies, as well as selecting the most relevant sequences (context-aware) for prediction is not straightforward.
The paper proposes a deep learning-based framework, DeepTrip, to model the system event-triggered individual mobility prediction problem that overcomes limitations in the existing literature. The main contributions are: • Development of a deep natural language processing (NLP) based model structure to predict, at arbitrary prediction times, individuals' next trips given their historical trip sequences. The model can also deal with arbitrary prediction horizons (when the next trip happens) not necessarily daily.
• Development of novel data representation methods for discrete and continuous attributes of trip sequences, using an overlapped embedding model for temporal data representation. The model captures both the categorical and numerical features of temporal information.
• Development of a random-sampling training algorithm coupled with a pairwise time-pointer mechanism to capture the impact of the arbitrary prediction time.
• Empirical analysis to validate the model using smart card trip data from a busy urban railway system and comparing with state-of-the-art models, as well as systematically explore the impacts of data representations and prediction time on model performance. The remainder of the paper is organized as follows. Section II reviews relevant literature on individual mobility focusing on problem definition, data representation, and prediction models. Section III defines the problem, proposes the DeepTrip model and details the prediction methodology. A case study using smart card data from an urban railway system is performed in section IV. The model performance is validated by comparing it with state-of-the-art models. The impacts of the data representation and prediction time are investigated. Finally, section V concludes the paper and discusses further studies and applications.

II. LITERATURE REVIEW
Various studies have shown that individual mobility, in general, can be predicted depending on mobility characteristics [11], [12], [13]. For example, Song et al. [11] used an entropy measurement and showed that 93% of human mobility can be potentially predicted in terms of the next location using mobile call and dial data. The rapid advancements in learning algorithms facilitate urban mobility prediction development. We reviewed the literature from three perspectives: Problem setting, data representation, and prediction models.

A. Problem Setting
The problem of individual mobility prediction has been studied under different contexts. Early studies focus on the individual next location prediction. For example, Calabrese et al. [6] assume that the human mobility behavior is periodic over time with a period T and predict the location at time t based on the sequence of locations at times t − T , t − 2T , etc. Many studies apply a Markov Chain model and predict the next location based on transition probabilities of candidate locations [14], [15], [16]. Recently, some studies are reported on predicting both the next locations and times. The prediction is triggered by a customer event, which is predicting the next travel record right after the most recently recorded trip (location and time) are observed. For example, Gidófalvi and Dong [8] propose a continuous time Markov model to predict individual departure times and destinations.
The aforementioned models focus on the next location prediction, however, very limited attention has been paid to the next trip prediction (origin, destination, and times). Zhao et al. [9] developed a mobility n-gram model to predict passengers' next trips in transit systems using AFC data. The approach divides the prediction into two sub-tasks: trip making prediction and trip attributes prediction, modeled using logistic regression and n-gram models, respectively.
Different from these problems, this paper formulates the individual mobility prediction problem with arbitrary prediction times (when the prediction is made) and the prediction horizon that may go beyond the end of the calendar day. Such a model can be useful for applications driven by system events, such as proactive travel recommendations under disruptions or crowding in transport systems. The model, for a given prediction time, predicts the individual next trips given the individual's historical trip sequences. The problem studied in this paper generalizes existing problems that assume a prediction time defined by the end of the last trip, and a prediction horizon constrained by the end of the day.

B. Data Representation
Mobility records contain the spatial-temporal travel attributes of trips, e.g. origin, destination, departure time, etc. Properly representing these attributes is critical for individual mobility prediction. The spatial travel attributes are usually represented using a grid-based system. For example, Calabrese et al. [6] divided the geographical space into labeled grids and allocated GPS records to the corresponding grids based on their coordinates. Feng et al. [10] aggregate GPS records with the proximity in space and time into one proxy record. Instead of directly using the location information, Gambs et al. [15] mined Point of Interest (POI) data to represent the corresponding travel locations. POI is found to be more informative than location information. They can capture the underlying travel activities and purpose, and thus improve the individual prediction performance.
Representing temporal information is generally challenging. Ideally, the temporal information representation includes both categorical and numerical features. The categorical features refer to the semantic relationship between time points regardless of the magnitude of the time difference, while the numerical features capture the chronological dependencies, particularly for discretization boundaries. Regarding temporal information representation, the simplest form used in the literature is based on the order of visiting different locations, regardless of the difference of these times [14], [15], [16]. For representations incorporating time attributes, two main approaches are commonly used. The first considers only the time of day but ignores the date. For example, Feng et al. [10] divided the day into 24 one-hour intervals. The drawbacks of such representation are: a) the proper interval width is hard to determine; b) the representation is sensitive to interval boundaries (two close time points on the two sides of the interval boundary are treated as very different). Several methods have been proposed to address these issues, including entropy-based and error-based discretization [17]. Although these methods could figure out a more reasonable division of the time domain, boundary and interval width issues still exist. The second approach uses timestamp records with both date and time information. The timestamp records are transformed into a sequence of ordered integers [8]. The limitation of both approaches is that the numerical feature of time is overrepresented, not well representing the temporal heterogeneity of mobility activities. For example, the same mobility activity may start/end with a time difference from day to day. Using the exact timestamp value may cause the model to treat the same activity differently, thus degrading the prediction performance.
We propose a novel overlapped embedding model to capture both categorical and numerical features of time attributes and automatically switch between them given the prediction context. The structure fits well with the nature of the studied mobility prediction problem.

C. Prediction Models
The individual mobility prediction models in the literature can be categorized as statistical and deep learning-based models.
The statistical models have been widely used and they model the mobility sequence dependencies using probabilistic methods, such as the Markov chain and its variations [15], [18], [19], decision trees [20], and natural language models [21]. For example, Gambs et al. [15] developed a n Mobility Markov Chain (n-MMC) model to predict the next location based on n previous visited locations. The n-MMC model exhibits a promising performance when n = 2. Increasing the number of looking back intervals has no significant improvement if k > 2. Mathew et al. [18] proposed a hybrid approach to predict the next location using GPS data by combining Hidden Markov Models (HMMs) and location clustering. The individual visited locations are clustered and then fed into the HMMs. Zhang et al. [19] developed a group-specific mobility modeling framework and predict the next visiting location based on Geo-tagged social media data. Monreale et al. [20] proposed a T-pattern tree model, a decision treebased model, which learns from a previously extracted concise representation of mobility behaviors using GPS trajectory data. Hsieh et al. [21] developed a time-aware language model (T-gram) to predict the next location using check-in data.
Despite their performance, the statistical-based prediction models can only capture the transition probability of mobility patterns in a static manner (calibrated or learned from the training data). They are limited in utilizing longitudinal dependencies, such as periodical regularity. Deep learning-based individual mobility prediction methods have emerged in recent years [10], [22], [23]. They not only capture the high-order spatial-temporal dependencies but also learn the longitudinal and periodic features, thus providing a better performance compared to statistical models. For example, Feng et al. [10] proposed the DeepMove model, a sequence-to-sequence model structure coupled with an attention mechanism filtering historical sequences, to predict an individual's next visiting location. The results show that DeepMove outperformed the MMC model by more than 10% in terms of accuracy. Rossi et al. [22] combined a Long-Short-Term-Memory (LSTM) network and an attention module to predict the next destination of taxis. Recently, newly emerged deep sequential models have been incorporated into deep learning-based mobility prediction models. Xue et al. [24] proposed a transformerbased model, MobTCast, to predict the traveler's next visiting location; Tao et al. [25] combined the transformer model with reinforcement learning to predict the individual's mobility for long-term. Other prediction scenarios have also been studied. Zhao et al. [26] studied the casual trip prediction in the metro system, and Zhou et al. [27] focused on the sparse trajectory prediction and trajectory classification.
The DeepTrip model proposed in this paper predicts the individual next trip attributes under the system event-triggered context with a deep learning-based framework. Table I summarizes the notations used in the paper. Let the time domain be denoted by T and be modeled as the ordered set of non-negative natural numbers N + . A trip tr of an individual can be characterized by a tuple of the related attributes where tr = (a 1 , a 2 , . . . , a p ). a i denotes the i th attribute. In this study, we use 4 attributes-start time t o , origin o, destination d, and day of the week w, to formulate the attribute tuple where tr = (t o , o, d, w). Note that extra attributes could also be added to the attribute tuple if available. Also, we do not include end time t d of a trip into the tuple since t d is mainly dominated by the system and not the individual's mobility behavior, predicting t d is out of the scope of this study.

III. METHODOLOGY A. Problem Definition
The trip sequence of an individual u is defined as a chronological list of u's last n recorded trips where S u,n = [tr 1 , tr 2 , . . . , tr n ] = [(t o,1 , o 1 , d 1 , w 1 ), . . . , (t o,n , o n , d n , w n )], where tr n denotes the recorded trip at time instance n, i.e. the last observed trip of u. The trip start times are irregularly spaced but temporally ordered where t o,1 < t o,2 < · · · < t o,n ∈ T.
Based on the individual's travel status, the mobility prediction problem is divided into two sub-problems: off-board prediction and on-board prediction. The off-board prediction problem is informally defined as predicting each travel attribute of the following trip.
• Problem 1. Off-Board Prediction: Given an individual u, its trip sequence history S u,n up to time instance n, and prediction time t pr ed where t d,n < t pr ed < t o,n+1 , predict the origin o n+1 , destination d n+1 , start time t o,n+1 , and day of the week w n+1 of the next trip. Subsequently, the on-board prediction problem is informally defined as predicting the destination of the current trip (the prediction time is between the start time and completion time of a trip).
• Problem 2. On-Board Prediction: Given an individual u, its trip sequence history S u,n , partial information of trip  2. Illustration of the off-board and on-board prediction problems. Off-board prediction is conducted in the interval between the last trip tr n and the next trip tr n+1 ; on-board prediction during a trip.
n +1, (t o,n+1 , o n+1 , ?, w) and prediction time t pr ed where t o,n+1 < t pr ed < t d,n+1 , predict the destination d n+1 of the current trip. Figure 2 illustrates the off-board and on-board prediction problems. The prediction time t pr ed is the time when the prediction takes place. Off-board refers to the scenario that the individual has already finished his/her last trip (tr n ) when the prediction is conducted, while on-board means that the prediction is made during the individual's next trip (tr n+1 ). Note that the prediction objective and available information are different in these two problems. • Sequence pattern learning module. It learns the sequential dependencies of individual trip sequences using a sequence-to-sequence structure with a multi-head attention mechanism. It processes the historical and real-time trip sequences separately (i.e. S V h and S V c ), and projects them using the corresponding historical/real-time GRU networks. The multi-head attention layer looks up the historical trip sequence and finds the most relevant trip sequence c n to the next trip prediction based on the current trip sequence information h n and the prediction time point t p τ . Finally, the selected historical trip sequences, real-time trip sequence, and the prediction time point are concatenated into a vector [c n , h n , t p τ ], serving as inputs to the prediction module.

B. Model Framework
• Prediction module. The prediction module includes a fully-connected neural layer with a softmax activation function. It takes as inputs the projected trip sequence vectors of all individuals and outputs the predicted information of the off-board of an individual.
The proposed DeepTrip structure provides a unified framework for dealing with both on-board and off-board prediction problems, by formulating model inputs differently. For the off-board prediction, the input for an individual u is the trip

C. Travel Feature Extraction Module
Trip attributes are represented by different data types, i.e. discrete and continuous. o, d, and w are categorical data (e.g. station number), while t o is continuous. Existing studies represent the time t o using hourly time intervals, e.g. 7:00-8:00 am [10], [22]. Let the time instance t 0 i ≤ t o,i ≤ t 24 i ∈ T be the starting time of trip i, which is between 0:00 AM and 24:00 PM of a day. The hourly interval representation of t o,i is: where hourly(t o,i ) is the hourly representation of trip start time, and int (·) the floor function. Although the hourly representation achieves good model performance, it suffers from potential risks of losing important features in temporal information. Theoretically, temporal information includes both categorical and numerical features. The categorical feature refers to the semantic relationship between time points regardless of the magnitude of the time difference.
For example, the time difference between 7:30 AM and 8:30 AM or 6:30 AM are both 1 hour. However, the mobility pattern at 8:30 AM is more similar to that at 7:30 AM compared to that at 6:30 AM, since 7:30 and 8:30 AM are in the morning peak while 6:30 AM is off-peak. The discrete representation can capture such categorical features well. However, it ignores the numerical feature of the temporal information and is limited in capturing chronological dependencies. For example, with the hourly interval representation, the interval difference between 8:59 AM and 9:01 AM is the same as that between 8:01 AM and 9:59 AM. However, conceptually the mobility pattern differences for these two cases could be substantial. The numerical feature is important for capturing temporal information, particularly for times around category boundaries. In the following, we develop 4 alternative strategies representing trip attributes.
1) Embedding representation: Following the word2vec idea [28], the categorical data is first represented using a one-hot format and then transformed into a dense vector using a one-layer linear neural layer. Compared to the direct one-hot representation, the Euclidean distance between dense vectors captures the semantic similarity and thus is more informative. For each trip attribute that has a discrete form, an embedding module is built to represent the raw data.
Continuous data can also be represented using the embedding function after discretization, e.g. hourly interval representation of t o . After discretization, the temporal domain is transformed into a finite number of points with each point being well-trained during the training phase. Therefore, the model with such a representation can fully capture the semantic relationship between points. However, this representation has two important drawbacks: a) the discretization interval width is arbitrary and hard to determine. A large interval width may lose feature variability within an interval while a small one may decrease the signal-to-noise ratio (i.e. same features are likely to be separated into different intervals due to noise); b) the data near the interval boundary is discretized into different categories although they are very close.
The requiring strategies are developed to represent continuous trip attributes, especially the temporal information: 2) Normalized representation: the normalized representation of t o,i is: The normalized representation is a non-floor version of the hourly interval representation, in which the time t o is transformed into a value between 0 and 1. The drawback of the normalized representation is that it only considers the numerical feature but lacks the categorical feature. 3) Projection representation: The representation based on the projection model deals with the limitation of the previous approach by transforming the continuous data into a dense vector representation that captures both numerical and categorical features without discretized operations. Figure 4 shows the projection model structure. It consists of two parts. First, time is transformed using the normalized representation. Second, a Multi-layer fully-connected Neural Network with a Rectified Linear Unit (ReLU) activation function takes the normalized value as input and projects it into a multi-dimension dense vector.
Theoretically, a neural network is a universal approximator [29], and thus capable of capturing both categorical and numerical features of the time information. The projection process in Figure 4 is mathematically written as: where t ′ is the normalized value of time t, W 1×K and W K ×K the weight matrices of stacked linear layers and out i the output of the i th linear layer. K is a hyper-parameter that controls the dimension of the output dense vector, and N is the number of neural layers.
Although the projection model captures both the categorical and numerical features of time, it may over-represent numerical features. In the problem studied in the paper, the numerical feature of time is informative when the time points are close indicating similar mobility patterns even if they may belong to different categories (e.g. the above-mentioned 8:59 AM and 9:01 AM case). However, when the time points are further, the mobility patterns are mostly dominated by categorical features other than the numerical difference. In addition, the projection model utilizes the original data which may lead to complex optimization hyper-planes. Thus it is prone to fall into local optimum solutions [30]. 4) Overlapped embedding representation: To address issues in the projection representation, we propose a novel overlapped embedding model to capture the categorical and numerical features of time and automatically switch between them in the training process, to the one that better fits the nature of the mobility prediction problem. A sliding window, with an interval width l w and a step l s , is used to divide the day into T overlapped time intervals {I 1 , I 2 , . . . , I T }, where I i is the i th interval. Then, the time point t is represented as a vector V with dimension T using the following rule: where v i is the i th element of v. Figure 5 shows an example of the overlapped embedding representation. The sliding window width is 1 hour and the step is 5 minutes. Given the T overlapped time intervals starting at 7:00 AM, the time point 7:07 AM is represented as After encoding the continuous data using the overlapped representation, the encoded data is then fed into a linear layer ( Figure 3) to transform it into a dense vector. The overlapped representation has three main advantages: • It captures the mechanism of how the time variables capture mobility similarities. If the two time points are close, then most of the activated cells (i.e. cells with value 1) in the corresponding vectors overlap indicating similar features. As the distance between the two time points gets larger, the number of overlapped cells decreases until 0 (disjoint vectors). The categorical feature is gradually dominating the numerical feature, only the categorical feature contributes to the prediction.
• It is flexible and has less information loss compared to the commonly used one-hot encoding method. As discussed before, l w in the one-hot encoding can not be too small (impact on signal-to-noise ratio), but loss of information if it is too large. In the overlapped representation, l w and l s are used to control the signal-to-noise ratio and information loss separately. Even a small value of l w , e.g. 5 minutes can be used, without introducing much noise.
• It is more robust to noise compared to one-hot encoding, especially for small interval length l w . For a feature variable with observation noise, the one-hot encoding is prone to assign it to the wrong category if the interval l w is small. However, the overlapped representation is robust even with a small interval length (e.g. 5 minutes l w instead of 1 hour in Figure 4).
In summary, all the categorical data, e.g. origin station, is represented by the embedding module. The continuous trip attributes are represented by either of the above four models. After the feature extraction, the original trip sequence is transformed into a sequence of dense vectors. In the case study in Section IV, we compare their performance in detail.

D. Sequence Pattern Learning Module
The essence of the next trip prediction is modeling the sequential pattern of an individual's trip sequence. Recurrent Neural Networks (RNN) are widely used to model sequential patterns of time series data. However, they suffer from the vanishing gradient problem for long sequence time series data [31]. We use the Gated Recurrent Unit (GRU) [32] to model the individual's trip sequence. GRU is a variant of Long-Short Term Memories (LSTM) and it has fewer parameters than LSTM and thus converges faster [33].
In terms of the information to be used for prediction, both the real-time and historical trip patterns are important, since an individual's next trip is highly related to recent trips and the long-term dependency of trips captures the weekly/monthly travel regularity. Given the different roles real-time and historical travel patterns may play in prediction, an individual's whole trip sequence is divided into two sub-sequences: realtime trip sequence S c = [tr k , tr k+1 , . . . , tr n ] and historical trip sequence S h = [tr 1 , tr 2 , . . . , tr k−1 ]. These sequences are processed differently with all the real-time trip sequences S c being used for prediction. However, instead of directly using the whole historical trip sequence S h , only the ones that are most relevant to the current trip pattern are identified and used (contributing to the next trip prediction).
To model the interaction between historical/real-time trip sequences, we adopt a seq2seq model [34] coupled with multi-head attention and pairwise time-pointer mechanisms [35]. The seq2seq model uses an encoder-decoder structure consisting of two GRU networks. The encoder GRU (historical GRU network in Figure 3) takes S V h as input and outputs the hidden state vector h h i corresponding to each vector v i in S V h . The decoder GRU (real-time GRU network in Figure 3) takes S V c as input. It not only outputs the hidden state vector step by step but also passes the hidden state vector to the encoder GRU at each step as a query to generate a context vector (i.e. a vector containing the relevant historical trip information) through a multi-head attention module.
1) Multi-Head Attention: The multi-head attention module consists of a set of single attention modules ( Figure 6). The single attention module has been widely used in Natural Language Processing (NLP) applications, such as Machine Translation [36]. It generates a context vector c n as a weighted sum of all the hidden states encoded by the encoder GRU (Equation 7). The weights capture the similarity between each encoder hidden state vector h h i and the last decoder hidden state vector h c n , i.e. the hidden state vector corresponding to trip tr n .
We adopt the general similarity form described in [34]. The attention weights are estimated through a Feedforward Neural Network (FNN) followed by a softmax activation function.
where W A is a linear projection layer, h h i W A h c n represents the general similarity. w i is the multi-head attention weight of each historical hidden state vector h h i corresponding to h c n . A larger weight value indicates a higher similarity. k − 1 is the length of the historical trip sequence and c n the context vector of the current time step.
The single attention module captures a limited semantic subspace of an individual's travel pattern (i.e. single trip attribute). A trip is characterized by multiple trip attributes. To capture the multi-aspect dependencies by trip attributes, we utilize a multi-head attention mechanism to model the joint interaction among different semantic subspaces of an individual's travel patterns. The multi-head attention module has N parallel single attention modules (or heads). The final context vector c ′ n is generated as a weighted sum of all the context vectors from single attention modules: where W q ∈ R N ×d represents the linear projection layer, h h k−1 is the last hidden state of encoder GRU, N is the number of attention head, c m n is the context vector generated from mth head (attention module), r m is the weight of c m n . A critical issue of the multi-head attention is that it leads to a redundancy problem, that is, all heads may eventually capture similar aspects of travel patterns. To avoid this problem, a penalty term is added to the final loss function. The penalty term penalizes the attention redundancy across different heads and forces different heads to focus on different travel patterns. It is calculated in two steps. First, for each head m, the average attention weight of each historical trip is calculated: where M is the sequence length of the decoder GRU, w m ]. Then, the penalty term is calculated as follows: where ||·|| F is the Frobenius norm and I an identity matrix with size N × N . This formulation penalizes the similarity between attention vectors from different heads and forces the Euclidean norm of each attention vector to be close to 1. Finally, the total loss function is: where, L pr ed is the loss function of the mobility prediction task. λ is a hyper-parameter that controls the weight of each loss, with λ ≥ 0.
2) Pairwise Time-Pointer: For the next-trip prediction problem, the prediction time information (the time when the prediction is made) is important to narrow down the solution space in order to improve the accuracy and efficiency of the prediction (as illustrated in Figure 1). It is challenging to  model the prediction time and incorporate it into the training process since the prediction time is dynamic and random (i.e. the next trip prediction can be made at any time in a day). Also, different prediction times provide different information for predicting the next trip given recent trips. We propose a pairwise time-pointer mechanism to enable the multi-head attention model to better select the relevant historical trips by making the best use of the prediction time. This mechanism also facilitates the model training process by systematically simulating the random prediction times in practice. Figure 7 illustrates the pairwise time-pointer definition. The timeline shows the sequence of trips for an individual. The red section indicates the duration of a trip and the green section the gap between consecutive trips. The pairwise time-pointer is composed of two sequences of time information, including the trip gap time sequence and the randomized prediction time sequence.
The off-board time-pointer is {t p 1 g , t p 2 g }, where t p 1 g = [g 1 , g 2 , . . . , g n ] is a sequence of gap times between consecutive trips and g i = t o,i − t d,i−1 . Each element of the gap time sequence t p 1 g is normalized to [0, 1] and concatenated with its corresponding trip vector v i to formulate a new vector (v i , g i ), which serves as the input to the GRU networks. The prediction time t p 2 g is generated as t p 2 g = p × (t o,n+1 − t d,n ) where p is a random value from the uniform distribution u(0, 1). It simulates a prediction conducted at a specific time point between t d,n and t d,n+1 . In the training process, the randomized prediction time t p 2 g is concatenated to h c n and fed into the attention module.
By using the time-pointer, each training sample instance (i.e. trip sequence) is treated as conducting a prediction at the given prediction time t p 2 g . After convergence, a large amount of prediction time points and the corresponding trip sequences are simulated. Ultimately, the model will automatically learn the best use of the prediction time in predicting the next trip.
The on-board time-pointer {t p 1 τ , t p 2 τ } in the sequence learning module behaves similarly to the off-board case. The only difference is that the on-board time-pointer captures the on-board trip attributes. Specifically, t p 1 τ is the sequence of trip duration times t p 1 is randomly sampled to simulate the prediction time when an individual is on-board his/her n + 1 trip.

E. Prediction
The output vector of the real-time GRU module h n , the context vector from the multi-head attention module c n , and the prediction time t p 2 are concatenated into a new vector (h n , c n , t p 2 ). The prediction module takes as input the vector (h n , c n , t p 2 ) and outputs the predicted next trip information. The prediction module is a NN network, consisting of a stack of FNN layers. The cross-entropy loss is used as the performance metric of the prediction task L pr ed : where e x[True] / j e x[ j] denotes the output probability of the true class (e.g. the real origin station of the next trip) after softmax.

F. Training Algorithm
Three DeepTrip models are trained to predict o n+1 , d n+1 , and t o,n+1 in the off-board prediction and one DeepTrip model to predict d n+1 in the on-board prediction. The training algorithm for the on-board and off-board predictions is the same except for the input trip sequence representations (as mentioned in section III-B). Algorithm 1 summarizes the off-board DeepTrip model training process. Note that in the on-board case, t p 1 g and t p 2 g in Algorithm 1 are replaced by t p 1 τ and t p 2 τ , respectively.

Algorithm 1 Training Algorithm for DeepTrip (Off-Board)
Input: Set of trip sequences U Output: Trained model for a i prediction 1: // constructing training set 2: U * ← ∅ 3: for each training sample S in U do 4: calculate normalized t p 1 g ;

14:
S V c ← f eatur eE xtraction(S c ); 15: for each vector v i in S V h do 16:

IV. CASE STUDY
We evaluate the proposed framework using the automated fare collection (AFC) data from an urban metro system. The system currently consists of 11 railway lines, serving 91 heavy rail stations and 68 light rail stops. It serves over 5 million trips on an average weekday. For the urban heavy rail lines, trip transactions are recorded when passengers enter and exit the system, providing information about the tap-in and tapout stations and corresponding timestamps. Individual trip data using the urban heavy railway (metro) from January 1st to March 31st in 2018 is used.
To validate the model performance, we select a random panel of 20,000 individuals who made at least 90 trips during the studied period (i.e. one trip per day on average). The trip attribute tuple includes origin station o ∈ [0, 90], destination station d ∈ [0, 90], day of the week w ∈ [0, 6], and tap-in time t o ∈ [0, 23]. These attributes are represented as categorical variables for comparison with the state-of-the-art models in the literature. Note that for DeepTrip, we use the proposed overlapped embedding to represent t o . Each trip can be characterized by a trip attribute tuple (o, d, t o , w), and the individual's trip sequence is captured by these tuples in chronological order.
To increase the size of the training sample, We utilize a sliding window of width n to generate the trip sequence sample instances from the individual's trip sequence. Specifically, given an individual's trip sequence S u = [tr 1 , tr 2 , . . . , tr N ], each trip sequence sample S sample is generated as S sample = [tr i , tr i+1 , . . . , tr i+n−1 ] with i ranging from 1 to N − n + 1. Accordingly, an individual with N trip records results in N − n + 1 trip sequence sample instances. For each sample instance, the prediction time information is generated based on the random-sampling method, i.e. randomly choosing a time point between the last two trips.
These trip sequence sample instances are used as model input in the case study. We use a window width n = 70 as 95.2% of the individuals in the panel have less than 70 trips per month. The width 70 adequately captures the periodic features of an individual's travel behavior (both weekly and monthly patterns). Finally, 1,386,872 trip sequence sample instances are generated. Samples generated from 80% of the individuals are used for training and the rest 20% for testing. For each sample, the travel attributes of the last trip (i.e. o, d, and t o ) are set as the prediction target.
The experiments are designed to validate the proposed DeepTrip model performance by comparing the results with the results from state-of-the-art statistical and deep learning models, as well as explore the impact of feature representation and prediction time information on the DeepTrip model prediction performance. We use the fraction of the correctly predicted trip attributes as the performance metric to evaluate the model prediction accuracy: (15) where N right represents the number of sample instances with the correct prediction, and N total denotes the total number of sample instances.

A. Model Validation and Performance Comparison
We build 3 DeepTrip models for o, d, and t o prediction, respectively. The model structures are the same, the only differences are the output and the training target.
We compare the proposed DeepTrip model with 4 baseline models in the literature: the Mobility Markov Chain model [15], the Mobility N-Gram model [  Chain and N-Gram models are classic statistical models, while DeepMove and MobTCast are deep learning-based models. The main differences between these models are summarized in Table II. The prediction target represents the model ′ s prediction outputs, either the next trip (multi-attributes) or location only. The temporal representation is the data representation method for the temporal information of trips. The prediction time represents the model's ability to capture the prediction time information and make predictions at arbitrary times.
Besides evaluating the whole test set (i.e. overall evaluation), each model is verified under 3 more scenarios, which are defined based on the length of the gap time g n+1 between the start time of the predicted trip and the previous trip end time.
• Short-term prediction: g n+1 < 2h; • Medium-term prediction: 2h ≤ g n+1 < 24h; • Long-term prediction: g n+1 ≥ 24h. To make a consistent comparison across the different models, the following settings are used:  Table III. For a fair comparison with the DeepTrip model, added the gap time information between consecutive trips t p 1 into the DeepMove model. f) DeepTrip. The off-board DeepTrip model is used in this evaluation since its input trip tuple has the same formulation as the above models. The DeepTrip model used the same hyper-parameter settings as the DeepMove (Table III). The results of DeepTrip in Table IV is the mean value of the accuracy under the prediction time ranging from 1% to 100% of the gap time g n+1 . Table IV compares the prediction results of the next trip attributes by prediction gap times. Generally, all models perform better for medium-term prediction than short-or long-term predictions. This could be attributed to the travel regularity and training sample size. The medium-term scenario exhibits the best results since most commuting trips are within 24 hours from the last trip, which is more regular and provides large samples for the model to learn.  Among those models, the deep learning-based models outperform the statistical-based methods by around 10% for predicting o and d and 5% for t o . Surprisingly, The MobTCast model exhibits marginal performance improvement compared to the GRU-based DeepMove model in the next trip prediction. This is different from the model comparison results for the GPS-based location prediction problem in [24]. It indicates that a large model (MobTCast) may not necessarily perform better than a small model (DeepMove). The possible reasons could be two-fold: 1) the MobTCast model size (or capacity) is much redundant for the studied problem that could be prone to the overfitting issue; 2) the studied problem and dataset have regular mobility patterns, which results in low model uncertainty and thus favor a small deep learning model. We provide further experimental analysis and literature evidence to support these arguments in the supplementary information file. The proposed DeepTrip model outperforms the DeepMove model, increasing the accuracy by around 2%. These results indicate that incorporating the prediction time information plays an important role in predicting individual mobility. Also, it can be seen that the improvement in the accuracy of predicting t o (around 3%) is higher than the other two trip attributes. That is because the prediction time information provides a hard constraint on the candidate tap-in times given the fact that the next tap-in event will happen later than the prediction time.

B. Ablation Experiment
To evaluate the effectiveness of each module contained in the proposed DeepTrip model, we conduct the ablation • GRU: use one single GRU. • Seq2seq: use the seq2seq structure which contains an encoder GRU and a decoder GRU.
• DeepTrip: incorporates the random-sampling training method into Seq2seq+Attn, which is the original form of DeepTrip. Table V summarizes the ablation experiment results. In general, the DeepTrip model outperforms all its variants in predicting the next trip attributes. The GRU model yields the worst performance, as inputting the whole sequence into one single GRU makes it indiscriminate of the history and recent travel patterns which would mislead the model in sequential pattern learning. The prediction performance slightly improves when the Seq2seq structure is used which separates the history and recent travel pattern information. The incorporation of the attention module boosts the performance (1.3% for o, 2.4% for d, and 2.9% for t o ) over Seq2Seq. The reason is that the attention module offsets the GRU's poor ability to handle long sequences. The DeepTrip model achieves the best prediction performance by taking advantage of the Seq2seq+Attn model and using the prediction time information. Incorporating the prediction information improves the model performance by 1-3% for each attribute compared to not. In all, the results highlight the importance of the attention mechanism and the prediction information in the DeepTrip model.

C. Impact of Data Representation
The data representation (discrete or continuous) is important for prediction. We evaluate the impact of the different representation approaches (normalized, embedding, projection, and overlapped) on the DeepTrip model prediction accuracy. The first three approaches are commonly used in the literature and the overlapped model is proposed to capture both the categorical and numerical features of the continuous variable (i.e. t o ). Table VI summarizes  The results show that the models using the prediction time information perform better than those without using such information, particularly for predicting the next trip time t o Comparing different representation models, the normalized representation performs the worst due to its poor ability in modeling the categorical feature of the temporal information. The overlapped representation performs the best in predicting t o and d on . These two attributes are sensitive to the temporal information representation as the model predicts exactly the time point t o and the on-board destination prediction d on is influenced by the departure time. For t o prediction, an inappropriate representation of time may predict two closed time points to two different time intervals. For d on prediction, a slight difference in the departure time may probably result in different destinations. The proposed overlapped representation captures both the categorical and numerical features of the temporal information that better fits the next trip prediction problem characteristics. However, the overlapped representation performs slightly worse in predicting o and d o f f than the projection and embedding representations. This is partly because the off-board prediction of origin/destination stations is less sensitive to temporal information and thus hardly benefits from the overlapped representation. Instead, it may get worse due to the fact that more parameters are introduced.

D. Impact of the Prediction Time Information
The contribution of incorporating prediction time information varies depending on the prediction context. For example, if the prediction is conducted just few minutes after the end of the last trip, the prediction time adds little information and hence, does not improve the accuracy of the prediction of the next trip. However, if the prediction time is far from the last trip ending time, it benefits the prediction by filtering out certain candidate trips.
We assess the impact of prediction time information on model performance as follows: First, we train the DeepTrip models for all trip attributes for the on-board and off-board problems. Then, we simulate the actual prediction for different values of t p 2 g = p × (t o,n+1 − t d,n ) and t p 2 τ = p × (t d,n+1 − t o,n+1 ), with p ranging from 0% to 100%. p = 0% means that the prediction time is equal to the end of the n th trip, while p = 100% means that the prediction time is equal to the start time of the next trip (i.e. n + 1th trip). Figure 8 shows the prediction accuracy for o, d, and t o predictions as a function of the prediction time. We also trained the corresponding DeepMove models as a benchmark for the analysis. The DeepMove prediction accuracy is also shown in Figure 8. Figure 8 shows that the prediction accuracy of DeepTrip keeps increasing with the increase of p, i.e. the time gap between the prediction time point and the previous trip ending time (off-board prediction) or the current trip starting time (on-board prediction). The DeepTrip model performs slightly worse than the DeepMove model in the beginning when the prediction is made immediately after the last trip is finished (or tap-in). A possible reason could be that the prediction time hardly provides useful information for the next trip prediction when it is close to the end of the last trip. In such cases, from an algorithmic perspective, the DeepTrip model trained using the less important prediction time information is less efficient than the DeepMove model with no such information.
The DeepTrip model outperforms the DeepMove model when the p exceeds 20%, reaches its peak when the p is around 70-86%, and then starts to drop after that. The performance degradation for p close to 1 can be attributed to the fact that the correct next trip may be filtered out if the prediction time is too close to the t n+1 b . Figure 9 provides an example to illustrate the potential impact of prediction time information. It shows the actual distribution of the start time of the next trip. When the prediction time is close to the last trip (i.e. t 1 pr ed ), the DeepTrip algorithm could filter out impossible candidate trips without affecting the likelihood of the next trip. However, if the prediction time is t 2 pr ed , the next trip tri p n+1 would be filtered out given its small probability to happen, which may lead to a false prediction result.

V. CONCLUSION
This paper presents an individual next trip prediction framework. The deep learning-based structure facilitates capturing multi-dimensional mobility patterns. The proposed time-pointer mechanism and random-sampling training algorithm simulate the prediction time information during the training phase. Compared with both classical machine learning models and state-of-the-art deep learning models, the proposed framework improves performance. The prediction accuracy is around 81% for origin locations, 62% for destination locations, and 47% for trip origin starting times, a 2-3% increase from the current SOTA model. This indicates that incorporating prediction time information is essential for the next trip prediction. We also propose variants of trip attribute representation to broaden the generalization ability of the framework. Four alternative strategies are proposed to extract the temporal information from different representation forms and evaluated.
The case study based on real-world data provides valuable insights. The prediction accuracy increases as the prediction time approach the departure time of the next trip. Our empirical results indicate that the best prediction performance could be obtained when the prediction time is around 80% of the off-board/on-board time. The model can also predict the next trips happening across days, not necessarily daily.
There are a number of promising directions for further research. Developing models that could capture the domain-specific information (e.g., the spatial similarity of activities) is an interesting direction. The proposed framework mainly focuses on frequent passengers since the advantage of the deep learning-based framework is based on mining complex long-term mobility features. Predicting infrequent users' mobility patterns is a challenging problem that may be worth exploring.