A Baselined Gated Attention Recurrent Network for Request Prediction in Ridesharing

Ridesharing has received global popularity due to its convenience and cost efficiency for both drivers and passengers and its strong potential to contribute to the implementation of the UN Sustainable Development Goals. As a result, recent years have witnessed an explosion of research interest in the RSODP (Origin-Destination Prediction for Ridesharing) problem with the goal of predicting the future ridesharing requests and providing schedules for vehicles ahead of time. Most of the existing prediction models utilise Deep Learning. However, they fail to effectively consider both spatial and temporal dynamics. In this paper the Baselined Gated Attention Recurrent Network (BGARN), is proposed, which uses graph convolution with multi-head gated attention to extract spatial features, a recurrent module to extract temporal features, and a baselined transferring layer to calculate the final results. The model is implemented with PyTorch and DGL (Deep Graph Library) and is experimentally evaluated using the New York Taxi Demand Dataset. The results show that BGARN outperforms all the other existing models in terms of prediction accuracy.


I. INTRODUCTION
R IDESHARING is an increasingly popular service paradigm where passengers from different requests are grouped into shared rides in order to achieve certain objectives, such as reduced travel expenses for vehicle drivers and improved vehicle dispatching flexibility for the service platform. With the development of GIS services and online ride-hailing applications like Didi, Lyft and Uber, there is an abundance of the road network data and passenger request data that underpin the development of dispatching algorithms [1].
For the past few years, a tremendous amount of research has been conducted on traffic forecasting [2] with the aim of tackling problems related to topics like transportation planning and environmental protection. With regards to ridesharing, one significant case is RSODP (Origin-Destination Prediction for Ridesharing, also referred to as OD prediction), which aims to foresee the pattern of passenger requests in the future for improved request-vehicle assignment. Figure  1 shows an example of how OD prediction aids the vehicle schedule planning in real time. It is clear to notice that with the prediction request produced 20 seconds ahead of time, the vehicle is scheduled to wait at node 1 instead of starting its delivery of passenger 1 immediately to node 2. As a result, the travel cost of the vehicle is minimized. Another simple case is that the vehicle travels to node 1 when idle from delivery at 6:00 p.m. because prediction indicates that a considerable number of requests will be submitted there before long.
Predicting the OD requests means not only the practicability to provide vehicle schedules in advance, but also the possibility to maintain passenger mobility. Understanding how people move around the city has potential benefits to other decision making tasks such as passenger travel behavior analysis. Intuitively, a chauffeur/chauffeuse who offers customized driving services to a certain passenger, needs to VOLUME 4, 2016 Suppose the requests are fed into the system every 20 seconds. In this example, each of the two latest request batches contains one request with the same OD, but different passenger ID. Without prediction, the vehicle will immediately start to handle the request from passenger 1. When passenger 2 submits the other request in the next 20 seconds, the vehicle is already halfway from node 1 to node 2 so it needs to re-route to handle the request from passenger 2, which incurs extra travel cost. With prediction of the future request in the first 20 seconds, the vehicle will be arranged to simply wait 10 seconds for the predicted request.
learn the travel behavior of the passenger (e.g., the passenger regularly requests a trip from node 1 to node 2 at 6:00 p.m.) in advance so that he/she can plan the optimal driving route with extra considerations including traffic congestion as well as the weather.
Contributing to the effort to address the RSODP problem, this paper proposes a new model, referred to as BGARN, Baselined Gated Attention Recurrent Network. The model utilises graph convolution with multi-head gated attention to extract spatial features, a recurrent module to extract temporal features, and a baselined transferring layer to calculate the final results.
The contributions of the paper are as follows: 1) It introduces a new model, referred to as BGARN, for addressing the RSODP challenge. 2) It utilises multi-head gated attention to combine different perspectives describing the semantic relationship among geographical grids. 3) It proposes a hybrid approach, termed in this paper as tuning, which combines linear baseline results with non-linear neural network results to enhance the predictive capability of the model. Different combination approaches are tested and discussed. 4) It presents a detailed comparative experimental analysis with existing state of the art models using the New York Yellow Taxi Trip dataset.
The remainder of the paper is organised as follows. Section II provides an outline of existing RSODP algorithms while section III presents an explicit definition of the RSODP problem. Section IV provides a comprehensive explanation of the proposed BGARN model. Section V presents a comparative quantitative experimental analysis of the proposed system. Section VI summarizes the paper and outlines future research directions.

II. RELATED WORK
A general ridesharing process includes a central system, a simulator and an optimizer, as illustrated in Figure 2. Such a system is responsible for passing the incoming requests from passengers to the optimizer and then dispatching the requests to the vehicles according to the returned schedule. The optimizer may utilise different ridesharing algorithms in order to provide the best request-vehicle assignment plan. The simulator is used to support algorithm selection since it evaluates the request handling metrics in the future.
The majority of existing OD prediction models utilize Deep Learning to capture the spatial as well as temporal features of the OD requests [4]- [14]. Although there have been attempts to develop statistical models such as [15], the prediction accuracy turns out to be unsatisfactory (lower than 50%). One possible explanation could be that the complex spatial-temporal dynamics of the passenger request stream on the city-wide road network cannot be captured by simply stacking Gaussian distributions with limited parameters.
The Deep Learning models proposed in the past few years tend to choose GNN (Graph Neural Network) over CNN (Convolutional Neural Network) to capture the non-Euclidean spatial dynamics and a recurrent module such as LSTM (Long Short-Term Memory) to capture the temporal dynamics. Typically, the request data is preprocessed into a grid map in which grids represent the geographical regions. Such grid map is also referred to as an OD Graph since each edge represents the number of requests from one grid to another. Aggregated with the spatial features such as Semantic Neighborhood and haversine distance, the grids from the OD Graph form a new feature graph in which one directed edge represents the propagation tendency of the features from one grid to another. With the support of GNN, the feature propagation among the nodes is performed and the output grid matrix specifies the extracted request forwarding pattern. A time series of these grid matrices is then pushed into a recurrent module like LSTM to integrate the temporal features such as tendency and periodicity (also referred to as trend and period explained in [16]). Finally, the output spatial-temporal features are translated back to a new OD Graph in order to predict the future request flow.
GCRN [5] does not utilize the spatial features. Instead, it processes the OD graph directly as a spatial feature, which severely increases the initial feature dimension. In order to resolve this issue, later proposed models like GEML [6] and Gallat [7] specify Semantic Neighborhood to examine the existence of requests from one grid to another and Geographical Neighborhood to examine the haversine distance between two grids. Gallat further splits Semantic Neighborhood into Forward and Backward Neighborhood to stress the importance of distinguishing the two roles of the grids as the origin or destination. Additionally, GEML and Gallat utilise pre-weights which take the contributions of different neighbors into account. This technique allows neighbors with more intensive request flow to contribute more to the feature extraction operation.
One of the deficiencies of applying simple GNN is that the importance of feature propagation along different edges should be different (the edge weights are unequal) since the affinities among grids are different due to the spatial dynamics. To address this issue, the Attention mechanism [17] was proposed in 2017, and, subsequently, the GAT (Graph Attention Network) model [18] incorporated attention into GNN and turned out to be a feasible solution to leveraging the edge weights. One of the currently state-of-the-art models, Gallat (Graph prediction with all attention) [7], utilized a pre-weighted GAT to distinguish the edge weights. In 2018, a gated-GAT called GaAN (Gated Attention Network) [19] was proposed to handle the traffic speed forecasting problem. The model introduced importance to each attention head [17] by using a novel concept referred to as gates [19], which provides a potential upgrade perspective for Gallat.
From the temporal perspective, [10] replaced LSTM with GRU (Gated Recurrent Unit) to reduce computation cost while [8] used a Transformer from [17] to capture the longterm temporal dynamics. The Transformer parallelises the computations of time series by maintaining the semantic embeddings among the input sequence units. Gallat also refers to this design and modifies the temporal processing unit according to the self-attention concept extracted from the Transformer. Nevertheless, such technique serves more as a supplemental spatial extraction, and suffers the risk of losing information which can be inferred by the continuity of time.
Finally, the aforementioned structure contains three modules responsible for three different tasks, namely spatial feature extraction, temporal feature extraction as well as feature translation which produces the final prediction results. Such complex model structure might easily suffer from gradient Apart from using residual block and normalization, a simple baseline model can also be used to provide reference outputs. The baseline outputs provide a rough estimate which the deep learning results can improve upon. As an example, LSTNet [4] combines the prediction outputs generated by the deep neural network and those generated by a baseline AR (Auto-Regressive) model using addition.

A. TOWARDS A NEW MODEL
Based on the above discussion, it is evident that there are several aspects for a good solution to the problem that existing systems do not all cover. For the preprocessing stage, the grids can be partitioned into hexagons rather than rectangles as suggested in [11], since hexagons have smaller perimeterto-area ratio as well as higher isotropy; for the spatial layer, multi-head gated attention can be introduced to investigate the spatial feature space on a finer granularity; for the temporal layer, as explained above, sticking to the conventional recurrent module design is a more viable option. Finally, as suggested by [4], baseline results from simple linear models may be used as a reference in the final prediction layer to enhance the predictive capability of the models. BGARN, introduced in this paper, aims to address these gaps. Table 1 summarises and contrasts the features supported and utilised by the four state of the art models, LST-Net, GCRN, GEML and Gallat, and the proposed BGARN model.

III. PROBLEM DEFINITION
In this section, RSODP and important concepts are mathematically defined. VOLUME 4, 2016 Definition III.1 (Time Slot). A time slot t ∈ [1, T ], t ∈ N + represents the minimum time unit for handling the requests. Each time slot is of t n time endurance in hours. By separating the requests every t n hours, T time slots of requests are generated in total. For RSODP, t n = 1 seems to be a reasonable time endurance to fully process the request data and have the model predict the requests in the next hour.
Definition III.2 (Grid). A map of city is partitioned into several grids according to the longitudes and latitudes such that they cover the region of the city without intersecting one another. As implemented in [6]- [8], [10], [12]- [14], the grids are partitioned into rectangular shape for easier computations. In real world case, each grid is approximately of 2.6 km × 2.6 km size. Figure 3 shows an example of partitioning New York City 1 into a grid map. There is, though, another grid partition technique suggested in [11] that produces hexagon grids to better describe the affinity between a grid and its geographical neighbors. This technique can be attempted on in the future.

Definition III.3 (Origin-Destination Graph). An OD Graph
at time slot t is a snapshot graph representing the origin-destination relationships among the grids. Each of the n grids is considered as a represents the request flow from grid i to grid j, with a total number of g t i,j requests appearing. Again, Figure 3 shows an example of an edge in an OD graph. In this case, the number of requests starting from grid 219 to grid 167 is 26. Regardless of time slots, the geographical adjacency matrix R, in which each r i,j ∈ R denotes the haversine distance between the central coordinates of grid i and grid j, remains unchanged.
.., G T } represents all T OD graphs in the input sequence.
Definition III.4 (Request). A request d = (t r , lat o , lng o , lat d , lng d ) ∈ D stores the request time t r as well as the coordinates of the origin and destination as a tuple of latitude and longitude (lat o , lng o ), (lat d , lng d ). It is preprocessed to construct the OD graph G t .
Definition III.5 (RSODP: Origin-Destination Prediction for Ridesharing). Given a sequence of requests D which is later transformed into a sequence of T OD graphs {G t } T t=1 , the geographical adjacency matrix R and basic grid map information, RSODP aims to predict G T +1 , the OD graph in the next time slot in the future.

IV. SYSTEM ARCHITECTURE
This section provides a detailed description of the proposed BGARN system. The overall structure of BGARN is depicted in Figure 4. The input raw data is the request stream D and grid information specifying the boundaries of the city on the map, grid size as well as the number of grids on latitude and 1 The raw map without grids and labels is cropped from OpenStreetMap.  The architecture of BGARN. Raw data will go through four key components, namely the preprocessing module, spatial attention layer, temporal recurrent layer as well as the transferring layer. The model outputs a demand vectord T +1 and the predicted OD graphĜ T +1 for the next time slot in the future. longitude directions. The preprocessing module transforms the requests into the OD graph sequence {G t } T t=1 , generates geographical adjacency matrix R and a grid feature matrix V t ∈ R n×d from the grid information. All these outputs are then passed into the spatial attention layer and temporal recurrent layer sequentially. As a result, the spatial-temporal embeddings will be computed to store the features of the affinities among grids. Eventually, the embeddings are passed through a transferring layer to translate the features into two required outputs: a demand vectord T +1 ∈ R n (as a subtask) storing the predicted number of outgoing requests from each grid, and the predicted OD graphĜ T +1 ∈ R n×n .

A. PREPROCESSING MODULE
There are basically three tasks to handle in the preprocessing module.
First, a grid map is generated using the grid information including the boundaries of the region of interest (e.g., a city) and the specified grid size for reference. Starting with 0 from the top left, each grid is labeled an ID from left to right and top down. Subsequently, the haversine distances among the grids are calculated to form up the geographical adjacency matrix R.
Next, the OD graph sequence {G t } T t=1 is generated. Each request is mapped to a specific time slot t using the request time t r . The coordinates of the origin and destination are mapped to two grids i and j. And 1 is added to the corresponding g t i,j . Finally, a grid feature matrix V t is created using the OD graph G t and the grid information. The feature of a grid can vary through multiple dimensions. For example, the coordinates and ID of the grid represent its geographical properties. The day of week (e.g., Monday as 0 and Sunday as 6), hour of day represent the time-related information (BGARN also uses weekday/weekend and time period such as "afternoon" in the implementation). The out-degree and in-degree of the grid represent the semantic features. Additionally, there are several auxiliary features including the functionality of the grid (e.g., residential area or workplace), weather for the time, etc. The feature vector v t i ∈ R d f of grid i at time slot t can thus be constructed by concatenating all the transformed features together.

B. SPATIAL ATTENTION LAYER
The spatial attention layer, which was introduced by [7], intends to extract spatial features from the grids and form up the spatial affinities among them. With limited information of the grids provided, the model specifies three views of spatial affinity analysis: Forward Neighborhood, Backward Neighborhood as well as Geographical Neighborhood. For completeness and convenience of the reader and to enable better understanding of the proposed model, the formal definitions of the aforementioned concepts as described in the original papers [6], [7] are provided below (Equations 1-7).
Definition IV.1 (Forward Neighborhood). If there is at least one request from grid i to grid j, then grid j is a forward neighbor of grid i. The set of forward neighbors for grid i at time slot t can be defined as follows [7]: Definition IV.2 (Backward Neighborhood). Correspondingly, if there is at least one request from grid j to grid i, then grid j is a backward neighbor of grid i. The set of backward neighbors for grid i at time slot t can be defined as follows [7]: Intuitively, the OD neighborhood specification tends to describe the tendency of people flowing from one grid to the other. If there are frequent requests happening between two grids, then future requests might have a higher probability to take place between the two as well. By aggregating the features of these neighbor grids together, the mobility pattern in the region can be effectively examined. The importance to consider forward and backward neighborhood separately, since they are distributed in time quite differently, is explained in [7], while [16] has shown that the propagation from a grid to its forward neighbors can be affected by that of its backward neighbors (e.g., continuous commuting to work).
Definition IV.3 (Geographical Neighborhood). If the haversine distance between two grids is within a specified threshold L, then grid i and grid j are considered as geographical neighbors of each other. The set of geographical neighbors for grid i can be defined as follows [6], [7]: Intuitively, if two grids are of geographical proximity to each other, then there are more chances that they share the same functionality (e.g., two adjacent grids both cover a residential area, where people tend to move out to work in the morning). It should, though, be clear that one grid can not be its own geographical neighbor, since it is meaningless. This neighborhood is useful in clustering semantically similar grids regardless of the behavior of the request stream. When the requests happening between two grids are quite few to be able to provide meaningful information (i.e., the input data is sparse), geographical neighborhood serves as a strong static relationship support. Specifically, the threshold L usually ensures that adjacent grids are geographical neighbors. However, the value can be bigger so that the clustering effect becomes more flexible.
It is essential to notice that the neighborhood sets only provide relationships in low resolution. For example, the number of requests from grid 2 to grid 1 is 26 and that from grid 3 to grid 1 is 105. In this case, grid 2 and grid 3 are both backward neighbors of grid 1, but their strength of neighborhood should absolutely be unequal. The same concern lies in the backward neighborhood and the geographical neighborhood. Hence, it is crucial to add a pre-weighting factor for each neighborhood strength calculation. These factors a i,t j , b i,t j and c i j for forward neighborhood, backward neighborhood and geographical neighborhood correspondingly, are calculated as follows [6], [7]: where is an extremely small value merely to avoid division by 0. VOLUME 4, 2016 With the pre-weighting factors specified, the attention weights ψ t i,j , φ t i,j and θ t i,j are calculated using softmax functions [7]: .
The attention mechanism is applied using the AttentionNet function defined as follows [7]: where FC µ a denotes a fully-connected layer with the activation function µ as LeakyReLU and W a ∈ R de×d f denotes a shared learnable weight matrix to project the feature vectors into the embedding space with dimension d e . The fully connected layer performs a weighted sum of the features from two vectors and provides a scalar as output. It defines a unique perspective to examine the relationship between two grids.
Finally, the weighted features of the neighbors are aggregated together. For grid i, the spatial embedding vector m t i at time slot t is constructed by concatenating the outputs from the three neighborhoods and the features of grid i together [7]: where W s ∈ R de×d f denotes a shared learnable weight matrix to project the feature vectors into the embedding space. Gallat [7] applies the spatial feature extraction procedure described above. However, the design does not utilize multihead attention and head gates introduced in GAT [18] and GaAN [19]. Essentially, the three neighborhoods generate three graph structures, specifying whether the corresponding relationship exists between two grids. On the other hand, each neighborhood graph generates K attention heads, thus constructing K different perspectives (mainly indicated by the AttentionNet) to examine how two grids are related. As an example, suppose grid j is a forward neighborhood of grid i, the request flow might be mainly due to a gathering event at grid j, or the regular commuting on Friday night. The weights will be applied to the feature dimensions differently for different perspectives (i.e., attention heads). Furthermore, suppose commuting is more common than the gathering event, the attention head considering the commuting case should thus be more important than that considering the gathering event. As a result, the attention gates are utilized.
By extending the spatial attention layer from [7], BGARN further upgrades Equation 7 to the following equation: where , ω k i,Θi denote the gates for the kth head, capturing features of affinity from grid i to its forward neighbors Ψ t i , backward neighbors Φ t i as well as geographical neighbors Θ i , correspondingly. here denotes an aggregation function that can be either an average operation or a sequential concatenation (average is used by default as it consumes less space). BGARN also adds one residual block for each attention output to avoid gradient vanishing when training such deep neural network.
The gates are calculated as follow: , where FC σ g,Ψ t i , FC σ g,Φ t i , FC σ g,Θi represents fully-connected layers with Sigmoid (to generate values between 0 and 1 as gates) as the activation function. They are responsible for mapping the processed vectors into the head space. W g,Ψ t i , W g,Φ t i , W g,Θi ∈ R de×d f are three learnable weight matrices to project the pre-weighted feature vectors into the embedding space with dimension d e . max({v 1 , . . . , v n }) produces the element-wise maximum of each value in the embedded vector. The gating function considers the importance of heads with two perspectives: max pooling and average pooling. After applying the gates, the heads are aggregated to form a further embedded vector of R de (if the average scheme is applied). As a result, the final spatial embedding vector m t i is of R 4de . By stacking these vectors vertically, the spatial embedding matrix M t = [m t 1 , . . . , m t n ] T ∈ R n×4de is retrieved. For an OD graph sequence {G t } T t=1 , the Spatial Attention Later will obtain the corresponding embedding matrices {M t } T t=1 . Figure 5 summarizes the key operations in the spatial attention layer.

C. TEMPORAL RECURRENT LAYER
As suggested in [16], there are two main aspects of temporal features: tendency and periodicity, which introduce shortterm as well as long-term time dependencies respectively. Definition IV.4 (Tendency). The request flow pattern is easily affected by those from the past few time slots due to the continuity of time. The set of spatial embedding matrices to be considered to this concern is as follows [7], [16]: where P specifies the total number of historical records considered.
Definition IV.5 (Periodicity). The request flow pattern appears to be similar to those from the same time slot in the past few days due to the daily mobility behavior. The set of spatial embedding matrices to be considered to this concern is as follows [7], [16]: where l specifies the number of time slots per day and P specifies the total number of historical records considered.
Note that RSODP intends to extract the temporal features for predicting the requests in time slot T + 1, so the subtractions in Equation 10 and 11 are based on T + 1. Besides, it is required that P ≤ T l . Intuitively, as an example of tendency, if there is a considerable number of people moving from grid 1 (residential area) to grid 2 (workplace) to work at 8:00 a.m., the request flow pattern might persist in the short future, meaning that there are probably many people moving from grid 1 to grid 2 to work at 9:00 a.m. as well; as for periodicity, it is common that people leave home to work nearly at the same time of each workday. In light of this, it can be hypothesized that the request flow pattern tomorrow at the same time is rather likely to be similar to that of today.
To have a miscellaneous view of both tendency and periodicity, BGARN also considers the prior and posterior time slot of the periodic time sequence S p , as suggested in [7]: Gallat [7] uses Scaled Dot-Product Attention in their Temporal Attention Layer design. By multiplying query features with spatial embeddings (keys), followed by a row-wise softmax function, the dispatching pattern of requests from one grid to the others can be examined. Nevertheless, such technique serves more as a supplemental spatial extraction, since it operates on the grids instead of the timeline. The actual temporal feature extraction in Gallat is a simple summation. By comparison, BGARN feeds the spatial embeddings sequentially into a recurrent module such as LSTM, then average the results over all temporal dimensions.
Basically, a time sequence S x ∈ {S t , S p , S tp − , S tp + } is forwarded to a recurrent network (e.g., LSTM) to generate the temporal embedding matrix M Sx : From Equation 13, four feature embeddings are retrieved from different temporal dimensions. By aggregating them (average by default), the Temporal Recurrent Layer retrieves the spatial-temporal feature embedding matrix M T ∈ R n×4de as follow: (14) where BN specifies the Batch Normalization operation. Figure 6 summarizes the key operations in the temporal recurrent layer.

D. TRANSFERRING LAYER
The Transferring Layer utilizes the spatial-temporal embedding matrix M T for two prediction tasks: the demand task which predicts the number of outgoing requestsd T +1 starting from each grid, and the OD task which predicts the OD grapĥ G T +1 in the next time slot. Compared to the OD task, the demand task predicts only the origin of the request flow, thus largely reduces the complexity (from n 2 predictions to n). Therefore, it will be helpful to set the demand task as a subtask. VOLUME 4, 2016  BGARN utilises simple baseline models, like HA (Historical Average) and AR (Auto-Regressive) described in section V-B, to provide a rough estimate upon which the deep learning model can improve its predictions. By performing an operation (termed in this paper as tuning) which combines the baseline results (linearity) and the prediction results inferred from M T (non-linearity), the Transferring Layer obtains the final prediction outputs. The equations for the tasks are as follows: whered ref T +1 andĝ ref i,j are baseline outputs and Aggr(a, b) specifies the tuning approach which can be sum, weighted sum or multiplication (multiplication by default). Basically, for weighted sum, both deep learning results and baseline results represent the request flow. The deep learning results for sum and multiplication, however, have a different meaning. In these two cases, they serve as a tuning factor of the baseline outputs, which might be more appropriate since normalization normally scales the intermediate values to around 1.0.

V. EXPERIMENTAL EVALUATION AND RESULTS
In this section, BGARN is evaluated to examine whether it performs the predictions on demands and request graphs effectively as expected. Table 2 summarizes the dataset. The New York Yellow Taxi Trip 2 data is selected to conduct the experiments. Basically, a portion of the data from Jan 1st, 2016 to Mar 31st, 2016 (3 months of time span) is collected and preprocessed. For a vehicle with speed 30 km/h, it takes around 5 minutes to travel through a grid, which is often considered to be an acceptable waiting time for the passengers. In light of this, New York City is partitioned into 361 grids, each with size of 2.49 × 2.52 km 2 . the requests are split by 1-hour granularity 2 Data URL: https://www.kaggle.com/vishnurapps/newyork-taxi-demand. since the human mobility pattern is often summarized in hours. For example, it is common to mention the phrase "rush hours", which specifies the hours in a day when traffic is the heaviest. Besides, 1 hour of time should be enough for the prediction model to perform one-round gradient descent and then for the optimizer to dispatch vehicle-request assignments.

B. BASELINES, OTHER MODELS & VARIANTS
The metrics evaluation results of BGARN are compared with the following baseline models: • HA: Historical Average is the very baseline method which computes the average of the historical demands from the previous time slots. Three versions of HA are tested using different temporal features settings, with HA + using all four as described in section 1.4.3, HAt using only tendency and HAp using only periodicity. • AR: AR (Auto-regressive) model. This paper uses a simple feed-forward network which calculates a weighted sum of the historical data. In addition BGARN is contrasted with four existing state of the art models LSTNet [4], GCRN [5], GEML [6] and Gallat [7].
The default aggregation scheme of BGARN is set as average in both spatial and temporal layer (since it consumes less memory). The default aggregation approach with the baseline results is multiplication with baseline results. Further, This paper specifies four variants for BGARN: • BGARN-NoTune: inherits the design of the transferring layer in Gallat, meaning there is no tuning with the baseline results. • BGARN-Concat: uses concatenation as the aggregation scheme in both spatial and temporal layer. • BGARN-WSum: uses another tuning approachweighted sum which adds the results from the attention layers and those from the baseline algorithm together with scaling weights specified. • BGARN-Shift: uses another tuning approach -shifting (sum) which adds the results from the attention layers and those from the baseline algorithm together.

C. PARAMETER SETTINGS
Smooth L1 Loss is used to calculate an overall loss as follow, as suggested in [7]: where η d and η o are two hyper-parameters to balance the importance of each task. The reason to use Smooth L1 Loss rather than Mean Square Error Loss is that it is more robust to outliers than L2 Loss so that the gradients do not change drastically when encountering abnormal data input, as explained in [20]. For metrics evaluation, this paper adopts the three classic functions -RMSE (Root Mean Square Error), MAPE (Mean Absolute Percentage Error) and MAE (Mean Absolute Error), which are widely used in regression tasks. The formulas are as follows: where z specifies the number of batches.ŷ and y specify the predicted results and ground true values respectively.
For the experiments, Adam has been used as the optimizer and all the models are trained (implemented 3 using PyTorch [21] and the Deep Graph Library [22]) on Tesla P100 PCIe. The settings of training epochs, batch size, gradient clipping norm, hidden dimension, number of attention heads, number of historical records P , task importances η d and η o are respectively 200, 32, 10.0, 16, 3, 7, 0.8, 0.2.

D. RESULTS
As mentioned in [7], the model should focus more on the regions with higher number of requests generated. Therefore, three thresholds are specified -0, 3, 5, to filter outputs below these thresholds and only calculate the metrics on the filtered outputs (e.g., MAE-5 specifies the Mean Absolute Error with threshold as 5).
The results are shown in table 3, 4, 5 and 6. Derived from table 3, it appears to be quite time-efficient to pass one sample (one-hour request data) through the model. Furthermore, the training time of BGARN does not increase linearly with regard to the number of attention heads (Gallat can be considered as using only one attention head).
From table 4, 5 and 6, it can be inferred that the RMSE, MAPE and MAE results of BGARN are best among those of all models for the Demand task and the OD task. In general, the metrics specifying errors for the OD task are significantly lower than those for the Demand task. This is because the request graph is rather sparse and the error value for each slot in the results is diluted.
The results of the baseline models are surprisingly good. This can be on account of the fact that the utilized dataset possesses a strong periodic feature, as indicated by the results from HAp. Simply averaging the weighted historical values (AR) seems to already provide a strong prediction output. Nevertheless, BGARN still manages to provide more accurate weights while some other models fail to. One of the most important reasons is that BGARN sets the outputs from baseline models as a basis and then improve them with refined feature extractions. By column-wise comparison, MAPE values generally drop with increasing thresholds. This might be due to the larger denominators when calculating percentages with more requests. On the other hand, RMSE and MAE generally increase, which indicates the complexity of predicting the request flow at locations with heavy traffic.

E. COMPARATIVE ANALYSIS
Generally, BGARN and its variants appear to perform better than those of Gallat, which indicates that the GaAN design in the spatial layer of BGARN is able to capture better feature representations than GAT used in Gallat. It is, in the meantime, interesting to note that Gallat and BGARN-NoTune, which are both designed without tuning in the transferring layer, perform far worse than the others. As a result, it seems to be rather helpful to combine baseline results with the model.  It is also noticeable that LSTNet performs better than GCRN, GEML and Gallat. This might be because LSTNet processes the request matrices (d and G) directly, while GEML and Gallat do not. Instead, GEML and Gallat uses the request matrices to generate features and different graph views. Although GCRN utilizes the request matrices in the spatial feature extraction module, it considers the historical records along tendency, which leads to a worse set of results with a highly periodic dataset. GEML, on the other hand, uses the historical records along periodicity, thus retrieve better results than GCRN. It is worth noticing that GEML, though uses a combined request view compared to Gallat, still manages to outperform Gallat and BGARN-NoTune, indicating that a simple GRU or LSTM might be better than using Scaled Dot-Product Attention.

F. ANALYSIS OF VARIANTS
BGARN-NoTune performs far worse than BGARN, indicating that the baseline results are rather helpful in providing more accurate results. BGARN-WSum and BGARN-Shift, however, perform even worse than BGARN-NoTune in the Demand task. The reason might be that the outputs of the deep learning model before tuning are actually small (around 1.0). In this case, tuning by scaling should be the best scheme among all. In the Demand task, the values are much larger so that the shifting scheme performs much worse than it does in the OD task. This also helps explain why Gallat (with no tuning) does not perform well, as it uses a sigmoid activation in the Transferring Layer which stagnates the training process when outputting 1. Finally, BGARN-Concat, which uses concatenation as the aggregation scheme in the spatial and temporal layer, performs slightly worse than the average version on Demand Task, while on OD Task it performs better. The results again show that OD prediction is a far more complicated task compared to predicting only the origin. Furthermore, it is found that BGARN-Concat provided more stable results through extensive repeated experiments, which makes sense since it preserves all the features without reduction. Nevertheless, direct concatenation consumes much more space and trains significantly longer, thus it might not be suitable for use in real-word cases.

VI. CONCLUSIONS
This paper has revisited the concept of Ridesharing and has proposed a new model, BGARN, for addressing the Origin-Destination Prediction for Ridesharing (RSODP) challenge.
The novel features of GARN are the utilisation of multi-head gated attention and a tuning approach which combines linear baseline results with non-linear deep learning results.
The utilization of multi-head gated attention provides an integrated view of different request flow relationship measurements among grids. This enables the capturing of multiple perspectives as well as their corresponding importance, thus supporting a more holistic analysis of the request flow and delivering more accurate predictions. In addition, the proposed tuning approach significantly enhances the prediction capability of the model. This approach is generic and can be applied to any regression task.
The experimental results obtained using the on the New York Yellow Taxi Trip dataset confirm that BGARN outperforms all the existing state of the art models in terms of prediction accuracy.
In the future, the model will be further extended by applying hexagon-based grid partitioning in the Preprocessing Module. It will also be tested on larger datasets (such as request streams in Beijing and Shanghai) with more complicated request dynamics. Table 7 lists and explains the notations used in the paper.