A Deep Learning Based Multi-Block Hybrid Model for Bike-Sharing Supply-Demand Prediction

,


I. INTRODUCTION
With the development of sharing economy, the station-free sharing bike has emerged as a novel and zero-emission shortdistance commute way for urban residents. As of 2018, the number of station-free sharing bike users in China had reached 235 million. Comparing with the traditional sharing bike with docking station, this new type of sharing bike can be parked anywhere designated and can be unlocked by anyone who scans the Quick Response code on the bike. However, convenience and freedom also bring limitations. The imbalanced distribution of bike-sharing system (BSS) is more likely to occur due to the fluctuation of spatial-temporal demand. Some areas with excessive parking resulted in a large amount of ''invalid demand'', others with short supply could not meet users' needs, thereby increasing the operating costs of rebalancing.
At present, management departments mainly use large trucks to transport sharing bike and update the whole system The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney.
continuously. However, this solution is time-consuming and laborious. Moreover, it has significant space-time delays. To solve this problem, the best way is to build prediction model based on big data and machine learning, which has been verified through many existing researches in the intelligent transportation systems (ITS) [1]- [5].
In this paper, we studied the bike-sharing supply-demand prediction problem, which related to predict the number of bikes rented and returned for a region in future by using historical multi-source datasets. In recent years, urban computing has been widely studied in the field of ITS [6]- [7]. Meanwhile, many works have been done on urban traffic prediction in terms of bike-sharing system. At first, how to analyze the correlation between multi-source data and prediction results is one challenging issue for accurate prediction. Some researchers did a lot of studies on the influencing factors of bike-sharing system. For instance, Gebhart and Noland [8] analyzed the influence of weather factors on the use of sharing bike. Bachand-marleau et al. [9] investigated social economy and spatial factors, and further analyzed their influence on the usage frequency. Besides, the univariate regression algorithms and multivariate regression algorithms were used by Ashqar et al. [10] to model available bikes at each station and at the spatially correlated stations of each region, respectively. While these studies failed to capture the complex spatial-temporal features, they did clarify the importance of external factors in prediction model. Furthermore, the data received are subject to constant changes in time and space as the complexity of transportation system itself. How to explore the fluctuation of spatialtemporal data and model the complex nonlinear relationship is another challenging issue. Fortunately, recent advances in deep learning modeling have motivated many researches on developing methods for traffic prediction [11]. The study of traffic forecasting model is fallen into three main directions: spatial, temporal and spatial-temporal feature capturing. On the one hand, convolutional neural network (CNN) showed its powerful ability in image and video recognition [12]. Inspired by this, some researchers treated the traffic information as an image to capture the spatial features by using CNN. Furthermore, Lin et al. [13] proposed a novel Graph Convolutional Neural Network (GCNN) that could learn hidden heterogeneous pairwise correlations between stations to predict station-level hourly demand. On the other hand, recurrent neural network (RNN) was widely used to address time series dataset. For example, a sequential learning task was well done in [14]. Further, Ma et al. [15] used long short-term memory neural network (LSTM) which could address the problem of back-propagated error decay of RNN for traffic speed prediction. Later, many researchers improved the structure of models by combining one neural network with another to achieve higher efficiency. For instance, in order to forecast passenger demand under on-demand ride service platform, a fusion convolutional long short-term memory network (FCL-Net) was proposed in [16] to capture spatial dependencies, temporal dependencies, and exogenous dependencies simultaneously. Afterwards, Yao et al. [17] presented Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework by connecting CNN and LSTM, which can model both spatial and temporal relations.
The above-mentioned studies provide valuable insights for traffic demand prediction. However, the prediction model for bike-sharing system demand is not yet sound enough, and most of them are limited to a single model or rarely processed exogenous data according to different categories. We here propose a novel multi-block hybrid (MBH) model to consider both spatial-temporal feature and external context dependencies simultaneously in supply-demand prediction of BSS. The major contributions of this study are summarized as follows: 1) The novel MBH model characterized spatial, temporal, and spatial-temporal properties of multi-source data, captured three types of features by using CNN, GRU-Net, and ConvGRU-Net respectively, and coordinated them in a multi-block structure by feature level fusion. 2) Considering that the supply-demand of bike-sharing varied dynamically in different regions and time periods, we conducted visual analysis to understand the trip patterns of bike-sharing in Shanghai and utilized map meshing method to redefine TAZ that conforms to the distribution characteristics of bike-sharing. 3) As to external factors, both meteorological and geographical data were taken into account in our model. More specifically, we proposed a novel quantification method for POI data which used AOI grading method to represent the overlay distribution effect of six types POI. And we conducted regression analysis to explore the influence of temperature, wind speed, humidity, and other factors on the prediction results. 4) We carried out visual analysis of the prediction results for the first time and put forward corresponding feasibility suggestions, which verified the practical significance of this study.
The rest of this paper is organized as follows. Section II presents a comprehensive overview of related works. Section III describes the preliminary preparation of this study, including preprocessing and analysis of multi-source data. Section IV presents the architecture of multi-block hybrid model and the specific neural network for each block. We conduct experiments and discuss the results in Section V, and apply the prediction results in Section VI. Conclusions are drawn and future works are indicated in Section VII.

II. RELATED WORK
During the past decades, considerable efforts have been devoted to forecasting traffic related data, such as vehicle speed, traffic volume, taxi service, and the emerging bikesharing supply or demand. The main research works for these different types of traffic prediction are the same which can be categorized into two aspects: data processing and prediction model. In this section, we will discuss the related work on the above research contents.

A. DATA PROCESSING
In this part, we believe that data processing includes external factors analysis and data preprocessing.
Numerous studies have been conducted to understand the contributory factors affecting the demand of sharing bike in recent years. Specifically, bike stations [10], the complex topological dependencies of road conditions [18], and POIs [19] were analyzed as spatial factors. As to temporal factors, Ashqar et al. [20] developed a method to quantify the effect of weather conditions on the prediction of bike station counts. Wu et al. [21] believed that weather may account for abnormal traffic drops. Other external context factors such as social economy [9], traffic events [22], and cycling behaviors about gender [23] were also considered to analyze the hidden correlations. Moreover, as the basis of research work, data preprocessing appeared in almost every literature. Representatively, Zheng et al. [6] gave a detailed presentation of urban data and mentioned typical techniques used in urban computing. VOLUME 8, 2020 While the above researches show that prediction can be improved by various factors, they still lack methods for quantifying nonlinear spatial-temporal correlations. In our study, we conducted regression analysis of meteorological factors and quantified POI by AOI grading to reveal implicit relationship between external factors and sharing bike usage. Further, the influencing factors were modeled separately according to the spatial-temporal type of data.

B. PREDICTION MODEL
Recent advances in deep learning have shown promising results in complex nonlinear relationships modeling. The current applications for traffic prediction in ITS mainly involve three directions: image processing [24]- [26], time series data processing [27]- [30], and multi-source fusion modeling [31]- [37].
Since the early 2000s, CNN has achieved success in object detection, image recognition, and image classification [24]. Computer vision has also expanded in transportation. Ma et al. [25] applied CNN to learn traffic as images and predicted traffic speed with a high accuracy. Further, Zhang et al. [26] proposed a convolution-based residual network to capture features on the images of traffic flow. The traditional methods to deal with time series data include ARIMA [38], RF [39], and SVM [40]. Recent studies further explored the utilities of advanced deep learning models. The first widely used model RNN was primarily utilized in natural language processing as studied in [28]. Further, to address the problem of back-propagated error decay of RNN, Sepp and Jürgen [29] popularized the application of LSTM. Xu et al. [30] used LSTM to develop a dynamic demand forecasting model for bikesharing. A simpler model GRU used in [32] achieved faster computational efficiency than others. However, these studies were simply spatial or temporal modeling, and none of them considered both aspects simultaneously. Therefore, the CNN and GRU-Net in our model were only used to capture geographical and meteorological features, respectively.
Recently, several studies have tried to improve prediction performance through aggregated models. For instance, Yang et al. [31] combined forest model and mobility model to evaluate sharing bike users' demand. Shi et al. [32] used Trajectory GRU to handle spatial and temporal dependency. Wang et al. [34] presented a hybrid model which connected layerwise structure and markov transition matrix to forecast traffic flow. These studies provided a valuable reference for spatial-temporal modeling. In this paper, we proposed a ConvGRU-Net that could capture spatial-temporal features and improve the computational efficiency.
In summary, the advantage of our proposed method compared with existing literatures is that we considered both spatial relation and temporal dependency in a novel multiblock model, which built each block based on different data types.

III. PRELIMINARY A. DATA SOURCES
This study uses the multi-source data collected from Shanghai which include the following three types: bikesharing GPS data, geographic data, and meteorological data.
A trip information records ID number, longitude, latitude, timestamp, and lock status of each sharing bike. Users can scan the QR code to use bike. When unlocking, the lock status is marked as 0, and when closing, it is 1. We cleared out redundant data and deleted invalid information through Python programming. Further, we divided GPS data of Shanghai into 15 min, 30 min, 45 min, and 60 min intervals respectively to build four datasets. Therefore, we can guarantee more comparative experiments with limited databases and analyze the impact of time interval, which would be described specifically in Section V.
The geographic data in this study include administrative division, road network, and land use information. Arcgis Shape Files depicting the road networks attributes and municipal districts were obtained from the transportation system planning document. More specifically, considering the population density and bike-sharing usage of each district in Shanghai, we selected the area as shown in the black rectangular region in Fig. 1(a) for subsequent research. Then, we collected POI data in selected region from a crawler developed by Python 3.5.
In order to increase our model prediction performance, the meteorological data were collected from Weather Underground website which provides history weather observations for every half hour of a day. The specific information we selected includes weather condition, temperature, humidity, and wind speed.

B. PREPROCESSING
In this section, we further processed multi-source data by different methods in order to achieve accurate prediction.

1) TAZ DIVISION
There are many regionalization methods in terms of different semantic and granularities meanings, such as partition space by grid [26] or TAZ (Traffic Analysis Zone) [30]. However, the object discussed in this paper is sharing bike which mainly solves the last-kilometer problem in urban transportation. Based on this, we applied the map meshing method to redefine the TAZ that meets the characteristics of sharing bike. As shown in Fig. 1(b), we divided the selected region into 63 × 60 grids by using the Create_Fishnet Tool of ArcMap10.2. The length and width of each grid correspond to 1.021 kilometers and 1.016 kilometers on the actual map approximately. Therefore, the gird with an average area of 1 square kilometer is in line with the bike-sharing travel scope, which can effectively reflect the urban bike-sharing conditions.

2) AOI GRADING
Urban POIs often directly affect residents' trip patterns, and each has its specific impact area and focus group. We selected six types of POI data and created buffers to represent each impact area through Proximity_Buffer Tool. The details of that are shown in Table 1. However, the six types of impact areas shown in Fig. 2 have distinct overlaps due to the different distribution density of POI, especially in the central area. We found that different land uses influence each other, and that the influence is regional, not a single point. The impact of land use on bike-sharing demand could not be quantified by using POI alone.
Hence, we came up with the conception of Area of Interest (AOI) which replaced point effects with region effects to represent the influence degree of geographic information data on a specific region. The AOI was utilized to express the effect of joint influence. More specifically, we used the Over-lay_Union Tool to take intersection of six types of buffers, and then used the Symbology Option in Properties Layer to set the AOI grading distribution result shown in Fig. 1(b), which can directly reflect the POI density and obtain AOI grade in each TAZ.
In this part of study, there are three details should be noted. First, some urban POIs with low public awareness, which have little impact on residents' trip patterns, were not discussed. Further, the POI with important living functions and high public influence were chosen and divided into six types. Second, to ensure accuracy, we used the Geodesic Buffer algorithm in ArcMap to create circular buffer of POI. Afterwards, we set buffer radius according to the actual impact area of bike-sharing near different POI types in Shanghai. Third, the higher the AOI grade is, the denser the POI distribution is, and the greater the usage of bike-sharing is. The AOI grade of each TAZ was defined based on the maximum value of all AOI grades contained in the grid.

3) SPATIAL-TEMPORAL DATA VISUALIZATION
After TAZ division and AOI grading, we found that the trip of urban bike-sharing has obvious tidal phenomenon in a short time and periodic changes in a long time through VOLUME 8, 2020   visual analysis [41]. In order to provide some intuitions, we present an example of bike-sharing demand for a working day in Fig. 3. The early peak and late peak appear at 8:00 and 18:00 respectively, which coincides with the peak time of urban trip. Moreover, spatial distribution is similar to the AOI grade distribution in Fig. 1(b), and the density is decreasing from central area to suburbs. Though there is a serious imbalance between different districts, supply and demand of bike-sharing in nearby grids may affect each other.

4) EXTERNAL FACTORS ANALYSIS
External factors, such as weather, land use patterns, and events, often have complex effects on traffic conditions. Compared with vehicles, open-air bikes are more susceptible to weather conditions. Fig. 4 and Table 2 illustrate the impact of adverse weather conditions on bike-sharing trips. Although the temporal dependency follows daily and weekly pattern, there are obvious non-periodic fluctuations in the corresponding time intervals. Taking mark 2 in Fig. 4 as an example, even on weekday, the early peak trip value reached its lowest due to rain, and the trip value gradually returned to normal until weather turned around.
In addition, we did a correlation analysis on other meteorological factors by using Seaborn which is a Python data visualization library based on Matplotlib. Fig. 5 shows the linear regression analysis results of temperature, wind speed, and humidity with demand. Although the influence of three meteorological factors on bike-sharing demand is relatively scattered, it can be clearly seen that temperature and wind speed are positively correlated with demand, while humidity is negatively correlated with demand.
Weather conditions not indicated are normal in all time intervals.
Therefore, the effect of external factors in prediction model should not be neglected. To enhance prediction performance of our proposed model, meteorological data (temperature, humidity, wind speed, weather condition), metadata (hour, DayOfWeek), and AOI grade were added.

IV. METHODOLOGY A. FORMULATION OF PROBLEM
In order to rebalance bike-sharing, we need to accurately predict the supply and demand within each grid through mathematical modeling. Thus, we fixed some notations and defined sharing bike supply-demand predication problem.  Fig. 1(b) was defined as a spatial matrix L of M ×N , where a grid cell l m×n donates a predication region. For simplicity, we use t instead of time intervals mentioned below.
Definition 3 (Supply and Demand): Let C be a set of historical record, the supply and demand at the time interval t in a grid l m×n were defined respectively as: Therefore, the supply-demand gap can be defined as:

Definition 4 (External Factors):
In this study, the meteorology information and AOI grade were taken as external factors of our model. In terms of meteorology, we neglected the influence of geographical features, only considering the time series, and took the value every hour. On the contrary, POI was only related to geographical location. Thus, we defined a combined feature matrix.
and H t stand for the weather condition, temperature, wind speed, and humidity in period t respectively. The AOI grade of l m×n was defined as A m,n .

Definition 5 (Predication Problem):
The supply-demand predication aims to predict the supply-demand gap at time interval t, given a set of historical trip record Tr = {Tr 1 , Tr 2 , · · · , Tr k } until time interval t-1.

B. THE STRUCTURE OF MBH MODEL
We constructed a multi-block hybrid (MBH) model for predicting supply-demand of bike-sharing, which involves three deep learning networks: the convolutional neural network (CNN), the gated recurrent neural network (GRU-Net), and the convolutional gated recurrent neural network (ConvGRU-Net). As shown in Fig. 6, each neural network is applied to one block, which is described as follows.

1) BLOCK1: SPATIAL MODELING WITH CNN
A city usually contains many districts whose population density and economic development level vary greatly. This determines different land use patterns and ultimately leads to uneven spatial distribution of bike-sharing. In this study, we conducted spatial modeling analysis for land use pattern. This type of variable is only spatially varied but temporally static during our study period. Thus, we built a CNN combination network to capture spatial feature.
The input is a feature map shown in Fig. 1b, of which grid cells have value of corresponding AOI grade. Fig. 7 shows the CNN structure used in this block. The extraction of spatial dependencies are performed mainly by the convolutional layer and pooling layer. We fed A m,n which is described in Definition 4 to first convolution layer, then the k th output can be calculated by: where * donates the convolutional operator, f is an activation function such as the ReLU (Rectified Linear Unit) f (x) = max(0, x) we used in this study; W k , b k are the parameters in the k th layer. In order to capture more spatial features, multiple kernels of a same size are set and scanned in the convolution layer simultaneously. Then, pooling layer is connected to convolution layer for reducing the spatial size of input feature maps and improving the robustness of the extracted features. Max pooling is a commonly used pooling method which takes the maximum value within a sliding region of given filter size [24]. Each pooling layer reduces dimension of the feature maps output from previous convolution layer, and obtains a vector of certain length as input of the next fully-connected layer. Finally, the captured spatial features of land use are catenated into a feature vector as the output.

2) BLOCK2: TEMPORAL MODELING WITH GRU-NET
In this block, we used gated recurrent unit (GRU) to capture the temporal sequential dependency of meteorology data, which is a variant of recurrent neural network (RNN) to address the exploding and vanishing gradient issue. The GRU is similar to Long short-term memory neural network (LSTM) in that it regulates information flow through VOLUME 8, 2020 sequence chain by the gate structure. However, it demonstrates more competitive performance and simpler structure than the standard LSTM. Previous work about short-term traffic speed prediction has also demonstrated the superiority of GRU-based model [33].
The typical structure of GRU is shown in Fig. 8, which only has two gates: reset gate R t and update gate Z t . The former captures short-term dependencies in time series, and the latter captures long-term dependencies. Both inputs are the current time step input E t which is described in Definition 4 and the hidden state H t−1 of previous time step, and output is calculated by a fully-connected layer where the activation function is sigmoid function. The core idea is defined by (5) and (6) where R t and Z t make the range of each element [0, 1] through sigmoid function. Equation (7) calculates the candidate hidden stateH t , in which the reset gate controls the flow of H t−1 by multiplying elements. Finally, update gate Z t is combined with hidden state H t−1 and candidate hidden stateH t to obtain hidden state of time t.
where W · denotes the weight matrix and b · indicates the bias. The operator denotes pointwise multiplication. σ is the sigmoid function. The output of each GRU layer is the hidden state at each time step.

3) BLOCK3: SPATIAL-TEMPORAL MODELING WITH CONVGRU-NET
In bike-sharing supply-demand prediction, the usage of bikes varies both temporally and spatially. This makes it impossible to extract temporal and spatial dependency simultaneously by using a single type neural network. To address this issue, we used a network named ConvGRU-Net, which combines CNN layer and GRU-Net layer to deal with spatial and temporal dependency. Specifically, as shown in Fig.9, the input of GRU-Net layer is the output of CNN layer. Note that the structure of neural networks we used here is similar with what we mentioned in block 1 and block 2.
The inputs of first spatial layer are some image-like matrices expressed as tensor Y t ∈ R 2×M ×N , which have both supply and demand channels (i.e., Definition 3), for each time interval t. Here the same CNN was used in block 1. After feeding Y m,n,0 t into k convolutional layers, the output in time interval t is Y m,n,k t which forms the input of next temporal layer in our proposed network. Further, we stacked multiple ConvGRU layers to better capture spatiotemporal features among bike-sharing usage data in this study.

4) BLOCK FUSION AND LOSS FUNCTION
We then fused the output of three blocks. The spatial output extracted from CNN, the temporal output extracted from GRU-Net, and the spatial-temporal output extracted from ConvGRU-Net were concatenated into a dense vector in feature fusion layer. Using a parametric-matrix-based fusion method proposed in [18], the prediction output was finally calculated through fully-connected layer. The fusion formula is as follows: where X · t is the output by each block at t time step, W b· denotes the weight matrix of each block, and b f indicates the bias of fusion layer.
Since the proposed model is end-to-end, once we obtain the prediction output, we can optimize model through calculating loss function to minimize the mean squared error between real value and predicted value in each grid cell l m×n . Furthermore, we note that the outputŶ t includes both supply forecast y s t and demand forecastŷ d t . The loss function is given as: where γ is a parameter to balance the effect of supply and demand. Both real value y s t , y d t and predicted valueŷ s t ,ŷ d t are calculated based on a whole region L of M × N .
For time series data, we divided the first 12 days' data into training set and validation set according to the ratio of 8:2, and took the remaining 2 days' data as the test set to verify the validity of our model. To improve the generalization ability of the model, we trained the parameters for several times.

2) BASELINES AND EVALUATION METRIC
In order to verify prediction performance of our proposed model, we compared MBH model with the single time series prediction models, the aggregated models, and two variant models of MBH, which are described as follows.
• RNN: A deep learning model which only considers the feature of spatial dimension. Following the practice in [15], we selected Adam optimization algorithm with learning rate lr = 0.01 and hyperparameters β 1 = 0.9, β 2 = 0.999, = 10e −8 to train model. The optimized model contained one input layer, three hidden layers with 64 hidden units in each layer, and one output layer.
• LSTM: It is an extension of RNN by introducing three ''gates'' to control the flow of data. The hidden layer is one LSTM layer with memory blocks and the other settings are the same as RNN.
• GRU: Compared with LSTM, GRU removes cell state and uses hidden state for information transmission. It only contains update gate and reset gate, which GRU can better capture the dependence of step distance in time series with a simpler model structure than LSTM. But the main parameters are the same as LSTM.
• XGBoost [43]: XGBoost is a powerful boosting tree based method, which performs well in various competitions. The following settings were used in the experiment: the number of trees is 50, the maximum depth is 4, and the learning rate is 0.002.
• ConvLSTM [32]: By extending the fully connected LSTM to have convolutional structures in both the inputto-state and state-to-state transitions, the convolutional LSTM is used to build an end-to-end trainable model for prediction. We set input-to-state and state-to-state kernel size to 5 × 5, and set 64 hidden states in each of the three hidden layers.
• ConvGRU: The structure of ConvGRU is similar to that of ConvLSTM, whose input are the output calculated by convolution operator.
• ST-ResNet [26]: An end-to-end structure named deep spatial-temporal residual networks can forecast crowd inflow and outflow in every region of a city.
• DMVST-Net [17]: Deep Multi-View Spatial-Temporal Network is proposed to predict taxi demand. We compared it with our model by feeding the datasets of this study.
• MBH-1 (without block 1): It is a variant of our proposed MBH model, in which we chose to ignore the influence of land use patterns.
• MBH-2 (without block 2): It is another variant of our proposed MBH model, in which we did not incorporate meteorological factors into the model. The experimental results of all models were compared and analyzed by three classical evaluation metrics: mean absolute error (MAE), average percentage error (MAPE), and root mean square error (RMSE). The specific calculations are as follows: where m is the number of test samples, y i andŷ i are the real and predicted value, respectively.

3) EXPERIMENTAL BASE
We note that the modeling process was under an open source Python distribution -Anoconda1.9.7, in which we used Sklearn to train XGBoost, RNN, LSTM and GRU by feature engineering and applied Keras based on the back end of Tensorflow to train other complex models by representation learning [44], [45]. All the training and testing processes were performed on a sever with CPU (Intel(R) Core(TM) i9-9900KF CPU @3.60GHz), 32-GB RAM, and GPU (NVIDIA Quadro P4000 with 8G memory). Specifically, to make a fair comparison, the input data of different types were normalized in the range [0, 1] through VOLUME 8, 2020 Min-Max standardization. Whereafter, we denormalized the forecasting results for evaluation. The parameters of each model were set by referring to its early work in relevant literature. Further, the learning rate of three blocks in MBH was also set to 0.002, and all the training process could stop on training process to meet proper epoch.

B. MODEL COMPARISON
In this section, we compared the proposed MBH model with eight benchmark algorithms, including three traditional time series prediction models (i.e., RNN, LSTM, and GRU), a ensemble learning model based on regression tree (i.e., XGBoost), two widely used spatial-temporal models (i.e., ConvLSTM and ConvGRU), and two recent aggregated models (i.e., ST-ResNet and DMVST-Net). All the eight benchmark models were trained by using the same training samples to MHB training. Fig. 10 shows the supply and demand forecast results of MAPE on different models respectively. Comparing the forecast accuracy for four time intervals, we can see that the prediction effect of dataset divided by 30 min is significantly higher than that divided by 15 min, 45 min, and 60 min.

1) EFFECT OF TIME INTERVAL
A previous study [30] on bike-sharing prediction has also found that the best prediction was made at 30 min, but it indicated that the prediction performance would increase with time interval.
As shown in Fig. 10, we extend time interval to one hour and find that the prediction accuracy also declines when time interval is too long. The reason is that too short sampling period will cause more data noise and too long sampling period will ignore part of the fluctuation characteristics. Therefore, we believe that the sampling frequency of 30 min is the best, and too large or too small frequency will fail to reflect the rule of bike-sharing trips. The subsequent comparative experiments were all analyzed with ''Dataset-30'' as input.
2) FORECASTING PERFORMANCE COMPARISON Table 4 and 5 show the supply and demand forecasting results of benchmark models on test samples. For illustrative purposes, we divided benchmark models into two groups and took one of the prediction results of supply and demand as examples. Fig. 11 further compared the prediction performance of eight benchmark models on test data samples. The following results can be concluded from comparisons.
a. It is obvious that the forecasting results of supply and demand are similar under the four datasets, and the MBH model produces higher prediction accuracy than other methods. Especially the highest accuracy of MBH reaches 91.51%. We calculated the mean value of supply and demand MAPE, which demonstrate that MBH is relatively 16.78% better than GRU, 14.00% better than XGBoost, 13.52% better than ConvGRU, and 3.30% better than DMVST-Net.
b. Combined with CNN to capture spatial features, Con-vGRU and ConvLSTM yield better performance than traditional time series prediction models. Additionally, due to the simpler structure, GRU algorithm has an advantage over LSTM in handling smaller data volume similar to this study.
c. Compared with aggregated models, the performance of XGBoost is poor when data mutates. The reason why model failed to capture the abrupt fluctuations is that exogenous variables are fed into regression tree model through feature engineering as inputs and are not better explained in terms of spatial-temporal dependencies.
d. Although both ST-ResNet and DMVST-Net use spatialtemporal information, they are worse than MBH because they did not analyze and model exogenous variables individually by different categories, which also demonstrates the importance of spatial and temporal blocks in our model.

C. MODEL INTERPRETATION
To further pursuit the effect of spatial and temporal blocks modeling, we compare the proposed MBH model with two variants: MBH-1 (without block 1) and MBH-2 (without block 2), which are defined above. Table 6 shows the prediction results of two variants on Dataset-30. MBH-2 has lower prediction performance than      MBH-1 in terms of all evaluation metrics. Obviously, meteorological factors have a greater impact on prediction accuracy than land use patterns. This is because the input of spatialtemporal block also implies the influence of some spatial factors. Fig. 12 indicates that the prediction performance of two variants are lower than the original model, which further proves the necessity of incorporating external factors into the prediction model. Moreover, we visualize the forecasting curves to better evaluate the proposed model. As shown in Fig. 13, MBH generates accurate predictions at both peak and trough hours of the day. In addition, the variant models have poor performance when the data fluctuated greatly or at early morning from 0:00 to 4:00. This essentially means that exogenous variables can increase the robustness of MBH.

VI. APPLICATION OF PREDICTION RESULTS
The ultimate goal of our prediction model we have established is to obtain supply-demand gap of urban bike-sharing in time for distribution and rebalancing. Therefore, the  predictive outputs of MBH model for the future day were calculated by (3) in Definition 3.
The results showed in Fig. 14 illustrate that 4:30-8:00 and 12:30-18:30 are the stages of demand, especially at 4:30-8:00 when demand continues to grow. A large amount of supply, however, is provided at 8:30-12:00 and after 19:00. To some extent, it reflects the tidal phenomenon of Shanghai residents' trip and imbalance between supply and demand of bike-sharing. The developed forecast model can be utilized to analyze the spatial-temporal fluctuation demand of urban bike-sharing in advance and improve the operation efficiency of the system. A better practical application is to provide useful information for rebalancing.
For illustrative purposes, we extracted the AOI grade heat map showed in the first map on the upper left in Fig. 13 and TAZ distribution map of central area respectively from Fig. 1. The edge merging was performed and TAZs were labeled with a total number of 274 separately. The supply-demand gap of each TAZ was predicted by MBH model and visualized by Arcgis 10.2. The spatial-temporal distributions of supplydemand gap over 274 TAZs are shown below.
According to residents' commuting habits, a day was divided into 5 time periods as shown in Fig. 15. Referring to AOI grade heat map, demand mode appears in the higher grade areas at 0:00-6:00 and 20:00-24:00 while other areas present supply mode, but it is exactly opposite at 6:00-10:00 and 10:00-15:00. Especially at 15:00-20:00, supplydemand gap reached the highest value in the day. The bike-sharing distribution is uneven and the supply-demand shows obvious imbalance. Based on this, we can implement replenishment or transfer of bike-sharing at different time and space according to prediction results, and further realize the scheduling and rebalancing of different regions in advance.

VII. CONCLUSION AND FUTURE WORK
In this study, we rendered the supply-demand forecasting of BSS into a spatial-temporal prediction problem, and proposed a multi-block hybrid model that captured spatial-temporal characteristics of multi-source data. Specifically, we integrated CNN and GRU-Net into the structure to express the effect of exogenous variables on spatial and temporal respectively, and aggregated ConvGRU-Net to learn the spatial-temporal dependencies on the usage of bike-sharing. To evaluate the effectiveness of MBH model, we compared the proposed model with ten baselines including RNN, LSTM, GRU, ConvLSTM, ConvGRU, XGBoost, ST-ResNet, DMVST-Net and two variants, based on four datasets divided by 15, 30, 45 and 60 min. The comparison results show that 30 min is the best time interval to realize the supplydemand forecast of bike-sharing. The proposed model can achieve more efficient and accurate prediction than other benchmark models in terms of MAPE, RMSE and MAE. We further investigated the effect of spatial and temporal blocks modeling. The evidence indicates that the modeling of exogenous variables is essential and meteorological factors have a greater impact on bike-sharing usage forecast. Moreover, the application of prediction results demonstrates that MBH model can be used to forecast supply-demand gap, which provides useful information for rebalancing in BSS.
In the future, we will focus on the applicability of the proposed model to other spatiotemporal traffic prediction tasks. In addition, more data source such as mobile phone signaling data and social network data could be added into our model for better depicting residents' trip patterns and achieving higher prediction accuracy.