A Big Data Analysis on Urban Mobility: Case of Bangkok

Designing an efficient on-demand mobility service requires comprehensive knowledge on the statistical characteristics of trips. In other words, it is critical to know how long passengers typically spend on a trip and how far they usually travel. Likewise, it is important to learn how much time a driver spends searching for passengers. This study presents a statistical analysis of taxi trips in Bangkok based on real traces of 5,853 taxis over the period of three months. Significant insights on trip volume, trip time, trip distance, and origin-destination distance are derived. In addition, the probability distributions of trip time, trip distance, and origin-destination distance are also characterized based on two goodness-of-fit tests. To our knowledge, this characterization is done for the first time for Bangkok taxi trips. It is shown that a lognormal distribution can best describe the empirical trip time distribution. On the other hand, a Weibull distribution can best describe the empirical trip distance distribution and the empirical origin-destination distance distribution. These distributions are essential to traffic simulation. Finally, the efficiency of the Bangkok taxi system is also quantified both at the system level and at the agent level.


I. INTRODUCTION
Understanding how people move in a city is critical to designing an effective transportation system. In-depth statistical knowledge on trip characteristics is required to build an efficient on-demand mobility service. In other words, it is necessary to know when people need a service, how long they typically spend on a trip, how far they usually travel, and how much time a driver typically spends searching for a passenger.
In this study, extending from our preliminary work [1], we perform statistical analysis on approximately eight million taxi trips in Bangkok, Thailand. Real traces of 5,853 taxis that operate in Bangkok and its nearby area during the period of three months from October 2019 to December 2019 are investigated. This allows us to thoroughly examine the spatio-temporal mobility pattern of Bangkok passengers and obtain significant insights into the characteristics of taxi trips in Bangkok. Moreover, it also allows us to evaluate the efficiency of the current Bangkok taxi system both at the system level and at the agent level.
In addition to understanding the characteristics of the current Bangkok taxi system, the statistical results presented in this study are beneficial to designing future mobility services such as on-demand self-driving taxis. Obviously, a service provider needs to understand the nature of taxi trips to devise an appropriate operation strategy. Moreover, the statistical results presented in this study are essential to traffic simulation. Indeed, a traffic simulator needs to know when and where to schedule the next taxi pickup realistically. It also needs to specify how long the trip should be and how far the destination should be from the origin. These require the statistical distributions derived from the empirical data. Finally, the efficiency of the current Bangkok taxi system evaluated in this study can be used as a benchmark for future mobility services. Basically, any future mobility service should be designed to achieve a higher efficiency level than that of the current system.
Although a few studies investigate the Bangkok taxis [2]- [5], none of them has characterized the statistical distributions of trip time, trip distance, and origin-destination (O-D) distance. In this paper, we characterize these distributions for the Bangkok taxi trips. Moreover, the data considered in this study are also more recent than those used in the existing studies. Unlike most existing works in the literature, which typically identify these distributions based on visual curvefitting [6]- [12], we methodically characterize them based on formal statistical tests.
The statistical analysis presented in this study covers many aspects of the taxi trips in Bangkok. The contributions of this work can be summarized as follows.
• We characterize the statistical distribution of trip time. It is shown that a lognormal distribution can best describe the trip time distribution. This applies to both the trips with passengers and those without passengers. • We characterize the statistical distribution of trip distance. It is shown that a Weibull distribution can best describe the trip distance distribution. This applies to both the trips with passengers and those without passengers. • We characterize the statistical distribution of O-D distance. It is shown that a Weibull distribution can best describe the O-D distance distribution. This applies to both the trips with passengers and those without passengers. • At the system level, we quantify the overall efficiency of the Bangkok taxi system based on three months worth of positioning data. We demonstrate that the temporal efficiency of the Bangkok taxi system is only around 46.67%, and the spatial efficiency is only about 37.06%. This indicates that the efficiency of the Bangkok taxi system is poor. • At the agent level, we quantify the performance of each taxi in terms of revenue-gain-rate. In addition, we also quantify the efficiency of individual taxis across different groups of earnings. We demonstrate that taxis in the high-earning group consistently have higher efficiency than those in the low-earning group. This suggests that taxis in the high-earning group do not simply earn more by working longer, but they are more efficient in driving and passenger searching. The rest of this paper is organized as follows. In Section II, we briefly discuss related studies that rely on positioning data of taxis. Data and features extraction are discussed in Section III. The statistical analysis is presented in Section IV. Finally, we conclude this paper in Section V.

II. RELATED WORK
A number of studies that rely on positioning data of taxis exist. However, they can generally be divided into two main categories: 1) data analytics and 2) prediction. We briefly discuss works in these categories here.

1) Data Analytics
In data analytics, the taxi positioning data are typically used to investigate passengers' mobility and drivers' behaviors. Studies on passengers' mobility usually involve analyzing the spatio-temporal characteristics of taxi trips such as trip time, trip distance, and O-D distance. In fact, many studies attempt to characterize the statistical distributions of these quantities [6]- [14]. However, only a visual observation based on curve-fitting is typically used in these studies to determine which statistical distribution can best fit the empirical data. A methodical verification by a statistical test is usually ignored.
Based on visual curve-fitting, the trip time distribution of taxi trips in Lisbon, Portugal is characterized with an exponential distribution [6]. In contrast, the trip time distribution of taxis in Xiamen, China, is characterized with a lognormal distribution in [7]. It is also reported that the trip time distribution of taxis in Shanghai, China, follows a lognormal distribution [8]. These characterizations are based on visual curve-fitting. A formal statistical test is not performed.
A variety of statistical distributions is used to characterize the trip distance distribution. A lognormal distribution is used to model the trip distance distribution of taxis in Harbin, China [9]. On the other hand, a gamma distribution is used to model the trip distance distribution of taxis in Lisbon, Portugal [6]. The gamma distribution is also used to model the trip distance distribution of taxis in Changsha, China [10]. In [11], the trip distance distribution of taxis in New York is characterized by a power-law distribution. Nonetheless, a statistical test is not employed in these studies. In [13], the statistical distribution of taxi trip distance in Chicago is investigated. A Kolmogorov-Smirnov (K-S) test is used to evaluate the goodness-of-fit. It is reported that the lognormal distribution can best describe the trip distance distribution.
The O-D distance is usually characterized with a powerlaw distribution. In [12], the authors investigate the taxi data in Beijing, China. The O-D distances of taxi trips that are smaller than 30 miles are modeled with a power-law distribution; whereas, the O-D distances of the larger trips are modeled with an exponential distribution. However, a goodness-of-fit test is not performed. In [14], the O-D distances of taxis trips in Shanghai, China, are characterized by a power-law distribution. A K-S test is used to evaluate the goodness-of-fit.
In addition to the characteristics of passengers' mobility, many data-analytics works focus on taxi drivers' behaviors. Typically, these studies involve analyzing passenger search strategies, identifying factors affecting cruising patterns, and examining route selection behaviors. In [15], the authors study the passenger search strategies of taxis in Shenzhen, China. A model based on Markov decision process (MDP) is proposed to imitate the passenger search strategies of the taxi drivers. In [16], three types of passenger search strategies of taxis in Hangzhou, China, are investigated. It is shown that local passenger hunting is usually more effective than local waiting and distant passenger hunting. In [17], factors affecting the cruising patterns of taxis in Shenzhen, China, are investigated. It is concluded that external factors (e.g., land use distribution, traffic condition, and local road network) have a stronger influence on each driver's cruising pattern than internal factors (e.g., driver's knowledge based on his past experience). Finally, in [18], the authors use the taxi data in Beijing, China, to analyze how a driver chooses a route to deliver a passenger. It is shown that most drivers usually choose a satisfactory route which may not necessarily be the optimal route (e.g., routes with shortest distance and shortest time).
It is interesting to note that Bangkok has not yet been considered in most existing data-analytics studies. To our knowledge, only a few data-analytics studies have used positioning records of taxis in Bangkok [2], [3]. However, none of them has analyzed the statistical distributions of trip time, trip distance, and trip O-D distance. In addition, none of them has quantified the efficiency of the Bangkok taxi system. In [2], the impact of the new taxi fare rate on drivers and passengers is studied. The taxi service zones are analyzed and grouped based on the positioning records in [3].

2) Prediction
Most studies in this category typically focus on taxi demand prediction. An accurate demand prediction allows taxi drivers to cruise to the locations where they are more likely to find passengers. This also helps reduce the waiting time of a passenger. A variety of solutions is considered in the literature, including probabilistic models, time-series models, and machine learning models. In [19], the authors use the taxi data in San Francisco to build a probabilistic-based route recommendation model which aims at minimizing the mileage without passengers. In [20], the authors use the taxi data in Nanjing, China, to build a model for predicting the road cluster where the passengers would most likely be. The model is based on an extreme learning machine (ELM). It basically recommends top-k road clusters that a taxi is more likely to find passengers. In [21], the authors use the taxi data in Shanghai, China, to build a recommendation model that suggests where a taxi should look for passengers. The model is based on a deep convolutional neural network. In [22], the taxi data in Chengdu, China, are used to build a demand prediction model. Similarly, the taxi data in Porto, Portugal, Stockholm, Sweden, and Shanghai, China, are also used to build a demand prediction model in [23]. The model is based on an ensemble of autoregressive time-series models. A few studies use Bangkok taxi data to build a demand prediction model. In [4], the authors propose a demand prediction model based on recurrent neural networks and XGBoost. In [5], the authors use k-means clustering to recommend a driver's spot to wait for passengers.
The main focus of this paper is on data analytics rather than prediction. We analyze the spatio-temporal characteristics of taxi trips in Bangkok. Particularly, we perform a formal statistical test to characterize the distributions of trip time, trip distance, and O-D distance. This has not been done in the existing studies for Bangkok taxis.

III. DATA AND FEATURES EXTRACTION
In this section, we describe the data used in this study and the features extracted from these data.

A. DATA
The dataset used in this study is provided by courtesy of the Thai Intelligent Traffic Information Center (iTIC). It contains the global positioning system (GPS) records of Bangkok taxis during the period of three months from October 2019 to December 2019. The position of each taxi was collected Status of the "for hire" light Status of the vehicle engine 1 (1 = active, 0 = inactive) FIGURE 1. The study area is shown inside the rectangular bounding box. It covers Bangkok and parts of its nearby areas. (This map is created from the open data provided by OpenStreetMap [24] under the Open Database Licence [25].) roughly every one to three minutes. Essentially, the data have nine features as shown in Table 1. In addition to the primary positioning data such as latitude, longitude, and timestamp, the "for-hire" light status and the vehicle engine status are also provided. These features are beneficial for differentiating between trips with passengers and those without passengers. Basically, when the for-hire light status is '1', it implies that the taxi is vacant. On the other hand, when the for-hire light status is '0', it implies that the taxi is busy carrying a passenger.
In this study, we only focus on trips that occur in Bangkok and some parts of its nearby area. The study area is defined as the rectangular region shown in Fig. 1. The geolocations of the four corners of the study area are given in Table 2. In this study, we only concentrate on the period when each taxi is active (i.e., when the engine is running). Thus, all the records VOLUME x, 2022 with inactive engine status will be filtered out. In addition, all the records with invalid GPS status will also be filtered out as well. Finally, any records with the geolocations that are outside of the study area will be excluded. After this initial filtering process, we are left with the valid GPS records of the active taxis inside the study area.

B. FEATURES EXTRACTION
After the initial filtering process described in Section III-A, the following features are extracted.

1) Trip:
One of the most significant features to extract from the GPS records is trip. A taxi trip can be viewed as a time series of its consecutive GPS records. Basically, there are two types of trips. They are referred to as busy trips and vacant trips. A busy trip or a trip with passengers is defined as a time series of consecutive GPS records with inactive for-hire status. In other words, this is a sequence of consecutive GPS records with '0' in the for-hire status. In contrast, a vacant trip or a trip without passengers is defined as a time series of consecutive GPS records with active for-hire status.
In other words, this is a sequence of consecutive GPS records with '1' in the for-hire status. 2) Pickup and Drop-off Locations: For a busy trip, we can estimate the pickup and drop-off locations. Basically, the beginning point of a busy trip is assumed to be the pickup location, and the endpoint of the trip is assumed to be the drop-off location. For example, suppose that a busy trip consists of K consecutive GPS points, namely P 1 , P 2 , . . . , P K . Then, in this case, the estimated pickup location is P 1 , and the estimated drop-off location is P K . 3) Trip Duration: Trip duration is simply the difference between the timestamps at the start and at the end of the trip. Suppose that a trip T consists of K consecutive GPS points, namely P 1 , P 2 , . . . , P K . Then, the duration of trip T can simply be calculated from the difference between the timestamp of P 1 and the timestamp of P K . 4) Trip Distance: Trip distance is an aggregated sum of the distance between each pair of consecutive GPS points in the trip. Suppose that a trip T consists of K consecutive GPS points, namely P 1 , P 2 , . . . , P K . Let D i,j be a geographical distance between P i and P j . Then, the total distance of trip T is calculated as is not the Euclidean distance between the starting point and the endpoint of the trip. 5) Origin-Destination Distance: The origin-destination (O-D) distance is defined as the Euclidean distance between the starting point and the endpoint of the trip. If a trip T consists of K consecutive GPS points, namely P 1 , P 2 , . . . , P K , then the O-D distance of trip T is simply the Euclidean distance between P 1 and P K . 6) Trip Fare: A fare, in Thai Baht (THB), for each busy trip can also be estimated. Trip fare is calculated according to the Thai taxi fare rate imposed by the Department of Transportation. The total fare is a function of trip distance and congestion time. The distancebased fare rate is given in Table 3. Basically, the fare rate increases as the distance increases. The congestion time-based fare rate is 2 THB/minute for every minute that the taxi cannot travel faster than 6 km/h. Before further analyzing the extracted trip data, an additional cleanup is done to filter out some possibly erroneous data. First, trips with abnormally long and abnormally short duration will be excluded. In this study, trips that are longer than or equal to four hours and busy trips that are shorter than or equal to one minute are considered abnormal. Second, trips with unusually large and unusually small distances will be filtered out. The maximum driving distance from one corner of the study area to its diagonally opposite corner is around 120 km. Thus, trips that are larger than 120 km will be considered abnormal. In addition, any busy trip with a distance smaller than 0.1 km is also regarded as unusual.

IV. ANALYSIS
In this section, we present the analysis on the Bangkok taxi trips.

A. TRIP VOLUME
Trip volume reflects mobility demand. It helps us understand the transportation need in different periods of time. The total number of trips extracted from the dataset is shown in Table 4. There are approximately eight million trips in this analysis. Around 3.7 million of them are busy trips, and 4.3 million of them are vacant trips. The average daily trip volume on different days of the week is shown in Fig. 2. It can be observed that the average number of trips is approximately the same on each day of the week. On average, there are slightly over 40,000 trips each day. In addition, it can be  observed that the average number of trips on the weekdays and that on the weekends are not significantly different.
The average hourly trip volume in each hour of the day on different days of the week is shown in Fig. 3. It can be observed that the average hourly trip volume on all days of the week exhibits a common periodic pattern. Basically, on a given day, the average number of trips is small during the early morning hours. Among these hours, the lowest average number of trips is between 2 am and 3 am. The average number of trips rises in the morning rush hours and stays above 2,000 trips during the daytime. Finally, the average number of trips declines during the evening hours. However, in these evening hours (i.e., from 7 pm to 11 pm), the average number of trips on Saturday is noticeably larger than that of the others.  are approximately the same. Specifically, the mean duration of the busy trips is 24.14 minutes while the mean duration of the vacant trips is 24.02 minutes. However, the median duration of the vacant trips is slightly larger than that of the busy trips. Specifically, the median duration of the busy trips is 14.70 minutes whereas the median duration of the vacant trips is 15.72 minutes.
The natural question to ask next is the type of probability distributions that the empirical trip time distributions follow. In other words, what kind of probability distributions can be used to describe the empirical trip time distributions shown in Fig. 4? A common statistical method to test whether a hypothesized distribution can characterize an empirical distribution is the Kolmogorov-Smirnov test (K-S test) [26]. In the K-S test, the empirical distribution is initially assumed to follow a hypothesized distribution. Then, it is examined whether the hypothesis can be rejected. More formally, let F (x) be the hypothesized distribution, and let G(x) be the VOLUME x, 2022 empirical distribution. The null hypothesis and the alternative hypothesis for the K-S test are given as The null hypothesis will be rejected if the test result shows that the difference between the empirical distribution and the hypothesized distribution is statistically significant. Readers are referred to [26] for more detail on the K-S test.
Based on the observation that the empirical distributions are right-skewed, the following three types of distributions are selected as the hypothesized distributions in this study.
1) Lognormal Distribution: A lognormal distribution is a two-parameter probability model. Its cumulative distribution function (CDF) can be expressed as where erf (y) 2 √ π y 0 e −t 2 dt is the standard error function, µ ∈ R and σ 2 > 0.
2) Gamma Distribution: A gamma distribution is a twoparameter probability model. Its CDF can be expressed as where α > 0 is the shape parameter, β > 0 is the rate parameter, Γ(s, x) ∞ x t s−1 e −u du is the upper incomplete gamma function, and γ(s, x) x 0 t s−1 e −u du is the lower incomplete gamma function.
3) Weibull Distribution: A Weibull distribution is a twoparameter probability model. Its CDF can be expressed as where λ > 0 is the scale parameter, k > 0 is the shape parameter. The following procedure is carried out to check whether these hypothesized distributions can be used to model the empirical trip time distributions. First, we create one thousand test sets, where each of these test sets contains two hundred random samples drawn from the empirical trip time data. Second, for each test set, we estimate the parameters of each hypothesized distribution that best fit the empirical samples based on the maximum likelihood estimation. In other words, the parameters µ and σ of the lognormal distribution, the parameters α and β of the gamma distribution, and the parameters λ and k of the Weibull distribution are estimated from the empirical samples in each test set. Finally, the K-S test is performed for each test set and each hypothesized distribution to check if the null hypothesis can be rejected. In this study, the null hypothesis will be rejected at the 5% significance level. In addition, in order to cross-check the results obtained from the K-S test, the Anderson-Darling (A-D) goodness-offit test is also performed. Similar to the K-S test, the A-D test evaluates if a set of random samples could have come from a hypothesized distribution. Readers are referred to [27] for more detail on the A-D test. The same experimental procedure as described in the previous paragraph is repeated for the A-D test.
The percentage of the test sets that pass the K-S test (i.e., those for which the null hypothesis cannot be rejected) for each type of hypothesized distributions is shown in Table 5. It can be observed that the lognormal distribution can best describe the busy trip time distribution. In fact, it can describe 98.6% of the test sets. This literally means that, out of 1,000 test sets, 986 of them can be characterized by the lognormal distribution. The Weibull distribution can also model the busy trip time distribution very well. It can describe 97.9% of the test sets. The gamma distribution, however, is not as effective in modeling the busy trip time distribution. It can only describe 86.3% of the test sets. In the case of the vacant trip time distribution, the only distribution that can best describe it is the lognormal distribution. The lognormal distribution can model 98.5% of the test sets while the other two distributions are much less effective.
A similar trend is also observed with the A-D test results shown in Table 5. In the case of the busy trips, the lognormal distribution and the Weibull distribution can model the trip time distribution exceptionally well. In fact, 99.9% of the test sets pass the A-D test when the hypothesized distribution is lognormal. In the case of the vacant trips, however, the only distribution that can best describe the trip time distribution is the lognormal distribution. The A-D test results are coherent with the K-S test results.
Next, we investigate if these hypothesized distributions are able to model the trip time distribution in each hour of the day. In addition, we also investigate if there is any difference between weekday trips and weekend trips. The same experimental procedure as described earlier is repeated for each hour of the day. The percentage of the test sets that pass the K-S test in each hour of the day is shown in   a taxi needs to drive to search for a passenger. The probability density of the trip distance is shown in Fig. 7. Both the busy trip distance and the vacant trip distance are compared. Similar to the trip time distribution, both the busy trip distance distribution and the vacant trip distance distribution are right-skewed. This means that short-distance trips occur more frequently than long-distance trips. The mean distance of the busy trips is 6.43 km while the mean distance of the vacant trips is 9.55 km. This indicates that the distance incurred on passengers searching is typically larger than the distance incurred on passengers delivering. The median distance of the busy trips is 3.88 km while the median distance of the vacant trips is 6.15 km. These numbers clearly suggest that the current taxi system is inefficient. The taxis typically need to drive longer to search for a passenger than actually deliver one. Next, we characterize the empirical trip distance distributions. The same statistical procedure as described in Section IV-B is also performed on the empirical trip distance data. The percentage of the test sets that pass the K-S test for each type of hypothesized distributions is shown in Table 6. It can be observed that the Weibull distribution and the gamma distribution can model the busy trip distance distribution remarkably well. The Weibull distribution can describe 99.7% of the test sets while the gamma distribution can describe 98.9% of the test sets. The lognormal distribution is fair in modeling the busy trip distance distribution. It can describe 91.4% of the test sets. In the case of the vacant trip distance distribution, the probability distribution that can best describe it is the Weibull distribution. The Weibull distribution can model 95.7% of the test sets whereas the other two types of distribution are not as effective. Especially, the lognormal distribution is awful in modeling the trip distance distribution.
The percentage of the test sets that pass the A-D test for each type of hypothesized distributions is also shown in Table 6. A similar trend as noted from the K-S test results can also be observed. In the case of the busy trips, the Weibull distribution and the gamma distribution can describe the trip distance distribution extremely well. In the case of the vacant trips, the trip distance can best be described by the Weibull distribution.
Next, we investigate if these hypothesized distributions are able to model the trip distance distribution in each hour of the day. In addition, we also investigate if there is any difference between weekday trips and weekend trips. The percentage of the test sets that pass the K-S test in each hour of the day is shown in Fig. 8. In the case of the busy trips, the Weibull distribution and the gamma distribution can model the trip distance distribution exceptionally well in all hours of the day, both on the weekday and on the weekend. In the case of the vacant trips, the Weibull distribution can model the trip distance distribution effectively in most hours of the day. It is less effective in a few early morning hours. In contrast, the lognormal distribution is not effective in modeling the vacant trip distance distribution at all. Similar behaviors are also observed with the A-D test results shown in Fig. 9.
In summary, the Weibull distribution is the only distribution that can model both the busy trip distance distribution and the vacant trip distance distribution most effectively. Both the K-S test results and the A-D test results confirm this.

D. ORIGIN-DESTINATION DISTANCE
Origin-Destination distance is the Euclidean distance between the starting point and the endpoint of a trip. The statistical knowledge on the O-D distance distribution is critical to creating a realistic O-D pair in a traffic simulation. The probability density of the O-D distance is shown in  Next, we characterize the empirical O-D distance distributions. The same statistical procedure as described in Section IV-B is also performed on the empirical O-D distance data. The percentage of the test sets that pass the K-S test for each type of hypothesized distributions is shown in Table 7.
In the case of the busy trips, all three types of hypothesized distributions are able to model the O-D distance distribution extremely well although the Weibull distribution is the best. The Weibull distribution is able to model 99.7% of the test sets. In the case of the vacant trips, in contrast, only the Weibull distribution can model the O-D distance distribution effectively. The Weibull distribution can model 97.6% of the test sets.
The percentage of the test sets that pass the A-D test for each type of hypothesized distributions is also shown in Table 7. A similar trend as noted from the K-S test results can also be observed. In the case of the busy trips, all three types of hypothesized distribution can model the trip O-D distance distribution extremely well. In the case of the vacant trips, the O-D distance can best be described by the Weibull distribution.
The percentage of the test sets that pass the K-S test in each hour of the day is shown in Fig. 11. In the case of the busy trips, all three types of hypothesized distributions can model the O-D distance distribution exceptionally well in all hours of the day, both on the weekday and on the weekend. In the case of the vacant trips, the Weibull distribution can model the O-D distance distribution effectively in most hours of the day. It is less effective in a few early morning hours. In contrast, the lognormal distribution is not effective in modeling the vacant trip O-D distance distribution at all. Similar behaviors are also observed with the A-D test results shown in Fig. 12.
In summary, the Weibull distribution is the best distribution that can describe both the O-D distance distribution of the busy trips and the O-D distance distribution of the vacant trips effectively. Both the K-S test results and the A-D test results confirm this.

E. PICKUP AND DROP-OFF LOCATIONS
Pickup and drop-off locations can help us visualize where people travel. A density map of the pickup locations is shown in Fig. 13. The density is normalized so that it ranges between 0 and 1, where 1 represents the highest density. Both the daytime (i.e., 6 am to 6 pm) locations and the night time (i.e., 6 pm to 6 am) locations are illustrated. Generally, the density of the pickups concentrates around the city center. The density decreases as the distance from the center increases. During the daytime, there are four distinct bright spots on the density map. These spots are the city center area, the Chatuchak area, the Don Mueang International Airport (DMK), and the Bangkok International Airport (BKK). The city center area is where most business activities take place. Thus, a high intensity of pickups is expected. The Chatuchak area is where many transportation terminals (e.g., bus, subway, and train) are located, so this is probably the main reason why a lot of pickups are observed in this area. During the night time, the density map generally looks similar to that of the daytime. However, the density of the pickups during the night time is highest around the Ploen Chit area where many of the shopping malls and hotels are located.

VOLUME x, 2022
A density map of the drop-off locations is shown in Fig. 14. Similar to what observed in the case of the pickups, the density of the drop-offs generally concentrates around the city center. It decreases as the distance from the center increases. During the daytime, there are two distinct bright spots. They are the city center area and the Chatuchak area. During the night time, the density of the drop-offs is highest around the Ploen Chit area, which is similar to what observed in the case of the pickups at night.

F. SYSTEM EFFICIENCY
In addition to the statistical characteristics of the taxi trips discussed in the previous sections, in this section, we evaluate the efficiency of the Bangkok taxi system. Particularly, the system efficiency is measured by the following two metrics.
1) Temporal efficiency: Temporal efficiency is defined as the ratio between the total busy time contributed by all taxis and the total operating time of all taxis. It basically measures the proportion of time that the system is busy. Formally, the temporal efficiency can be expressed as where τ Bi is the amount of time that taxi i is busy, τ Vi is the amount of time that taxi i is vacant, and N is the total number of taxis in the system. 2) Spatial efficiency: Spatial efficiency is defined as the ratio between the total busy distance contributed by all taxis and the total operating distance of all taxis. Formally, the spatial efficiency can be expressed as where d Bi is the amount of distance that taxi i is busy, d Vi is the amount of distance that taxi i is vacant, and N is the total number of taxis in the system. The efficiency of the Bangkok taxi system is shown in Table 8. It is clear that the Bangkok taxi system is highly inefficient. The temporal efficiency is only around 46.76%, which implies that the taxis spend more time searching for the passengers than actually carrying them. The spatial efficiency is worse. The total distance traveled by the taxis while carrying the passengers is only 37.06% of the total operating distance. The temporal efficiency and the spatial efficiency reported here are critical to the future development of on-demand mobility services. They are the benchmark that future services need to take into consideration. Obviously, any future services should be designed to have a higher efficiency level.

G. AGENT EFFICIENCY
In this section, we evaluate the efficiency of each individual taxi. In this analysis, a taxi is viewed as a revenue-generating agent in the system. Since a taxi could be driven by more than one driver, the efficiency computed in this analysis is not intended to reflect the performance of an individual driver. Rather, it reflects how well each individual taxi (i.e., machine) contributes as a revenue-generating asset in the system. The efficiency of an individual taxi is measured in terms of revenue-gain-rate. The revenue-gain-rate is defined as the amount of revenue (i.e., fare) generated by a taxi over its operating time. It is measured in THB/hour, and it can be expressed as where r i is the revenue generated by taxi i, τ Bi is the amount of time that taxi i is busy, and τ Vi is the amount of time that taxi i is vacant. Since we are assessing how much revenue each taxi gains, it is logical to consider only the taxis that operate regularly. Therefore, to evaluate the revenue-gain-rate, we only consider the taxis that make at least 100 busy trips during the three-month study period. This is roughly equivalent to making at least one busy trip a day, on average. The probability density of the revenue-gain-rate is shown in Fig. 15. The distribution looks symmetrical. The mean revenue-gainrate is 102.45 THB/hour, and the median revenue-gain-rate is 103.29 THB/hour. The minimum wage in Thailand is 300 THB/day or around 37.5 THB/hour. Thus, the mean revenue-gain-rate is approximately 2.7 times higher than the minimum wage rate.
Next, we divide the taxis into four quantiles based on the total revenue they generate during the three-month study period. The first quantile represents the bottom 25% of taxis with the lowest revenue whereas the fourth quantile represents the top 25% of the taxis with the highest revenue. The revenue-gain-rate of the taxis that belong in each quantile is shown in Fig. 16. It can be observed that the taxis in the fourth quantile generally have the highest revenue-gainrate. This suggests that the taxis in this group do not earn more revenue by simply operating longer. On the contrary, these taxis are more efficient in searching for passengers and generating revenues.

V. CONCLUSIONS
In this study, we perform statistical analysis on the Bangkok taxi trips, which are extracted from the real GPS records of taxis in the period between October 2019 and December 2019. We conclude this study with the following insights.
• In terms of trip volume, the average number of trips on each day of the week is approximately the same. In addition, there is no significant difference between the average number of trips on the weekday and that on the weekend. • In terms of trip time, Bangkok passengers spend, on average, around 24.14 minutes on each taxi trip. Similarly, the taxis spend, on average, around 24.02 minutes searching for a passenger. Based on the K-S test and the A-D test, the probability distribution that can best describe both the busy trip time distribution and the vacant trip time distribution is the lognormal distribution. This agrees with [7] and [8], where the busy trip time distribution is characterized with a lognormal distribution based on visual curve-fitting. • In terms of trip distance, the driving distance between the pickup location and the drop-off location is around 6.43 km, on average. However, the average distance that a taxi drives to search for passengers is around 9.55 km. Based on the K-S test and the A-D test, the probability distribution that can best describe both the busy trip distance distribution and the vacant trip distance distribution is the Weibull distribution. This is in sharp contrast to what reported in the existing works, which mostly rely on visual curve-fitting. The busy trip distance distribution is characterized with a lognormal distribution in [9], a gamma distribution in [6] and [10], and a power-law distribution in [11]. • In terms of O-D distance, the average Euclidean distance between the starting point and the endpoint of the busy trip is around 4 km. The average Euclidean distance between the starting point and the endpoint of the vacant trip is around 6.36 km. Based on the K-S test and the A-D test, the probability distribution that can best describe both the busy trip O-D distance distribution and the vacant trip O-D distance distribution is the Weibull distribution. This is in sharp contrast to what reported in the existing works. The O-D distance distribution of the busy trips is characterized with a power-law distribution in [14] and a mixed of powerlaw distribution and exponential distribution in [12]. • The efficiency of the Bangkok taxi system is poor. The temporal efficiency is around 46.76% while the spatial efficiency is around 37.06%. This clearly suggests that the taxis spend more than half of their operating time without passengers aboard. The efficiency of the Bangkok taxi system is not reported in the existing works. • In terms of revenue-gain-rate, on average, each taxi generates around 102.45 THB/hour. It is also shown that taxis that belong in the higher-earning group have higher revenue-gain-rate. This suggests that high-earning taxis do not gain higher revenue by simply working longer. In fact, they are more efficient in their operation (e.g., searching for passengers and generating revenues). KAVEPOL KHUNSRI is currently a student in an integrated bachelor's and master's degree program in information technology at the Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand. His research interests are big data analytics and machine learning. In 2021, he won the best paper award from the 18th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications, and Information Technology (ECTI-CON 2021). VOLUME x, 2022