Identifying Flow Clusters Based on Density Domain Decomposition

Flow clustering is one of the most important data mining methods for the analysis of origin-destination (OD) flow data, and it may reveal the underlying mechanisms responsible for the spatial distributions and temporal dynamics of geographical phenomena. Existing flow clustering approaches are based mainly on the extension of traditional clustering methods to points by redefining basic concepts or some spatial association indictors of flows and the implementation of classic clustering processes, such as aggregating, collecting or searching. However, current techniques still suffer from two main problems: poor identification accuracy and complicated parameter selection processes. To resolve these problems, a new clustering method is proposed in this study for arbitrarily shaped flow clusters based on the density domain decomposition of flows. Simulation experiments based on our method and existing methods show that our method outperforms the three most commonly used methods in terms of the overall identification rate and almost all F1 measures, and it does not require any manual adjustments during the parameter selection process. Finally, a case study is conducted on taxi trip data from Beijing. Several flow clusters are identified to represent different types of residents’ travel behaviors, including daily commuting, return travel, tourism and behaviors on special days.


I. INTRODUCTION
The movement of a geographical object between two locations (e.g., the daily commute from dwelling to workplace [1], immigration between states [2] or delivery services in a city [3]) can be presented as an origin-destination (OD) flow [4], [5]. Such flows, which include human, commodity, information, capital and relationship flows, can be thought of as interactions or relationships between two georeferenced places and may reflect the underlying mechanisms responsible for the spatial distributions and temporal dynamics of geographical phenomena [6]- [8].
In recent years, with the emergence of different kinds of mobile devices, large amounts of OD flow data have emerged, including mobile phone signal data [9], GPS trajectories [10], The associate editor coordinating the review of this manuscript and approving it for publication was Corrado Mencar .
footprints [11], WiFi positioning trace data [12], logistics records [13], and transaction records [14]. Therefore, many spatial analysis and geodata mining methods, such as abnormal flow detection [15], flow cluster identification [6], [16], [17], and flow estimation or prediction techniques [18]- [20], have been developed to discover patterns in these OD flow data. Among these methods, flow clustering is commonly used to discover the distribution characteristics of flows. In early years, it was mainly used to evaluate the global roles of cities in transportation [21] or improve our understanding of the geographical patterns in residents' mobility [22], [23]. However, in recent years, this method has been extended to detect malaria hotspots in the epidemiology field [24], to analyze saving propensities and wealth distributions [25], to understand the interdisciplinary nature of knowledge absorption [26], and many other fields [27], [28].
The main principle of generating a flow clustering algorithm is to extend the traditional point clustering algorithms by redefining some basic concepts (such as distance, density and reachability) or some spatial association indictors. On this basis, current flow clustering methods can be grouped into three categories: hierarchical-based clustering, density-based clustering and statistics-based clustering. In hierarchical clustering methods for flow data, the distance of an OD flow should be defined according to the OD locations [29], [30] and, sometimes, the attributes of the flows [6], [31]; furthermore, an agglomerative or divisive strategy should be used to organize each flow into a hierarchy [32], [33]. These methods can identify flow clusters at different spatial scales, and they are usually proposed to solve problems associated with flow cluster identification, generalization and visualization [34]- [37]. In density-based clustering methods for flow data, the main aspiration is to find high-density subsets of flow data [38]- [40]; accordingly, the definition of the local density of an OD flow should also be defined based on the quantity or reachability of each flow. Then, a traditional algorithm such as DBSCAN [41] or OPTICS [42] can be extended to identify flow clusters based on a density-connected process. Moreover, these methods are insensitive to outlier flows because unnecessary noise can be eliminated since not all flow data need to be clustered [43]. Finally, in statistics-based clustering methods, the definitions of spatial association indictors or other statistic measures, such as Moran's I [29], [44], Getis-Ord's G [8], Ripley's K-function [6], and the log-likelihood ratio [30], can be extended to describe the local aggregative characteristics of the flow subsets. The objective of these methods is usually to find the subset of flows with the optimal values of statistical measures. These methods can usually create a significant description of spatial homogeneity, thereby providing a standard measure for comparing flow clusters [45]- [47].
Although many flow clustering methods have been proposed in recent years, some problems remain unsolved. First and foremost is the identification rate for arbitrarily shaped flow clusters. Most of the above methods are very useful at identifying certain types of flow clusters. However, when encountering some irregularly shaped flow clusters, these methods may not be as effective [30]. Second, in most cases, the parameters need to be manually selected, which is often difficult. Determination of an unknown number of flow clusters is always a troublesome problem, and some other scale parameters, such as distance threshold and neighborhood range, are hard to set in clustering algorithms.
To solve these problems, in this study, we propose a clustering method for arbitrarily shaped flow clusters based on the density domain decomposition of flows. In our method, a flow dataset is assumed to be composed of clusters with high-density flows and noise with low-density flows. A mixed probability density model of k-th nearest neighbor distances is used to separate clusters and noise. The model parameters can be estimated through an EM algorithm, and thus, flow clusters can be determined. This method is believed to automatically identify arbitrarily shaped flow clusters with high accuracy.
The remainder of this paper is arranged as follows. Section II introduces some basic concepts about flows. Section III describes the proposed method in detail. Section IV presents the simulation experiment and compares our method with popular methods presented in previous studies. Section V presents a case study involving taxi trip data in Beijing. Finally, Section VI provides the conclusions and future work.

II. BASIC CONCEPTS ABOUT FLOWS
Before we describe the details of our method, several basic concepts about OD flows must be introduced.
= (x D , y D ) denote the coordinates of the origin point and destination point, respectively. Therefore, the flow space is a metric space that is expressed as the Cartesian product of two 2-D planes (R 2 × R 2 ), and each flow can be seen as a 4-D point in this flow space.

Definition 2 (Flow Distance):
The flow distance is the fundamental measurement ρ in the flow space. Two types of distance measurements in flow space are defined as follows: Chebyshev Distance: Manhattan Distance: where ρ O (or ρ D ) represents distance between origin points (or destination points).

Definition 3 (ε-Neighborhood of Flow):
The This ε-neighborhood can be expressed as a flow sphere with center f and radius ε in the flow space. The volume of this sphere, denoted V N ε (f ) , can be calculated as follows: where (ρ O , θ O ) and (ρ D , θ D ) are the polar coordinates of the 2-D origin plane and the 2-D destination plane, respectively. Based on the above two definitions of the flow distance, the volumes are π 2 ε 4 for the Chebyshev distance and π 2 ε 4 /6 for the Manhattan distance. Figure 1 shows the ε-neighborhood of flow based on both types of distances.

Definition 4 (Flow Density):
The flow density, denoted λ, is the number of flows per unit volume. For a flow zone (a subset of the flow space) z ⊂ R 2 × R 2 , the flow density λ(z) can be calculated as λ(z) = n z /V Z , where n z = |{f |f ∈ z}| is the number of flows in z and V z is the volume of z. For a VOLUME 8, 2020 single flow, the local flow density λ(f ) can be calculated as follows:

III. METHOD
Based on the concepts defined above, a density domain decomposition model of flows is proposed as an extension of the point process decomposition model [48] to identify flow clusters. This method can be divided into four steps, as shown in Figure 2. First, we determine whether the flow set is homogeneous by using several quantitative indices proposed in our previous work (such as NLH * , A-w) [47].
If the flow set is not homogeneous, we proceed to the second step. Otherwise, the flow set is considered homogeneous and cannot be decomposed. Second, a mixed probability density function (pdf) of the k-th nearest distances of flows is generated to describe the density domain model of the flow set. Third, the parameters of this pdf are evaluated by an expectation-maximization (EM) algorithm, and all the flows are decomposed into two components with different densities that correspond either to dense flows or sparse flows. Each sparse flow is seen as noise, while each dense flow can be generated into flow clusters based on the density-connected clustering concept [36] in the final step. Since our previous work [47] provides the details of the first step, we introduce only the remaining three steps.

A. MIXED PDF OF THE K-TH DISTANCES OF FLOWS
For one homogeneous flow set, where the flow density is λ(f ) ≡ λ, the probability distribution of the k-th nearest flow distances F D k can be acquired by traversing the pdf that includes 0, 1, 2, . . . , k-1 flows within the D k -neighborhood: where k is the ordinal number of nearest neighbors and n N ε is the number of flows in the ε-neighborhood. In this case, P(n N ε = k) follows a Poisson distribution and is expressed as Therefore, the mixed pdf of the k-th nearest flow distances of the two homogeneous flow sets with different densities, for example, λ 1 and λ 2 (λ 1 > λ 2 ), can be expressed as follows: where p is the proportion rate of flow clusters, λ 1 is the density of the flow cluster and λ 2 is the density of noise. Since a flow set can be seen as a mixture of flow clusters and noise, equation (7) can be used to describe the density domain model of flows.

B. PARAMETER EVALUATION FOR DECOMPOSING FLOWS OF DIFFERENT DENSITIES
Once the density domain model of flows is generated, the next step is to evaluate the parameters and decomposing flows of different densities. In this process, an EM algorithm [49], [50] is applied to evaluate the parameters (λ 1 , λ 2 , p) based on the histogram of the observed k-th nearest flow distances (Fig. 3). A summary of the algorithm can be seen as follows: E-Step:

M-
Step: where n is the number of flows and t is the iteration time. δ t+1 i is the probability that flow f i belongs to a dense flow. Ifδ t+1 i ≥ 0.5, flow f i can be marked as a dense flow; otherwise, it is marked as a sparse flow. In this step, parameters λ 1 , λ 2 and p can be estimated by the iteration process, and parameter k can be selected by the fitting accuracy of D k (ε). Therefore, the density domain model of flows can be determined, and flows can be decomposed into a dense part and a sparse part.

C. IDENTIFYING FLOW CLUSTERS BASED ON DENSITY -CONNECTED CLUSTERING
After decomposing process, we can filter out noise (sparse flows) from the flow set and obtain candidate features (dense flows) that can be collected into flow clusters. In this process, the classic DBSCAN algorithm can be improved for identification of flow clusters by modifying the densityconnected concept of flows [41]. Here, we present the definition that flow f p is density-connected to flow f q with respect to (wrt) Eps and MinPts if there is a chain of flows f p 1 , f p 2 , . . . , f p n , p 1 = q, p n = p such that |N eps (f p i )| ≥ MinPts(i = 2, 3, . . . , n − 1) and f p i−1 ∈ N Eps (f p i ), (i = 2, 3, . . . , n). On this basis, a flow cluster can be identified as a set in which all features are density-connected to each other. Parameter MinPts is equal to k, and parameter Eps can be estimated by the following formula: Then:

IV. SIMULATION EXPERIMENTS
In this section, we design and analyze a Monte Carlo simulation experiment with 100 sets of simulated flow data to validate our method. Each dataset is composed of noise and three clusters with different shapes (i.e., bar-strip, ''S''-''C'' and ''O''-''+''), coded as C-1, C-2 and C-3, respectively. The density of each flow cluster is much higher than that of the noise (100 times greater). An example of the simulated dataset is displayed in Figure 4. In this experiment, each dataset is processed by our method, and a significance test is designed to validate all the identified flow clusters. Through the test, the A-w statistic of flow clusters [47] is used to test whether the density of flow clusters is different from that of noise clusters. These simulated datasets are also evaluated by three other flow clustering methods for comparison: a hierarchical clustering method, a density-based clustering method and a spatial statistics-based clustering method. For the hierarchical clustering method, we use the algorithm proposed by Guo in 2014 [7]. This algorithm iteratively merges flows to form VOLUME 8, 2020 a hierarchy of flow clusters and is believed to be effective at aggregating spatial flows and simplifying flow sets into groups [37]. The only parameter k is set to ensure that, on average, each flow has five flow neighbors. For the density-based clustering method, we apply the classic trajectory clustering technique proposed in 2006 [39]. This method takes each flow as a trajectory with only two points and adopts an improved OPTICS algorithm to identify the flow clustering structure. Parameter MinPts is empirically set as and parameter Eps is set as average k-th distance of flows. For the statistics-based clustering method, we use a local version of the L-function (the L-function is the normalization of Ripley's K-function) to identify flow clusters within the simulated dataset. This algorithm first identifies the aggregation scale parameter (maxL) based on the global L-function and then calculates the local L values using this scale. Finally, flows with top 1% local L values are merged as the dominant cluster. This approach has been shown to be useful for detecting spatial clustering patterns in flow data [6], [31]. These experimental settings are listed in TABLE I. Since all methods can successfully identify significant flow clusters, we analyze the average identification rates shown in TABLE 2 for comparison. From these results, we can see that our method outperforms the other methods in terms of the overall identification rate and almost all F1 measures. The other three methods have good recall rates, but, in general, their precisions are unsatisfactory (less than 90%). It is worth noting that the precision of our method is much higher than those of other methods, especially for the ''S''-''C''-shaped flow cluster, and the shape of the identified flow cluster is reconstructed very well (Fig. 5). To identify irregularly shaped flow clusters, higher precision means that the original shape of the flow cluster can be better maintained. Thus, despite a few defects in the recall rate, we believe that our method is generally superior at identifying irregularly shaped flow clusters.

V. CASE STUDY A. DATA DESCRIPTION
We apply our flow clustering method to taxi trip data in Beijing to identify different flow cluster patterns of daily traffic. Our dataset contains records of GPS trajectories from more than 25,000 taxis (more than 1/3 of all taxis in Beijing). Each record is described by five fields: <taxi ID, current time, longitude, latitude, status>. Thus, the OD flows from each taxi can be extracted according to changes in status.  Here, we choose six region pairs as study areas within Beijing during different periods to discover different flow patterns of residents' travel behaviors in Beijing ( Figure 6). The first two region pairs, A and B, are two main commuter flows with short distances distributed between the east and west districts of Beijing, respectively. Next, two region pairs, C and D, are return flows from certain transportation hubs, e.g., an airport (C) and Beijing South Railway Station (D). The last two region pairs, E and F, are traffic flows on two special days, e.g., National Day (E) and Qingming Festival (F). TABLE 3 shows the study areas and descriptions of the data used in our case study.

B. FLOW CLUSTER RESULTS
The flow cluster results are shown in Figure 7. In the morning peak commuter flows of Wangjing, flow clusters from three different residential communities to the Taiyanggong subway station are identified (Fig. 7a). The origins of these flow clusters include school district communities  along the Wangjing North Road (Wangxin Garden and Shangjing New Route, blue flow clusters in Fig. 7a), residential communities in Wangjingxiyuayan (yellow flow clusters in Fig. 7a) and business-living buildings in the Huajiadi community. The Taiyanggong subway station (subway line 10) is commonly used by residents in the Wangjing area. For commuter flows from the university area to Zhongguancun, only one flow cluster is identified, from Wudaokou to Zhongguancun (Fig. 7b). The origins of this flow cluster are mainly distributed throughout the Dongsheng Science and Technology Park and Wudaokou commercial district, whereas the destinations are relatively dispersed, mainly distributed in the commercial center and residential communities in Zhongguancun. These results reflect the daily commuting behaviors in northwestern and northeastern Beijing. Figures 7c and 7d show the return flows from two transportation hubs at night on the weekdays. Two flow clusters are identified from the airport to areas outside the East 3 rd Ring Road (Fig. 7c). The destinations are distributed in the residential communities near the Sihui Bridge (red arrows in Fig. 7c) and Chaoyang Joy City, including the Yuanyangguoji community, Ciyunli community and Pearl Rome Jiayuan community. For the return flows from Beijing South Railway Station, three flow clusters are identified; the destinations of these flow clusters are mainly distributed in Beijing West Railway Station, the communities around the Guangqumen Bridge and the Panjiayuan community. From these results, we can see that the return flows from the airport are more concentrated than the flows from a railway transportation hub. Figures 7e and 7f show the flow clusters on two special days. Flows from the 3 rd Ring Road to the 2 nd Ring Road are composed of three flow clusters in Figure 7e. The red flow cluster and yellow flow cluster are morning tourist flows from residential communities along the East and West 3 rd Ring Roads to Tiananmen, respectively, whereas the blue flow cluster may represent travel flows from the Sanyuan Bridge to Beijing Railway Station. The flows in Figure 7f mainly represent trips to Babaoshan for those seeking to sweep graves during the Qingming Festival. The origins of these flow clusters are mainly distributed in several communities near Wukesong and Beijing West Railway Station, including the Mingrijiayuan, Jiujiefang, and Muxidinanli communities. These results reflect purposeful travel behaviors on special days.

VI. CONCLUSION AND FUTURE WORK
In this study, we propose a method for flow clustering based on the density domain decomposition of flows. This method identifies arbitrarily shaped flow clusters with high accuracy and does not need any parameters in the clustering process. Simulation experiments show that our method outperforms three commonly used methods in terms of the overall identification rate and almost all F1 measures. The proposed method is applied to different taxi data in Beijing as a case study, and VOLUME 8, 2020 it can also be easily extended to other OD flow data such as resident travel paths and migration and logistics data, which may help to provide information for urban management and region planning.
However, this method has some limitations. First, the parameter estimation process is very time consuming because we must traverse all possible parameters and then choose the optimal one for decomposing the flow set. Second, in our method, the flow set is assumed to be composed of two components, features and noise. Thus, some OD flows with different densities may not be separated because they are all considered features. Future research will focus on the analysis of superposed flow sets, which may contain more than two homogeneous flow sets with different densities, and on improving the calculation efficiency. TAO PEI received the Ph.D. degree from the China University of Geosciences, in 1998. He is currently a Professor with the State Key Laboratory of Resources and Environmental Information System, Institute of Geographical Sciences and Natural Resources Research. His research interests include spatial big data mining and geostatistics.
HUA SHU received the B.S. degree from Shaanxi Normal University, in 2013. He is currently pursuing the Ph.D. degree with the University of Chinese Academy of Sciences. His research interests include spatiotemporal big data mining, mobile computing, and geographic information science. VOLUME 8, 2020