ATR: Automatic Trajectory Repairing With Movement Tendencies

GPS trajectories are always embedded with errors, due to the weather or environmental variables. Existing trajectory repairing methods have employed Kalman filters or sequential data cleaning methods. Kalman filter or its variants change all observed measurements, while generally most measurements are originally accurate. Sequential data cleaning methods are mainly applied on one-dimensional data sequences, and when encountering multi-dimensional trajectories, their performance will be compromised due to that the features of multi-dimensional trajectories are not fully utilized. To address these issues, we propose to repair GPS trajectory with movement tendencies, speed change tendency, travel distance tendency and repair distance tendency. We formalize the tendency based trajectory repairing, and propose an exact solution to find the repair which minimize movement tendency score. Then we propose high quality candidate selection and dynamic error range estimation, to improve the efficiency and effectiveness of exact solution. Experiments on three data sets demonstrate the superiority of our proposal.


I. INTRODUCTION
With the proliferation of mobile devices, massive trajectory data are generated to record GPS positions. With the GPS positions, trajectory data have been widely used by many applications, such as urban planning, intelligent transportation and mobility pattern analytics [21]. One fundamental task to enable these applications is the high quality of GPS positions. Though the accuracy of GPS positions has reached less than five meters in many areas, the GPS positions, if located around at high buildings and inside tunnels, suffer from high spatial errors [3]. When trajectory data contains errors, how to ensure the quality is the main focus of this paper.
In literature, the classic Kalman filter [1] and its variants such as EKF [7] and UKF [14] have been used to clean outliers. However, they could falsely modify the truly correct positions. The map-matching techniques [12], [17] project GPS points to road networks, but are unable to work with the trajectories of free-space moving objects. Recent works such as [15], [19], via meaningful speed constraint, first identify The associate editor coordinating the review of this manuscript and approving it for publication was Akansha Singh. the violations to such constraints, next repair outlier points, and finally generate a cleaned sequence (which hence satisfies the constraints). In particular, the work SDC [19] aims to repair GPS trajectories by maximizing the likelihood of consecutive speed changes within the constraint of a certain total repair distance (a.k.a the budget). Unfortunately, such repair-based approaches suffer from two following issues. 1) Ineffectiveness: It is hard to tune the input parameter budget. Either a greater or smaller budget leads to rather different repair result. We will soon illustrate the issue by an example. 2) Inefficiency: These works simply set a fixed error radius to select repair candidates. Typically, a rather high error radius makes sure that no candidates are falsely missed. However, such a radius could overly select unnecessary candidates and lead to much high computation overhead.
We now give the following example to illustrate the ineffectiveness issue above.
Example 1: Give an observation trajectory T = {p 1 , p 2 , . . . , p 7 } in Figure 1, three dirty points p 2 , p 4 and p 5 appear at time point 3, 11 and 14, respectively. We plot the trajectory T in black and the corresponding ground truth points in yellow. In addition, we give the proposed repair (namely ATR) of T in blue dashed line, and the state-of-the-art approach SDC (in green and purple) with two repair budgets, 1 and 3, i.e.,  For the five trajectories above (i.e., the observation, ground truth, SDC-1, SDC-3 and ATR), we compute the overall speed change and travel distance by the following equations. SC = i∈ [2,n−1] Dist (p [2,n] Dist(p i−1 , p i ) From Table 1, we have the following findings. 1) The speed change and travel distance of the ground truth trajectory are the smallest among all five trajectories. It makes sense because mobile users usually tend to move on short paths with trivial speed change. 2) For the observation trajectory, due to the noise and errors, the associated speed change and travel distance are much higher than those of the ground truth trajectory. 3) Given the various budgets 1 and 3, the two SDC approaches lead to different speed change and travel distance. For example, with a smaller budget 1, SDC-1 leads to a smaller travel distance, and vice versa. Thus, the SDC approach essentially involves the trade-off between the travel distance and speed change. How to tune a reasonable budget is non-trivial. 4) Finally, our approach ATR repairs the observation trajectory with the two values much close to the ground truth.
Besides the speed change and travel distance features, the repair of an observation trajectory T should not significantly modify the observation points, which is consistent with the minimum change principle (MCP) in the data repair community [2], [19]. The repair distance of a repaired trajectory T of T is computed as i∈ [1,n] Dist(p i , p i ), p i ∈ T , p i ∈ T . Thus, the repair distance should be as small as possible.

A. PROPOSAL
To address these issues, we propose an automatic trajectory repairing (ATR) which can be applied on free-space moving object. Unlike the previous works such as SDC requiring the non-trivial effort to tune input parameters such as budgets, we propose to repair observation trajectories in order to minimize speed change, travel distance and repair distance. Here, we define an overall score function to unify the three items, and repair the observation trajectories with the purpose to minimize the overall score.
Next, in terms of the inefficiency issue, we do not set a fixed error radius to select candidates, and instead exploit a well-trained regression model to adaptively tune an error radius for every GPS point. In this way, we can greatly reduce the amount of unnecessary candidates for much higher efficiency.

B. CONTRIBUTION
To address challenges above, our major contributions in this paper are summarized as follows.
• We design an overall movement score to unify three tendencies including speed change, travel distance and repair distance. When given an observation trajectory, we study the problem of finding a repair trajectory to minimize the overall movement score. Our approach to solve this problem thus can effectively overcome the non-trivial effort of tuning parameters.
• We develop the efficient algorithms to solve the studied problem by a dynamic programming-based pseudopolynomial algorithm and its improvement, and a regression model to tune an error radius for each GPS point to avoid overly selecting too many candidates, leading to higher computation efficiency.
• We validate our algorithms on three data sets to study the trade-off between repair accuracy and computation efficiency. The results demonstrate that our proposal outperforms both classic filtering methods and state-ofthe-art repair approaches. The rest of the paper is organized as follows. Section II first defines the movement tendencies and repair problem. Section III then gives the exact solution of repair problem and an improvement. Section IV next presents a regression model to tune error radius. After that, Section V reports evaluation result and Section VI reviews related work. Section VII finally concludes this paper.

II. PROBLEM SETTING A. PROBLEM STATEMENT
Consider a trajectory T = {p 1 , p 2 , . . . , p n } consisting of n GPS points. For brevity, we denote p i...j the sub-trajectory p i , p i+1 , . . . , p j with 1 ≤ i ≤ j ≤ n. Each point p i = x i , y i contains the GPS longitude and latitude coordinate. In addition, the point p i has a time stamp t i , and an error radius θ i . The error radius means that the distance between p i and its true position p t i is not greater than individual radius θ i , the literature works [19] usually set a fixed radius for all points. For example, simply using the maximal radius θ max among all points could avoid falsely missed candidates but at the cost of overly selecting too many candidates. To repair a dirty point p i ∈ T , at first we need to select the candidates used to repair p i . Such candidates are those points whose distance to p i is within the error radius θ i . As shown in Figure 2, we intuitively consider that those candidates are within a circle with p i as the center and θ i as the radius. To generate a finite candidate set C i for p i , we use a granularity parameter ε and divide the circle into grid cells, where the cell width is ε. In this way, we take the lattice points within the circle as the candidate set C i . The formal definition of candidate set C i of p i is as follows.
From the previous work [9], we have the following lemma to give the size of a candidate set C i as follows.
Lemma 1: The size of a candidate set C i is |C i | = π ( θ i ε ) 2 + E( θ i ε ). In the lemma above, the term E( θ i ε ) is a relatively small absolute value, and we have |E( θ i ε )| ≤ 2 √ 2π θ i ε . Thus, the size of a candidate set can be approximated by the area of the circle, i.e., π ( θ i ε ) 2 . Now, for each point p i ∈ T and one candidate p i ∈ C i with 1 ≤ i ≤ n, we define a movement tendency score M i to measure the goodness of this candidate p i to repair p i .
In the equation above, we introduce the three neighbouring candidates p i−1 , p i and p i+1 to compute the score M i . In this way, we have chance to compute three tendency scores with respect to the repair distance (D), travel distance (L) and speed change (V). The detail of the three tendency scores will be soon given in Section II-B.
Until now, for each point p i within an observation trajectory T , we have an associated candidate set C i . Given the candidate sets for all points within T , we define the following problem to repair T .
Problem 1: Given an observation trajectory T consisting of n points p i with 1 ≤ i ≤ n, we need to find a repaired trajectory T with n candidate points p i corresponding to the points p i ∈ T , such that the overall movement tendency score M(T ) is minimized.

B. COMPUTATION OF MOVEMENT TENDENCIES
In this subsection, we give the detail of three movement tendencies w.r.t repair distance (D), travel distance (L) and speed change (V). After that, in the next subsection, we normalize these tendencies.

1) REPAIR DISTANCE TENDENCY
Given an observation point p i and its candidate p i , we compute the associated repair distance between p i and p i , i.e., Dist(p i , p i ). For an observation trajectory T and a candidate repairing trajectory T , we then compute the overall repair distance by Dist(T , T ) = n i=1 Dist(p i , p i ). Following the minimum change principle (MCP) [2], we expect that the overall repair distance Dist(T , T ) should be small and otherwise the repaired trajectory T could distort the observation one T . The rationale behind MCP is as follows. Though observation data may contain noise, such observation more or less indicates the true data. Thus, the MCP is meaningful to minimize the overall repair distance Dist(T , T ).

2) TRAVEL DISTANCE TENDENCY
Differing from the repair distance above, we now consider the candidates of a point p i and its predecessor p i−1 , denoted by p i and p i−1 , and then compute the travel distance Dist(p i−1 , p i ). After that, given the trajectories T and T , we compute the overall travel distance Dist(T ) = n i=2 Dist(p i−1 , p i ). We expect that the selected candidates p i and p i−1 should lead to a small travel distance Dist(p i−1 , p i ) and thus the overall travel distance Dist(T ). It makes sense because people usually would like to choose short travel paths for energy and time saving. Note that we could compute the travel distance by point p i and its successor point p i+1 , instead of its predecessor p i−1 , but without degrading the proposed approach.

3) SPEED CHANGE TENDENCY
Let us consider the point p i , its predecessor p i−1 and successor p i+1 , and the corresponding candidates p i , p i−1 and p i+1 . To compute the speed change of p i , we first denote the . It is not hard to find that the change of speeds in neighbouring points should not be significant in order to make sure safe driving of moving objects.
We give the following example to compute the three tendencies.
Example 3: In this example, we use the point p 2 in Figure 2 to show how the three movement tendencies above can help trajectory repair. For simplicity, we assume the error radii θ 1 = θ 3 = 0 and then no repair is needed for the points p 1 and p 3 . Thus, we only need to consider the repair of p 2 . Given the candidate set C 2 = {p 2,1 , p 2,2 , . . . , p 2,13 } (see Example 2). Among the candidates in C 2 , we want to find the best one to repair p 2 . To this end, we take into account three movement tendencies as follows.
Firstly, depending upon the repair distance tendency, we have a sorted list of candidates by the ascending order of the repair distance: {p 2,7 → p 2,3 → p 2,6 → p 2,8 → p 2,11 → . . . → p 2,1 }. In terms of the repair distance, the candidate p 2,7 is the best option due to the repair distance Dist(p 2,7 , p 2 ) = 0. Secondly, if we compute the travel distance, we similarly have a sorted list of candidates by ascending order: p 2,11 → p 2,12 → p 2,8 → p 2,13 → . . . → p 2,1 . Finally, the sorted list of candidates in ascending order of the speed change is p 2,10 → p 2,13 → p 2,11 → . . . → p 2,9 . If we take into account the three sorted lists together, we compute the overall movement tendency score M of each candidate and the candidate p 2,11 is selected as the best one to repair p 2 .

C. NORMALIZATION OF MOVEMENT TENDENCIES
In this section, we give the detail to normalize the three tendencies mentioned above. Since the normalized travel and repair distance requires the subitems that appear within the normalized speed change, we thus first introduce the normalized speed change, then the normalized travel distance and finally the normalized repair distance.
Firstly, the speed change tendency of point p i depends upon p i 's predecessor p i−1 and successor p i+1 . We thus leverage the candidates of p i−1 and p i+1 together to normalize the speed change tendency.
Noted that p 1 has no predecessor and p n has no successor. We thus add a head p 0 = p 1 and a tail p n+1 = p n to trajectory T , and meanwhile set their time stamps and error radii by t 0 = t 1 − 1, t n+1 = t n + 1 and θ 0 = θ n+1 = 0. Thus, we have the candidate sets C 0 = {p 0 } and C n+1 = {p n+1 }. Secondly, since the travel distance tendency only takes into account the point p i and its predecessor p i−1 , we have the normalized tendency as below.
In the equation above, the subitem 1 |C i+1 | , consistent with one subitem in Equation 3, makes sure that the tendencies in the two equations above are within the same scale.
Finally, in terms of repair distance tendency, we again add two subitems with respect to the amount of the candidates in p i−1 and p i+1 into the following equation.

III. EXACT SOLUTION
In this section, we first give an exact solution to solve Problem 1 by a pseudo-polynomial algorithm. This algorithm leverages the dynamic programming technique to repair the observation trajectory T by a candidate one T with the purpose to minimize the overall movement tendency score. Next, to improve the repair efficiency, we propose to reduce the amount of used candidates.

A. DYNAMIC PROGRAMMING
We show the basic idea of this exact algorithm as follows.
Since the speed change requires a triple of neighbouring candidates p i−1 , p i , p i+1 , we consider all possible p i−2 , p i−1 , p i positions in each recurrence during the dynamic programming-based steps. Next, let us consider a repair of sub-trajectory p 1...i , we denote F(i, p i−1 , p i ) as the minimum overall movement tendency score M(p 1...i−1 ). Since the successor candidate p i+1 is still unavailable, we have to Algorithm 1 gives the Pseudocode of the dynamic programming-based approach. By this algorithm, we could find F(n + 1, p n , p n+1 ) to be the minimum score, among all possible candidates p n and p n+1 . By retracting all VOLUME 8, 2020 Input: trajectory T, error radius θ , granularity ε Output: the minimum score of the optimal repair 1 add head p 0 and tail p n+1 to T ;  F and trace), where |C max | is the maximum size of candidate sets, i.e., |C max | = max n i=1 |C i |.

B. IMPROVEMENT
The time complexity of the exact solution (Algorithm 1) is determined by the size of candidate sets |C i |. For example in Figure 2, with the granularity parameter ε = 0.2 and error radius θ 2 = 2, we have the amount |C 2 | = 317 of candidates for point p 2 . Moreover, the computation of movement tendency score depends upon the combinations of such candidates, which could degrade the repair efficiency.
In this section, we propose a heuristic technique to select partial candidates from the candida set C i for more efficient repair.
We give the basic idea as follows. Among all candidates in the set C i , some of them are much more qualified to correct a point p i than the others. We thus focus on the highly qualified candidates and then prune the remaining ones. To find such candidates p i , we expect that the overall tendency score of p i should not be higher (a.k.a worse) than the overall score of the observation p i . Thus, among the three tendency scores (w.r.t D, L, and V ) of p i , we consider those candidates p i which should lead to at least one of three tendency scores smaller (a.k.a better) than the corresponding score of p i . Based on the intuition, we estimate the quality scores of a candidate p i,j of p i as below.
Thus, by referring to the scores of p i , we formally select the high quality candidates H i from C i as follows.

IV. ERROR RADIUS ESTIMATION
Recall that we use a fixed error radius θ max to select candidates for every point within a trajectory. In particular, we usually use a rather large error radius θ max to avoid false negative issue. It is not hard to find that using a large radius θ max could lead to too many unnecessary candidates for each point. As a result, the repair algorithm has to prune those unnecessary candidates, leading to significant performance degrade. To overcome this issue, we propose to tune an adaptive error radius by a regression model, which could adaptively estimate an error radius θ i for each point p i .
The general idea of the regression model to estimate the error radius for each point is as follows. The regression model maintains the mapping function from movement features of each point to its error radius. To train the model, we need an amount of labeled points with movement features and associated error radii. To collect such labeled points, we note that the Android API in nowadays smart phones can comfortably sample GPS coordinates and corresponding error radii. Once the labeled points are collected to train a regression model, in the prediction phase, we then apply the regression model to estimate the error radius of an input testing point which contains only movement features.
To train the regression model above, we focus on the detail to extract meaningful features for every point p i . Recall that the movement features such as repair distance, travel distance, and speed change require a triple of points. To this end, we define a parameter, i.e., the half window size ω, to select the triples of points from the window of 2 × ω points. Based on each of the triples, we then compute the corresponding movement features.
We take Figure 3 with ω = 2 to illustrate the feature extraction for point p i . Among the window of 2 × ω = 4 points p i−2 , p i−1 , p i+1 , p i+2 , we take into account a triple of points p j , p i , p k with i − 2 ≤ j < i < k ≤ i + 2 to compute movement features for p i . As a result, we have four available triples p i−2 , p i , p i+1 , p i−2 , p i , p i+2 , p i−1 , p i , p i+1 , and p i−1 , p i , p i+2 to compute movement features. From this example, to find a certain triple of points, we need to select the first point from the first half window of ω points, the third one from the second half window of ω points, and yet the second one is fixed to be p i . As a result, given the half window size ω, the total amount of triples is equal to ω 2 . In our experiment, we set ω = 3 by default and thus typically select 9 triples for each point p i .
In terms of a certain triple p j , p i , p k , we extract totally 6 movement features. The first three features include the distance, speed and turning angle. As shown in Figure 4(a), the point p i = (x i , y i ) at timestamp t i has a predecessor p j and a successor p k (j < i < k). We compute the distance, speed and turning angle of p i by f In terms of distance difference, we note that moving objects are likely to move on shortest paths. Given the triple p j , p i , p k , we then compute the distance difference between the path p j → p i → p k and the one p j → p k by f j,i,k L = |Dist(p j , p i ) + Dist(p i , p k ) − Dist(p j , p k )|. As for repair distance f j,i,k D , we use p j and p k to infer the repair position p i of p i by linear interpolation on p j → p k . The inferred position p i of p i is denoted as ). We heuristically use p i as the repair of p i which locates on the short path with no speed change, which results in both minimum speed change tendency and travel path tendency. Thus the repair distance is denoted as f j,i,k D = Dist(p i , p i ). We note that the regression model could estimate an error radius that is smaller than the ground truth and the best repair candidate may be falsely missed. To tackle this issue, we follow the previous work [16] to estimate the variance of bagged predictors and random forests predictors based on the jackknife [4] and the infinitesimal jackknife (IJ) [5]. [16] proves that jackknife rule has an upward bias, while the IJ rule has a downward bias. By taking the arithmetic mean of the two variance estimations, it merges these two rules to achieve nearly unbiased variance estimation. In this way, we compute the error bars of the predicted radii. Then, in order to alleviate the candidate missing problem, we add the error bar on each predicted radius.

A. COUNTERPART
In this section, we evaluate the performance of our proposed approaches including 1) DP: the dynamic VOLUME 8, 2020 programming-based algorithm in Section III-A, 2) DP + H: the improved DP approach by selecting high quality candidates, and 3) DP + HRv: the enhancement of DP + H by a regression model to tune the error radius of each point with variance estimation. We compare these approaches with the state-of-the-art SDC [19] and classical Kalman filter (KF) [1] with its two variants, extended Kalman filter (EKF) [7] and unscented Kalman filter (UKF) [14]. Noted that SDC requires a repair budget parameter, we thus increment the repair budget until the likelihood of SDC remains stable. We do not repeat the evaluation of other approaches such as the smoothing-based EWMA, or the constraint-based method. It is mainly because such approaches suffer from worse performance than SDC (see [15], [19]).

B. DATASETS
We use three trajectory datasets.
• Ts: We collect a GPS trajectory by smartphones when mobile users are walking on the road network of a university campus, by the sampling rate of one GPS position per second. The ground truth positions of the trajectory are marked manually as follows. We first project the collected trajectory onto the digital map of the campus, visualize the projected trajectory, and carefully label each GPS point. In this way, we can manually identify 235 points with high spatial errors among the total 2710 collected GPS points.
• Tt: We use a real dataset provided by one of the largest telecommunication (Telco) in China. This dataset contains 5367 measurement report (MR) samples and corresponding GPS positions. Each MR sample measures the signal strength between mobile devices and nearby up-to-seven cellular towers. We apply a Telco localization algorithm [20] to recover the trajectories of outdoor positions of mobile devices from the MR samples, and the corresponding GPS positions of such mobile devices are then as the ground truth of the recovered trajectories. In terms the localization error, computed by the distance between a recovered position and ground truth, we can achieve the medium error 20.4 meters.
• Tr: We randomly generate synthetic noise (including various noise radius and noise rate) on each GPS point within a real GPS trajectory with 18610 GPS points, and the original trajectories can be treated as ground truth. We use two parameters, i.e., noise rate N r and noise radius N a , to generate the noise points. Here, the noise rate N r indicates the percentage of GPS points that are selected to generate random noise. Given a selected point, we randomly generate a noise point such that the distance between the noise point and original one is smaller than the noise radius N a . For example, by setting N r = 20% and N a = 40 meters, we first randomly select 20% points from the trajectory points, and then generate the noise points with the distance within 40 meters to the selected points. In this way, depending upon the parameters N r and N a , we generate various synthetic trajectories.

C. METRICS
Besides the running time of repair algorithms, we also evaluate the repair effectiveness by RMS errors [10]. Let T be an input observation trajectory, T t be the ground truth of T , and T be the repaired trajectory. We then compute the RMS error by:

. PERFORMANCE ON TWO REAL TRAJECTORIES
We evaluate all repair approaches on two real trajectories Ts and Tt. For Ts, we set θ max = 16 meters and ε = 4 meters, and for Tt, we set θ max = 80 meters and ε = 20 meters. In this way, we have 40 candidates for each point in the both datasets of Ts and Tt. From Table 4, we find that the RMS errors of two observation trajectories Ts and Tt (denoted as Observe) are 1.82 and 90.11, respectively. In terms of all listed repair approaches, we have the following findings.
First of all, the RMS errors of KF and its two variants are rather high. It is mainly because these approach could modify all points, no matter these points are errors or not. As online approaches, these KF-based methods are fast, leading to the least running time. Among the three approaches, KF simply adopts the linear model, and yet EKF and UKF non-linear models, leading to lower RMS errors than KF.
Secondly, the state-of-the-art approach SDC slightly repairs the input trajectories. Specifically, 1) SDC transforms the 2-dimensional GPS trajectory {(x 1 , y 1 ), . . . , (x n , y n )} into two 1-dimensional sequences {x 1 , x 2 , . . . , x n } and {y 1 , y 2 , . . . , y n }. The two 1-dimensional sequences are then be repaired separately. 2) Among the three movement tendencies studied in this paper, SDC only adopts the speed change tendency. As a result, it is not hard to find that SDC suffers from higher RMS errors than all variants of our work. Moreover, the running time of SDC is much slower than the three variants of our work, except the baseline approach DP.
Thirdly, among the three variants, DP + HRv achieves the lowest RMS error, and DP + HRv leads to the fastest  running time. The high effectiveness and efficiency of DP + HRv are mainly because it adaptively tune the error radius for each point and greatly save the overhead to process a smaller amount of candidates. In addition, when compared with the baseline approach DP, the improvement DP + H significantly reduce the amount of candidates and leads to much faster running time.
Finally, for the two datasets Ts and Tt, all repair approaches lead to higher RMS errors and slower running time to process Tt than those to Ts. It is mainly because the Telco trajectory Tt suffers from much higher medium errors and more data samples.
As a summary, the filter-based approaches including KF and its two variants suffer from the highest RMS errors but with the fastest running time. The state-of-the-art approach SDC, though leads to much lower RMS errors than the filter-based approaches, but at the cost of significantly higher running time. Our approaches involve the tradeoff between RMS errors and running time, and lead to much lower RMS errors than the filter-based approaches and faster running time than SDC.
In the rest of this section, since our goal is to repair input trajectories with errors, we mainly focus on the comparison between our approaches and SDC.

E. PERFORMANCE ON THE SYNTHETIC TRAJECTORY
In this section, by varying the parameters to generate the synthetic trajectory Tr, we measure the performance of SDC and three of our approaches DP, DP + H and DP + HRv. Figure 5 reports the performance of all methods on the dataset Tr. First, in Figure 5(a-b), the growth of the noise ratio N r and noise radius N a leads to greater RMS errors for all approaches. In terms of a certain value of N r and N a , our approaches consistently outperform SDC.
In Figure 5(c-d), among the four approaches, the running time of DP and SDC remains relatively stable. It is mainly because N r and N a have trivial effect on the time complexity in these methods. Instead, both DP + H and both DP + HRv are rather sensitive to N r and N a because they need to prune more candidates when given greater N r and N a .

F. EFFECT OF MOVEMENT TENDENCIES
Since our approaches adopt three movement tendencies including the repair distance (D), travel distance (L) and speed change tendency (V), in Figure 6(a), we study the performance of various combinations of three tendencies. Though we use DP + HRv for illustration in this figure, the variants such as DP and DP + H are also applicable. Among the three combinations using the individual tendency, we find that the travel distance tendency (L) leads to the least RMS error and the repair distance tendency (D) the greatest error. For the three combinations of two tendencies, we again find that the travel distance tendency (L) consistently leads to the major contribution. Finally, we see that three tendencies DLV leads to the best performance.

G. EFFECT OF REGRESSION MODEL
To evaluate the performance of the regression model to tune error radii, we implement various regression models on the smartphone trajectory Ts. In Figure 6(b), we vary the parameter ω (a.k.a the half window size) from 1 to 8. Among the five regression models, the Random Forest (RF) leads to the least RMS errors on ω = 3. Instead, comparatively smaller ω could miss the context information of neighbouring points, and comparatively greater ω adds more irrelevant features into the regression model. Thus, either those greater or smaller values of ω degrades the performance. In addition, we noted that the RMS error of SVM grows with a greater ω. This is mainly because a greater ω leads to more features and it is harder for SVM to separate the entire feature space by a hyperplane.
In addition, as shown in Figure 6(c), we report the importance of the features used by the Random Forest (RF)-based regression model. Recall that each triple of three points p i−j , p i , p i+k with 1 ≤ j, k ≤ ω leads to 6 features which are then used by the regression model. In this figure, the x-axis represents the pairs j, k . For example, the pair (2, 3) represents the triple p i−2 , p i , p i+3 illustrated in Section IV. From Figure 6(c), we find that 1) the features with respect to the pairs (1, 1), (2, 1) and (3, 1) are more important than others, 2) among the 6 features in each pair, the three features f L , f V , f D are more important than others (i.e., the three features f L , f V and f A ), and 3) the points far away from p i , e.g., the pair (3, 3), have trivial affect on the error radius estimation of p i .

H. EFFECT OF TWO PARAMETERS ε AND θ MAX
Finally, we study the effects of two parameters ε and θ max on the smartphone trajectory Ts. As shown in Figure 7(a), when the granularity parameter ε becomes smaller, we have more candidates (See Figure 2) and smaller RMS errors for all three variants of our approach. Again in Figure 7(b), a greater θ max indicates more selected candidates and also leads to smaller RMS errors. As shown in Figure 7(c-d), more candidates, caused by a smaller ε and greater θ max , lead to higher running time.

VI. RELATED WORK A. FILTER-BASED REPAIR
Kalman filter (KF) [1] has been widely used in navigation and control of vehicles. It repairs the sequence of measurement data with noise and other errors. KF works in a two-stage process: prediction and update. In the prediction stage, KF generates the estimations of the current state variables, along with previous updated state estimate and the uncertainty of current state. These estimations are updated by using a weighted average when the next measurement is observed. These two stages are recursively computed along with the sequentially generated observation in real time. In this way, KF gives the optimal linear estimate for linear system models.
Unfortunately, most systems are nonlinear and two variants of KF are proposed to address the non-linearity issue, extended Kalman filter EKF [7] and unscented Kalman filter UKF [14]. The difference between EKF and UKF is how they handle non-linear equations. EKF takes the Jacobian matrix [8] to linearly approximate non-linear functions. UKF takes a bunch of representative points from a Gaussian distribution to construct a better approximation of non-linear functions.
Comparing with our proposed approaches, KF based methods modify every observed measurements and the estimations are computed by only previews estimation and uncertainty. Our methods have the chance to remain the accurate observed measurement. Moreover, our solution DP globally considers the entire trajectory to repair the observed measurements. Our proposal thus can obtain more accurate repair results than KF-based methods.

B. DATA REPAIR
The two previous works [6], [15] repaired the sequential data by solving a constraint-based optimization problem. Sequential dependencies [6] concerns the range of value changes between two consecutive data points. Taking advantage of the (widely existing) associated timestamps, [15] states that the repaired sequence should satisfy the constraint on the speed of data changes. For the distances on timestamps, sequential dependencies can be also interpreted as a special case of speed constraints with fixed time intervals.
Both of the above methods can only detect and repair violations to the specified constraints and do not consider any likelihood aspects among data points. As reported in SDC [19], the statistical-based repair could outperform constraint-based methods. It is particularly meaningful for the case of some small scale errors, which is common in GPS trajectories studied in the paper.

C. STATISTICAL-BASED REPAIR
The statistical-based repair techniques have been applied on relational data [11], [13], [18] and sequential data [19]. [18] identifies the attributes with dirty values as flexible attributes and the attributes with correct values as reliable attributes. The relations between reliable and flexible attributes are then modeled and it can be applied to repair the value with flexible attributes. After the repair, the data replacement with the most likelihood w.r.t. the data distribution in the relation is applied.
[11] constructs a more complex dependency network to model the relationships among attributes and then iteratively applies the repair. By observing the change of the distributions before and after each operation, the repair process will terminate until such change is sufficiently small.
Unfortunately, relational data is quite different from our GPS trajectory data, neither of these two approaches can be directly applied on our data sets. For example, [19] repairs sequential data by using the likelihood of speed change in the observation sequence. However, it only considers the one-dimensional sequence but not two-dimensional trajectories. Given a GPS trajectory, [19] has to divide it into two one-dimensional data on latitude and processes the repair individually. Besides, [19] relies on the proper setting of the repair budget parameter, which is nontrivial to tune the parameter on a new data sequence.

VII. CONCLUSION
In this paper, we propose to repair GPS trajectory with the objective to minimize the overall movement tendency score which involves the three specific movement tendencies (i.e., speed change, travel distance and repair distance). Based on the candidates of each GPS point, we propose a dynamic programming (DP)-based algorithm to obtain an exact solution with pseudo-polynomial cubic-time complexity. Next, by selecting a small amount of high quality candidates, we improve the efficiency of the DP algorithm. After that, with help of a regression model, we adaptively estimate the error radius of each point to further reduce the amount of candidates per node. Experimental studies on three trajectory data sets validate the effectiveness and efficiency of our algorithms.