QDetect: Time Series Querying Based Road Anomaly Detection

Road anomaly detection has attracting increasing attention in recent years due to its significant role in the public transportation of modern cities. A few methods has been proposed to detect road anomaly with inertial sensors (e.g., accelerometer and gyroscope), which usually utilize classification techniques by extracting time and frequency domain features from inertial sensor data. However, existing methods are time consuming since these methods perform on the whole datasets. In addition, few of them pay attention to the self-similarity of the data when vehicle passes over the road anomalies. In this paper, we propose QDetect, a road anomaly detection system with less data-dependency via querying and re-comparing. Specifically, QDetect consists of two phases: 1) Query filter. This phase is designed to roughly extract road anomaly segments by matching existing labelled anomalies; 2) Re-comparison on suspicious anomalies to identify their anomaly types. We have conducted comprehensive experiments on two real-world data sets, and the results show that our method outperforms some existing methods in both detection performance and running time. We expect to lay the first step to some new thoughts to the field of real-time road anomalies detection in subsequent work.


I. INTRODUCTION
Road anomaly(such as pothole, metal bump or speed bump) is any permanent obstacle generated by the continuous use, weather conditions or traffic planning decisions in the road and can lead to serious traffic accidents. For example, between 2000 and 2011, there were 2 million traffic accidents in Canada, of which 33% were related to road conditions or bad weather. In 2015, about 50,000 British drivers were involved in traffic accidents caused by road anomalies, and road pits caused a car accident every 11 minutes. As a result, governments spend huge amounts of manpower and resources on road maintenance. The British government announced that they spent $1.2 billion on road maintenance in 2017. In 2014, for the city of Toronto, Canada spent a total of $6 million on road repairs. Therefore, detecting road The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan . anomalies in an efficient and simple way is helpful to reduce the expense and improve efficiency of road repairs.
Previous detection methods such as vision-based or laserbased techniques are time-consuming and labor intensive [1]. With the development of mobile devices and sensor technologies [2], [3], it's convenient to collect anomaly data using mobile phones, pads or other electrical devices with sensors. When the vehicle passes over the anomaly, data captured by accelerometer imply an pattern according to the type of anomaly. This problem of detecting road anomaly can be converted into finding out possible anomalies from sensor data cleverly.
Several studies [4]- [6] utilized threshold-based techniques to detect road anomalies. For example, some researchers [4]- [6] detected bumps and potholes, respectively. Although these methods can detect whether there is an anomaly, they cannot distinguish the type of anomalies. Furthermore, the detection accuracy is quite low. Recently, a few studies firstly extracted different kinds of features [6]- [11], then used classification models [8], [10]- [12] such as Support Vector Machine, k-means or decision tree. These methods suffer a few limitations include: 1) they rarely filter out the impact of normal roads on model training; 2)As the Figure 1 shows, the data between two anomalies in the same category may be similar in shape and time domain, but none of them take this into consideration and 3) they often use a fixed window length, leading to the possibility to slice anomalies' data.
On the other hand, these methods take less consideration about how to start the detection on the real world road. Most of these methods [5], [6], [8] starts on a large data set that collected and labeled in months. If the government wants to start detection on anomalies, this data collecting step can not be avoided. It's not economic and efficient to stop the detection project until enough data are labeled. If one method performs well under small data set, the entire detection project will be much smooth, thus it needs a less data-dependent method to identify the anomalies effectively at the beginning of detection.
Taking these issues into consideration, this paper focuses on: (1) building a more realistic and meaningful model where this system relies less on the size of data sets; (2) achieving better anomaly detection performance compared to some methods proposed in other papers; (3) enabling a less time-consumption road anomaly detection model no matter how large the size of data sets is.
Our model can detect three kinds of anomalies: pothole, speed bump and metal bump. In fact, only the potholes should be reported to the government, but the detection and identification of these types are necessary for the road anomaly detection system only needs to send information on whether there is an anomaly without determining the type of road anomaly to drivers; however, from the perspective of road maintenance personnel, judging whether an anomaly is a road pothole or a normal road equipment(speed bump or metal bump) is helpful for their work.
We utilize some existing data set to validate the effectiveness and efficiency of our method in different situations. Hopefully our work can be the basis for subsequent road inspections such as formulating a road anomaly map of the city.
The contributions of this work are two-folds: • A query-based detection algorithm to extract the topk reliable anomalies and the rest suspicious anomaly window and a re-compare classifier to define the specific types in anomaly types are proposed based on the theory. The former one locates the possible position that might exist anomalies, and the latter one reduces time consumption and improves performance.
• Our model is less data-dependent that it use a small part of training set. We compare our method with some existing methods and the proposed method demonstrated the best F1 score with the second least time consumption compared to all the other methods. The organization of the article is as follows. The II reviews some related works during the past few years. The III gives the main ideas and implementation of our method. The IV presents the experimental results for multi-class anomalies classification and the time performance. Finally, the V concludes our work and predicts possible research directions.

II. RELATED WORK
Early studies mainly focus on threshold-based detection techniques to identify the anomalies. For instance, Smartphones et al. [4] proposed a system using 310Hz sampling rate to acquire acceleration information of the smartphone. Eriksson et al. [5] used a threshold-based filter to detected road anomaly with acceleration and GPS data in the Boston area. The system proposed by Mednis et al. [6] used four heuristic threshold methods to detect road anomaly: Z-THRESH, G-ZERO, Z-DIFF and STDEV. However, threshold-based detection method can only detect whether there is an anomaly, but cannot identify the type of it.
In recent years, many researchers believe that the extracting features from both frequency domain and time domain information can more accurately help to solve the problem. For example, Perttunnen et al. [7] used the fast Fourier transform to acquire the energy value of each frequency band of the acceleration data, and extracted a 95-dimensional feature for detection. Seraj et al. [8] decomposed acceleration data to extract feature vectors by utilizing wavelet transform. Carlos et al. [9] showed that better detection effect could be obtained by features based on the standard deviation score. Alessandroni et al. [13] researched the relationship between the road roughness and the vehicle speed using their own model. Some researchers [14] used 288 features per window to calculate the Euclidean distance. Madli et al. [15] established a rough judgement between the height or depth of road anomalies and the type of road anomalies. Manan and Ghani [16] built a shape comparison model between the road anomalies. Some researchers [17]- [19] built a feature-based comparison model between the road anomalies. El-Wakeel et al. [10] used wavelet denoising method to improve the quality of low-cost MEMS sensing data, and made a roadway surface disruption based on the characteristics of time domain and frequency domain information VOLUME 8, 2020 extraction of acceleration. Gonzclez et al. [11] took a word bag model to extract features from the acceleration data. Some researchers [20] used some features such as the final stop duration or the number of stops of the car to characterize the road condition. Fox et al. [21], [22] modify the acceleration data by taking the angle of the road into consideration, while Xue et al. [23] proposed some equations to use the acceleration data to discover higher dimension of the pothole. Some researchers [24] used Gaussian model to predict the depth of the potholes. Some people [25] employed undersampled oscillation system to estimate the road surface. Some researchers [17] proposed time-series based method to solve this problem. Also, there's one existing work [26] which used edge computing method to detect anomalies.
One drawback of these studies is that they do not think highly of the quantity difference between the normal road and the anomaly. According to data collected by some studies [27], [28], the ratio between normal road and anomaly is about 95:5 to 97:3. As a result, sliding window techniques [27] generate a large amount of windows that represent normal road, and it makes the procedure of SVM slow and inaccurate. Some methods [28] used imbedded platform to filter those normal road data, but it's complicated in design and cannot be applied to all mobile devices.
Another drawback is that the existing work extracts some features to reflect and represent the most important characteristic of the window, but none of the existing work discovers the fact that the data collected when a vehicle has driven over the anomaly is similar not only in features but also in the shape itself. Table 1 provides a summary of the related works and their characteristics.

III. OVERVIEW OF THE PROPOSED DESIGN A. DESIGN FRAMEWORK
Our proposed model produces the detection and judgement result of an existing piece of data by three phases: 1) Query region selection and distance calculation; 2) Distinguishing reliable regions and uncertain regions and 3) Region extension and determining types of uncertain regions. The framework is shown in Figure 2.
We design this model based on aforementioned assumption: when the vehicle drives over the anomaly, change in acceleration data is similar and limited in some types. The differences between different anomalies that belong to the same type mainly lie in the maximum of acceleration data, the duration and the changing process. These differences can be decided by velocity and suspension system of the vehicle or position of the device, but the main shape of acceleration data will be limited in determined types. This assumption is supported by data we collect and data set utilized in other papers.
According to this assumption, we attempt to use the time series querying idea to briefly find out those anomaly region by calculating the querying distance. In this phase, the querying algorithm might select some normal region mistakenly, and there must be some regions are highly standard with small distance compared to training data. On the other hand, we query the entire testing set with three types of anomaly, which leads to a multi-class querying results handling problem. To solve this problem and to determine those regions with high probability to be anomalies, the second phase is applied. Still there be some uncertain regions which are anomalies in fact, but these anomalies may be similar in the shape while different in the duration. We extend anomaly windows of the training set and anomaly windows determined in previous phase to compare with these regions again to specify whether these uncertain regions are anomalies or not. These three phases are described in Algorithm 1. Figure 3 shows an example of the region division and meaning of parameters. Algorithm 2 shows the algorithm of how to calculate the distance between two regions. This step is based on the idea of query algorithm according to [30]. We've located the problem of comparing two anomalies on comparing two parts of time series data, and this method is helpful in calculating the similarity between two parts of time series data. In [31] and [32], they proposed distance measures that dynamically weigh the differences between time series. We've observed the following data features to compare anomaly regions and propose a new distance measure based on the observations:

B. QUERY REGION SELECTION AND DISTANCE CALCULATION
(1) Preservation of visually salient features such as peaks, troughs and slopes: the more pronounced the feature of interest was -the longer the slope, the deeper or wider the trough or peak -the more likely this region to be an obvious anomaly.
(2) Non-uniform global scaling: consider a region of an upward sloping straight line at a 45 degree angle. Any upward sloping line segment within the data is an acceptable match to this region if time and amplitude scales are modified accordingly.
(3) Local distortions: certain features within a region may be exaggerated such as the width or depth of a peak or trough or the relative difference between the heights of smaller and larger peaks in a pattern. if Average of the z-axis value in detection window Avg = z 1 +z 2 +...+z L L overpass the threshold then 4: while Move from the beginning of detection window do 5: if Curve monotonicity changes then 6: Add this point into region edge point set E 7: end if 8: end while 9: end if 10: Move the detection window forward by length L 11: end while 12: Merge continuous points in E as one region, and calculate the number of segments. Add this region into candidate region set C 13: for each q i in Q, each c j in C do 14: Calculating distance between q i and c j and add to dist matrix DistQuery[i] [j] Section III-B 15: end for 16: Label top-k reliable regions in C according to DistQuery and add them to Q Section III-C 17: for each q i in Q, each c j in C do 18: Extend the shorter one in q i and c j , calculate distance and add to dist matrix Section III-D 19: end for 20: Label uncertain regions in C according to DistExtend Section III-D Thus, we divide the region into several segments by curve monotonicity and calculate all these features separately, no matter local distortions in one segment or global scaling. Once the numbers of segments same, we start to calculate the distance as Figure 3(a) shows.
Taking into consideration that sketches often do not respect aspect ratio, we non-uniformly globally rescales the shorter and the smaller region to fit the longer and taller one (Figure 3(b)). Two coefficients are calculated as Equation 1, 2(These equation regard the query region as smaller one, so as the following equations).
After global-rescaling, we computes the distance of the query region from the candidate data region as a linear combination of two errors: local distortion errors and shape errors. The distortion error is the amount of local rescaling required  for each smaller, shorter segment to better fit its corresponding segment (Figure 3(c)). The distortion error accounts for our tendency to exaggerate features within regions and minimizes its overall influence on the distance measure. Distortion error can be calculated by Equation 3, 4 and 5.
The shape error is the difference in shape, or Manhattan distance, between a data and query segment after locally distorting the query segment (Figure 2(d)). Shape Error can be calculated by Equation 6.
To combine these two differences in different regions, we use one distance to add up them, which can be calculated Use the smallest distance in top 10 smallest value whose type is same as RegionType, mark as RegionDist 5: if RegionDist < TA then 6: Mark the region as reliable region with RegionType 7: else if RegionDist > TN then 8: Mark the region as reliable normal region 9: else 10: Mark the region as Uncertain Regions 11: end if 12: end for by Equation 8.
C. DISTINGUISHING RELIABLE REGIONS AND UNCERTAIN REGIONS Algorithm 3 shows the algorithm of how to filter out reliable regions. After the previous phase, we've figured out the distance matrix DistQuery. Some candidate regions and some query regions that belong to the same anomaly type are similar in shape, leading to the smallest distance over all distance results. However, not all candidate regions can be directly assigned to the anomaly type of the smallest distance value. As a result, the algorithm first sort out the top 10 smallest value and vote to the most similar one. We set a threshold to find out the most reliable regions and utilize them to become query one the next time.
To also filter out the normal road region, we set a high enough threshold that it means none of query regions are similar to the candidate region, Thus it's also reliable region as well to become normal region.

D. REGION EXTENSION AND DETERMINING TYPES OF UNCERTAIN REGIONS
Algorithm 4 shows the algorithm of how to determine the type of uncertain regions. After the previous phase, we've figured out a large amount of reliable regions. We've found and summarized the basic pattern of the three anomalies as Figure 4 shows. When encountering potholes, the vehicle firstly drops then supports by the ground, with acceleration data drops than shakes with vibration. When encountering a metal bump, the situation is converse as well as the change of acceleration data. Most of the speed bumps are long with little time to avoid, thus the acceleration data will gain another shake when the rear wheel meet the speed bump.
Motivated by these basic pattern, we locate the two most important changing point as well as the local maximum and minimum and regard the region as three parts, as Figure 5 shows. For each part, calculating that the longest region will not be greater than third time of the shortest one and supported by the data.Regions are divided into three parts, and part 1 in region 2 should be extended as long as part 1 in region 1. The red line means part 1 in region 2 is shorter, thus we apply linear interpolation to make up the difference in length. Because the former step already filter the normal road, we utilize the smallest distance value to determine the anomaly type.

IV. EXPERIMENT EVALUATION
In this section, we will present the results of a series of experiments conducted to evaluate the performance of the proposed anomaly detection model. Firstly, we describe the settings of experiments including data sets, compared methods and evaluation metrics. Then, we will report and discuss the experiment results.

A. EXPERIMENTAL SETTINGS 1) DATA SETS
We utilize two data sets to validate the performance of our model and other existing methods.
Data set 1: This data set is based on the anomalies data that the paper [27] provides. The collection of these anomalies was performed while driving in urban settings, both in residential streets and high speed avenues, under standard driving conditions. In order to expand the scale of the problem to meet more realistic scenarios, we use all the anomalies data and normal road data, then randomly generate a road that contains every anomaly included in the data set. To simulate the real road in the real world, we insert anomalies into normal road in a realistic proportion according to the ratio of anomalies to VOLUME 8, 2020 Algorithm 4 Algorithm of Extension and Recalculation Input: Query set Q and Uncertain region S Output: Regions with anomaly types 1: for each q i in Q, each c j in C do 2: Find the maximum and minimum point pmax i , pmin i , pmax j , pmin j in region q i , c j , separate the region as three parts. 3: for each part do 4: Extend the shorter one by linear interpolation 5 10: Mark the region as this type of anomaly 11: end for normal road in existing data sets [11], [29], [33]. The ratio of the length of the anomalies to the normal road is set to 1:50.
Data set 2: The data set is inspired by the paper [34] to simulate and collect data from Carsim R . We use the Carsim program to simulate vehicles driving over potholes, metal bumps and speed bumps. With this tool, we simulated vehicles driving over 4000 potholes, 2000 metal bumps and 2000 speed bumps. A large amount of accelerometer data is collected.
The size of the four data sets is shown in Table 2.

2) COMPARATIVE METHODS
We compare the proposed recommendation model with the following four methods, all of which are popular work from the last three years.
• SVM-WW: the method [27] was originally proposed to identify different anomalies such as potholes, metal bumps and speed bumps based on z-axis data acceleration. Features are mainly some statistical indicators such as mean and standard deviation. They also proposed confidence score as feature to evaluate whether the statistical features are trustworthy.
• CPD-SFS: the method [22] was originally proposed to detect different lanes and to identify anomalies in different lanes. They managed to select most effective features by greedy forward selection algorithm, and create some features that are more complicated such as the average of the absolute value of the product of z-axis acceleration and the velocity.
• SVM-MDDP: this method [10] was originally proposed to identify various anomalies via sensor data based on multi-domain of processing. They first de-noised the data, then not only extracted statistical features like other literature, but also used time and frequency domain data, and eventually achieved a feature scale of more than 70 dimensions.
• MLM-WB: the method [11] was originally proposed to represent the time series data with some simplified 98980 VOLUME 8, 2020   representations called word bag. Basing on these representation, machine learning methods reached a higher accuracy.
• RoDS-ACGAN: the method [35] used one novel method to process road anomaly data based on ACGAN and build a novel system to collect data and give results.
We compare these methods with our proposed method, Query Enabling with Comparison Over shape Metrics(QDetect). The results will be shown in the follow section.

3) EVALUATION METRIC
There are four main indicators for our evaluation: F1-score, average latency, average network usage and energy consumption. For F1-score, it's calculated by the equation: In order to explain the problem more clearly, and to make our model closer to practical applications, we identify a TP as the overlapping between the prediction and the real data [27], as the Figure 6 shows. The ground truth indicates the anomaly in the real world. Predicted line 1 is generated by one general method, and predicted line 2 is generated by another general one. The predicted line 1 and the ground truth overlap in time series 10 to 15, so we suppose the prediction is correct and it's a TP. The predicted line 2 and the ground truth don't overlap in any position, but there is a prediction, so it's a FP. In the real prediction situation, we want to know if there is a anomaly on the road. We also want to know the specific location of the anomaly. However, predicting the true length of the anomaly is often unnecessary: the road maintenance department does not care about this type of information. As a result, this definition of TP is more reasonable. FP, FN have similar definitions.

4) IDENTIFICATION PERFORMANCE FOR ALL METHODS
For a comprehensive comparison, we adjust the validation part(for machine-learning based method, part of the data set is used to train and the rest is used to test; for our method, part of the data set is used to be compared with, and the rest is used to be query set) from 50% of the data set to 10% of the data set in order to test the change of model performance under different scale of verification sets. As a result, the training part will grow from 50% to 90% by an increment of 10%. In data set 1, we use 50% of the training set, while in data set 2 we use 10% of the training set.
The results of data set 1 are illustrated in Figure 7. This data set includes a reasonable ratio of pothole, speed bump and metal bump. The number of anomalies are big enough and leads to a steady result.
From the three figures, we can observe: 1) Other methods perform better in pothole identification reach more than 0.4 F1 score. The best method SVM-WW reaches more than 0.62 in some training partitions. Method SVM-MDDP also reaches more than 0.6 in some partitions as well. Compared to these methods, our method always passes over 0.7, passes over 0.8 when training partition is higher than 70% and achieves at least 0.15 advantage in F1 score. 2) In speed bump identification, F1 score of method CPD-SFS grows greatly from 0.5 to over 0.6 with the growth of the size of training set. Method SVM-WW and MLM-WB perform similarly to increase nearly 0.7, while method SVM-MDDP always achieves higher than 0.7 and performs stable. Compared to the best method SVM-MDDP, our method performs a little bit lower with 50% data to train, but it increases steadily and has a 10% advantage most with 90% of the set being used to train. 3) In metal bump identification, our method still performs better than other methods. CPD-SFS and SVM-MDDP both reach beyond 0.6 when the training part is larger than 70% of the total set. SVM-MDDP reaches more than 0.7 with 90% partition of the set as training set. Compared to the best method, our method at least achieves 8% better in 80% training set, and 18% in 70% training set at most. This data set is used by method SVM-WW. In this method, F1 score of speed bump almost reaches 0.6, and 0.4 for metal bump identification. The paper indicates that it does not quite fit for multi-class identification, and the reason is that the statistical feature of this method such as standard, mean or the covariance may be similar for all the anomalies. As a result, this method cannot distinguish different anomalies from each other. CPD-SFS takes some complex features into consideration, such as the product of the velocity and the z-axis acceleration data. The complexity of the features contributes to the identification of metal bump and pothole from speed bump. However, for the similarity between the metal bump and the speed bump, such as the vehicle both rising first and the z-axis data changing in a similar way. This method requires sufficient samples to train the model and improve the performance. Method MLM-WB uses some short pattern to represent the fundamental segment of the anomaly and slice the anomaly into a series of patterns. These patterns can be compared and sorted by regular expression. The advantage of this method is it can be the most stable one that it doesn't need a large amount of training data. The pothole or metal bump identification effect stay almost same despite the change of training set partition. However, if the pattern is not defined correctly to represent the anomaly, these similar anomalies will always become unrecognizable. Method SVM-MDDP performs better than other methods because: 1) this method utilizes more than 70 dimensions of features, making it covers more information. 2) the feature of this method is more complex compared to other methods for its features covering statistical, time and frequency domains. Our method can fully utilize the advantage of time series query method to start in a small training set and quickly filter out a large amount of reliable set. After that, the reliable result will be produced by COTE algorithm avoiding the default of existing machine learning methods that they can hardly distinguish different anomalies.
The results of data set 2 are illustrated in Fig 8. The former data set only contains about 10 to 20 anomalies each type in 10% partition of data set. Thus the F1 score changes greatly when training set is 80% or 90% and one anomaly is misclassified. However, there will be 100 to 200 anomalies each type in 10% partition of data set, thus the influence of one misclassification decreases a lot. Also, this data set is collected by simulation, which means the anomaly data are more smooth and regular. As a result, performance of all the methods improve compared to data set 1.
From the three figures, we can observe: 1) Other methods perform better in pothole identification reach more than 0.6 F1 score. The best method CPD-SFS and SVM-MDDP reach more than 0.8 with training partitions higher than 70%. Compared to these methods, our method passes over 0.85 when the training partition is higher than 70% and achieves at least 0.03 advantage in F1 score. With 90% of data set to be the training set, our method achieves 11% better than the best of other methods. 2) In metal bump identification, the result of our method and the best performance of other methods is very close. Our method takes a 0.05 advantage in F1 score compared to the best method when the training partition is high than 70%. 3) In speed bump identification, method CPD-SFS still performs best in other methods and always reaches more than 0.8. Compared to this method, our method at least achieves better performance with 60% training set, and 7% advantage with 90% training set at most.
This data set is generated by a simulation software, and its noise interference is much smaller than the first data set. In this situation, some methods that have made ways to eliminate noise, such as SVM-WW with some features as the confidence of other features, have lost the role of these features to some extent. The others, such as the methods of CPD-SFS and SVM-MDDP, dedicated to the processing of the data in numerical, time or frequency domain, take the possession of high F1 score. On the other hand, the performance of the two methods in other data sets that containing noise is not satisfactory as well. From this perspective it might not be suitable to take the noise adjustment into feature consideration. Also, CPD-SFS takes a large amount of time in pre-processing and pre-calculating, resulting in its high identification performance with high pre-processing time consumption. SVM-MDDP, instead, uses complex features in model training, leading to its high identification performance with high model-training time consumption. SVM-WW still performs stable in metal bump identification. It's highly assumed that the different patterns in training set cause its rise in pothole identification effect from 70% to 80% training partition. Our method uses only 10% of the training set this time for the reason that it can filter out and absorb enough reliable regions. On the other side, the more standard data is, the less shape error will be. As a result, more reliable regions will be filter out and uncertain regions will have more samples to compare with, leading to a fine identification result.

5) TIME CONSUMPTIONS FOR ALL METHODS a: HARDWARE ENVIRONMENT
The program runs in Windows platform PC with a Intel i7-7700k cpu and 16GB ddr4 ram. We refer to the paper [27] and use python package ski-learn to formulate our program. All the basic methods such as random forest or support vector machine are all programmed with that package. We didn't use GPU to accelerate the calculation or the model training. Python programs run in python 3.5.

b: TIME CONSUMPTION ANALYSIS
The results of time consumption are illustrated in Figure 9, Figure 10. We still use 70% of the data set to train and increase the size of validation set from 5% to 30% by 5%. To reach a more accurate result, we do prediction by 5 times with a fixed model. The time consumption includes the pre-processing time, which means transforming data into feature windows, and prediction time, which means the time consumption for prediction. From the two figures, we can observe: 1) Time consumption grows almost linearly with the size of validation part. Also, the relative order between these methods remain almost unchanged. Thus it's understandable that the order of time consumption between these methods stay unchanged despite the size of the data set. 2) Method MLM-WB spend the least time, while the method SVM-MDDP spends the most time. The reason why method SVM-MDDP spends so much time is that this method calculates a large number of time domain and frequency domain features. Although this method performs well in the previous analysis, it takes too much resources to reach this performance. 3) Our method spends a medium degree of time. In data set 1, compared to the fastest method, our method will take 20% extra time consumption. However, the performance of the fast method are rather not ideal. Compared to the best performance method in the other methods, method SVM-MDDP, our method only take 1/8 time to perform a better result. In data set 2, method SVM-MDDP performs well, while its time consumption grows rapidly. Compared to those two methods that have a close performance, our method only takes 1/3 to 1/4 time to do calculation and prediction. Although we need extra 20% time consumption compared to the method spending the least time, we reach more than 40% better in performance.
In conclusion, we suppose we spend an acceptable more time consumption to reach a much better performance, or we spend much less time to reach a similar performance compared to the existing methods.

V. CONCLUSIONS
Smartphones are becoming easier to perform as data collection devices, making it more possible to analyze road conditions with these data. In this work we propose a model to detect the anomaly on the road based on a series of acceleration data. We first apply time series query method to identify reliable regions and uncertain regions. After that, we add these reliable regions to our query set in order to determine the uncertain regions. We test our method in two data sets using F1 score as the metric to compare to some existing methods. Our method takes much advantage to all the other methods in two data sets, indicating that the model is usable in most situations. Further more, we only utilize part of the training set, thus our method will be easier to transform from one system into another while needing recollecting data.
On the other hand, we do efforts to reduce the time consumption. We take far less time consumption compared to the best existing method, which means we only use 1/8 of time consumption to reach a better performance.
In the future work, we would like to apply this system to a large amount of taxies and generate a map of the entire anomalies in the city with group perception. It requires the judgement and selection of the large amount of data and will help the government to repair the road in a more simple way.