Time Series Data Cleaning: A Survey

Errors are prevalent in time series data, which is particularly common in the industrial field. Data with errors could not be stored in the database, which results in the loss of data assets. At present, to deal with these time series containing errors, besides keeping original erroneous data, discarding erroneous data and manually checking erroneous data, we can also use the cleaning algorithm widely used in the database to automatically clean the time series data. This survey provides a classification of time series data cleaning techniques and comprehensively reviews the state-of-the-art methods of each type. Besides we summarize data cleaning tools, systems and evaluation criteria from research and industry. Finally, we highlight possible directions time series data cleaning.


I. INTRODUCTION
Time series data can be defined [1] as a sequence of random variables, x 1 , x 2 ,..., x n , where the random variable x 1 denotes the value taken by the series at the first time point, the variable x 2 denotes the value for the second time period, x n denotes the value for the n-th time period, and so on.Time series have been widely used in many fields [2]- [4] such as financial economy, meteorology and hydrology, signal processing, industrial manufacturing, and so on.Time series data are important in industry, where there are all kinds of sensor devices capturing data from the industrial environment uninterruptedly.Owing to the fact that data of the sensor devices are often unreliable [5], time series data are often large and dirty.In the financial field, the most important application of time series data is to predict future commodity (stock) price movements.However, time series errors in the financial field are also very prevalent, even some data sets, which are considered quite accurate, still have erroneous data.For instance, the correct rate of stock information on Yahoo Finance is 93%.The costs and risks of errors, conflicts, and inconsistencies in time series have drawn widespread attention from businesses and government agencies.In recent work, the data quality issues in time series data are studied, since they pose unique data quality challenges due to the presence of autocorrelations, trends, seasonality, and gaps in the time series data [6].According to Shilakes et al. [7], the relevant market growth rate of data quality is about 17%, which is much higher than the 7% annual growth rate of the IT industry.For instance, approximately 30% to 80% of the time and cost are spent on data cleaning in data warehousing project development.The time series errors can be either timestamp errors [8] or observed value errors.In this survey, we focus on the existing methods of dealing with observed value errors, thereby, the time series errors mentioned in the following are observation errors.There are two types of processing methods commonly used in the industry when dealing with time series data errors: (1) Discarding erroneous data.First, the time series is detected via using an anomaly detection algorithm, and then the detected abnormal data are discarded.
(2) Cleaning data.Data cleaning is divided into manual cleaning and automatic cleaning.There is no doubt that manual cleaning has a high accuracy rate, but it is difficult to implement because it takes more time and effort.
The existing surveys of data cleaning [9], [10] mainly summarize the methods of dealing with data missing [11], data inconsistence, data integration and erroneous data in the database.Karkouch et al. [12] review the generation of sensor data, the reasons for the formation of data quality problems, and the techniques for improving data quality.However, Karkouch et al. [12] do not provide a detailed overview of the existing state-of-the-art of erroneous data cleaning.Thereby, we review the state-of-the-art of time series data error value cleaning, which may provide a tutorial for others.

A. Problem Statement
In this study, enlightened by related research [13], [14] and [15] on the classification of time series error types, we summarize the common error in time series into three categories, namely, single point big error, single point small error and continuous errors.This article takes the stock price of a stock for 30 consecutive trading days as an example.As shown in Figure 1, the characteristics of these three types of errors are described in detail.The red line in the figure indicates the true price of the stock in 30 consecutive trading days, and the blue line indicates the price of the stock crawled by a website.For various reasons, the observed value may not be the same as the actual value.It can be seen that in the four consecutive trading days of 8-11, the observed values are all 0, and the true values are 1.3, 1.2, 1.1 and 1.15, respectively, on the 20th trading day, the observed value is 2.4 and the true value is 1.3, on the 25th trading day, the observed value is 1 and the true value is 1.3.
(1) Continuous errors.The so-called continuous errors, that is, in the time series, errors occur in several consecutive time points.Specifically, continuous errors can continue to be subdivided into several types [13], but no longer detailed here.The observed values from the 8th to 11th trading days in Figure 1 are all 0, that is, continuous errors occur here.Continuous errors are common in real life.For instance, when someone is holding a smartphone and walking on the road, nearby tall buildings may have a lasting impact on the collected GPS information.Besides, system errors can also cause continuous errors.
(2) A single big error.A single point error is an error that occurs discontinuously in a time series and only occurs on a single data point at intervals.A big error means that the observed value of the data point is far from the true value.Remarkably, the size of the error is relative and closely related to the real situation of the data set.As shown on the 20th trading day in Figure 1, the observed value differs from the true value by 1.1.Compared with the 25th trading day when the observed value differs from the true value by 0.3, the error of this data point is large, so this data point error is a single big error.The single point big error is also very common in daily life.For instance, the data of motor vehicle oil level recorded by cursors may cause a single big error when bumping on the road.
(3) A single small error.Similar to a single point big error, that is, errors do not occur consecutively, only on a single data point at intervals.When the observed value of the data point differs from the true value by a small distance, on the 25th trading day in Figure 1, it is a single point small error.As stated in [16], the rationale behind single-point small errors is that people or systems always try to minimize possible errors.For instance, people may only have some small omissions when copying files.
(4) Translational error.As shown in Figure 2, where x axis represents time and y axis represents the value of the corresponding time, the red line represents true value, and the blue line represents error value after the translation, the solution to this type of error is not as much as mentioned above.
Ignoring time series errors often results in unpredictable consequences for a series of applications such as query analysis.Thereby, time series cleaning algorithms are very important for mining the potential value of data.This paper reviews the cleaning algorithm and anomaly detection algorithm of time series data.By summarizing the existing methods, a reference or guidance is given to scholars interested in time series data cleaning and based on this, the possible challenges and future work of time series cleaning topics are discussed.

B. Problem Challenge
For the problem of time series data cleaning, the following four difficulties have been discovered through the survey: (1) The amount of data is large and the error rate is high.The main source of time series data is sensor acquisition.Especially in the industrial field, sensors distributed throughout the machine are constantly monitoring the operation of the machine in real-time.These sensors often collect data at a frequency of seconds, and the amount of data collected is quite large.For instance, the sensor collection interval of a wind power company equipment is 7 seconds, each machine has more than 2000 working status data, and more than 30 million pieces of data are collected every day, so the working status data of one day could exceed 60 billion.However, the data collected by the sensor are often not accurate enough, and some because of the physical quantity of the observation is difficult to measure accurately.For instance, in a steel mill, with affecting by environmental disturbances the surface temperature of the continuous casting slab cannot be accurately measured or may cause distortion due to the power of the sensor itself.
(2) The reasons for generating time series data errors are complicated.People always try to avoid the generation of time series erroneous data, however, there are various time series errors.Besides the observed errors that we mentioned above for various reasons, Karkouch et al. [12] also explain in detail the IoT data errors generated by various complex environments.IoT data is a common time series of data, and its widespread existence is really in the world.The complex reasons of time series errors also are challenges we face in cleaning and analyzing data that is different from traditional relational data.
(3) Time series data are continuously generated and stored.The biggest difference between time series data and relational data is that the time series is continuous.Thereby, for time series data, it is important that the cleaning algorithm supports online operations (real-time operations).The online anomaly detection or cleaning algorithm can monitor the physical quantity in real-time, detect the problem and then promptly alarm or perform a reasonable cleaning.Thereby, the time series cleaning algorithm is not only required to support online calculation or streaming calculation but also has good throughput.
(4) Minimum modification principle [17]- [19].Time series data often contain many errors.Most of the widely used time series cleaning methods utilize the principle of smooth filtering.Such methods may change the original data too much, and result in the loss of the information contained in the original data.Data cleaning needs to avoid changing the original correct data.It should be based on the principle of minimum modification, that is, the smaller the change, the better.

C. Organization
Different algorithms tackle these challenges in different ways, which usually include smoothing-based methods, constraint-based methods, and statistical-based methods as shown in Table I.Besides some time series anomaly detection algorithms can also be effectively used to clean data.The remainder of this paper is organized as follows.The aforesaid four types of algorithms are discussed from Section II to V, respectively.In Section VI we introduce existing time series cleaning tools, systems, and evaluation criteria.Finally, we summarize this paper in Section VII and discuss possible future directions.

II. SMOOTHING BASED CLEANING ALGORITHM
Smoothing techniques are often used to eliminate data noise, especially numerical data noise.Low-pass filtering, which filters out the lower frequency of the data set, is a simple algorithm.The characteristic of this type of technology is that the time overhead is small, but because the original data may be modified much, which makes the data distorted and leads to the uncertainty of the analysis results, there are not many applications used in time series cleaning.The research of smoothing technology mainly focuses on algorithms such as Moving Average (MA) [20], Autoregressive (AR) [4], [21], [22] and Kalman filter model [25]- [27].Thereby, this chapter mainly introduce these three technologies and their extensions.

A. Moving average
The moving average (MA) series algorithm [20] is widely used in time series for smoothing and time series prediction.A simple moving average (SMA) algorithm: Calculate the average of the most recently observed N time series values, which is used to predict the value at time t.A simple definition as shown in equation (1).
In the weighted moving average (WMA) algorithm, data points at different relative positions in the window have different weights.Generally defined as: In equation (3), ω i represents the weight of the influence of the i position data point on the t position data point, other definitions follow the example above.A simple strategy is that the farther away from the two data points, the smaller the mutual influence.For instance, a natural idea is the reciprocal of the distance between two data points as the weight of their mutual influence.Similarly, the weight of each data point in the exponential weighted moving average (EWMA) algorithm [32] decreases exponentially with increasing distance, which is mainly used for unsteady time series [86], [87].
Aiming at the need for the rapid response of sensor data cleaning, Zhuang et al. [31] propose an intelligent weighted moving average algorithm, which calculates weighted moving averages via collected confidence data from sensors.Zhang et al. [88] propose a method based on multi-threshold control and approximate positive transform to clean the probe vehicle data, and fill the lost data with the weighted average method and exponential smoothing method.Qu et al. [89] first use cluster-based methods for anomaly detection and then use exponentially weighted averaging for data repair, which is used to clean power data in a distributed environment.

B. Autoregressive
The Autoregressive (AR) Model is a process that uses itself as a regression variable and uses the linear combination of the previous k random variables to describe the linear regression model of the random variable at the time t.The definition of AR model [21], [90] as shown in equation (4).
In equation ( 4), x t is the predicted value of x t , x t represents the true value at time t, k is the order, µ is mean value of the process, ǫ t is white noise, ω i is the parameter of the model, a is a constant.
Park et al. [91] use labeled data y to propose an autoregressive with exogenous input (ARX) model based on the AR model: In equation ( 5), y t is the possible repair of x t , and others are the same to the aforesaid AR model.Alengrin et al. [24] propose Autoregressive moving average (ARMA) model), which is composed of the AR model and MA model.Besides that, the Gaussian Autoregressive Moving Average model is defined as shown in equation ( 6) [13].
In equation (6), are polynomial in B of degrees p and q, respectively, θ 0 is a constant, B is the backshift operator such that BZ t = Z t−1 , and x t is a sequence of independent Gaussian variates with mean µ=0 and variance σ 2 x .Box et al. [92] propose a more complex Autoregressive Integrated Moving Average (ARIMA) model based on the ARMA model, which is not described in detail here.Akouemo et al. [93] propose a method combining ARX and Artificial Neural Network (ANN) model for cleaning time series, which performs a hypothesis test to detect anomalies the extrema of the residuals, and repairs anomalous data points by using the ARX and ANN models.Dilling et al. [23] clean highfrequency velocity profile data with ARMA model and Chen et al. [94] use the ARIMA model to clean wind power data.

C. Kalman filter model
Kalman [25] proposes the Kalman filter theory, which can deal with time-varying systems, non-stationary signals, and multi-dimensional signals.Kalman filter creatively incorporates errors (predictive and measurement errors) into the calculation, the errors exist independently and are not affected by measured data.The Kalman model involves probability, random variable, Gaussian Distribution, and State-space Model, etc.Consider that the Kalman model involves too much content, no specific description is given here, and only a simple definition is given.First, we introduce a system of discrete control processes which can be described by a Linear Stochastic Difference equation as shown in equation (7).
Also, the measured values of the system are expressed as shown in equation (8).
In equation ( 7) and ( 8), x t is the system state value at time t, and v t is the control variable value for the system at time t.m and n are system parameters, and for multi-model systems, they are matrices, y t is the measured value at time t, r is the parameter of the measurement system, and for multimeasurement systems, r is a matrix, p(k) and q(k) represent the noises of the process and measurement, respectively, and they are assumed to be white Gaussian Noise.
The extended Kalman filter is the most widely used estimation for a recursive nonlinear system because it simply linearizes the nonlinear system models.However, the extended Kalman filter has two drawbacks: linearization can produce unstable filters and it is hard to implement the derivation of the Jacobian matrices.Thereby, Ma et al. [95] present a new method of predicting the Mackey-Glass equation based on the unscented Kalman filter to solve these problems.In the field of signal processing, there are many works [28]- [30] based on Kalman filtering, but these techniques have not been widely used in the field of time series cleaning.Gardner et al. [32] propose a new model, which is based on the Kernel Kalman Filter, to perform various nonlinear time series processing.Zhuang et al. [31] use the Kalman filter model to predict sensor data and smoothed it with WMA.

D. Trajectory Simplification
The main purpose of trajectory simplification is to reduce the original trajectory from N trajectory data points to M trajectory data points, M < N .And retain important location or topological features.The compression algorithm needs to ensure that the sequence of M points has a minimum compression error (M -ǫ problem), Such as MRPA [37], Error-Search [38].The input of the Error-Search algorithm is the target compression ratio λ, Original track T .Returns the compression that satisfies the given compression ratio λ with the smallest DAD error.The main steps of the Error-Search algorithm are as follows: (1) According to the original trajectory T , Build search space ǫ, create a search ǫ space based on the opposite direction in a more efficient way Trace T .(2) Error checking.the main task is to determine whether there is a simplified trajectory T , satisfy DAD( T ) < ǫ.Error-Search is designed to retain the direction information of the track.The time complexity of the algorithm is O(N 2 log N ), the space complexity is O(N 2 ).Error-Search is an accurate trajectory

E. Summary and Discussion
As shown in Table II, there are many methods based on smoothing, such as the state-space model [26], [35], [36] and Interpolation [33], [34].The state-space model assumes that the system's change over time can be determined by an unobservable vector sequence, the relationship between the time series and the observable sequence can be determined by the state-space model.By establishing state equations and observation equations, the state-space model provides a model framework to fully describe the temporal characteristics of dynamic systems.To make this kind of smoothing algorithm have a better effect, many studies [96]- [98] have also proposed various techniques to estimate the parameters in the above methods.Most smoothing techniques, when cleaning time series, have a small-time overhead, but it is very easy to change the original correct data points, which greatly affects the accuracy of cleaning.In other words, correct data are altered, which can distort the results of the analysis and lead to uncertainty in the results.

III. CONSTRAINT BASED CLEANING ALGORITHM
In this section, we introduce several typical algorithms, which include order dependencies (ODs) [39], sequential dependencies (SDs) [44] and speed constraints [45], for repairing time series errors.

A. Order Dependencies
In relational databases, Order Dependencies (ODs) are simple and effective methods, which have been widely studied [39]- [42].We find that ODs are also suitable for solving some time series data cleaning problems.The specific explanation is as follows: Let x(t) = x 1 , x 2 ...x t be a time series, ODs can be expressed by <, ≤, >, ≥.For the number of miles traveled by the car x(t), the mileage should increase over time.9) states that N is strictly increasing with M .Such as equation (10).
ODs and DCs can also be used as an integrity constraint for error detection and data repairing in databases.Wijsen [99], [100] extends ODs with a time dimension for temporal databases.Let I = {I 1 , I 2 , I 3 , . . .} be a temporal relation, which can be viewed as a time series of conventional "snapshot" relations, all over the same set of attributes.A trend dependency (TD) allows attributes with linearly ordered domains to be compared over time by using any operator of {<, =, >, ≤, ≥, =}.Consider the constraint is specified over (I i , I i+1 ) in I.For each time point i, it requires comparing employee records at time i with records at the next time i + 1, such that salaries of employees should never decrease.Lopatenko et al. [43] propose a numerical type data cleaning method based on Denial Constraints (DCs) as constraints, whose principle is similar to this one.

B. Sequential Dependencies
The sequential dependency algorithm proposed by Golab et al. [44] focuses on the difference in values between two consecutive data points in a time series.Golab et al. [44] define the CSD Tableau Discovery Problem as given a relation instance and an embedded SD M → g N , to find a tableau t r of minimum size such that the CSD (M → g N, t r ) has confidence at least a given threshold.A CSD can be (hour → (0,∞) height, [1961.01.01 00:00-2016.01.01 00:00]).
Generally, a sequential dependency (SD) is in the form of In equation (11), M ⊆ R are ordered attributes, N ⊆ R can be measured by certain distance metrics, and g is an interval.It states that when tuples are sorted on M , the distance between the N -values of any two consecutive tuples are within interval g.Fischer et al. [101] propose the concept of streaming mode to represent the structure and semantic constraints of data streams.The concept contains a variety of semantic information, including not only numeric values, but also attributes between order.The sequential dependency algorithm can be used not only for traditional relational database cleaning, but also for time series cleaning.In fact, there are many dependencybased cleaning algorithms designed for relational databases that are not suitable for time series data cleaning, such as: Functional Dependencies [102] (FDs), Conditional Functional Dependencies [103], [104] (CFDs), Differential Dependencies (DDs) [105], [106] or Comparable Dependencies (CDs) [107], [108].The sequential dependency is one of the few algorithms based on dependency that can be used for time series cleaning.

C. Speed Constraints
To clean time series data, speed constraint-based method [45] considers the restrictions of speed on value changes in a given interval.As we have learned some common sense, e.g., the maximum flying speed of a bird, temperatures in a day, car mileage, etc.Consider with time window T is a pair of minimum speed S min and maximum speed S max over the time series x = x 1 , x 2 , . . ., x t , where each x i is the value of the i-th data point, with a timestamp i.
For instance, consider time series: To remedy the violations (denoted by red lines), a repair on x 5 can be performed, i.e., x ′ 5 = 190, which is represented by the blue " * " symbol in Figure 3.As illustrated in Figure 3, the repaired sequence satisfies both the maximum and minimum speed constraints.
Generally, a speed constraint is in the form of equation (12).
If time series data x satisfies the speed constraint S , then for any x i , x j in a time window T , it has S min < xj −xi j −i < S max .In practical applications, speed constraints are often valid for a specific period of time.For instance, when considering the fastest speed of a car, the time period of the reference is often in hours, and two data points in different years are not considered.The value of the speed constraints S may be positive (the growth rate of the constraint value) or negative (the rate of decline of the constraint value).Speed constraints are less effective when dealing with small errors, and Wei Yin et al. [46] propose a further study of variance constraints, which use the variance threshold V to measure the degree of dispersion of the time series in a given W window.

D. Temporal Dependence
Discovering and exploiting temporal dependence is highly desired in many applications.For example, consider health care, analysis of temporal dependence can be of value throughout the entire disease treatment process, from disease prevention to treatment.A temporal dependence contains two components: the causative behavior and the dependent behavior.The users future behavior is affected by the causative behavior and the causative behavior causes the dependent behavior.For one more example, in E-commerce networks, accurate analysis from users time series activity records is of significant importance for advertising, marketing, and psychological studies.Qingchao Ca et al. [49] proposed recurrent cohort analysis to group users into cohorts, and Dehua Cheng et al. [50] improved the performance of temporal dependence recovery by reversing the time stamps of original time series and combine both time series.

E. Summary and Discussion
In the field of relational databases, there are many cleaning algorithms based on integrity constraints, which are difficult to apply in the field of time series in which the observed values are substantially numerical, because they follow a strict equality relationship.A few methods, which we summarize in Table (V), can be used for time series data cleaning, for instance ODs and SDs can be used to solve problems in some scenarios, such as the number of miles in a car is nondecreasing.Further speed-based constraints can be used to process data such as GPS and stock prices, but only with relevant domain knowledge can give a reasonable constraint.Therefore, the constraint-based cleaning algorithm needs to

Reference
Method [39]- [42] ODs [99], [100] Extend ODs [43] DCs [44] SDs [102] FDs [103], [104] CFDs [45] Speed Constraints [46] Variance Constraints [47] Similarity Rule Constraints [48] Learning Individual Models be further improved to have better robustness.Similarity rules [47] can capture a more general form of constraints on similarities between data values [109].Moreover, learning individual models [48] could help in repairing missing data in different scenarios.One possible future direction is to use anomaly detection methods to detect anomalies first, and then treat outliers as missing to repair.We will discuss anomaly detection in Section V.

IV. STATISTICS BASED CLEANING ALGORITHM
Statistical-based cleaning algorithms occupy an important position in the field of data cleaning.Such algorithms use models, which learned from data, to clean data.The statisticalbased approach involves a lot of statistical knowledge, but this article focuses on statistical-based data cleaning methods, so we won't cover statistical-related knowledge in detail.

A. Maximum likelihood
The intuitive idea of the maximum likelihood principle is a random test, if there are several possible outcomes x 1 , x 2 ...x t , if the result x i occurs in one test, it is generally considered that the test conditions are favorable for x i , or think that x i has the highest probability of occurrence.
Notation: For a given time series data x(t), which consistents with a probability distribution d, and assume that its probability aggregation function (discrete distribution) is F d ; consider a distribution parameter θ, sampling x 1 , x 2 ...x n from this distribution, then use F d to calculate its probability [51] as shown in equation (13).
Ziekow et al. [52] use the maximum likelihood technique to clean Radio Frequency Identification (RFID) data.Wang et al. [53] propose the first maximum likelihood solution to address the challenge of truth discovery from noisy social sensing data.Yakout et al. [54] argue a new data repairing approach that is based on maximizing the likelihood of replacement data in the given data distribution, which can be modeled using statistical machine learning techniques, but this technology is used to repair the data of the database.For the repairing of time series data errors, Zhang et al. [55] propose a better solution based on maximum likelihood, which solves the problem from the perspective of probability.According to the probability distribution of the speed change of adjacent data points in the time series, the time series cleaning problem can be converted to find a cleaned time series, which is based on the probability of speed change that has the greatest likelihood.

B. Markov model
Markov process is a class of stochastic processes, which means that the transition of each state in the process depends only on the previous n states.This process is called a n−order model, where n is the number that affects the transition state.The simplest Markov process is the f irst−order process, and the transition of each state depends only on the state before it.Time and state are discrete Markov processes called Markov chains, abbreviated as X n = X(n), n = 0, 1, 2....The Markov chain [58] is a sequence of random variables X 1 , X 2 , X 3 ....The range of these variables, that is, the set of all their possible values, is called the "state space", and the value of X n is the state of time n.
The Markov Model [59], [60] is a statistical model based on Markov chain, which is widely used in speech recognition, part-of-speech automatic annotation, phonetic conversion, probabilistic grammar and other natural language processing applications.In order to find patterns that change over time, the Markov model attempts to build a process model that can generate patterns.[59] and [60] use specific time steps, states, and make Markov assumptions.With these assumptions, this ability to generate a pattern system is a Markov process.A Markov process consists of an initial vector and a state transition matrix.One thing to note about this assumption is that the state transition probability does not change over time.
Hidden Markov Model (HMM) [61], [62] is a statistical model based on Markov Model, which is used to describe a Markov process with implicit unknown parameters.The difficulty is to determine the implicit parameters of the process from observable parameters, and then use these parameters for further analysis, such as prediction of time series data.For instance, after rolling the dice 10 times, we could get a string of numbers, for example we might get such a string of numbers:1, 4, 5, 3, 3, 1, 6, 2, 4, 5 as shown in Figure 4.This string of numbers is called the visible state chain.But in HMM, we not only have such a string of visible state chains, but also a chain of implied state chains.In this example, the implicit state chain might be: D5, D3, D2, D3, D4, D6, D1, D5, D1, D2.
Gupta et al. [63] use HMM to predict the price of stocks.Baba et al. [64] argue a data cleaning method based on the HMM, which used to clean RFID data related to geographic location information.In multi-dimensional time series cleaning, HMM has more application space than the singledimensional cleaning algorithm, because of the correlation between the dimensions.

C. Binomial Sampling
Shawn R. Jeffery et al. [67] propose an adaptive method for cleaning RFID data, which exploits techniques based on sampling and smoothing theory to improve the quality of RFID data.Tag transition detection: tag transition detection refers to (1) Interrogation cycles: the reader's question-and-answer process for the tag is the basic unit of the reader's detection tag.
Based on the above concept, the following definition: W i is smooth window of tag i and is composed of ω i epoch, S i is the window that tag i is actually detected in the W i window, Count t indicates the number of inquiry cycles of t, and R is the corresponding number of t epoch tag i.For a given time window, suppose the probability that the tag i may be read in each epoch is Countt , and the Statistical Soothing for Unreliable RFID Data (SMURF) [67] algorithm treats each epoch's reading of the tag as a Bernoulli experiment with probability p i .Therefore, p i conforms to the binomial distribution B(ω i , p i ).p i,avg is the average read rate in S i .
Using the model based on the Bernoulli experiment to observe the tag i, if the average reading rate of the tags in ω i epoch is (1 − p i,avg ) ωi .To ensure the dynamic nature of the tag the size of the sliding window W i needs to be satisfied as shown in equation (14).
The SMURF algorithm first sets the initial window size to 1, and then dynamically adjusts the window length based on the actual situation of the read.If the current window meets the integrity requirement [67], the SMURF algorithm will detect the status of the tag.When the detection result indicates that the tag status changes, SMURF will adjust the current window length to 1/2 of the original window to react to the tag's transition.If the calculated window size that satisfies the integrity constraint is greater than the current window size, the algorithm linearly increases the current window size by 2 steps and outputs the point data in the current window.If it is detected that the label does not move, the algorithm outputs the current window midpoint as the output point, and then continues to slide an epoch for the next processing.
SMURF algorithm is widely used to clean RFID data, and many studies [65], [66] improve it.Leema, A et al. [65] study the effect of tag movement speed on data removal results and H Xu et al. [66] consider the impact of data redundancy on setting up sliding windows.

D. Spatio-Temporal Probabilistic Model
Besides data cleaning, M. Milani et al. [68] propose Spatio-Temporal Probabilistic Model (STPM), this method learns more detailed data patterns from historical data, and then cleans the current data.STPM not only gives joint probability distributions that are updated on the data set at different times, but also distinguishes association updates from association values.STPM based on Dynamic Probabilistic Relational Models (DRPMs), so we need to state DRPMs model first.The DRPMs is a graph model used to represent the relationship between dynamic data sets, its models based on the dependency relationship between attributes, and generally uses conditional probability distribution to calculate the probability of each attribute value in a given parent node value and forms a relationship chain.For instance, when we need to estimate the data at time T , we can only use the data before time T to infer, namely, the current state depends only on the previous state, which is similar to the Markov Model.STPM extends DRPMs to model update pattern between different time data, and captures spatial and temporal update patterns by modeling updates events to provide update relationships of possible existence, finally detect and repair data.

E. Others
Firstly, we summarize the methods described above in Table (VI).In fact, Bayesian prediction model is a technique based on Bayesian statistics.The Bayesian prediction model utilizes model information, data information, and prior information, so the prediction effect is good, there this model is widely used, including in the field of time series data cleaning.Wang et al. [56] establish a cost model for Bayesian analysis which is used to analyze errors in the data.Bergman et al. [70] consider the user's participation and use the user's feedback on the query results to clean the data.Mayfield et al. [72] propose a more complex relationship-dependent network (RDN [71]) model to model the probability relationships between attributes.The difference between RDN and traditional relational dependencies (such as Bayesian networks [57]) is that RDNs can contain ring structures.The method iteratively cleans the data set  [53], [78] [54], [55] Maximum Likelihood [58]- [60] Markov Model [61]- [63] [64] HMM [65]- [67] SMURF [68] STPM [56], [57] Bayesian [70]- [72] RDN [111] GMM [69] EM and observes the change in the probability distribution of the data set before and after each wash.When the probability distribution of the data set converges, the cleaning process is aborted.Zhou et al. [110] argue a technique for accelerating the learning of Gaussian models via using GPU.The article believes that in the case of excessive data, it is not necessary to use all the data to learn the model.Also, the author provides a method of automatic tuning.In order to clean and repair fuel level data, Tian et al. [111] propose a modified Gaussian mixture model (GMM) based on the synchronous iteration method, which uses the particle swarm optimization algorithm and the steepest descent algorithm to optimize the parameters of GMM and uses linear interpolation-based algorithm to correct data errors.Shumway et al. [69] use the EM [112] algorithm combined with the spatial state model [27], [35] to predict and smooth the time series.

V. TIME SERIES ANOMALY DETECTION
Gupta et al. [14] investigate the anomaly detection methods for time series data: for a given time series data, there may be two types of outliers, namely single-point anomalies and subsequence anomalies (continuous anomalies).In this section, we first discuss the detection methods of abnormal points and abnormal sequences, next introduce the application of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm in data cleaning, and then review the abnormal detection methods related to machine learning.

A. Abnormal point detection
For single-point anomalies, the most common idea is to use predictive models for detection.That is, the predicted value of the established model and the observed value for each data point is compared, and if the difference between the two values is greater than a certain threshold, the observed value is considered to be an abnormal value.Specifically, Basu et al. [113] select all data points with timestamps t − k to t + k with the timestamp t as the center point, and the median of these data points is considered to be the predicted value of data points with timestamp t value.Hill et al. [21] first cluster the data points and take the average of the clusters as the predicted value of the point.The AR model and the ARX model are widely used for anomaly detection in various fields, such as economics, social surveys [3], [4], and so on.The ARX Fig. 5.An Example Abnormal Point Detection model takes advantage of manually labeled information, so it is more accurate than the AR model when cleaning data.The ARIMA model [114] represents a type of time series model consisting of AR and MA mentioned above, which can be used for data cleaning of non-stationary time series.Kontaki et al. [79] propose continuous monitoring of distance-based outliers over data streams.One of the most widely used definitions is the one based on distance as shown in Figure 5: an object p is marked as an outlier, if there are less than k objects in given distance.Here k = 4, q is the normal point and p is the abnormal point.

B. Abnormal sequence detection
Different studies have different definitions of subsequence anomalies.Keogh et al. [75] proposed that a subsequence anomaly, that is, a subsequence has the largest distance from its nearest non-overlapping match.With this definition, the simplest calculation method is to calculate the distance between each subsequence with length n and other subsequences.Of course, the time complexity of this calculation method is very high.In the later studies, Keogh et al. [115] propose a heuristic algorithm by reordering candidate subsequences and Wei et al. [116] argue an acceleration algorithm using local sensitive hash values.In calculating the distance, the Euclidean distance is usually used, and Keogh [117] further proposes a method using the compression-based similarity measure as the distance function.As shown in Figure 6, the data is divided into multiple sub-sequences that overlap each other.First, calculate the abnormal score of each window, and then calculate the abnormal score (AS) of the whole test sequence according to the abnormal score (AS) of each window.Window-based techniques can better locate anomalies compared to direct output of the entire time series as outliers.There are two main types of methods based on this technique.One is to maintain a normal database [118], [119], and then compare the test sequence with the sequence in the normal database to determine whether it is abnormal; the other is to build an anomalous database [76], [77] and then compare the test sequence with the sequence in the database to detect if it is anomalous.[120] found that the length of the error is unknown and they use grammar induction to aid anomaly detection without any prior knowledge.

C. Density-Based Spatial Clustering of Applications with Noise
The DBSCAN [121] algorithm is a clustering method based on density-reachable relationship, which divides the region with sufficient density into clusters and finds clusters of arbitrary shape in the spatial database with noise and defines the cluster as the largest set of points connected by density.Then the algorithm defines the cluster according to the set density threshold as the basis for dividing the cluster, that is, when the threshold is satisfied, it can be considered as a cluster.
The principle of DBSCAN algorithm: (1) DBSCAN searches for clusters by checking the Eps neighborhood of each point in the data set.If the Eps neighborhood of point p contains more points than M inP ts, create a cluster with p as the core object; (2) Then, DBSCAN iteratively aggregates objects that are directly reachable from these core objects.This process may involve the consolidation of some densityreachable clusters; (3) When no new points are added to any cluster, the process ends.
Where M inP ts is the minimum number of neighbor points that a given point becomes the core object in the neighborhood, Eps is the neighborhood radius.For instance, Eps is 0.5 and M inP ts is 3, for a given data set, the effect of clustering is as shown in Figure 7.Some noise points can be repaired and clustered into classes adjacent to them.Recent research [122] has shown that after repairing erroneous data.They also perform cleaning experiments on GPS data based on DBSCAN, the accuracy of clustering on spatial data can be improved.But this method cannot solve continuous errors and needs further improvement.

D. Generative Adversarial Networks
With the rapid development of machine learning technology, more and more problems are solved using machine learning.Dan Li et al. [80] use the GANs network to effectively detect anomalies in time series data.GANs trains two models at the same time, which are the generation model for capturing data distribution and the discriminant model for discriminating whether the data are real data or pseudo data as shown in Figure 8.
Given a random variable with a probability of uniform distribution as input, we want to generate a probability distribution of the output as "dog probability distribution".The philosophy of Generative Matching Networks (GMNs), which idea is to train the generative network by directly comparing the generated distribution with the true distribution, is to optimize the network by repeating the following steps: (1) Generate some evenly distributed input; (2) Let these inputs go through the network and collect the generated output; (3) Compare the true "dog probability distribution" with the generated "dog probability distribution" based on the available samples (e.g.calculate the MMD distance between the real dog image sample and the generated image sample); (4) Use backpropagation and gradient descent to compute the errors and update the weights.The purpose of this process is to minimize the loss of the generation model and discriminant.
Dan Li et al. [80] use GANs to detect abnormalities in time series and a natural idea is to use GANs network to repair missing values of time series data.Perhaps more machine learning algorithms are waiting for the cleaning of time series error values.A simple idea is to treat the detected anomaly data as missing data and then repair it.Y. Sun et al. [81] first analyze the similarity between parking space data and parking data, and then use Recurrent GANs to generate parking data as repair data, which provide a new idea for solving the problem of time series data repair.C. Fang et al. [82] propose FuelNet which is based on Convolutional Neural Networks (CNNs) and GANs.FuelNet is used to repair the inconsistent and impute the incomplete fuel consumption rates over time.

E. Long Short-Term Memory
Since Recurrent Neural Network (RNN) also has the problem of gradient disappearance, it is difficult to process longsequence data.F. A. Gers et al. [123] improve RNN and got the RNN special case Long Short-Term Memory (LSTM), which can avoid the disappearance of the regular RNN gradient.It has been widely used in industry and [83]- [85] use LSTM to perform anomaly detection on time series data.The left picture is a simple RNN structure diagram, and the right picture is a simple LSTM structure diagram in Figure 9, where given function as shown in equation (15).
In equation ( 15), x t is the input of data in the current state, h t−1 (hidden state) indicates the input of the previous node received, y t is the output in the current state and h t is the output passed to the next node.As can be seen from the Figure 9, the output h t is related to the values of x t and h t−1 .y t is often used to invest in a linear layer (mainly for dimension mapping), and then use sof tmax to classify the required data.As shown in Figure 9, RNN has only one delivery state h t , LSTM also has a delivery status c t (cell state).There are three main stages within LSTM: (1) Forgotten phase.The forgetting phase is mainly to forget the input that is passed in from the previous node.A simple principle is: forget the unimportant, remember the important one.More specifically, z f is calculated as a forgotten gate, which is used to control the previous state c t−1 , and then decide whether to retain the data or forget it.
(2) Selective memory phase.At this stage, the input is selectively memorized.Mainly to remember the input x, the more important the data needs to be more reserved.
(3) Output phase.This phase determines which outputs would be treated as states.Similar to the normal RNN, the output y t is often also obtained by h t change.
Filonov, Pavel et al. [83] and Pankaj Malhotra et al. [124] provide recurrent neural networks by providing network time series data.The recurrent neural network understands what the normal expected network activity is.When an unfamiliar activity from the network is provided to a trained network, it can distinguish whether the activity is expected or invaded.

F. Summary and Discussion
In addition to the methods described above, we also summarize some common methods in Table (VII).As shown in Table (VII), Xing et al. [125] show that the cleaned sequence can improve the accuracy of time series classification.Diao et al. [73] design LOF [126] [74] use detect anomalous geographic data by distance-based method, and then use Gradient-boosted tree (GBT) to repair the anomalous data.Charu C et al. [78] proposed a solution for distributed storage and query for large-scale streaming sensor data, and they examined the problem of historical storage and diagnosis of massive numbers of simultaneous streams.We can conclude that anomaly detection algorithms play an important role in time series data cleaning.It is also becoming more and more important to design anomaly detection algorithms for time series repair, and we discuss future directions in Section VII.

VI. TOOLS AND EVALUATION CRITERIA
In this section, we first give an overview of tools to clean time series and then summarize evaluation criteria related to time series cleaning methods.

A. Tools
There are many tools or systems for data cleaning, but they are not effective on time series cleaning problems.In Table VIII we investigate some tools that might be used for time series cleaning because they [128]- [130], [132] are originally used to solve traditional database cleaning problems.Ding et al. [131] present Cleanits, which is an industrial time series cleaning system and implements an integrated cleaning strategy for detecting and repairing in industrial time series.Cleanits provides a user-friendly interface so users can use results and logging visualization over every cleaning process.Besides, the algorithm design of Cleanits also considers the characteristics of industrial time series and domain knowledge.The ASPA proposed by Rong et al. [133] violates the principle of minimum modification and distort the data, which is not suitable for being used widely.EDCleaner proposed by J. Wang et al. [134] is designed for social network data, detection and cleaning are performed through the characteristics of statistical data fields.Y. Huang et al. [135] propose PACAS which is a framework for data cleaning between service providers and customers.R. Huang et al. [136] present TsOutlier, a new framework for detecting outliers with explanations over IoT data.TsOutlier uses multiple algorithms to detect anomalies in time series data, and supports both batch and streaming processing.EGADS [137] Offers two classes of algorithms for detecting outliers: Plug-in methods and Decomposition-based methods, which is designed for automatic anomaly detection of large-scale time series data.There is not much research on time series cleaning tools

Method
Detail PIClean [128] Based on statistics Produce probabilistic errors and probabilistic fixes using low-rank approximation, which implicitly discovers and uses relationships between columns of a dataset for cleaning.HoloClean [129] Based on statistics Learn the probability model and then select the final data cleaning plan based on the probability distribution.ActiveClean [130] Based on statistics Allow for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees.Cleanits [131] Anomaly detection Develop reliable data cleaning algorithms by considering features of both industrial time series and domain knowledge.MLClean [132] Anomaly detection The combination of data cleaning technology and machine learning methods is designed to generate unbiased cleaning data, which is used to train accurate models.ASAP [133] Smoothing based Develop a new analytics operator called ASAP that automatically smooths streaming time series by adaptively optimizing the trade-off between noise reduction and trend retention.EDCleaner [134] Based on statistics For social network data, detection and cleaning are performed through the characteristics of statistical data fields.PACAS [135] Based on statistics Design a framework for data cleaning between service providers and customers.TsOutlier [136] Anomaly detection Use multiple algorithms to detect anomalies in time series data, and support both batch and streaming processing.EGADS [137] Anomaly detection Offer two classes of algorithms for detecting outliers: Plug-in methods and Decomposition-based methods, which is designed for automatic anomaly detection of large-scale time series data.
or systems, and we discuss further in Future Directions in Section VII.

B. Evaluation criteria
The Root Mean Square (RMS) error [67] is used to evaluate the effectiveness of the cleaning algorithm.Let x denotes the sequence consisting of the true values of the time series, x denotes the sequence consisting of the observations after the error is added, and x denotes the sequence consisting of the repaired values after the cleaning.Here the RMS error [67] is represented as shown in equation (16).
The equation ( 16) measures the distance between the true value and the cleaned value.The smaller the RMS error, the better the cleaning effect.
Other criteria include error distance between incorrect data and correct data, repaired distance between erroneous data and cleaned results (as shown in equation ( 17) referring to the minimum modification principle in data repairing).
Dasu et al. [138] propose a statistical distortion method to evaluate the quality of cleaning methods.The proposed method directly observes the numerical distribution in the data set and evaluates the quality according to the variation of the distribution caused by different cleaning methods.

VII. CONCLUSION AND FUTURE DIRECTIONS
In this paper, we review four types of time series cleaning algorithms, cleaning tools or systems and related research on evaluation criteria.Next, we summarize the full text in Section VII-A and list some advice of future directions in Section VII-B.

A. Conclusion
With the development of technology, people gradually realize the value contained in the data.Owing to companies want to derive valuable knowledge from these data, and data analysis has played an increasingly important role in finance, healthcare, natural sciences, and industry.Time series data, as an important data type, is widely found in industrial manufacturing.For instance, a wind power enterprise analyzes sensor data, which are located throughout the wind turbine, to determine whether the fan is in a normal state; transport companies also want to optimize vehicle fleet travel by analyzing vehicle GPS information.However, due to external environmental interference, sensor accuracy, and other issues, time series data often contain many errors that can interfere with subsequent data analysis and cause unpredictable effects.

B. Future Directions
As mentioned above, data is an intangible asset and advanced technology helps to fully exploit the potential value of data.Thereby, time series data cleaning methods provide very important technical support for the discovery of these values in processing time series error data.Next we list some advice of future directions based on [15].
The error type illustrates handbook of time series data.At present, data scientists have a very detailed analysis of the errors in the traditional relational database.However, there is still much work to be further studied in the analysis of time series data error types.For instance, this paper roughly divides the types of time series errors into three types, namely single point big errors, single point small errors and continuous errors.In fact, in continuous errors, there are also a lot of meticulous types of errors, such as additive errors, innovational errors [13] or missing errors [139].How to systematically analyze these error types and form time series data error type illustrated handbook is very important.The clear error type helps to develop targeted cleaning algorithms to solve the problem of "GIGO (Garbage in, garbage out.)" that exists in the current field.
The design of time series data cleaning algorithm.Each chapter of this paper reviews some time series error cleaning algorithms, but further optimizations are possible.The existing methods are mostly for a single-dimensional time series (even the GPS data exists two dimensions' information), but each dimension is cleaned separately during cleaning [45], [55], [127].To further improve the practicability of the algorithm, it is imperative to consider the cleaning of multidimensional time series.Besides, with the development of machine learning technology, more technical learning techniques should be considered for data cleaning algorithms, which may lead to better cleaning results because of the mathematical support behind them.
The implementation of time series cleaning tool.At present, the mainstream data cleaning tools in the industry are still aimed at relational databases, and these tools are not ideal for processing time series data.As time series data cleaning problems become more serious, how to use the fast-developing distributed technology, high performance computing technology and stream processing technology to implement time series cleaning tools (including research tools and commercial tools) and apply them to real-world scenarios such as industry is also the key work of the next stage.
The algorithm design of time series anomaly detection.In real-world scenarios, efficient anomaly detection algorithms play an irreplaceable role in time series repair.It is difficult to judge the difference between the error value and the true value, so it is necessary to specifically design a time series anomaly detection algorithm that can be applied to an industrial scene.It is worth noting that more research is needed on how to perform anomaly detection, cleaning, and analysis in the case of weak domain knowledge or less labeled data.
Design of data cleaning algorithms for specific application scenarios.With the application of various technologies in the industry, the application scenarios are becoming more and more clear.The requirements for cleaning algorithms in different application scenarios have different focuses.For instance, rather than cleaning errors, another application is to directly answer queries over the possible repairs [140].Moreover, rather than quantitive values, cleaning qualitative events [141], [142] is also highly demanded.Finally, the data stored in the Blockchain network [101] are generally structured data, with the development of Blockchain technology, the design of data cleaning algorithms on Blockchain networks is also particularly important.

Fig. 8 .
Fig. 8.A Simple Flow Chart of Generative Adversarial Networks

Fig. 9 .
Fig. 9. Simple RNN Structure and Simple LSTM Structure based online anomaly detection and cleaning algorithm.Zhang et al. propose an iterative minimum cleaning algorithm based on the timing correlation of error time series in continuous errors and keep the principle of minimum modification in data cleaning.The algorithm is effective in cleaning continuous errors in time series data.Qu et al. [89] first use cluster-based methods for anomaly detection and then use exponentially weighted averaging for data repair, which is used to clean power data in a distributed Formal representation is as follows: t 1 < t 2 then x t1 ≤ x t2 where x(t) is mileage, t is timestamp.For instance, consider an example relation instance in TableIII.The tuples are sorted on attribute sequence number, which identifies sea level that rapidly increase from hour to hour.Generally, ODs in the form of equation (

TABLE VII SUMMARY
OF DETECTION environment.R. Corizzo et al.

TABLE VIII SOME
EXAMPLES OF TOOLS OR SYSTEMS