An Adaptive Outlier Detection and Processing Approach Towards Time Series Sensor Data

The intelligent environment monitoring network, as the foundation of ecosystem research, has rapidly developed with the ever-growing Internet of Things (IoT). IoT-networked sensors deployed to monitor ecosystems generate copious sensor data characterized by nonstationarity and nonlinearity such that outlier detection remains a source of concern. Most outlier detection models involve hypothesis tests based on setting outlier threshold values. However, signal decomposition describes stationary and nonstationary relationships sensor data. Therefore, this paper proposes a three-level hybrid model based on the median filter (MF), empirical mode decomposition (EMD), classification and regression tree (CART), autoregression (AR) and exponential weighted moving average (EWMA) methods called MF-EMD-CART-AR-EWMA to detect outliers in sensor data. The first-level performance is compared to that of the Butterworth filter, FIR filter, moving average filter, wavelet filter and Wiener filter. The second-level prediction performance is compared to support vector regression (SVR), K-nearest neighbor (KNN), CART, complementary ensemble EEMD with CART and AR (EEMD-CART-AR) and ensemble CEEMD with CART and AR (CEEMD-CART-AR) methods. Finally, EWMA is compared to Cumulative Sum Control Chart (CUSUM) and Shewhart control charts. The proposed hybrid model was evaluated with a real dataset from the hydrometeorological observation network in the Heihe River Basin, yielding experimental results with better generalization ability and higher accuracy than the compared models, and providing extremely effective detection of minor outliers in predicted values. This paper provides valuable insight and a promising reference for outlier detection involving sensor data and presents a new perspective for detecting outliers.


I. INTRODUCTION
The intelligent environment monitoring network consists of numerous sensor devices that form a ubiquitous, reliable and distributed internet of things (IoT) network for sensing and communicating which is gradually driving the evolution of ecosystem research, and massive amounts of time series sensor data have been collected [1]- [3].According to a published report, the total amount of global Earth monitoring The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang .data is increasing exponentially each year and the IDC report predicts that global data might be excepted to reach 163 ZB approximately by Sen and Jayawardena [4].Outlier detection in bulk sensor-collected data has been a matter of great concern and major challenge.In particular, devices deployed in high altitude and harsh regions often generate spatiotemporal variations in networked sensor data [5], [6].In addition, the sensed data are largely affected by the environment of the underlying surface of the atmosphere in cold and arid regions, where high uncertainty can be caused by local climate change with non-stationary and nonlinear characteristics [7].
Data outliers have posed a considerable challenge for scientific research.It is of practical significance and importance to develop a suitable outlier detection approach for sensor data.
A sensor data outlier is defined as an observed value that is far from others.Outlier detection focuses on the process of discovering data deviations [8]- [10].In fact, outlier detection and processing play vital roles in identifying abnormal patterns and have been applied in many different fields, such as process control [11], environmental monitoring [12] and traffic monitoring [13].Many existing detection methods based on hypothesis tests setting the threshold values of outliers have been proposed to identify outliers through uniformly inspecting the main characteristics of a set of objects [14], including distance-based methods [15], K-nearest neighbor (KNN) methods and prediction-based methods [16].The autoregressive (AR) model, autoregressive moving average (ARMA) model, and autoregressive integrated moving average (ARIMA) model based on statistics were used to detect outliers in complicated multivariate sensor data involving single-variable time series [17]- [19].
Similarly, the physical, statistical, and machine learning models that have been developed to detect outliers are not sufficiently capable of analyzing non-stationary data [20]- [23].Signal decomposition is a processing method that describes stationary and non-stationary relationships.This approach decomposes non-stationary sensor data into stationary data and retains the structure of the raw data.Therefore, to solve the problems noted above, some signal processing methods, such as empirical mode decomposition (EMD), complementary ensemble EMD (CEEMD), ensemble EMD (EEMD), variational mode decomposition (VMD) and wavelet transform (WT), have been widely applied to recursively decompose data into different intrinsic modes and improve the effectiveness of outlier detection [24]- [26].To a large extent, signal processing methods have a limited capacity to improve the performance and accuracy of a detection model.Therefore, researchers have extended these methods, for instance, a hybrid model based on EMD and AR aimed at transforming data from the time domain to the frequency domain was successfully applied for outlier detection to assess the construction and precisely track the frequency of signals [27].WT provides a high temporal resolution in the high-frequency range of a time series signal.However, WT in outlier detection has led to shortcomings in analyses of big data, and WT is time consuming compared to existing models [28].Similarly, researchers have extended the applications of EMD to process sensor data with non-stationary due to its prominent advantages [29].
Although single, hybrid and combined methods have achieved some success, the existing approaches have not achieved exceptional performance.Considering the above shortcomings, a high-level outlier detection model called the MF-EMD-CART-AR-EWMA model is presented for outlier detection in this paper.Of this model, a three-level ensemble method is leveraged, where MF is used as the preprocessor to preliminarily screen a series data that contains outliers, such as large sudden changes.EMD is chosen due to its flexibility in processing nonstationary data.CART with the AR method are employed as the base learner for the prediction task, and then we use the EWMA control chart to detect outliers.The proposed outlier detection model is designed with a black-box scenario in mind.We define that outliers are deviation-based or significant changes in time-series sensor data.Specifically, the outliers that deviate from the upper (UCL) and lower (LCL) control limits of EWMA can be addressed for further investigation, while implementing replacement with the prediction value.
The ultimate objective of the proposed approach is to provide a highly accurate and robust outlier detection model to overcome the challenges of large-scale sensor data.The model proposed in this paper aims at not only detecting outliers but also processing the outliers so that an improved dataset is obtained.To investigate and evaluate the performance of the model, the proposed method was thoroughly evaluated and benchmarked based on real sensor data from the hydrometeorological observation network in the Heihe River Basin.
The primary contributions of the proposed model are summarized as follows.
(a) One-step-ahead preprocessing for identifiable outliers Preprocessing is the first level of the proposed model for original data series analysis.In this step, the original data with obvious outliers, such as sudden extremes, are be processed.We aim to address various real-world sensor data outlier challenges using MF, thereby eliminating these patterns before the outlier analysis and modeling steps.
(b) Developing the EMD-CART-AR approach for second-level prediction.
EMD is used to decompose the preprocessed data into new and stationary intrinsic mode functions (IMFs) with different features, and the CART and AR models are employed considering the characteristic scales of decomposed subsequences, which can promote the accuracy of the prediction model.
(c) Using an EWMA control chart to detect outliers in predicted data.
EWMA is introduced as the last model level based on the aforementioned first two levels for identifying the minor outliers in the predicted data.Taking advantage of the control parameters, the entire iterative process of the model can be effectively regulated.
(d) Applying comprehensive statistical indicators to evaluate the performance of the proposed model.
The proposed approach is applied to real-world data sets, and the results are evaluated with statistical indicators.The test includes four sets of data from the hydrometeorological observation network dataset.The results are also compared to those of other models, including SVR, KNN, CART, CEEMD-CART-AR, and EEMD-CART-AR, to assess the preprocessing and prediction performance of the proposed model.
The paper is organized as follows.In section 2, the framework, main implementation steps and employed methodology are given.In the section 3, the data description and analysis are presented.In section 4, the evaluation criteria used in this paper and the experimental results and discussion are introduced.Finally, a brief conclusion is made.

II. IMPLEMENTATION METHOD AND SCHEMATICS A. SCHEMATICS OF THE THREE-LEVEL HYBRID MODEL
In this section, the adaptive outlier detection modeling approach is established for outlier detection in real-world data set, and the schematics of the three-level hybrid model are shown in Fig. 1.The three levels are placed at different positions and have specific functions.The preprocessing level is the first level, and it preprocesses the original data that may be influenced by obvious outliers, such as large or small sudden variational patterns.The EMD-CART-AR level, as the second level located between the preprocessing and outlier detection levels, is a predictive model that provides input data for outlier detection.The final level, the EWMA detector, identifies possible minor outliers in the predictive output and is used to adjust the iterative procedure of the model.
The main steps of the model are as follows.
Step one: Conduct a preliminary data test on and then preprocess the result with MF.The preprocessed data are recorded as Y (t).
Step two: Decompose the preprocessed data Y (t) into X (t) and r(t) with EMD and record it the result as Y T is the trend term, and n is the sample size.
Step three: Predict X (t) and r(t) with the CART and AR models, respectively.Predict the high-frequency terms with the CART model, and record the result The trend term, r(t), is predicted with the AR model, and the result is recorded as where n is the sample size.The final predicted value is denoted as ŷ(t) = x(t) + r(t).
Step four: Compare the real and predicted values, and calculate the residual sequence, namely ε = y(t) − ŷ(t).
Step five: Detect the outliers with the EWMA control chart, which is also used to control the entire iterative process of the model.
Last step: Process the outlier data with the proposed model and obtain clean data through the iteration and reconstruction of the proposed model.

B. METHODS 1) MEDIAN FILTER (MF)
The MF is an algorithm based on statistical theory to suppress noise in nonlinear signal processing [30].The basic principle of this algorithm is to replace the value of a point in a sequence with the median value of each point in the neighborhood to eliminate the isolated noisy points.Suppose data series X (m) is a signal written as where m is the size of the series.The time window length of the MF is defined as n.The process for the j th point is to take n samples centered on the j th point as the input values, reorder them by size, and generate a new data sequence( ).The median value X j is selected as the output of the filter.n is typically an odd number, and if n is an even number,the output value will be the mean of the two sample values at the middle position.

2) EMPIRICAL MODE DECOMPOSITION (EMD)
EMD was proposed by Huang et al. and is a new signal processing method for decomposing a signal into IMFs [31], [32].The algorithm refers to the smooth processing of a signal and subsequent decomposition of a non-stationary signal into a stationary series with functions of different characteristic scales, each called an IMF [33].The IMF must satisfy two conditions.First, in the whole data series, the number of extreme points must be the same as the number of zero-crossing instances, or the difference between these two values must be not greater than 1.Second, the data series must be locally symmetric about the time axis, namely, the local mean is zero at any time point.
The main processing steps in the model are as follows.
Step one: Find all the maximum and minimum points inX (t)(the original signal), and fit two envelope curves with the cubic spline interpolation function method.
Step two: Find the mean m(t) of the upper envelope and the lower envelope.
Step three: Subtract the m(t) mean by the original series to obtain the new series c(t),namely c(t) = X (t) − m(t).
Step four: Determine whether c(t) meets the IMF conditions: if the conditions are met, separate c(t) and obtain the remainder r(t), namely, r(t) = X (t) − c(t); if the conditions are not met, take c(t) as the new signal, and repeat Step one to Step three until the conditions are met.
Step five: Take the obtained r(t) as the new original series and repeat Step one to Step four.Finally, obtain finite IMF components and a trend component.
After the process above is implemented, the signal with random non-stationarity is decomposed into several stationary IMF components and a trend component, as shown in Eq. (1).
In Eq. ( 1), c i (t) refers to the i th IMF component, representing the signal components with different characteristic scales in the original signalx(t), and r refers to the trend component, reflecting the trend of the original signal x(t).Therefore, the signal x(t) can be decomposed into n stationary components (IMFs) with different characteristic scales and a trend term.

3) CLASSIFICATION AND REGRESSION TREE (CART)
As a typical classification algorithm, the CART method is a supervised non-parametric classification method that creates a binary tree based on a simple model and easily implemented extraction rules to obtain predictions [34].The CART algorithm has been widely applied in classification and prediction tasks [35].The properties of the root node of the data are first found according to the Gini index, and a tree is created from the top to the bottom in a recursive manner until every sample established after division is pure.The leaf nodes of the decision tree represent the categories of information associated with the sample, and each path along a branch from the root node to the leaf node represents a rule.A complete binary tree refers to a rule set.Essentially, the decision tree classifies data with a series of rules.The main decision trees are binary branched trees and multibranch trees, and the former is used in this research because of its search flexibility.
The following concepts were used to construct the CART.For all the sample data, a tree with many levels and leaf nodes is created to fully reflect the relations among the data (at this moment, the data relations reflected by the tree are often influenced by overtraining).Through trimming the tree, a series of subtrees is created, from which the trees of appropriate size are selected to classify the data.
The main process of the model is as follows.
Step one: Input the training dataset D.
Step two: Output the CART f (x).
In the input space of the training dataset, divide every region into two subregions recursively and determine the output value of each subregion to create the corresponding binary decision tree. 1) Choose the optimal segmentation variable j and segmentation point s, and solve Eq.( 2) as follows. min Traverse j and scan s for the fixed segmentation variable j; then, obtain the minimum pair (j, s) through Eq.( 2).2) Divide the region with the chosen (j, s), and determine the corresponding output value, as shown in Eq.( 3), 3) Continue to repeat Steps one and two until the stopping condition is met.
and generate the decision tree, as shown in Eq.( 4).
The EWMA control chart as a prediction-based detector is introduced in this work, and it presents a robustness for detecting minor outliers compared with the traditional control chart, e.g., Shewhart control chart and Cumulative Sum Control Chart (CUSUM) control chart.The EWMA chart proposed by Roberts in 1959 assigns the maximum weight to the nearest observed value [36].Due to the flexibility and reliability of the EWMA control chart for monitoring the small shifts in parameters, this control chart has been applied widely [37].The Shewhart control chart yields omission of minor outliers among slight fluctuations aspects.The CUSUM control chart has better performance than the Shewhart control chart in terms of detecting slight fluctuations.However, for CUSUM, the two adjacent statistics have a strong correlation, in fact, there is only one sample difference.When the mean and variance of the sample cannot be accurately estimated, the analysis effect is weakened.However, the EWMA control chart is flexible and has a strong detection ability for small fluctuations and gradual drifts.Compared with traditional outlier detection methods, the EWMA control chart provides excellent performance in identifying small fluctuations and slow shift processes; therefore, it is highly suitable for outlier detection based on prediction [38].In particular, the outlier detection was driven by the desire to present a robustness as much as possible and to allow accurate detection in time-series sensor data [39].Therefore, we proposed an adaptive outlier detection tightly coupled to the prediction-based estimator to detect minor outliers and close the detection iterations.The EWMA control chart employed in the MF-EMD-CART-AR model is to detect possible minor outliers in prediction process, while is used to regulate model iterations.
The EWMA control chart can be expressed as shown in Eq.( 5), where the λ is a constant constrained by 0 < λ ≤ 1 and X 1 , X 2 , • • • , X n compose a sample of observed values.The target value of the process is usually taken as the initial value Z 0 = µ.Alternatively, the mean of the initial data can serve as the initial value, namely, Z 0 = X .If the observed value X i is an independent random variable with the same variance σ 2 , then the variance of Z i is as shown in Eq.( 6).
Therefore, the EWMA control chart is constructed with a monitoring index based on the Z i statistics, and the upper and lower control limits are shown in Eq. ( 7).
where L refers to the regulatory factor selected to ensure that the expected ARL 0 can be achieved.As i increases, the control limits will converge to µ ± Lσ λ 2−λ .The process parameters of the EWMA control chart are L and λ.Hence, detailed research has been conducted on the ARL properties of the EWMA control chart with different design parameters.Generally, when 0.05 ≤ λ ≤ 0.25 [37], the EWMA control chart provides excellent detection performance.According to practical experience, the value of λ is generally relatively small to make the control chart flexible and effective.

III. EXPERIMENT AND ANALYSIS A. DATASETS
In this section, the sensor data from the hydrometeorological observation network in the Heihe River Basin, an endorheic basin located in the arid and semiarid regions of Northwest China [40]- [42], are used to verify the accuracy and robustness of the proposed model.The hydrometeorological observation network currently transmits approximately 200,000 recorded values per day collected from sensor devices, such as temperature and humidity sensors, wind speed and direction sensors and soil moisture sensors.Moreover, changing seasonal factors result in non-stationary and nonlinear characteristics in the sensor data.Therefore, we employ four sets of data from different sites and with different collection times and sample sizes.These datasets are independently used to evaluate the proposed approach.
To evaluate the generality of the proposed prediction model, for each experimental case, we evaluated two kinds of sample sizes; 7-day temperature and humidity sensor data samples (1008 data points) from the Daman superstation and 10-day data (1440 data points) from the Arou superstation were obtained for different time periods.Then 80 % of the data are randomly selected for training the EMD-CART-AR model.The remainder of the data is used as test sets to evaluate the performance of the proposed model.The locations of the Daman and Arou superstations can be seen in Fig. 2.

1) DAMAN SUPERSTATION DATASET
Daman superstation (Altitude is 1556 m; 100.3722E, 38.8555N ) is located in the Dagan Irrigation District of Wuxing Village, Xiaoman Town, Zhangye City, Gansu Province, China, and consists of a meteorological element gradient observation system, an eddy-covariance system, 2 largeaperture scintillometers, a lysimeter, a cosmic-ray soil moisture observation system and nine soil moisture wireless sensor network nodes.The temperature and humidity data from Daman superstation dataset were selected from May 5 to 11, 2018, and from December 31, 2016, to January 6, 2017.
The recurrence plot (RP) is an important method to analyze the periodicity, chaos and nonstationarity of time-series data.Specifically, RP depicts black and white points on the time plane of the square, where the black points represent the occurrence of recursion in the corresponding state of the horizontal and vertical axis on the coordinate, while the white point indicates that no recursion occurs [43].Therefore, the RP can be used to analyze the nonstationary and nonlinear characteristics of a time-series data.For a stationary time series, the corresponding RP is uniformly distributed, and the RP of a nonstationary time series is nonuniformly distributed.The RP of temperature and humidity sensor data of Daman superstation dataset is given in Fig. 3   be found in Fig. 3-a that RP has large white or blue points, which indicate that the time-series data has a large mutation during this period, and the data are in a relatively stable state of a period of time before and after the sudden change, that is, a stable state.In Fig. 3-b, the nonuniform characteristics of the data are relatively weak with respect to Fig. 3-a.

2) AROU SUPERSTATION DATASET
Arou superstation (Altitude is 3033 m; 100.4572E, 38.0384N ) is located in Arou Village, Qilian County, Qinghai Province, China (Che et al., 2019), and consists of a meteorological element gradient observation system, an eddy-covariance system, 2 large-aperture scintillometers, a weighing-type rain gauge, a vegetation phenology observation system, a cosmic-ray soil moisture observation system and 16 soil moisture wireless sensor network nodes.Due to the high altitude of the location, low average annual temperature and poor observation conditions at Arou, outliers are common in the sensor data collected from Arou superstation.To further verify the robustness and applicability of the model, an experiment was conducted on the temperature and humidity data collected from Arou, and the samples were selected from November 1 to 10, 2017, and from September 6 to 16, 2017.
Similarly, Fig. 4 shows the RPs of temperature and humidity at the Arou superstation.According to the figure, the nonstationarity of the temperature data is obvious, and the humidity data are weakly nonstationary.The nonuniform distributions of the temperature and humidity data RPs further reflect the nonstationary characteristics of the sensor data.In general, from the RPs analysis, it suggested that the experimental data has obvious nonstationary characteristics.
The MF is used first to preprocess the obviously visible outliers in the raw data, and the EMD-CART-AR model is then employed for prediction.Finally, the EWMA method is used to detect the outliers.The detailed data outlier detection results are presented in the next section.

1) PARAMETER SETTING
In each experiment, all the data are first preprocessed with the MF.The filter window length of the MF for preprocessing needs to be adjusted according to the characteristics of the data.Here, we chose filter windows with different lengths to assess the performance of data preprocessing [44].Moreover, the scheme of partition of time-series data over a sequence of temporal windows via a time window is shown in Fig 5 .It can be seen that 1 to k-1 from the first subset can be chosen to train the model, 2 to k from the same subset are selected for prediction by using the trained model.After several adaptive iteration processes of the model, the model can mitigate interference and noise effects and became sufficiently stable.The parameters of EMD are obtained by employing the stopping criteria [45].Grid searches are adopted to optimize the CART parameters and provide maximum prediction accuracy [46].The parameters of the AR model are defined according to the autocorrelation coefficient and partial correlation coefficient of the sample data.The (λ, L) values of EWMA are considered based on a confidence level of 99.97% [37].For all the methods, detailed parameter settings are described in Table 1.

2) MF PREPROCESSING RESULTS
This section presents the proposed preprocessing procedure focusing on the first level of an outlier detection model, with the aim of preliminarily screening a series data that contains outliers, such as large sudden changes.To address various real-world data outlier challenges, these outlier data should be eliminated before outlier analysis and modeling.In this context, the MF is used to preprocess the visual outlier data in Y DamT (t), Y DamH (t), Y ArouT (t) and Y ArouH (t), where these data series are selected from the temperature and humidity datasets from Daman superstation and Arou superstation.To highlight the advantages of the MF in processing non-stationary data, several outliers are randomly added to the historical temperature and humidity data.In practical applications, unprocessed historical data can also be assessed by outlier detection models.
The results of the data preprocessed by the MF are shown in Fig. 6 and Fig. 7, in which the red curve refers to the preprocessed data,the blue curve refers to the raw data and the hollow circles are outliers.The obvious discernible outliers that are too high or too low are processed, and the red curve almost coincides with the blue curve.The results confirm that the scheme used in this paper yields high accuracy.This finding suggests that the MF is suitable for the outlier processing of sensor data with the capability for fusing, denoising and smoothing to a certain extent.Notably, the MF is a nonlinear smoothing technique with a selection adjustment scheme based on a filter window, and the value of each data point is set as the median of all data points in a certain neighborhood window for that data point.As a result, the outlier value in a data series is replaced by the median value of the neighborhood window.

3) EMD DECOMPOSITION RESULTS
In this section, the temperature and humidity sensor data can be regarded as a time series signal, and the EMD method is introduced to decompose the preprocessed data series, i.e.Y DamT (t), Y DamH (t), Y ArouT (t) and Y ArouT (t).Fig. 8 shows that Y DamT (t) decomposed by EMD comprises 6 IMF components X DamT (t) = IM F i (i = 1, 2, • • • , 6) and a trend term r D amT (t).To obtain relatively stationary original data and a locally stationary trend, the IMF i can be reconstructed by X (t) DamT = 6 i=1 IM F i .X DamT (t) displays an undulation trend similar to that of Y DamT (t).Similarly, the highfrequency term X (t) DamH = 7 i=1 IM F i and the trend term r DamH (t) are obtained by reconstructing the decomposed humidity data with the EMD method.Y ArouT (t) and Y ArouH (t) are also decomposed by the EMD model.Y ArouT (t) is decomposed into 10 components, including 9 IMFs, namely, IM F i (i = 1, 2, • • • , 9), and one trend term r ArouT (t).Similarly, Y ArouH (t) is decomposed into 10 components, including 9 IMFs, namely, IM F i (i = 1, 2, • • • , 9), and 1 trend term r ArouH (t), which are given in Fig. 9 To obtain relatively stable original data and the partial stationary trend, the IMFs in IMF i are reconstructed.X ArouT (t) = 7 i=1 IM F i and X ArouH (t) =   i=1 IM F i are recorded as the high-frequency terms, and r ArouT (t) and r ArouH (t) are recorded as the trend terms.
The basic concept of employing EMD for predictions involves decomposing sequence data into IMF components and trend terms.The separated trend terms at different scales can reduce the complexity of the time series, and the divided IMF components are able to maintain the unique physical meaning and stationarity of the data [47].Thus, EMD is able to improve the prediction accuracy in specific time horizons based on this approach.
Processing small-sample time series data with the CART model is effective.Therefore, the CART model is used to predict the high-frequency terms X DamT (t), X DamH (t), X ArouT (t) and X ArouH (t).As the most common analysis model for time series, the AR model, which is characterized by simplicity and high accuracy, is ideally qualified for predictions involving locally stationary data, such as r DamT (t), r DamH (t), r ArouT (t) and r ArouH (t).The detailed data processing results are presented in the next section.

4) EMD-CART-AR PREDICTION RESULTS
In the CART prediction model, the curve smoothness and error degree are taken into consideration for the prediction of the nonlinear data series [48].Therefore, the highfrequency terms X DamT (t), X DamH (t), X ArouT (t) and X ArouH (t) are predicted by the CART model.The AR model can be used for time series prediction and analysis to the trends The experimental results indicate that the EMD-CART-AR hybrid model proposed in this paper reduces the prediction error effectively and demonstrates excellent prediction ability in terms of processing the non-stationary time series problem.

5) EWMA OUTLIER DETECTION AND PROCESSING RESULTS
As noted earlier, the EWMA control chart is not affected by the mean value of a dataset and is widely used in the processing of time series data; additionally, the random error conforms to a normal distribution with a mean value of and variance of δ 2 [50].The robust EWMA control chart employed in this section involves detecting outliers and identifying minor errors in the residual series.The proposed detection model architecture and EMWA approach considered in this paper aim not only to control the reasonable error range but also to effectively regulate the entire iterative process of the model and achieve continuous detection and processing.At the outlier detection and processing stage, the UCL and LCL control limits of the EWMA control chart of the four groups of experimental test data are calculated at the confidence level of 99.73% (3δ), and the (λ, L) values of EWMA are presented in Table 1.
The detection results for the Daman and Arou superstation data set are presented in Fig. 12 and 13.Fig. 12-a shows that the upper and lower limits of the EWMA control chart are approximately 0.6125 and −0.6125, respectively.According to the figure, the residual error obtained from the predicted and real values is within the upper and lower limits.As a result, the error range between the predicted and real values are 0.5.Similarly, in Fig. 12-b, the upper and lower limits of the EWMA control chart are 1.6 and −1.4,respectively, and the error range of the humidity data is 2.5.
The results for Arou superstation data are shown in Fig. 13.The upper and lower limits of temperature are approximately 1 and −1, respectively, in Fig. 13-a, and the error almost zero.As shown in Fig. 13-b, the upper and lower limits of the humidity data are approximately 5 and −5, respectively, and several outliers are clearly marked, but these values are not shown in Fig. 9-b.For instance, from 0 to 125, 3 obvious abnormal points between the predicted and real values are present, especially the 121th point, with an error that reaches 9.This point was not marked in Fig. 12-b but was detected by the EWMA model.
These results suggest that the introduced approach, as an outlier detector, is effective in detecting outliers in time series and predicted values.In the meantime, the proposed model also targets processing outliers.Specifically, the obvious outlier can be preprocessed through the first-level of the proposed three-level adaptive detection system.Additionally, we can analyze conditions of actual values preliminarily based on the confidence level, and for the detected outliers that deviate from the UCL and LCL control limits of the EWMA control chart, we replace them using the prediction  value, while addressing them for further analysis.Moreover, outliers may still be triggered from systematic noise and sensor faults.As a result, the role of the preprocessor and detector, e.g., the first level, for preprocessing some obvious outliers and minor outliers that are detected by the EWMA control chart, can also be deemed an alarm for some special applications.

IV. DISCUSSION
This section mainly discusses and analyzes the performance of the MF-EMD-CART-AR-EWMA model proposed in this paper, which involves three levels, as shown in Fig. 1.The MF, which is a single model, is used to preprocess outlier data in the first level.The second level includes a hybrid model, the EMD-CART-AR model, which is used to establish a prediction model.The last level identifies outliers based on detection with the EWMA control chart.The first two levels mainly improve the accuracy and robustness of the proposed model, and the third provides effective outlier detection and iterative control for regulating the allowable error range by adjusting the parameters of the EWMA control chart.Therefore, the performance of the three levels of the prototype model proposed in this paper are analyzed and evaluated.
To clearly illustrate the performance, stability and robustness of the MF-EMD-CART-AR-EWMA model, the temperature and humidity data collected at the Daman and Arou superstations were chosen.A comparison of the MF and other signal processing methods, e.g., WT, the Butterworth filter,

A. EVALUATION METHODOLOGY
To evaluate the preprocessing and prediction ability of the model, three different statistical indicators, namely, the root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE), were used [51].The preprocessing and prediction accuracies reflect the consistency between the processed results and actual values, and these accuracies are usually reflected by error indicators.Therefore, the larger the error is, the lower the accuracy.The error is defined as ε = y(t) − ŷ(t), where y(t) is the actual value and ŷ(t) is the preprocessed or predicted value.When ε > 0, ŷ(t) is a poorly predicted value; conversely, when ε < 0, the prediction accuracy is high.The metrics are shown in Eq. ( 8), Eq. ( 9), and Eq.(10).Performance comparisons based on the MF and other filter methods for the raw temperature and humidity data from the Daman and Arou superstations are given in Fig. 14 and Fig. 15.
The results of the MF are compared to those of the Butterworth filter, the FIR filter, the moving average filter, WT, and the Wiener filter.Based on the preprocessing scheme used in this paper, the preprocessed data are close to the real data because the MF is a nonlinear smoothing technique that sets the value of a given data point as the median of all data values in the corresponding neighborhood window.As a result, some obvious outlier points are processed.The results suggest that the MF outperforms the other methods for both specific points and the whole dataset in terms of processing the series outliers.The statistical evaluation criteria for several filter methods, such as the MAE, RMSE, and MAPE, are shown in Table 2 and 3. Similarly, the MF performs better than other filters in processing the data outliers.The MAE, RMSE and MAPE of the data processed by the MF are smaller than those for the Butterworth filter, FIR filter, moving average filter and Wiener filter for both the temperature and humidity datasets from the Daman and Arou superstations.For the Daman superstation test set, the MF yields the maximum observed improvement over the FIR filtering results based on the temperature data, with MAE, RMSE and RMSE values of approximately 96.7%, 99.5% and 96.9%, respectively.Additionally, the observed improvements in the MAE, RMSE and RMSE were approximately 97%, 99.7% and 97.4%, respectively, for the humidity data.
For the Arou superstation test set, compared with Butterworth filter, the MF yielded a 93.3% improvement in MAE, a 99.6% improvement in RMSE and a 95.8% improvement in MAPE for the temperature data.Similarly, improvements of approximately 93.4% in MAE, 99.6% in RMSE and 93.3% in MAPE were obtained for the humidity data.
Outliers influence the estimation of the parameters of prediction model; therefore, to effectively detect data outliers, data preprocessing is emphasized in this paper to improve the accuracy of the prediction model.According to Table 2 and 3, the MF has clear advantages in processing obvious outliers compared to the other methods assessed and displays stronger generalization ability and robustness.

C. RESULTS AND DISCUSSION OF THE PREDICTION MODEL
For the Daman superstation test sets, several models, e.g., MF-SVR, MF-CART, MF-EEMD-CART-AR, and MF-CEEMD-CART-AR, were assessed to evaluate the   performance of the proposed scheme, the MF-EMD-CART-AR hybrid model [52].The results predicted by the employed models and the original data are presented in Fig. 16.Notably, compared with the single models, such as MF-SVR and MF-CART, the hybrid models, such as MF-EMD-CART-AR, MF-EEMD-CART-AR, and MF-CEEMD-CART-AR, yield higher accuracy and better performance.For example, compared with MF-SVR, MF-CEEMD-CART-AR yielded   Similarly, a performance comparison of the EMD-CART-AR model and the single models (e.g., MF-SVR and MF-CART) for both specific points and the entire dataset is given in Fig. 16.According to the figure, the blue curve shows the MF-EMD-CART-AR predictions, and the black curve illustrates the original data.The findings presented in this figure indicate that the two curves largely coincide.In addition, to highlight the superiority of the EMD method, a hybrid model was constructed by combining the CEEMD and EEMD methods for comparison.As shown in Fig. 16, the MF-EMD method provides better prediction performance than MF-EEMD and MF-CEEMD.The results confirm that to some extent, the EMD approach introduced in this paper performs better than EEMD and CEEMD in processing non-stationary data.Similarly, Table 4 shows that the MF-EMD-CART-AR model outperforms the MF-EEMD-CART-AR and MF-CEEMD-CART-AR models in terms of the prediction ability.For example, compared with MF-CEEMD-CART-AR, MF-EMD-CART-AR yields a 43.7% improvement in MAE, a 42.6% improvement in RMSE and a 46.9% improvement in MAPE for the temperature data.Similar, a comparison of MF-EMD-CART-AR and MF-CEEMD-CART-AR highlights increases of 64.2%, 39.1%, and 69.2% in the MAE, RMSE, and MAPE, respectively, for the humidity data.According to Table 2, the results confirm that the developed model performs better than MF-EEMD-CART-AR and MF-CEEMD-CART-AR.This result suggests that the EMD method displays better performance in processing the non-stationary data than do EEMD and CEEMD because of its consideration the dynamic behavior of sensor data, with obvious physical meaning.
For the Arou superstation test set, as shown in Fig. 18, the applicability, generality and superiority of the MF-EMD-CART-AR model were further verified.Three evaluation criteria, the MAE, RMSE and MAPE, were used to compare the proposed model and other models.For the Arou superstation test set, the MF-KNN and MF-EMD-CART-AR models were compared, and the proposed model yielded a 68.2% improvement in MAE, an 83.9% average improvement in RMSE and a 97.6% improvement in MAPE for the temperature data, as well as 60.9%, 45.1% and 59.6% improvements in MAE, RMSE and MAPE, respectively, for the humidity data.The results using different prediction models are shown in 5.The findings show that the MF-EMD-CART-AR model proposed in this paper outperforms all others based on all three evaluation criteria.
The error measures for the MF-SVR, MF-KNN, MF-CART, MF-CEEMD-CART-AR, MF-EEMD-CART-AR and MF-EMD-CART-AR models based on data from the Daman and Arou superstations are shown in Fig. 17 and Fig. 18.Notably, the largest improvement was obtained by MF-EMD-CART-AR.The two figures also show that the errors associated with the MF-CEEMD-CART-AR and MF-EEMD-CART-AR results are lower than the errors for the single model results, such as MF-SVR, MF-KNN and MF-CART.The largest differences between the MF-CEEMD-CART-AR and MF-EMD-CART-AR MAE, RMSE and MAPE values were 43.7%, 36.1%, and 46.9%, respectively.Fig. 18 and Fig. 19 also show that the majority of the improvement in the overall error is due to the preprocessing and signal decomposition methods.The performance, which was evaluated based on the three criteria, confirms that the accuracy of the MF-EMD-CART-AR model is higher than that of the other models.
The prediction accuracy and statistical interpretation performance can be summarized as follows: a. the hybrid model can effectively provide predictions based on sensor data; b. the combination of the CART and AR models enhances the performance of the hybrid model; c. the comparison of the MF-EMD-CART-AR model and other models indicates that the proposed model displays superior performance; d. the comparison of the four sets of temperature and humidity experiments with different sampling times and sample numbers indicates the MF-EMD-CART-AR model has good generalization ability; and e. as shown in Table 4 and Table 5, the model is accurate, broadly applicable, robust and effective.In summary, the MF-EMD-CART-AR model provides an effective method for outlier detection based on predictions for sensor data.

D. RESULTS AND DISCUSSION OF THE OUTLIER DETECTION MODEL
In this section, we use residual sequences of the real and predicted values of temperature and humidity taken from Daman superstation and Arou superstation to evaluate the ability of the detector.At the outlier detection stage, the UCL and LCL control limits of employed methods are computed based on the confidence level of 99.73% (3δ).Tables 6-7 show the control limits of these methods.Fig. 20 shows the performance of EMWA, CUSUM and Shewhart control charts based on prediction for detecting outliers using grouped violin plots in all experimental test sets.Each bar is a sideways plot of the distribution of each DR or FR across per group test sets.
From Table 6, it can be found that the UCL and LCL control limits of MF-EMD-CART-AR-EWMA are (-0.4212,0.3973) for temperature data and (-1.2405, 1.5086) for humidity data.It shows that the MF-EMD-CART-AR-EWMA method has narrowest control limits compared to the others (e.g., MF-EMD-SVR-EWMA, MF-EEMD-CART-AR-EWMA, MF-CEEMD-CART-AR-EWMA, MF-SVR and MF-CART).Notice that the three control charts almost have the same control limits, while having different strategies for detecting outliers.Likewise, it can be seen in Table 6 that the MF-EMD-CART-AR-EWMA has the narrowest control limits of (-0.6606, 0.6810) and (-3.4223, 3.3669) for temperature data.An important problem in a detection model is the accuracy of the prediction method that leads to the change of control limits of the EMWA, CUSUM and Shewhart control charts, compromising the final detection results achieved by the detector operation.
For the Daman and Arou superstation test sets, MF-SVR, MF-CART, MF-EEMD-CART-AR, and MF-CEEMD-CART-AR are combined with the EWMA control chart to evaluate the performance of the proposed scheme, the MF-EMD-CART-AR-EWMA model.The results detected by the different detection schemes are presented in Tables 8-11.The findings presented tables are evaluated by detection ratio (DR) and fail-detection rate (FR).Therefore, DR is defined as the ratio of the amount of the points in which the outlier is detected to the total amount of test points.The FR is the ratio of the amount of the points in which the outlier failed to be detected.The results confirm that the detection scheme introduced in this work achieves comparable performance with MF-SVR-EMWA, MF-CART-EWMA, MF-EEMD-CART-AR-EWMA, and MF-CEEMD-CART-AR-EWMA across all dataset groups.This is because the proposed MF-EMD-CART-AR has superior performance compared to MF-SVR, MF-CART, MF-EEMD-CART-AR, and MF-CEEMD-CART-AR.As a result, MF-EMD-CART-AR-EWMA achieves good accuracy, thereby reducing a failure detection in the outlier detection model.
Similarly, to see the functionalities and performance of the proposed detection method, some contrast tests were performed which include the CUSUM control chart and Shewhart control chart.A performance comparison of DR and FR is given in Tables 8-11.Notice that none of the methods can be said to be consistently superior in the four group test sets.For example, compared with the Shewhart control chart, EWMA has almost the same DR in all datasets.Meanwhile, MF-EMD-CART-AR-EWMA and MF-EMD-CART-AR-Shewhart show superior performance in general with comparably low DR and FR compared to MF-EMD-CART-AR-CUSUM. MF-EMD-CART-AR-CUSUM has high DR and FR due to its detection strategy.Therefore, the CUSUM control chart is not particularly suited for time-series sensor data.It is easy to find that our adaptive methods offer a great improvement in detection rate compared to MF-EMD-CART-AR-CUSUM. Additionally, EWMA control chart among Shewhart control chart are process control strategy for monitoring outliers, while Shewhart control chart assumes that observations obey a Gaussian distribution, EWMA control chart are robust against this assumption and particularly suited for time-series data.Thus, we employed EWMA control chart as the outlier detection method to achieve an adaptive outlier detection approach towards time-series sensor.

V. CONCLUSION
In this paper, the proposed three-level hybrid model, which integrates preprocessing, prediction and outlier detection tasks, achieves excellent performance in outlier detection for non-stationary and nonlinear data collected by environment monitoring network networked sensors.To address the sensitivity of the prediction model with respect to outliers, preliminary screening based on the MF method, as the first level of the model, is conducted, and this approach significantly outperforms five other methods in preprocessing obvious outliers.EMD can decompose non-stationary data into stationary data series, and the prediction model simultaneously considers the accuracy and robustness of the prediction result.In this context, the EMD-CART-AR prediction model is proposed as the second level of the model, and it outperforms other models in predictions based on sensor data.For instance, compared with a single model, e.g., MF-SVR, the maximum observed improvements for temperature data from Daman superstation are approximately 67.1% for MAE, 61.1% for RMSE and 65.3% for MAPE by applying MF-EMD-CART-AR, and compared with hybrid models, e.g., MF-CEEMD-CART-AR, the improvements in the MAE, RMSE, and MAPE are 43.7%,36.1%, and 46.9% for the humidity data, respectively.Then, an EWMA control chart, as the last level in the model, is formulated to detect minor deviations in the data.This approach is especially suitable for outlier detection in predicted values.A three-level hybrid model is constructed to identify and treat outliers in environmental monitoring data.
We evaluate the performance of the proposed approach with four data series from a real-world sensor data set of the hydrometeorological observation network in the Heihe River Basin.The experimental results suggest that the preprocessing and prediction methods proposed in this paper achieve a better generalization ability and higher accuracy levels than other models in dealing with non-stationary and nonlinear sensor data.Moreover, the detection method displays outstanding effectiveness in terms of minor outlier detection.This research provides a new perspective for outlier detection and improvements to environmental monitoring data.However, this research evaluates only temperature and humidity data, including humidity data with weak non-stationary characteristics.In future work, the proposed method will be further expanded and optimized to detect outliers in different sensor data.

FIGURE 1 .
FIGURE 1. Schematics of the three-level hybrid model.

FIGURE 2 .
FIGURE 2. Locations of the Daman and Arou superstations.
, According the figure, Fig. 3-a and Fig. 3-b exhibit a significant difference.It can

FIGURE 3 .
FIGURE 3. Recurrence plots of sensor data from Daman superstation: a. temperature data and b. humidity data.

FIGURE 4 .
FIGURE 4. Recurrence plots of sensor data from Arou superstation: a. temperature data and b. humidity data.

FIGURE 5 .
FIGURE 5.The time-window scheme of training dataset and testing dataset selection.

FIGURE 6 .
FIGURE 6. Results for the temperature and humidity data from Daman superstation preprocessed by the MF.

FIGURE 7 .
FIGURE 7. Results for the temperature and humidity data from Arou superstation preprocessed by the MF.

FIGURE 8 .
FIGURE 8. Results for temperature data from Daman superstation decomposed by EMD.

FIGURE 9 .
FIGURE 9. Results for humidity data from Arou superstation decomposed by EMD.

FIGURE 10 .
FIGURE 10. Results for the temperature and humidity data from Daman superstation predicted by EMD-CART-AR.

FIGURE 11 .
FIGURE 11. Results for the temperature and humidity data from Arou superstation predicted by EMD-CART-AR.

FIGURE 12 .
FIGURE 12. Results obtained by the EWMA control chart for Daman weather station data.

FIGURE 13 .
FIGURE 13.Result obtained by the EWMA control chart for Arou superstation data.
and others, was performed to demonstrate the advantages and performance of the MF in processing sensor data outliers.In addition, to evaluate the prediction ability of the EMD-CART-AR model, other comparisons are made involved the model and the SVR, KNN, CART, CEEMD-CART-AR and EEMD-CART-AR models.Finally, we assessed the performance of the EWMA control chart with the CUSUM control chart and the Shewhart control chart in terms of outlier detection.Figs.14-20 illustrate the performance of the preprocessing model, prediction model and outlier detection model, and the results are presented in Tables 2-9.

FIGURE 14 .
FIGURE 14.Comparison of the outliers in the raw data and outliers preprocessed by the MF, the Butterworth filter, the FIR filter, the moving average filter, wavelet transform, and the Wiener filter for temperature and humidity datasets from Daman superstation.

FIGURE 15 .
FIGURE 15.Comparison of the outliers in the raw data and outliers preprocessed by the MF, the Butterworth filter, the FIR filter, the moving average filter, wavelet transform, and the Wiener filter for temperature and humidity datasets from Arou superstation.

FIGURE 16 .
FIGURE 16.Comparison of raw data and values predicted with the MF-SVR, MF-CART, MF-CEEMD-CART-AR, MF-EEMD-CART-AR, and MF-EMD-CART-AR modes for temperature and humidity data from Daman superstation.

FIGURE 17 .
FIGURE 17.Comparison of the raw data and values predicted by the MF-SVR, MF-KNN, MF-CART, MF-CEEMD-CART-AR, MF-EEMD-CART-AR, and MF-EMD-CART-AR models for temperature and humidity data from Arou superstation.

FIGURE 18 .
FIGURE 18.Comparison of the MAE, RMSE and MAPE for the MF-SVR, MF-CART, MF-CEEMD-CART-AR, MF-EEMD-CART-AR, and MF-EMD-CART-AR models based on temperature and humidity data from Daman superstation.

FIGURE 20 .
FIGURE 20.Comparison of the EWMA, CUSUM and Shewhart control chart integrated with the MF-SVR, MF-CART, MF-CEEMD-CART-AR, MF-EEMD-CART-AR, and MF-EMD-CART-AR models based on temperature and humidity data from Daman and Arou superstation.
41.5%, 39.1% and 34.7% improvements in the MAE, RMSE and MAPE, respectively, for the temperature data.Similarly, compared with MF-SVR, MF-CEEMD-CART-AR yielded improvements of 48.4%, 51.4% and 52.3% for the MAE, RMSE and MAPE, respectively, based on the humidity data.The MF-EMD-CART-AR model compared with MF-SVR yielded the maximum observed improvement for the temperature data, including approximately 67.1% for the MAE, 61.1% for the RMSE and 65.3% for the MAPE.Additionally, observed improvements were approximately 82.5%, 70.4%, and 85.3% for the MAE, RMSE and MAPE, respectively, based on the humidity data.Notably, the hybrid models include signal decomposition methods, which decompose non-stationary series into relatively stationary series with different characteristics to improve the accuracy of predictions.The results of this experiment demonstrate that the characteristics of stationary and non-stationary data have a considerable influence on the prediction accuracy.

TABLE 1 .
Experimental parameter of all methods.

TABLE 2 .
Preprocessing result of temperature and humidity data at Daman superstation.

TABLE 3 .
Preprocessing result of temperature and humidity data at Arou superstation.

TABLE 4 .
Prediction result of temperature and humidity data at Daman superstation.

TABLE 5 .
Prediction result of temperature and humidity data at Arou superstation.

TABLE 6 .
Control limits of EMWA, CUSUM and Shewhart control chart on temperature and humidity data from Daman superstation.

TABLE 7 .
Control limits of EMWA, CUSUM and Shewhart control chart on temperature and humidity data from Arou superstation.

TABLE 8 .
Performance of integrated detection model on temperature data from Daman superstation.

TABLE 9 .
Performance of integrated detection model on humidity data from Daman superstation.

TABLE 10 .
Performance of integrated detection model on temperature data from Arou superstation.

TABLE 11 .
Performance of integrated detection model on humidity data from Arou superstation.