An Adaptive Kalman Filtering Approach to Sensing and Predicting Air Quality Index Values

In recent years, Air Quality Index (AQI) have been widely used to describe the severity of haze and other air pollutions yet suffers from inefficiency and compatibility on real-time perception and prediction. In this paper, an Auto-Regressive (AR) prediction model based on sensed AQI values is proposed, where an adaptive Kalman Filtering (KF) approach is fitted to achieve efficient prediction of the AQI values. The AQI values were collected monthly from January 2018 to March 2019 using a WSN-based network, whereas daily AQI values started to be collected from October 1, 2018 to March 31, 2019. These data have been used for creation and evaluation purposes on the prediction model. According to the results, predicted values have shown high accuracy compared with the actual sensed values. In addition, when monthly AQI values were used, it has depicted higher accuracy compared to the daily ones depending on the experimental results. Therefore, the hybrid AR-KF model is accurate and effective in predicting haze weather, which has practical significance and potential value.


I. INTRODUCTION
In recent years, haze pollution has raised great concern in worldwide societies and scientific communities, due to its influencing living environment of human beings, even as potential impedance of the social progress from the world economic development perspective. The main causes of the haze pollution are the emission of exhaust gas from industrial production, smoke and dust from coal combustion, waste gas from vehicles, and dust from construction sites [1], [2]. Different data retrieval methods have been used from historical monitoring results to WSN-based collection [3], as well as its optimization [4], [5]. In this case, if the relative humidity is high and the air flow is relatively slow, the air will easily saturate and condense to form haze through the cooling of atmospheric radiation. Haze affects people's life in many aspects. First, it has direct harm to public health. For example, it causes rise of respiration, cardiovascular and The associate editor coordinating the review of this manuscript and approving it for publication was Tie Qiu . cerebrovascular diseases. Meanwhile, it can lead to huge direct and indirect losses to the social economy. In severe haze pollutions, public and private transportation can be affected due to the reduction of visibility. Therefore, measuring, monitoring, and predicting air quality and haze pollutions become critical in order to achieve eventual reduction of haze risks in practical life.
Although the cause of air pollution is complex and stochastic, this does not mean that the prediction of the air pollution cannot be done. In recent years, many researchers and scholars have shown high concern in the analysis and the prediction of the Air Quality Index (AQI). Such work can be concluded into two groups including deterministic approaches and statistical approaches. The former approaches focus on the physical theory in atmosphere and meteorological processes with concern on high-volume historical data, so diffusion models of the atmospheric pollution were generally presented by using specific mathematical approaches. Chen et al. [6] simulated the PM 2.5 formation and emission based on Community Multi-scale Air Quality (CMAQ) model. Saide et al. [7] VOLUME 8,2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ proposed a WRF-Chem model with optimal parameters. On this basis, a forecasting system was developed in order to describe air quality and meteorological measurements. However, these proposed models usually require a large number of historical data in meteorological aspects, which are difficult to obtain for researchers in practice. Moreover, limited knowledge of pollutants evolution processes and experience of parameter selection would affect forecasting accuracy. On the other hand, statistical approaches for prediction have been widely adopted recently due to their flexibility and simplicity. The well-used statistical models include AutoRegressive Integrated Moving Average (ARIMA), Grey Models (GM), Support Vector Regression (SVR), Multiple Linear Regression (MLR), Artificial Neural Networks (ANN), and other hybrid models. For instance, Yang et al. [8] presented the formation cause analysis of haze by time series methods, then a vector autoregressive model was constructed to predict daily haze increment. The results have showed good stability in short-term prediction. Carbajal et al. [9] introduced a fuzzy system to classify parameters, and then proposed an autoregressive model to predict the AQI based on the system. Combarro [10] employed the SVR method to determine the elements which had the greatest impact on the air quality in Oviedo city in Spain. Wu and Zhao [11] predicted the annual average concentration of PM 2.5 in the three different regions of China in 2020 based on the fractional order accumulation grey model called FGM (1,1). The results showed that its forecasting performance was better than traditional grey model. Challoner et al. [12] used two different models to predict air quality indices. A personal exposure activity location model was used to predict the outdoor air quality of a specific building, while an artificial neural network model was used to predict the indoor air quality. The above models were combined to fit the relationship between indoor air and outdoor air of the building. Liu et al. [13] promoted a seq2seq model to predict air quality with historical air quality data and introduced n-step recurrent prediction to solve error problems. Bai et al. [14] established a W-BPNN model by using a wavelet technique and a backpropagation neural network (BPNN) to predict daily air pollutants concentration. The results showed that the prediction accuracy of hybrid model was better than that of BPNN model. Wu et al. [15]- [16] proposed an optimal hybrid model, which combined secondary decomposition, neural network and optimization algorithms to predict air quality index. All of the solutions above focused on the prediction of the AQI with limited introduction of their data sources. Instead, how to retrieve real-time data via a WSN-based network becomes challenging. However, due to the uncertainty and diffusion of air pollution, some individual statistical models tended to introduce biases for air quality prediction. Meanwhile, hybrid models obtained better forecasting results to some extent. Furthermore, a Kalman Filter (KF) approach can strengthen the ability of dealing with stochastic uncertainty combined with its state-space equation. Therefore, a hybrid model applying the KF approach to a statistical model is proposed in this study.
The remainder of this paper is organized as follows. Section 2 mainly presents related works using Kalman filtering models. Section 3 describes detailed data collecting, processing and modeling processes proposed in this paper. The experimental results of the Kalman filtering approach is depicted in Section 4. Section 5 discusses the Kalman filtering method for prediction. Finally, the last section makes a conclusion of this paper.

II. RELATED WORKS
In 1960, Kalman [17] proposed a state-space model into the filtering method and derived a set of recursive estimation equation called simple Kalman filters. With the popularization and improvement of the Kalman filtering model, it has been widely applied in different fields, such as hydrology, physics, mechanical control and economy.
A Seasonal Autoregressive Integrated Moving Average (SARIMA) with a Generalized Autoregressive Conditional Heteroscedasticity (GARCH) approach was proposed in order to predict traffic flow [18]. On this basis, an adaptive Kalman filter was used to realize the proposed model to improve the forecasting performance. Hua et al. [19] used a Weather Research and Forecast (WRF) model to compare the observed wind speed with the predicted wind speed, and then revised the predicted wind speed on the basis of the Kalman filter theory in order to reduce systematic and random errors. Finally, the forecasting accuracy has been well improved. An unscented Kalman Filter (UKF) approach with support vector regression (SVR) was adopted to conduct the shortterm prediction of wind speed. Meanwhile, compared with four different models, the hybrid UKF-SVR model achieved better forecasting performance [20]. Lai et al. [21] proposed a Kalman Filtering algorithm to predict six kinds of different air pollutants compared with common forecasting models. The results demonstrated that the KF model could obtain the optimal prediction results. Galanis et al. [22] improved the prediction performance of regional weather by applying a nonlinear function to the classical Kalman filter algorithm. Chaabene and Ammar [23] proposed an autoregressive moving average model for medium-term forecasting based on a Kalman filter, yielding higher accuracy compared with the short-term weather forecasting. Kumar [24] introduced a Kalman filtering technique (KFT) to predict traffic flow with limited input data. The result proved the suitability of the presented prediction without enough data. Xing et al. [25] proposed a temperature model to construct a state of charge (SOC) estimation method. An Unscented Kalman Filter approach was used to deal with various uncertainties, such as environment variation, intercellular variation and modeling inaccuracy by adjusting the model parameters in each sampling step. Mastali et al. [26] employed an Extended Kalman filter (EKF) to predict the state of the batteries. On this basis, a dual concept was introduced in the extended Kalman filter model in order to improve the forecasting accuracy. According to the results, the filters have kept small maximum errors indicating that the validity of Kalman filter. 4266 VOLUME 8, 2020 Soubdhan et al. [27] proposed the framework of a linear dynamic Kalman filter for predicting solar and photovoltaic production including probabilistic initialization, expectation maximization (EM) and auto regressive (AR) models. Two common Kalman filtering methods including EKF and UKF have been employed to fuse the pseudo-range, ranging information and location information for indoor localization, and the experiment results proved that the positioning performance of the nodes have improved obviously combined with Kalman Filter approach [28]. Rigatos and Siano [29] used a nonlinear Kalman filter to predict the default probability of financial companies and estimated the default risk by predicting the ratio of option to asset values.
Aimed at evaluating the level of air pollution, AQI has been chosen as an effective index to measure the comprehensive level of air quality, to which the Chinese environment minister has paid great attention in recent years. Moreover, the Kalman filter approach has been gradually applied to the fields of economy and finance, but seldomly used in the field of meteorology. In this paper, AQI data have been analyzed, evaluated and predicted by using the Kalman filter approach in order to accurately predict the air quality in the near future.

III. DATA PROCESSING AND MODEL SELECTION
In this section, three types of time-series data processing methods will be firstly analyzed and compared, i.e. Artificial Neural Networks (ANN), Wavelet Transform (WT) and Kalman Filter (KF). The ANN method [30]- [33] has been well applied in the field of image and voice processing, e.g. pattern detection and recognition. It has strong generalization and fault-tolerant features for the description of nonlinear systems. However, in regard to a large amount of system noises, an ANN model can fall into a local minimum value, resulting in serious prediction errors to a certain extent. The WT method [34]- [36] is a powerful tool for non-stationary signal processing. Due to the variation of signal characteristics in different scales, it is difficult for wavelet functions to be derived from a specific basis function, in order to achieve proper approximation on local signals in different scales. Therefore, reconstructed signals can lose the original time domain during de-noising processes. The Kalman filter method [37], [38] updates and processes real-time data through an accurate mathematical model, which is convenient to be programed to realize the prediction efficiently. The state space model of the Kalman filter can estimate current time state by using the estimated values of previous time steps and the observed value of current time step, so the state estimation can achieve high accuracy and is therefore suitable for linear discrete finite-dimensional systems due to its strong ability of handling the stochastic uncertainty.
Considering the stationary characteristics of an AQI dataset, an autoregressive linear prediction model is firstly established in the following part of this section. Then, based on the recurrent relationship between the front and back terms of the AR model, the AQI data can be corrected by a Kalman filter approach. The historical AQI values collected from January 2014 to September 2018 via a WSN-based network were used as training data, and 182 data from October 2018 to March 2019 were used as test data to evaluate the prediction performance of the model. In order to compare the forecasting performance of different time scales including daily data and monthly mean data. Similarly, 48 AQI monthly data from January 2014 to December 2017 have been employed as training data to predict near future monthly mean values and the AQI data from January 2018 to March 2019 were correspondingly used as test data.

A. BRIEF INTRODUCTION TO THE AUTOREGRESSIVE MODEL
The autoregressive (AR) model [39] is a linear model which uses the linear combination of initial random variables to describe current random variables. As a popular linear regression model, it is used to fit stationary time series, which has been applied in the prediction of economics, informatics and natural phenomena in recent years.
Let a time series be x (1) , x (2) , · · · , x (t), and the predicted values of the t + 1 time series has the following structure: where 1 ≤ p ≤ t The model is called a p-order AR model, denoted as AR (p), where ϕ represents model parameters, p represents the highest order number of the model ε (t) is a zero-mean white noise random disturbance sequence. σ 2 ε is the variance of ε (t). The current random disturbance term is independent from the past sequence values. Generally, the model can be simplified as: (2)

B. BRIEF INTRODUCTION TO THE KALMAN FILTER
The Kalman Filter model (KF) was introduced to improve accuracy of AQI forecasting in this paper combined with its strong capacity of handling stochastic uncertainty. The Kalman filter approach as a statistical approach was proposed in 1960 for the first time. Then it has been applied in different fields especially in meteorological applications because of its good prediction performance. The Kalman filter can calculate the optimal estimation parameters by the minimum mean square error (MSE) for many problems, which has high efficiency.
For general linear stochastic systems without an input parameter, the state space representation equation is given: where t ≥ 2. Equation (3) is the state equation of the system. Equation (4) is the measurement equation of the system. X t+1 is an n-dimensional state vector at time t + 1. A is the state transition matrix; W t is the process noise vector of p-dimensional system. Z t+1 is an m-dimensional observation vector at time t + 1. B is the predicted output transfer matrix. V t is a q-dimensional observation noise vector. Let W t and V t be white noises, which are independent of each other and obey normal distribution. Q t represents the covariance matrix of a process noise vector and R t denotes the covariance matrix of an observation noise vector. In this paper, the AR model is introduced into the state equation of the Kalman filter to simplify the corresponding processes.
Let + 1), thus the AR model can be expressed as: According to (5), Therefore, the vector form of state equation of the AR-KF model can be written as follows: According to (2) to (7), the observation equation based on the Kalman filter can be obtained as follows: whereX t|t−1 is the state estimation at time t under the condition of at time t1.X t|t is the optimal state estimation at time t after consideringX t|t−1 . Z t represents the observation vector, G t is the Kalman gain and P is the error covariance matrix. This algorithm is implemented according to Table 1.

IV. EMPIRICAL ANALYSIS
According to the Ambient Air Quality Index (AQI) Technical Regulations (HJ 633-2012). the AQI can be classified into the following six grades: This paper takes the historical data of the AQI as the reference data of haze concentration and combines the Kalman filter approach with autoregressive (AR) model to predict AQI values.

A. DATA SOURCE AND PROCESSING
In this paper, 1826 historical data of AQI concentrations have been employed from January 1, 2014 to December 30, 2018 in Nanjing. The data can be obtained from the website of Weather Post-report (http://www.tianqihoubao.com/).
The AQI data for establishing the model are from January 1, 2014 to December 31, 2018.The time series diagram of AQI data in Nanjing can be shown in Figure 1.  In Fig. 1, it can be found that the fluctuation of AQI data in Nanjing is relatively stable. The range of AQI values is from 40 to 300. According to the applicable conditions of the AR model, stationary time-series data need to be used to train the proposed model. In order to test the stationarity of the time series data theoretically, the Augmented Dickey-Fuller (ADF) technique as a unit root method has been used. If there is no unit root in the time series data, the data are stationary; otherwise, it is a non-stationary series. Then in the former case, the p value can be calculated to be p = 1.1157 × 10 −11 , which is less than 0.05, i.e. no unit root in the sequence. Therefore, the daily AQI data is stationary which can be used as input data in the AR model.
Then, the order number of the AR model needs to be determined. Because the tailing and truncation of autocorrelation and partial autocorrelation can be performed depending on subjective operations. To improve the objectivity of this part, a Bayesian Information Criterion (BIC) approach is adopted to select the order p of the model by setting the upper and lower bounds and then traversing them one by one. Define where N represents the number of the samples, p denotes the order number of the model parameters,L represents the likelihood function. When the BIC reaches the minimum value, the p value is chosen as the order of the optimal AR model under the criterion. According to programming experiments, the parameter has been optimally set to p = 3. Therefore, the AR (3) model can be established as follows: Next, the parameters of the AR model can be estimated by the least square method, which can be obtained by programming experiments, as shown in Table 3. Then the three-order autoregressive model has been established by using 1734 AQI data from January 1, 2014 to As can be intuitively seen in Figure 2, the AQI values of December and January of each year from 2014 to 2018 are higher than that of other months. The AQI values of Nanjing are high in every winter and the trend is consistent with the regularly occurring haze in people's life, which shows the haze pollution will be more serious in winter than in other time; while the haze pollution concentration will reduce in summer to some extent, indicating that the model is reasonable. In recent years, except for the sudden increase of the AQI value in the winter in 2018, the peak value curve of Nanjing in winter shows a decreasing trend. Such phenomenon should be closely related to the environmental air control taken by the Nanjing Municipal Government in recent years.
Since the fluctuation trend of the simulated value curve and the real value curve of the AQI daily values is intuitively consistent (Figure 2), 182 data of Nanjing from October 1, 2018 to March 31, 2019 have been used as test samples to fit into the model, in order to further analyze the prediction VOLUME 8, 2020 ability of the model through the test results. The simulation result is shown in Figure 3. The test results have shown that the curves of the test values by KF-AR model and the real values have good consistency (Figure 3), and both the two curves are in a downward trend. The errors from the hybrid model are mainly scattered in extreme values compared with the AR model, which are often caused by abnormal factors, such as temperature inversion, rain, wind, coals burning, and automobile exhaust. Specifically, the curve appears a bit smooth from the AR model at the beginning compared with the KF-AR model, while the KF-AR model fits well over the entire period. Therefore, it is difficult to fit well by the AR model in the whole periods. Compared with the real values and the predicted values, the root mean square error from October 1, 2018 to March 31, 2019 is 26.27 according to the KF-AR model, while the root mean square error of the AR model is 27.07, which is higher than that of the KF-AR model.

B. MODELING PROCESS AND RESULT ANALYSIS OF AQI MONTHLY MEAN SEQUENCE
Based on the previous analysis, the daily data are processed into monthly data to calculate the monthly AQI mean value series. The monthly mean values of 48 months from January 2014 to December 2017 have been collected to perform the model training experiment, whereas the monthly mean values of 15 months from January 2018 to March 2019 were selected to conduct the test. Based on the previous BIC criterion, the order of the AR model is set as p = 4. A four-order autoregressive model has been established for the 48 monthly AQI mean data from January 2014 to December 2017. According to the (2), it can be refined as follows with 4 significant digits being retained: The  As indicated in Figure 4, the monthly mean value series reach peak around the January of each year which shows the same trend as the daily value. The curve of real values has fluctuation since 2014, but generally depicts a slow downward trend and the air quality level is ''Good''. The main reason is that the government has strengthened the pollution control and reduced emissions of air pollution in recent years.
Then, 15 monthly mean value data of AQI from January 2018 to March 2019 in Nanjing were used as test samples to continue fitting the series data with the model. The prediction ability of the model for the monthly mean series was further discussed according to the test results, as shown in Figure 5.
According to Fig. 5, although there are some deviations in certain months, the trend of the real value curve of the AQI monthly mean data is consistent with the general trend of the predicted value curve compared with the AR model, which implies that the prediction effect is favorable. Specifically, in some time periods when AQI fluctuates greatly, the non-linear prediction performance based on the hybrid model outweighs the AR model. Also, in recent months, the prediction results of air quality show that the grade level is good. At the same time, the educational level of the society is increasing year by year, so the environmental protection awareness is gradually enhanced, and the air quality in the future is expected to be gradually improved. According to the difference between the real values and the test values, the root mean square error of two models are 10.37 and 11.84, respectively, which is less than the daily data prediction results. Moreover, the forecasting accuracy of the KF-AR model is  less than that of the AR model. Table 4 gives the results of all experimental simulations, which are consistent with previous analysis. This phenomenon shows the forecasting performance of the hybrid model outperforms the AR model. Plus, the accuracy of monthly mean data prediction is better than the daily one.

V. DISCUSSION
In this paper, an adaptive Kalman filtering model is introduced to improve the prediction accuracy of air quality based on the autoregressive model. The proposed methods can provide accurate data support for haze prevention and control. The results have shown that the Kalman filter based on the AR model can be well fitted into the AQI series data retrieved in Nanjing compared with individual model. Furthermore, this method can be extended to other fields for air quality prediction, such as PM 2.5 , PM 10 , SO 2 , etc. It can also be combined with other prediction models, such as Support Vector Machine (SVM) and Artificial Neural Network (ANN) models to realize the hybrid prediction of the AQI in future.

VI. CONCLUSION
An adaptive Kalman filter approach based on the AR model has been proposed by training, testing and predicting the daily and monthly AQI series data in Nanjing. The paper finds that the model is more effective in predicting the monthly AQI series data collected via a WSN in Nanjing than in predicting daily AQI series data. Moreover, the forecasting performance of the hybrid KF-AR model is better than that of the individual AR model. Then a conclusion can therefore be drawn as follows: (1) The prediction method based on the Kalman filter in this paper can predict the monthly mean value of AQI, which shows that the method depicts good prediction ability for AQI prediction and practical significance in the field of haze prediction.
(2) According to the training and test results, the AQI is decreasing in recent years. This not only means that haze prevention and control in Nanjing has achieved obvious effects but reflects the gradual improvement of people's environmental and ecological protection awareness as well.
(3) By observing the prediction curve of the AQI monthly mean series data, it is found that there is a certain delay error in the time. Therefore, the Kalman filtering model can be further improved on the AQI prediction in the future, by considering the integration of other regression methods to improve the Kalman filter and correct the delay error of the predicted values.
[39] Y. Liu XIAODONG LIU (Senior Member, IEEE) received the Ph.D. degree in computer science from De Montfort University. He joined Napier, in 1999. He was the Director of the Centre for Information and Software Systems. He is a Reader and is currently leading the Software Systems Research Group, IIDI, Edinburgh Napier University. He is an active Researcher in software engineering with internationally excellent reputation and leading expertise in context-aware adaptive services, service evolution, mobile clouds, pervasive computing, software reuse, and green software engineering.