Causal Identification Based on Compressive Sensing of Air Pollutants Using Urban Big Data

This study addresses the causal identification of air pollutants from surrounding cities affecting Beijing’s air quality. A novel compressive sensing causality analysis (CS-Causality) method, which combines Granger causality analysis (GCA) and maximum correntropy criterion (MCC), is presented for efficient identification of the air pollutant causality between Beijing and surrounding cities. Firstly, taking the spatiotemporal correlation into consideration, the original data is mapped into low-dimensional space. Valid information is then obtained based on compressive sensing (CS), which can greatly reduce the dimensions of the data, thus decreasing the amount of data analysis required. Secondly, to analyze the causal relations, GCA, represented by the prediction from one time series to another, is extended to rule out “Non-Granger” causes of air pollutants in Beijing originating from its surrounding cities. Thirdly, the greatest impact on Beijing’s air quality is confirmed based on MCC. Finally, the accuracy of these results is verified using the transfer entropy.


I. INTRODUCTION
In recent years, air quality has attracted widespread concern due to its rapid deterioration. Air pollution not only directly affects human health, and even seriously threatens human life. Moreover, air pollution is estimated to cause 3.7 million deaths per year and has contributed increasingly to the global burden of disease. This serious phenomenon has caused widely public concern. In particular, the air quality in Beijing and its surrounding cities has received widespread attention from relevant departments and research institutions. According to authoritative reports, the content of PM2.5, PM10, SO2, NO2, CO in the air is an important indicator of the severity of air pollution.
Because of the harm caused by air pollutants to the human body, control of air pollution is critical, especially in terms of reducing the content of pollutants in the air. Now, the primary task is to seek the source of pollutants, to cut off the diffusion of pollutants, so as to fundamentally solve the air pollution problem. Taking the relationship between air quality in Beijing and its surrounding cities as an example, it can be seen from FIGURE 1 that there is a certain correlation between them. However, the huge pollutant data contains a lot of false The associate editor coordinating the review of this manuscript and approving it for publication was Zhe Xiao . data and duplicate data, which makes the causality analysis result inaccurate. Therefore the severity of air pollution has not been significantly alleviated.

A. RELATED WORKS
To address air pollution problems, data-driven analysis and causality modeling analysis have become popular tools to predict and analyze the relationship between meteorological data.
1) Data-driven air quality analysis. Several researchers [1]- [6] proposed approaches to forecast and estimate air quality by analyzing and processing the correlations and patterns found in heterogeneous big data. Shang et al. [7] estimated gas consumption and pollutant emissions based on GPS data. Zheng et al. [8] designed a linear regression-based temporal predictor to model the local air quality factors considering current meteorological data. Simona et al. [9] proposed a mobile air pollution monitoring framework coupled together with a data-driven modelling approach for predicting the air quality inside urban areas, at human breathing level. A static monitoring protocol was proposed for comparing the performance of two different mobile sensing units. The proposed experimentation protocols showed not only a significantly higher impact of NO2 concentrations, but also poor noise levels registered. The data-driven investigation revealed that data generated by the mobile sensing units when used outdoors can be accurately used to predict future NO2 levels in urban areas. 2) Causality modelling for time series. Identifying the source of pollutants has become a critical problem.
Recently, due to the availability of air quality big data collected by wireless sensors deployed in different regions, it has become possible to analyze the causalities of pollutants among surrounding cities. To address the problem, Zhu et al. [10] proposed a patternaided graphical causality analysis approach, based on pattern mining and Bayesian learning, to identify the spatiotemporal causal pathways for air pollutants. Ebert-Uphoff and Deng [11] investigated the causal discovery problem, which set up spatiotemporal problems using a time series and handled temporal and spatial boundaries. Yang et al. [12] approximated the Granger causality of the multivariate time series in a unified framework. To predict haze pollution, Peng et al. [13] proposed a PS-FCM (Primary Sub-Fuzzy Cognitive Maps) model to reveal causality in the formation of haze. The causality, based on time series data of haze pollution with PS-FCM, were explored and discovered through considering the formation of haze as an evolving process with time. Thus, a multidimensional time series data mining method based on the PS-FCM was developed to investigate the formation of haze.

B. MOTIVATION AND MAJOR CONTRIBUTIONS
Although these methods succeeded in improving the performance of data analysis, identifying the source of pollutants from air quality big data is a serious challenge for the following reasons: 1) Large amounts of air quality data can make causality analysis very difficult. Tens of thousands of sensors deployed in different areas sample large amounts of data frequently on a daily basis. Discovering causality in such a large amount of air quality data is very difficult. 2) A large amount of noisy data can lead to inaccuracy in the causality analysis. There are many repetitive or false data points in the original air quality data. These data points are not only useless, but also directly affect the accuracy of the causality analysis. With such a large amount of data (there is repeatability between the data), an efficient data compression strategy is essential to analyze causality efficiently. Data compression algorithms can be divided into two types: lossless and lossy compression. Lossless compression guarantees the integrity of reconstruction with a cost of relatively poor compression ratios. In lossy compression, compressive sensing (CS) [14], [15] is best known for its performance advantages. The most prominent feature of CS is that redundant temporalspatial domain information can be reduced for each sensor independently. The method with superior characteristics is very suitable for data compression processing before causal analysis of big data.  Inspired by the robustness and efficiency of the recently proposed pg-Causality, we bring further contributions to the analysis of pollutant causality between Beijing and its surrounding cities based on Granger causality analysis (GCA) and compressive sensing (CS). The spatiotemporal causalities should be reflected by applying the following two aspects: 1) The spatial correlation, which denotes the propagation and diffusion of multiple pollutants in the space. 2) The temporal correlation, which indicates causality is changed at different time lags.
Based on these aspects, we carried out the following innovative research. Firstly, the original air quality data from the different cities (the sparse time series ) were respectively mapped into low-dimensional space to obtain the respective low-dimensional vectors. The compression process effectively removed repetitive or false data and reduced computational complexity. Secondly, the causalities between Beijing and its surrounding cities were determined using the Granger causality index. Thirdly, using the maximum correntropy criterion (MCC), the surrounding cities which had the greatest impact on Beijing's air quality were determined.

A. ORIGINAL DATA TEST
The spatiotemporal air quality data in Beijing and its surrounding cities were selected from November 1st to November 26th in 2019. The synchronous correlations in PM2.5, PM10, SO2, CO, and NO2 are shown in FIGURE 2. The horizontal axes show the time range (November 1, 2019 to As can be seen in Fig.2, it is difficult to predict which urban air pollutants have a significant impact on Beijing's air quality based on the original data. Therefore, we will utilize Granger causality with compressive sensing to further explore the impact of pollutants in other cities on Beijing's air quality in the next section.

B. DATA PROCESSING
It is well known that the original collected data will be sparse and repetitive [16], [17]. In order to improve the effectiveness of air quality data, special processing is carried out on the original data to remove invalid data, so as to provide effective information for the data collection and analysis.
Let U = {U 1 , U 2 , U 3 , U 4 , U 5 } be the set of air qualities from surrounding cities, in which U 1 , U 2 , U 3 , U 4 , U 5 denote pollutants in Zhangjiakou, Chengde, Tianjin, Baoding, and Tangshan, respectively. The set V represents pollutants in Beijing. Compressive sensing is applied to reduce the amount of required data analysis and processing. The processing consists of four steps, as illustrated in FIGURE 3. Firstly, sparse representation of original data. Generally, the original data are not sparse, so needed to transform into sparse representations on the sparse basis, which can be formulated as (1)- (2).
where ∈ R N ×N is a sparse basis, Z τ ∈ R N ×T , ∈ R N ×T are sparse coefficients, U τ , V are time series of original data, where N is the size of the time series. Secondly, spatial projections. In this matrix operation, the original data are projected to an M -dimensional space for eliminating spatial correlations through M times cycle, which can be expressed in (3)-(4). This step achieves data compression in the spatial domain.
is the sensing matrix that satisfies the restricted isometry property(RIP). Thirdly, temporal projections. For the sparsity in the time-domain, data compression can further reduce the amount of data for causal analysis. The process can be described in (5) is temporal compressive basis. Finally, data reconstruction. After causal analysis, the original data are recovered from the sparse coefficients using the l 1 -norm minimization [17].
Without loss of generality, in the subsequent compression process, the elements of sensing matrix are constructed as Gaussian distribution, sparse basis is calculated by applying the inverse Discrete Cosine Transform upon the columns of the identity matrix, and elements of temporal compressive basis are designed to be 1.

C. GRANGER CAUSALITY ANALYSIS
To detect and measure which cities have an impact on Beijing's air pollution, causalities between Beijing and its surrounding cities have been discussed. The popular, recent causality models include graphical causality [18], [19], unitlevel causality [20], and predictive causality [21], [22]. Granger [23] in predictive causality is applied to test the impact of the surrounding cities on Beijing's air pollution.
Here e τ d,1 (t), e τ d,2 (t) are the errors at t, p i (i = 1, 2, 3, 4) is the number of timestamps, and α d,i , β d,i , (d = 1, 2, · · · , M ) are the correspondent weights for the time series X τ , Y . When the models represented by equations (7) and (8) satisfy the null hypothesis, they degrade to autoregressive (AR) models, which indicates that there is no causality between X τ and Y . Thus, the AR models for the time series in Granger causality can be expressed as follows: where e τ d, 3 (t), e d,4 (t) are the errors at t; p 5 , p 6 are the number of timestamps; and α 3 d,i , β 3 d,i are the correspondent weights for the time series X τ , Y . In the case of (9) and (10), the time series were caused by their own histories, not by others. In [24], the independents of X τ , Y are defined as follows: Var(e τ 1 (t)) = Var(e τ 3 (t)) and Var(e τ 2 (t)) = Var(e 4 (t)), where Var(e τ i (t)), (i = 1, 2, 3, 4) denotes the variance 1, 2, 3, 4). Otherwise, we need to distinguish causality between X τ and Y . The Granger causality index from X τ to Y can be defined as follows: Var(e 4 (t)) T t=1 Var(e τ 2 (t)) Var(e 4 (t)) >

T t=1
Var(e τ 2 (t)), then X τ is the cause of Y . Thus, we can obtain that F X τ →Y > 0. In particular, if F X τ →Y > 0, we can conclude that X τ is the cause of Y . Therefore, the sufficient and necessary condition of F X τ →Y > 0 is that X τ is the cause of Y . Certainly, if F X τ →Y = 0, there is no causality from X τ to Y . Similarly, the definition of the Granger causality index from Y to X τ is given by: Var(e τ 1 (t)) When F X τ →Y > F Y →X τ , the effect of X τ on Y is significantly higher than the effect of Y on X τ . For the theoretical analysis, the original data set of U τ (τ = 1, 2, 3, 4, 5) and V were first sampled periodically. The invalid data were then effectively eliminated based on CS and the lowdimensional time series X τ (τ = 1, 2, 3, 4, 5) and Y could then be obtained. F X τ →Y and F Y →X τ (τ = 1, 2, 3, 4, 5) were calculated and respectively compared using (11)- (12). Thus, it could be determined which X τ (pollutant index of the surrounding cities) was the cause of Y ( pollutant index of Beijing).

D. CAUSALITY OBSERVATION
To detect and measure which surrounding cities could be the cause of air pollution in Beijing, a number of assumptions had to be made about the external environment. 1) Uncontrollable factors, such as wind speed, wind direction, temperature, humidity, pressure, and other natural conditions, were not considered. Uncontrollable factors vary from place to place. If they were all taken into account, it would inevitably increase the difficulty of the causality analysis and affect the accuracy of analysis results.
2) The compression error of original data was allowed. The mapping of the original data from highdimensional space to low-dimensional space is a lossy process, as is the reconstruction of the data from lowdimensional space to high-dimensional space. Typically, the error ratio of the compression process is within 5%. We utilized the compressed data associated with the five major pollutants to analyze the relationship between the air quality in Beijing and its surrounding cities. From Table 1, the Granger causality conclusions can be drawn as follows: 1) PM2.5 in Baoding, Tianjin, Tangshan, and Zhangjiakou were the cause of PM2.5 in Beijing when the time lag was 2 or 3. 2) PM10 in Baoding, Chengde, and Tangshan were the cause of PM10 in Beijing when the time lag was 2 or 3.

3) SO2 in Baoding, Chengde, Tianjin, Tangshan, and
Zhangjiakou were the cause of SO2 in Beijing when the time lag was 2 or 3. 4) CO in Chengde, Tianjin, Tangshan, and Zhangjiakou were the cause of CO in Beijing when the time lag was 3 or 4. 5) NO2 in Baoding, Tianjin, Tangshan, and Zhangjiakou were the cause of NO2 in Beijing when the time lag was 2 or 3. Regardless of wind direction, wind speed, and other factors, air pollutants arrive in Beijing after 2-3 days of diffusion and have an impact on the air quality in Beijing. From Table 1, the kinds of pollutants in specific cities that have an impact on Beijing's air quality can also be identified, but the degrees of impact of the pollutants from these cities on Beijing's air pollution cannot be determined. In the next section, we will utilize MCC to estimate the degrees of impact of air pollutants around Beijing.

III. GCA BASED ON MCC
There are two traditional measures of similarity between two random variables: minimum error entropy (MEE) and MCC. In this section, MCC is used to determine which surrounding city has the greatest impact on Beijing's air pollution. Firstly, MCC is applied to identify the weights for the time series and guarantees the minimum error. Secondly, the X τ which is the predominant causality of Y is determined through minimum error entropy. If there is causality between X τ and Y , the correntropy is defined using the joint probability density function f X τ Y (x τ d,t , y d,t ) as: Here E(·) is the expectation operator, and κ(·) denotes a shiftin-variant Mercer kernel. In this study, the Gaussian kernel VOLUME 8, 2020 was adopted by: where σ 2 d denotes the variance, which is a constant at temporal space. W d = w 1 w 2 · · · w m T is the weight, and X d (t) is defined as: The MCC algorithm can be derived through a stochastic gradient as: Here W d (i) is the weight vector at iteration i, η d > 0 denotes the step-size, e d (i) is the prediction error of e τ d,2 (i) at iteration i, and f (e d (i)) is a function of the error e d (i). In general, f (e d (i)) can be defined as: Substituting (18) into (17), the gradient-based update equation can easily be derived to maximize the correntropy, which is the minimum error entropy. Hence, (17) can be rewritten as: In order to facilitate further research, the weight error vector W d (i − 1) at iteration i − 1 can be expressed as: Here W 0 d denotes the desired weight vector that was not known beforehand. Clearly, W d (i) is closest to W 0 d when the norm ofW d (i) is a minimum. Based on equation (16), the priori error can be represented by: Clearly, e a (i) and e d (i) have the following relationship: A direct consequence of the energy conservation relation [25] being applied, leads to the equation: Based on the analysis, Through the properties of the function, if W 0 d satisfies the condition η d E X d (i) 2 f 2 (e d (i)) = E [e a (i)f (e d (i))], the minimum of equation (23) can be solved for as follows: We can then easily obtain the optimum of the weight W d , is the degree of impact of X τ on Y . This process is repeated to calculate the degree of impact of X τ (τ = 1, 2, 3, 4, 5) on Y . X τ 0 has the greatest impact on Y if the expectation of the norm for the weight errorW d (i), (d = 1, 2, · · · M ) is a minimum between X τ 0 and Y . Due to the introduction of CS, large volumes of data can be handled, such as the five major pollutants for Beijing and its surrounding cities in 2019. The compression ratio M N = 0.65 was applied. The degrees of impact of the surrounding cities are shown in Table 2 (Surrounding cities without Granger causality were excluded).
As shown in the TABLE 2, among PM2.5 and SO2, Tangshan has the minimal degree of impact. Therefore Tangshan has the greatest impact on Beijing's air quality. Similarly, Baoding has the greatest impact on Beijing in PM10 and NO2, and Tianjin has the greatest impact on Beijing in CO.

IV. PERFORMANCE ANALYSIS A. MEAN-SQUARE STABILITY
The mean-square stability is an important index for GCA, which has been extensively studied in the literature [26]. In order to evaluate the stability of the weight errorW d (i), it was assumed that the prediction error {e d (i)} was zeromean Gaussian distributed, with variance σ 2 d , independent of the {X d (i)}. Now, equations (18) and (19) are used to analyze the meansquare stability of the weight update (19). From (23), Assuming that the weight update (19) is stable, we can obtain: Steady state in (23) is then: Considering the assumption and equation (22), we can derive: In this way, By substituting (25) and (26) into (24), we have: Note that if step-size η d satisfies condition (31), E W (i) 2 will be decreasing, and hence stable.

B. RESULT VALIDATION
Transfer entropy (TE), in terms of stability and accuracy, is a very useful tool in quantifying directional causality for both linear and nonlinear relationships. To verify the accuracy of the results based on MCC, TE was applied to explore the air pollutant causality between the surrounding cities and Beijing. TE X τ →Y is defined from conditional entropies as where y d,t = y d,t y d,t−1 · · · y d,t−m+1 is an mdimensional vector, and VOLUME 8, 2020 is an m-dimensional vector, P(y d,t+u y d,t ) is the entropy of the conditional process Y in its past, and can be calculated as: According to the properties of conditional probabilities, equations (33) and (34) are then substituted into equation (32), According to equation (35), the impact of X τ on Y can be calculated. TE X τ →Y (τ = 1, 2, 3, 4, 5) is then calculated, enabling the determination of which X τ has the greatest impact on Y by the maximum TE X τ →Y . By calculation, Tangshan has the greatest impact on Beijing's air quality in terms of PM2.5 and SO2, Baoding has the greatest impact on Beijing in terms of PM10 and NO2, and Tianjin has the greatest impact on Beijing in terms of CO. This result is consistent with the result of MCC (TABLE 2).

V. EXPERIMENTS
To verify the effectiveness of our algorithm for CS and GCA, experiments were conducted using Python 3.7, and data was derived from the Online Monitoring and Analysis Platform for Air Quality in China [27]. According to the comparative analysis of a large number of data, it is found that the data sampled every 4 hours are concentrated in distribution. For the convenience of data compression, the elements in temporal compressive basis are all 1, that is, the average value of 6 groups data in every day is taken as the causal analysis data. Therefore, only the spatial compression ratio is considered.
When the data is compressed in the temporal space, the average value of the 6 data sampled in a day is acquired, so it is equivalent to adopting only one data every day. Therefore, the step size η d = 1. Since our main goal is causal analysis rather than data compression; we only use traditional sensing matrix and sparse basis to achieve preliminary data processing. To facilitate the experiment, σ 2 d = 1. For implementing the analysis, the following steps had to be completed: 1) Original data from the open official website was sampled.
2) The original data was compressed based on CS.
3) The causality of pollutants between Beijing and its surrounding cities was analyzed based on GCA. 4) The pollutants in each city which had the greatest impact on Beijing's air quality were determined based on MCC. To achieve these processes, the related problems had to be addressed from two perspectives. 1) Because the data set to be processed was large, coupled with the fact that CS is a lossy compression process, the data compression accuracy varied with the compression ratio.
2) The efficiency of data processing and analysis was greatly affected by the compression ratio, causing a kind of tradeoff between error ratio, efficiency, and compression ratio. From Fig.4a, it can be seen that the error ratio decreased as the compression ratio increased. When the compression ratio increased to 65%, the error ratio dropped to almost 5%. Similarly, the efficiency decreased as the compression ratio increased as shown in Fig.4b. When the compression ratio increased to 65%, the efficiency was still 90%. As the compression ratio continued to increase, the efficiency dropped sharply. Therefore, a tradeoff point was reached when the compression ratio reached 65%.

VI. CONCLUSION
In this paper, we proposed a novel method of causal identification based on compressive sensing of air pollutants using urban big data. The compression model was developed to compress spatiotemporal correlation data representing the amount of air pollutants in Beijing and its surrounding cites. By extending the existing Granger causality theory, the spatiotemporal Granger causality analysis model was adopted for algorithm implementation to identify which cities had an impact on the air quality in Beijing. Specifically, degrees of impact were determined by applying the MCC among the surrounding cities. To verify the effectiveness of the algorithm, the mean-square stability of GCA was confirmed and the accuracy of the degrees of impact of the MCC was obtained based on transfer entropy. By conducting real meteorological big sensing data experiments, it was demonstrated that our proposed algorithm significantly improved data causality analysis performance within the allowed error ratio.
In fact, we have discussed the linear causality of air pollutants between Beijing and surrounding cities. Next, we will continue to study the nonlinear causality between them, which is of great significance for the study of air pollution.
MINGWEI LI received the M.S. degree in operations research and control theory and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2008 and 2014, respectively. Since 2017, she has been an Assistant Professor with Northeastern University at Qinhuangdao. Her research interests include data analysis, big data processing, and compressive sensing. VOLUME 8, 2020 JINPENG LI received the B.S. degree in information and computing sciences from Shanxi Agricultural University, China, in 2018. He is currently pursuing the master's degree with Northeastern University, China. His research interests include applied mathematics, artificial intelligence, machine learning, data mining, and programming languages.
SHUANGNING WAN is currently pursuing the bachelor's degree in learning applied statistics with Northeastern University at Qinhuangdao (NEUQ). His current research interests include data mining, data handling, and programming languages.
HAO CHEN is currently pursuing the bachelor's degree in information and computing science with Northeastern University at Qinhuangdao (NEUQ). His current research interests include machine learning, web development, and programming languages.
CHAO LIU received the M.S. degree in fundamental mathematics and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2006 and 2009, respectively. Since 2011, he has been an Assistant Professor with Northeastern University at Qinhuangdao. His research interests include modeling and dynamical analysis of the stochastic biological and infectious disease systems.