Introduction
The North American Great Lakes (Lakes Superior, Michigan, Huron, Erie, and Ontario) have been subjected to elevated pollutions due to intensive human activities [1], [2]. Therefore, the United States Environmental Protection Agency (U.S. EPA) implemented the Integrated Atmospheric Deposition Network (IADN) program to monitor air pollution over the Great Lakes. They applied the Internet of Things technology to the field of environment, widely collect data, and use intelligent technologies such as data mining to screen and refine the collected data, so as to provide researchers and decision makers with safe, reliable and effective data information. The IADN team has been measuring Persistent Organic Pollutants (POPs) in the Great Lakes atmosphere and precipitation since 1990. Venier et al. reported that the residues of POPs in the Great Lakes air would not be removed promptly, and they mainly arose from human activities [3]. Though the concentrations of various contaminants have declined in the IADN samples since the implementation of the Stockholm convention in 2004, the decreasing trends in atmospheric levels of PCB-11 were not significant [4], [5].
Previous studies usually employed traditional statistical approaches without prediction to analyze the IADN results. A. Salamova et al. presented their measurements of several halogenated and non-halogenated Ops (Organophosphate esters) in particle samples collected as part of the Integrated Network(IADN), and Some statistical analysis results are given for the data [6]. R.A. Hites separated the analytical error from the sampling error for the target compounds by using surrogate (recovery) standards [7]. The precision of atmospheric concentration measurements of POPs was discussed by D.C. Lehman et al. [8]. In the present study, we revisited these data and analyzed them in terms of machine learning combined with data-driven research methods, which lets the model fit the data, so as to change the model to achieve the effect. The input factors of the time series model, such as model fixed order and differential orders, are adjusted to make the model fit the data and get better prediction effect in this article. The temporal and spatial trends of POPs was more accurately analyzed, and some predictions were made about future concentration data.
Machine learning, a branch of artificial intelligence, has received mounting attention from numerous research fields, including environmental, epidemiological, and pharmaceutical studies, among others [9]. Its algorithms include supervised learning, unsupervised learning, and reinforcement learning. These models can efficiently visualize complex data in terms of multiple techniques (e.g., data sorting and reduction of data-dimensions), enabling researchers to extract valuable information from large datasets [10]. Additionally, machine learning facilitates the use of existing data to perform predictions scientifically, which can be of great commercial and social values [11].
Machine learning has a wide range of research. Intelligent model design of complex system becomes a key issue for organization responsiveness to uncertainties. J. Li and N. Xiong et al. provided a novel framework and approach to design cluster supply chain without across-chain horizontal cooperation [12]. M.M. Hassan and H. Liao et al. applied machine learning to the IIoT (Industrial Internet of Things) environment and achieved some good results [13], [14]. Importing heuristic algorithm, F. Long et al. improved the virtual topology strategy to satisfy the requirements of users [15], [16]. Y. Yang et al. proposed a decentralized flocking algorithm to achieve the goal of collision avoidance [17]. Z. Zhou applied machine learning algorithm to IoHT (Internet of Health Things), proposed a new scheme, and verified its effectiveness and reliability [18]. An increasing number of environmental researchers have starting incorporating machine learning to evaluate their data. For example, Knoll et al. and Nourani et al. used machine learning technique to predict the groundwater levels and the concentrations of nitrate, and compared the prediction performance among various models [19], [20]. Machine learning algorithm and model have great practicability [21], [22]. To estimate the potential threats by groundwater to public health, a previous study integrated Neural Networks (NN) and Support Vector Machines (SVM) into a Geographic Information System (GIS) to identify contaminated wells, and used logistic regression and feature selection methods to prioritize variables [23], [24]. Machine learning was also successfully adopted to predict the dissociation energy of carbon-fluorine bonds [25], as well as to predict the biological activity of per- and polyfluoroalkyl substances [26]. Specially, the application and research of time series model are very extensive. T. Shen et al. proposed 3D Augmented Convolutional Network (3DACN) to extract time series information and solve the serious imbalanced data problem [27]. Based on machine learning, Y. Zhang and J.C. Sun et al. Put forward improved time series analysis methods, which all owed the information in the time series to be extracted by analyzing the associated complex network [28]. R.J. Zhou proposed Flexible Multi-Scale Entropy (FMSE) to increases the reliability and stability of measuring time series complexity [29]. These studies showed the great advantage of employing the machine learning.
The rest of this article is organized as follows. Section II describes the sources of research data and the research methods. Section III presents detailed data analysis and visualization of experimental outcomes. And the models were evaluated. Section IV gives environment-related conclusions based on the results obtained in the previous two sections. Section V summarizes the findings of this study and makes the prospects for future research.
Materials and Methods
A. Research Area
The Great Lakes located in the east-central North America, between the United States and Canada, is the largest group of freshwater lakes in the world; The total area of these five lakes is 245,660 square kilometers [30]. The IADN, a long-term atmospheric monitoring program, which is run by the office of the Great Lakes’ national program of the U.S. EPA, has been measuring Polychlorinated Biphenyls (PCBs), Pesticides (PESTs), Polycyclic Aromatic Hydrocarbons (PAHs), and flame retardants in the Great Lakes atmosphere and precipitation since 1990 by Indiana University. The sampling sites were Brule River (BR), Chicago (CHIC), Cleveland (CLEV), Eagle Harbor (EH), Point Petre (PP), Sleeping Bear Dunes (SBD), and Sturgeon Point (STP) (see Figure 1) [31].
Based on Internet of Things technology, a set of atmospheric sample (both vapor and particle phases) was collected every 12 days with a bulk-active air sampler for 24 hours at each site (except at PP, where the sampling frequency was every 24 or 36 days). The sampling periods were 1996–2002 at BR, 1996–2016 at CHIC, 2003–2016 at CLEV, 1990–2016 at EH, 1998–2016 at PP, 1991–2016 at SBD, and 1991–2016 at STP. A detailed list of targeted contaminants can be found at the IADN Data Viz. The data discussed in this article were also downloaded from that website [29].
B. Introduction to Modeling Workflow
In this study, we analyzed the total concentrations of PCBs, PESTs, and PAHs (
C. Data Pre-Processing
There would be missing values and outliers in the original dataset for various reasons, so data preprocessing is required prior to Modeling. After investigating the original dataset, we detected a number of missing values and outliers. Generally speaking, there are three main processing methods for processing data missing: 1) using the attribute which contains the missing value directly without any processing; 2) deleting the attribute which contains a large number of missing values; and 3) missing value replacement. In this article, we adopted the second and third approaches. As the majority of
Trend charts of POPs concentrations in the Great Lakes atmosphere after the missing data processing.
As can be seen in Figure 3 (a), the trend graphs of
Then, the Python box graph was used to identify outliers. The structure of box figure is shown in Figure 4. After calculating the first quartile (Q1, at 25% position), the median and the third quartile (Q3, at 75% position), we defined the inter-quartile range (IQR) as (Q3-Q1), and considered the values outside the range between (
In Figure 5 (a), the sampling time of EH is from 1990 to 2016, which is the longest sampling period among the seven sampling points. At EH sampling site, the number of abnormal data points per year is within the acceptable range, which can be counted on one’s fingers. In Figure 5 (b), the sampling period span of STP is from 1991 to 2016. Compared to other sampling sites, the data volume collected in STP was also larger. The situations about abnormal data points were similar to the EH sampling site.
D. Pre-Analysis of Experimental Data
The annual median concentrations (i.e., experimental data; after data preprocessing) of
Taking the average value of the above experimental data, the spatial distribution of the three kinds of POPs at the seven sampling points is shown in Figure 6, from which we can get a clear understanding of the spatial distribution of PCBs, PESTs, PAHs. Since BR sampling points are completely missing from the sum of PESTs data, so the BR case is not considered in Figure 6 (b). The situation shown in this figure is consistent with the data analysis result in Table 1.
Pie charts of three kinds of POPs spatial distribution proportion at seven sampling points.
E. Model Introduction
To predict the future trends of POPs’ concentration based on the existing IADN data, the Time Series Prediction Method (TSPM), a machine learning algorithm, was utilized. Autoregressive moving average (ARMA) model was one of TSPM [32]. ARMA, the most commonly model for fitting a stationary sequence, can be subdivided into three categories: AR model (auto regression model), MA model (moving average model), and ARMA model (auto regression moving average model). When the time series itself is not stationary, if its increment, that is, a difference, is stable near zero, it can be regarded as a stationary sequence. In practical problems, most of the non-stationary sequences encountered can become stationary time series after one or more differences [33].
The data sequence formed by the prediction index over time is regarded as a random sequence, and the dependence of this group of random variables reflects the continuity of the original data in time. On the one hand, the influence of the influencing factors, on the other hand, it has its own rule of change, assuming that the influencing factors are ![]()
![]()
The error term is dependent in different periods, which is expressed by the following formula, ![]()
Thus, the ARMA model expression can be obtained:![]()
The ARMA model building algorithm is seen in Figure 7 (
Results and Discussion
A. Data Trend Overviewing
The temporal trends of median
Broken line diagrams for median
B. Analysis of Visualized Result
In this study, by constantly optimizing the parameters to better fit of data, an data-driven intelligent environmental model based on ARMA algorithm was constructed to predict POPs concentrations in the next few years. Given that EH and STP have the largest sample sizes, in addition, STP stands for rural site and EH stands for remote site, which makes the sampled data rich and representative. So we used their data to predict the
ARMA model prediction of median
According to Figure 9 (a), in the EH atmosphere, there will be a slight increase in
C. Model Evalution
This article intends to evaluate the ARMA model which was constructed from the following two aspects: feasibility analysis and sensitivity analysis.
1) Feasibility Analysis
The feasibility analysis of ARMA model was coducted from two aspects: 1) testing whether the residuals were normally distributed using the QQ plot(Quantile-Quantile plot); and 2) assessing autocorrelation of residuals in terms of the D-W(Durbin-Watson) statistics (when the value of D-W is significantly close to 0 or 4, there is autocorrelation; when it is close to 2, there is no autocorrelation). Figure 10 shows the QQ plots for the ARMA modelling of median
QQ plots of ARMA prediction model for median
Figure 10 (a) is the QQ plot of the ARMA model for the median
2) Sensitivity Analysis
Sensitivity Analysis (SA) investigates how the variation in the output of a numerical model can be attributed to variations of its input factors [34]. Figure 11 shows input factors and output definition for the SA of the model. As we can see from the figure, the main input factors included processed time series, model parameters, differential orders and model fixed orders. These input factors all have certain influence on the predicted value of the model.
Based on the model type, we used correlation methods to preform sensitivity analysis of this model. We can define the sensitivity of the model as follows:![]()
![]()
Environmental Significance
PCBs have been widely used as insulation oil, heat carrier, and lubricating oil. They can also be used as additive in many industrial products (such as various resins, rubber, binders, coatings, carbon paper, ceramic glaze, fire retardant, pesticide extender and dye dispersant) [37]. PCBs are carcinogens that tend to accumulate in fatty tissue, it can cause diseases of the brain, skin and internal organs, and it also affects the nervous, reproductive, and immune system [38]. Thus, they were added to the Stockholm convention and have been banned in most countries (including the United States and Canada) for decades, which consistent with the downward trend we found and predicted.
PESTs, which can also accumulate in tissues, such as heart, liver and kidney, and enter human and animal bodies through the food chain [39]. The pesticides which accumulate in bodies can also be excreted through the mother’s milk, or into the egg, ultimately affecting the offspring [40]. Therefore, countries strictly control the residues of PESTs in food. For example, Germany, the United States, Japan, and many other countries do not allow cyclopentadienyls PESTs to be detected in food. In the 1960s, China began to ban the use of DDT and 666 on tobacco, vegetables, tea, and other crops [41]. Though the
PAHs are ubiquitous in the environment, and mainly comes from the burning of coal and oil, but also from garbage incineration or forest fires. Their production volumes are closely related to combustion equipment and combustion temperature [42]. PAHs are found in the exhaust of diesel and gasoline engines, as well as in the waste gas and water from refineries, coal tar processing plants and asphalt processing plants [43]. The decreasing trend of
These three classes of POPs not only have posed risks to the environment but also have adversely affected human health. Therefore, they should be used with caution. Our results illustrate that the
IoT, by using local network or Internet and other communication technologies, sensors, controllers and machines, people and objects can be connected together in new ways to form people-and-objects, and objects-and-objects links, so as to realize information-based remote management control and intelligent network [47]. IoT was widely used in environmental protection and plays a very important role. The researches of this topic is carried out in the the Internet of Things environments. Firstly, Environment-related data are collected based on the technology of the Internet of Things. Secondly, required data are obtained from the Internet. Thirdly, Machine learning and data analysis technologies are used for research. The last one, the research results can be uploaded and shared. IoT has played a great role in intelligent environmental protection.
Conclusion and Future Work
We analyzed
The future research in this field can be carried out from the following three aspects. Firstly, studying the relationship between each persistent organic pollutant particle-phase concentration percentage and temperature by constructing appropriate regression model. The IADN collected concentrations of persistent organic pollutants in vapor and particle phases. Physical phenomena shows that matter changes from solid to gas as the temperature rises. Calculating the persistent organic pollutant particle phase percentage, that is P/(
ACKNOWLEDGMENT
The data for this work is downloaded from Indiana University’s IADN Data Viz website. The authors thank Dr. Yan Wu at Indiana University for useful discussions.












