CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The North American Great Lakes (Lakes Superior, Michigan, Huron, Erie, and Ontario) have been subjected to elevated pollutions due to intensive human activities [1], [2]. Therefore, the United States Environmental Protection Agency (U.S. EPA) implemented the Integrated Atmospheric Deposition Network (IADN) program to monitor air pollution over the Great Lakes. They applied the Internet of Things technology to the field of environment, widely collect data, and use intelligent technologies such as data mining to screen and refine the collected data, so as to provide researchers and decision makers with safe, reliable and effective data information. The IADN team has been measuring Persistent Organic Pollutants (POPs) in the Great Lakes atmosphere and precipitation since 1990. Venier et al. reported that the residues of POPs in the Great Lakes air would not be removed promptly, and they mainly arose from human activities [3]. Though the concentrations of various contaminants have declined in the IADN samples since the implementation of the Stockholm convention in 2004, the decreasing trends in atmospheric levels of PCB-11 were not significant [4], [5].

Previous studies usually employed traditional statistical approaches without prediction to analyze the IADN results. A. Salamova et al. presented their measurements of several halogenated and non-halogenated Ops (Organophosphate esters) in particle samples collected as part of the Integrated Network(IADN), and Some statistical analysis results are given for the data [6]. R.A. Hites separated the analytical error from the sampling error for the target compounds by using surrogate (recovery) standards [7]. The precision of atmospheric concentration measurements of POPs was discussed by D.C. Lehman et al. [8]. In the present study, we revisited these data and analyzed them in terms of machine learning combined with data-driven research methods, which lets the model fit the data, so as to change the model to achieve the effect. The input factors of the time series model, such as model fixed order and differential orders, are adjusted to make the model fit the data and get better prediction effect in this article. The temporal and spatial trends of POPs was more accurately analyzed, and some predictions were made about future concentration data.

Machine learning, a branch of artificial intelligence, has received mounting attention from numerous research fields, including environmental, epidemiological, and pharmaceutical studies, among others [9]. Its algorithms include supervised learning, unsupervised learning, and reinforcement learning. These models can efficiently visualize complex data in terms of multiple techniques (e.g., data sorting and reduction of data-dimensions), enabling researchers to extract valuable information from large datasets [10]. Additionally, machine learning facilitates the use of existing data to perform predictions scientifically, which can be of great commercial and social values [11].

Machine learning has a wide range of research. Intelligent model design of complex system becomes a key issue for organization responsiveness to uncertainties. J. Li and N. Xiong et al. provided a novel framework and approach to design cluster supply chain without across-chain horizontal cooperation [12]. M.M. Hassan and H. Liao et al. applied machine learning to the IIoT (Industrial Internet of Things) environment and achieved some good results [13], [14]. Importing heuristic algorithm, F. Long et al. improved the virtual topology strategy to satisfy the requirements of users [15], [16]. Y. Yang et al. proposed a decentralized flocking algorithm to achieve the goal of collision avoidance [17]. Z. Zhou applied machine learning algorithm to IoHT (Internet of Health Things), proposed a new scheme, and verified its effectiveness and reliability [18]. An increasing number of environmental researchers have starting incorporating machine learning to evaluate their data. For example, Knoll et al. and Nourani et al. used machine learning technique to predict the groundwater levels and the concentrations of nitrate, and compared the prediction performance among various models [19], [20]. Machine learning algorithm and model have great practicability [21], [22]. To estimate the potential threats by groundwater to public health, a previous study integrated Neural Networks (NN) and Support Vector Machines (SVM) into a Geographic Information System (GIS) to identify contaminated wells, and used logistic regression and feature selection methods to prioritize variables [23], [24]. Machine learning was also successfully adopted to predict the dissociation energy of carbon-fluorine bonds [25], as well as to predict the biological activity of per- and polyfluoroalkyl substances [26]. Specially, the application and research of time series model are very extensive. T. Shen et al. proposed 3D Augmented Convolutional Network (3DACN) to extract time series information and solve the serious imbalanced data problem [27]. Based on machine learning, Y. Zhang and J.C. Sun et al. Put forward improved time series analysis methods, which all owed the information in the time series to be extracted by analyzing the associated complex network [28]. R.J. Zhou proposed Flexible Multi-Scale Entropy (FMSE) to increases the reliability and stability of measuring time series complexity [29]. These studies showed the great advantage of employing the machine learning.

The rest of this article is organized as follows. Section II describes the sources of research data and the research methods. Section III presents detailed data analysis and visualization of experimental outcomes. And the models were evaluated. Section IV gives environment-related conclusions based on the results obtained in the previous two sections. Section V summarizes the findings of this study and makes the prospects for future research.

SECTION II.

Materials and Methods

A. Research Area

The Great Lakes located in the east-central North America, between the United States and Canada, is the largest group of freshwater lakes in the world; The total area of these five lakes is 245,660 square kilometers [30]. The IADN, a long-term atmospheric monitoring program, which is run by the office of the Great Lakes’ national program of the U.S. EPA, has been measuring Polychlorinated Biphenyls (PCBs), Pesticides (PESTs), Polycyclic Aromatic Hydrocarbons (PAHs), and flame retardants in the Great Lakes atmosphere and precipitation since 1990 by Indiana University. The sampling sites were Brule River (BR), Chicago (CHIC), Cleveland (CLEV), Eagle Harbor (EH), Point Petre (PP), Sleeping Bear Dunes (SBD), and Sturgeon Point (STP) (see Figure 1) [31].

FIGURE 1. - Map of the IADN sampling sites around the Great Lakes.
FIGURE 1.

Map of the IADN sampling sites around the Great Lakes.

Based on Internet of Things technology, a set of atmospheric sample (both vapor and particle phases) was collected every 12 days with a bulk-active air sampler for 24 hours at each site (except at PP, where the sampling frequency was every 24 or 36 days). The sampling periods were 1996–2002 at BR, 1996–2016 at CHIC, 2003–2016 at CLEV, 1990–2016 at EH, 1998–2016 at PP, 1991–2016 at SBD, and 1991–2016 at STP. A detailed list of targeted contaminants can be found at the IADN Data Viz. The data discussed in this article were also downloaded from that website [29].

B. Introduction to Modeling Workflow

In this study, we analyzed the total concentrations of PCBs, PESTs, and PAHs ($\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs) in the IADN vapor phase samples for their spatiotemporal trends and made some predictions by building the right model. The modeling workflow are shown in Figure 2. The entire modeling process consists of three stages. The stage I, preparation before modeling. This stage mainly includes mission understanding, data importing, data understanding, and data preparation. Data preparing refers to the process of making a series of operations on the original data into experimental data. The stage II, modeling. The main steps in the modeling stage include model type selection and super parameter setting, specific model training and statistics viewing, evaluating goodness of fit, and modeling assumptions and discussion. In addition, we need do model optimizing and reselection if the obtained results do not meet the requirements. The stage III, application of the model. The corresponding function of the model is implemented according to the model type in this stage. The prediction function was realized by the model constructed in this article. The details can be obtained from the figure below.

FIGURE 2. - Flow chart of modeling.
FIGURE 2.

Flow chart of modeling.

C. Data Pre-Processing

There would be missing values and outliers in the original dataset for various reasons, so data preprocessing is required prior to Modeling. After investigating the original dataset, we detected a number of missing values and outliers. Generally speaking, there are three main processing methods for processing data missing: 1) using the attribute which contains the missing value directly without any processing; 2) deleting the attribute which contains a large number of missing values; and 3) missing value replacement. In this article, we adopted the second and third approaches. As the majority of $\sum $ PESTs data were missing for the BR sampling site, we deleted the corresponding attribute directly, while for the rest attributes, their missing values were substituted with the corresponding group mean. POPs at EH and STP sampling sites were taken as examples, the data distributions after the missing value processing were shown in Figure 3.

FIGURE 3. - Trend charts of POPs concentrations in the Great Lakes atmosphere after the missing data processing.
FIGURE 3.

Trend charts of POPs concentrations in the Great Lakes atmosphere after the missing data processing.

As can be seen in Figure 3 (a), the trend graphs of $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs at EH sampling points are continuous, indicating that the missing values have been completed. In addition, we found that there were some anomalous data points in the trend charts. The values of these data points, which needed to be removed in the next step, were quite different from those of most data points. the similar situation can be seen in Figure 3 (b).

Then, the Python box graph was used to identify outliers. The structure of box figure is shown in Figure 4. After calculating the first quartile (Q1, at 25% position), the median and the third quartile (Q3, at 75% position), we defined the inter-quartile range (IQR) as (Q3-Q1), and considered the values outside the range between (${ Q1-1.5\times IQR}$ ; the lower limit) and (${ Q3+1.5\times IQR}$ ; the upper limit) as outliers. Similarly, POPs at EH and STP sampling sites were taken as examples, the box plots for POPs levels in the Great Lakes atmospheric samples are shown in Figure 5, where the outliers clearly stand out and are excluded from the subsequent elucidation of the POPs’ environmental behaviors.

FIGURE 4. - Structure of box plot.
FIGURE 4.

Structure of box plot.

FIGURE 5. - Box plots of POPs concentrations in the Great Lakes atmosphere.
FIGURE 5.

Box plots of POPs concentrations in the Great Lakes atmosphere.

In Figure 5 (a), the sampling time of EH is from 1990 to 2016, which is the longest sampling period among the seven sampling points. At EH sampling site, the number of abnormal data points per year is within the acceptable range, which can be counted on one’s fingers. In Figure 5 (b), the sampling period span of STP is from 1991 to 2016. Compared to other sampling sites, the data volume collected in STP was also larger. The situations about abnormal data points were similar to the EH sampling site.

D. Pre-Analysis of Experimental Data

The annual median concentrations (i.e., experimental data; after data preprocessing) of $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs for the individual IADN sites are presented in Table 1. The sample sizes for BR CHIC, CLEV, EH, PP, SBD, and STP were 172, 526, 336, 670, 234, 628, and 634 respectively. The main distribution space of $\sum $ PCBs was CHIC, followed by CLEV, STP and PP, while the median $\sum $ PCBs in the SBD and EH samples were only two digits in most years. The spatial distribution patterns of $\sum $ PESTs at CHIC, CLEV, STP, PP, SBD, and EH were not obvious in the early years. However, the median $\sum $ PCBs at STP, PP, SBD, and EH has dropped to two digits recently, whereas both CHIC and CLEV have remained at three digits. The median $\sum $ PAHs were much higher than those of $\sum $ PCBs and $\sum $ PESTs, and the highest levels were observed at CHIC and CLEV, followed by STP, PP, SBD, and EH. The BR data were not discussed due to its small sample and the missing of $\sum $ PESTs data.

TABLE 1 Annual Median Concentrations of POPs in the Great Lakers Atmosphere (After Data Preprocessing; pg/m3)
Table 1- 
Annual Median Concentrations of POPs in the Great Lakers Atmosphere (After Data Preprocessing; pg/m3)

Taking the average value of the above experimental data, the spatial distribution of the three kinds of POPs at the seven sampling points is shown in Figure 6, from which we can get a clear understanding of the spatial distribution of PCBs, PESTs, PAHs. Since BR sampling points are completely missing from the sum of PESTs data, so the BR case is not considered in Figure 6 (b). The situation shown in this figure is consistent with the data analysis result in Table 1.

FIGURE 6. - Pie charts of three kinds of POPs spatial distribution proportion at seven sampling points.
FIGURE 6.

Pie charts of three kinds of POPs spatial distribution proportion at seven sampling points.

E. Model Introduction

To predict the future trends of POPs’ concentration based on the existing IADN data, the Time Series Prediction Method (TSPM), a machine learning algorithm, was utilized. Autoregressive moving average (ARMA) model was one of TSPM [32]. ARMA, the most commonly model for fitting a stationary sequence, can be subdivided into three categories: AR model (auto regression model), MA model (moving average model), and ARMA model (auto regression moving average model). When the time series itself is not stationary, if its increment, that is, a difference, is stable near zero, it can be regarded as a stationary sequence. In practical problems, most of the non-stationary sequences encountered can become stationary time series after one or more differences [33].

The data sequence formed by the prediction index over time is regarded as a random sequence, and the dependence of this group of random variables reflects the continuity of the original data in time. On the one hand, the influence of the influencing factors, on the other hand, it has its own rule of change, assuming that the influencing factors are $X_{1}$ , $X_{2}$ , $\ldots $ , $X_{k,}$ , $\beta _{\mathrm {i}}$ (i = 0, 1, 2,…) are coefficients, by regression analysis, as shown below.\begin{equation*} Y=\beta _{_{0}} +\beta _{_{1}} X_{_{1}} +\beta _{_{2}} X_{_{2}} +\ldots +\beta _{_{k}} X_{_{k}} +e,\tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $Y$ is the observed value of the predicted object, and e is the error. As the predicted object, $Y_{t}$ is affected by its own changes, and its law can be reflected by the following formula:\begin{equation*} Y_{t} =\beta _{_{0}} +\beta _{_{1}} {X}_{_{t-1}} +\beta _{_{2}} {X}_{_{t-2}} +\ldots +\beta _{_{p}} {X}_{_{t-p} } +e_{_{t}},\tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The error term is dependent in different periods, which is expressed by the following formula, $\alpha _{i}$ (i = 0, 1, 2, $\ldots $ ) are coefficients, $\mu _{t}$ are random term to test the co-integration relationship.\begin{equation*} \textrm {e}_{_{t}} =\alpha _{_{0}} +\alpha _{_{1}} e_{_{t-1} +} \alpha _{_{2}} e_{_{t-2} +} \ldots +\alpha _{_{q}} e_{_{t-q} +} \mu _{_{t}},\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Thus, the ARMA model expression can be obtained:\begin{align*}&\hspace {-0.5pc}Y_{t} =\beta _{_{0}} +\beta _{_{1}} {X}_{_{t-1}} +\ldots +\beta _{_{p}} {X}_{_{t-p}} +\alpha _{_{0}} +\alpha _{_{1}} e_{_{t-1}} \\&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\ldots +\alpha _{_{q}} e_{_{t-q}} +\mu _{_{t}}.\tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features.

The ARMA model building algorithm is seen in Figure 7 ($p$ refers to the autoregression order, and $q$ refers to the moving average order). Firstly, inputting the processed time series. Then, judging its stationarity. If the time series is stable, moving to the next step; otherwise, make a differential operation to make the time series keep stable. Next, determining the orders of model. Finally, parameter estimating is performed to evaluate the model performance. If the performance of the model is good, the model can be determined for prediction; otherwise, the orders of the model needs to be determined again.

FIGURE 7. - The algorithm of ARMA model building.
FIGURE 7.

The algorithm of ARMA model building.

SECTION III.

Results and Discussion

A. Data Trend Overviewing

The temporal trends of median $\sum $ PCBs, $\sum $ PESTs and $\sum $ PAHs at the seven IADN sampling sites are demonstrated in Figure 8. At BR sampling site, the $\sum $ PAHs were higher than that of $\sum $ PCBs. While the $\sum $ PCBs leveled off at relatively low levels, the $\sum $ PAHs tended to decrease. Similar trends were observed for CHIC, CLEV, EH, PP, SBD, and STP where $\sum $ PAHs were significantly greater than the $\sum $ PCBs, while $\sum $ PESTs were the lowest. Additionally, although the median concentrations of these POPs fluctuated, the overall trends were all downward.

FIGURE 8. - Broken line diagrams for median 
$\sum $
 PCBs, 
$\sum $
 PESTs, and 
$\sum $
 PAHs in the Great Lakes atmosphere.
FIGURE 8.

Broken line diagrams for median $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the Great Lakes atmosphere.

B. Analysis of Visualized Result

In this study, by constantly optimizing the parameters to better fit of data, an data-driven intelligent environmental model based on ARMA algorithm was constructed to predict POPs concentrations in the next few years. Given that EH and STP have the largest sample sizes, in addition, STP stands for rural site and EH stands for remote site, which makes the sampled data rich and representative. So we used their data to predict the $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the following 4–5 years (see Figure 9).

FIGURE 9. - ARMA model prediction of median 
$\sum $
 PCBs, 
$\sum $
 PESTs, and 
$\sum $
 PAHs in the EH (a) and STP (b) atmosphere.
FIGURE 9.

ARMA model prediction of median $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the EH (a) and STP (b) atmosphere.

According to Figure 9 (a), in the EH atmosphere, there will be a slight increase in $\sum $ PCBs, but such fluctuations will not be considerable in general, and the overall temporal trend will remain declining. The $\sum $ PESTs will continue dropping in a few years after 2016. The $\sum $ PAHs will fluctuate to some extent, showing a zigzag pattern. Regrading the STP atmosphere (see Figure 9 (b)), its $\sum $ PCBs will decline with fluctuations. The $\sum $ PESTs will gain right after 2016, and then level off, and the $\sum $ PAHs will fluctuate slightly. The concentrations of pops at EH and STP sampling sites have slight fluctuations, but on the whole, they showed a downward trend.

C. Model Evalution

This article intends to evaluate the ARMA model which was constructed from the following two aspects: feasibility analysis and sensitivity analysis.

1) Feasibility Analysis

The feasibility analysis of ARMA model was coducted from two aspects: 1) testing whether the residuals were normally distributed using the QQ plot(Quantile-Quantile plot); and 2) assessing autocorrelation of residuals in terms of the D-W(Durbin-Watson) statistics (when the value of D-W is significantly close to 0 or 4, there is autocorrelation; when it is close to 2, there is no autocorrelation). Figure 10 shows the QQ plots for the ARMA modelling of median $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the EH and STP samples.

FIGURE 10. - QQ plots of ARMA prediction model for median 
$\sum $
 PCBs, 
$\sum $
 PESTs, and 
$\sum $
 PAHs in the EH (a) and STP (b) atmosphere.
FIGURE 10.

QQ plots of ARMA prediction model for median $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the EH (a) and STP (b) atmosphere.

Figure 10 (a) is the QQ plot of the ARMA model for the median $\sum $ PCBs, $\sum $ PESTs and $\sum $ PAHs in the EH atmosphere. Virtually all the data points were on a straight line and equally distribute on both sides of the line, indicating that the residuals satisfactorily met the normal distribution. The corresponding D-W values for $\sum $ PCBs, $\sum $ PESTs and $\sum $ PAHs were, respectively, 1.85, 1.47, 1.97, all were closer to 2, suggestive of insignificant autocorrelation. Similar residual distribution patterns and D-W values were also found for STP samples (see Figure 10 (b)). Therefore, our model prediction should be robust and reliable.

2) Sensitivity Analysis

Sensitivity Analysis (SA) investigates how the variation in the output of a numerical model can be attributed to variations of its input factors [34]. Figure 11 shows input factors and output definition for the SA of the model. As we can see from the figure, the main input factors included processed time series, model parameters, differential orders and model fixed orders. These input factors all have certain influence on the predicted value of the model.

FIGURE 11. - Input factors and output definition for the SA of ARMA model.
FIGURE 11.

Input factors and output definition for the SA of ARMA model.

Based on the model type, we used correlation methods to preform sensitivity analysis of this model. We can define the sensitivity of the model as follows:\begin{equation*} S_{A} =correlation \left ({{X_{i}, Y} }\right)\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $X_{i}$ represents input factors and $Y$ represents output. The sensitivity of model can be analyzed by means of comparing RMSE, the error between the predicted data and the real data. Its mathematical expression is shown as (6).\begin{equation*} RMSE=\sqrt {\frac {\sum \nolimits _{i=1}^{n} {(Y_{predict,i} -Y_{real,i})}}{n}^{2}}\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $Y_{predict,i}$ is the predicted value of the model, $Y_{real,i}$ is the true value of the model, and n is the number of data points. RMSE can well reflect the fitting status of model data. Within a certain range, the smaller RMSE, the higher the Fitting degree [35], [36]. There are two most important parameters $p$ and $q$ in this model, $p$ refers to the autoregression order, and $q$ refers to the moving average order, that is, the fixed order of the model(main input factors), which can be adjusted to better fit the data mainly. By changing the $p$ and $q$ value, calculating the corresponding RMSE value and comparing, the sensitivity of the model can be analyzed. We changed the value of $p$ , $q$ for many times and found that the RMSE value changed a lot (See Table 2). In Table 2, RMSE in the best fitting case was highlighted, and its corresponding $p$ and $q$ values were selected to build the model. Further analysis of the data in Table 2 showed that RMSE values changed to a large extent with the change of input factors $p$ and $q$ , which means that the change of input factors has a significant impact on the output of this model, namely, the model is highly sensitive.

TABLE 2 Comparison of RMSE Values Corresponding to Different $p$ and $q$ Valuess
Table 2- 
Comparison of RMSE Values Corresponding to Different 
$p$
 and 
$q$
 Valuess

SECTION IV.

Environmental Significance

PCBs have been widely used as insulation oil, heat carrier, and lubricating oil. They can also be used as additive in many industrial products (such as various resins, rubber, binders, coatings, carbon paper, ceramic glaze, fire retardant, pesticide extender and dye dispersant) [37]. PCBs are carcinogens that tend to accumulate in fatty tissue, it can cause diseases of the brain, skin and internal organs, and it also affects the nervous, reproductive, and immune system [38]. Thus, they were added to the Stockholm convention and have been banned in most countries (including the United States and Canada) for decades, which consistent with the downward trend we found and predicted.

PESTs, which can also accumulate in tissues, such as heart, liver and kidney, and enter human and animal bodies through the food chain [39]. The pesticides which accumulate in bodies can also be excreted through the mother’s milk, or into the egg, ultimately affecting the offspring [40]. Therefore, countries strictly control the residues of PESTs in food. For example, Germany, the United States, Japan, and many other countries do not allow cyclopentadienyls PESTs to be detected in food. In the 1960s, China began to ban the use of DDT and 666 on tobacco, vegetables, tea, and other crops [41]. Though the $\sum $ PESTs decreased in the Great Lakes atmosphere, the IADN team measured legacy PESTs only. The environmental concentration of emerging pesticides is expected to increase due to the phase-out of traditional PESTs and the introduction of many alternatives to them because of the market demand.

PAHs are ubiquitous in the environment, and mainly comes from the burning of coal and oil, but also from garbage incineration or forest fires. Their production volumes are closely related to combustion equipment and combustion temperature [42]. PAHs are found in the exhaust of diesel and gasoline engines, as well as in the waste gas and water from refineries, coal tar processing plants and asphalt processing plants [43]. The decreasing trend of $\sum $ PAHs in the Great Lakes atmosphere indicates that the elevated energy efficiency and the vehicle emission control in North America have alleviated the PAH formation in recent years [44], [45].

These three classes of POPs not only have posed risks to the environment but also have adversely affected human health. Therefore, they should be used with caution. Our results illustrate that the $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs have been decreasing on the whole, consistent with the fact that many countries have released policies to limit their use and emission [46] due to their recognized toxic potentials.

IoT, by using local network or Internet and other communication technologies, sensors, controllers and machines, people and objects can be connected together in new ways to form people-and-objects, and objects-and-objects links, so as to realize information-based remote management control and intelligent network [47]. IoT was widely used in environmental protection and plays a very important role. The researches of this topic is carried out in the the Internet of Things environments. Firstly, Environment-related data are collected based on the technology of the Internet of Things. Secondly, required data are obtained from the Internet. Thirdly, Machine learning and data analysis technologies are used for research. The last one, the research results can be uploaded and shared. IoT has played a great role in intelligent environmental protection.

SECTION V.

Conclusion and Future Work

We analyzed $\sum $ PCBs, $\sum $ PAHs, $\sum $ PESTs in the Great Lakes atmospheric samples (vapor phase) collected from seven sampling sites by the IADN team at Indiana University. By constructing an data-driven intelligent environmental model, $\sum $ PCBs, $\sum $ PESTs, and $\sum $ PAHs in the EH and STP samples were predicted for the following 4–5 years, We also presented the detailed processes of modeling workflow, and used the Python development language and tools to visualize our results. The result showed concentrations would continue declining with slight fluctuations and the model is feasible and highly sensitive. In addition, we pointed out the important role of the IoT in the smart environmental protection.

The future research in this field can be carried out from the following three aspects. Firstly, studying the relationship between each persistent organic pollutant particle-phase concentration percentage and temperature by constructing appropriate regression model. The IADN collected concentrations of persistent organic pollutants in vapor and particle phases. Physical phenomena shows that matter changes from solid to gas as the temperature rises. Calculating the persistent organic pollutant particle phase percentage, that is P/($\text{P}+\text{V}$ ) (where P and V are the particle- and vapor-phase concentrations) and obtain the corresponding temperature. Taking temperature as independent variable and persistent organic pollutant particle phase percentage as dependent variable, an appropriate regression model can be tried to be established. The effect of temperature on the state change of POPs can be clearly understood. Secondly, studying the composition of POPs in the Great Lakes by constructing an appropriate classification model. POPs such as Poly-chlorinated Biphenyls (PCBs), Pesticides (PESTs), and Polycyclic Aromatic Hydrocarbons (PAHs) are composed of a variety of different molecular weight substances. The IADN website also recorded concentrations of these substances. We can calculate the proportion of the concentration of these substances and then determine the main components and secondary components of POPs by constructing a classification model. Finally, IoT system for environmental protection can be built to realize data real-time collection, upload-ing and sharing of research results, so as to realize information and digitization of environmental protecting and make environmental management more scientific and efficient.

ACKNOWLEDGMENT

The data for this work is downloaded from Indiana University’s IADN Data Viz website. The authors thank Dr. Yan Wu at Indiana University for useful discussions.