An Effective Machine Learning Scheme to Analyze and Predict the Concentration of Persistent Pollutants in the Great Lakes

Persistent organic pollutants (POPs) are highly toxic and difficult to degrade in the natural ecology, which has a severe negative impact on the ecological environment. Quantifying changes in the concentrations of persistent organic pollutants in the Great Lakes is challenging work. Machine learning (ML) methods are potent predictors that have recently achieved impressive performance on time series tasks. ARIMA model, Linear Regression methods, XGBoost algorithm, and Long Short-Term Memory (LSTM) are commonly used for estimating time-series changes. Traditionally Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) have been standard criteria to measure the error between the actual value and predicted value; however, Euclidean distance (ED) cannot effectively calculate the similarity between two-time series. We proposed an alternative criterion called Penalty Dynamic Time Wrapping (Penalty-DTW) based on Dynamic Time Wrapping (DTW). It can accurately measure the difference between the actual value and the predicted value. We study the benefits of Penalty-DTW vs. ED under the above ML algorithms. Further, considering the machine learning algorithm’s uncertainty, we proposed combining LSTM and deep ensemble methods to quantify algorithms uncertainty and make a confident prediction. We find improved LSTM model outperformed other predictive power models by comparing pollutant concentration prediction. The prediction results show that the concentration of pollutants has a stable downward trend in recent years. Simultaneously, we found that pollutants’ concentration correlates with seasons, which positively guides environmental pollution control in the Great Lakes.


I. INTRODUCTION
The Great Lakes of North America are a series of large interconnected freshwater lakes and are generally on or near the Canada-United States border. They are lakes Superior, Michigan, Huron, Erie, and Ontario. Over recent decades, with the economic development around the Great Lakes, the increase of human activities and other factors have directly led to the environmental pollution of the Great Lakes [1]. Eriksen et al. [2], and Mason et al. [3] confirmed the existence of microplastics in Great Lakes' surface water. Baldwin et al. [4] found the presence of plastics in the Great Lakes' water. These microplastics contain many chemicals, and persistent organic pollutants (POPs) are one of them.
The associate editor coordinating the review of this manuscript and approving it for publication was Pengcheng Liu . POPs are difficult to degrade in the environment. They have the characteristics of bio-accumulation, persistence, long-distance transport, and high toxicity. POPs are persistent in the environment, having long half-lives in soil, sediment, air, or biota [5].
A variety of methods have been developed for pollutants concentration prediction. Time series models have been used to predict future changes of pollutants for the concentration of pollutants that occur in cyclic or repeating patterns. Statistical and mathematical analysis method such as Auto-Regressive Integrated Moving Average (ARIMA) model [6] is fitted to time series data either to understand the data better or to predict future points in the series. Shamshad Ahmad et al. analyzed water quality based on the multiplicative ARIMA model, which has both nonseasonal and seasonal components [7]. LY Siew et al. have presented the ARIMA model and improved the ARIMA model called ARFIMA for forecasting pollution [8]. Taking advantage of its strictly statistical approach, the ARIMA method only requires a time series's preliminary data to generalize the forecast. However, the ARIMA model has apparent disadvantages: first, It requires that the time series data be stable or stable after differentiation. Second, the ARIMA model always assumes linear relationships between independent and dependent variables. It can only capture linear relations but not nonlinear relations.
Machine learning (ML) has progressed dramatically over the past two decades, from laboratory curiosity to a practical technology in widespread commercial use [9]. ML algorithms also have a wide range of time series analysis applications, especially in pollutant concentration prediction. Marta Venier and Ronald A Hites presented multiple linear regression (LR) methods to address what rate of atmospheric concentrations of a specific chemical decreasing around the Great Lakes region [10]. Marla C. Maniquiz et al. used the same multiple LR methods to develop estimation models of pollutant loads [11]. Compared with the ARIMA model, the LR model is more straightforward and faster, leading to widespread deployment in practice. EXtreme Gradient Boosting (XGBoost) algorithm is proposed by Chen and Guestrin [12], have achieved state-of-art performance on a wide variety of machine learning tasks, including regression, time series. To control air pollution effectively, JunMa et al. propose a methodology framework combining XGBoost and Grid Importance Rank (GIR) [13]. Compared with SVR, RF, and GBDT, authors find that XGBoost has the best performance. In [14], Random Forest (RF), XGBoost, and Deep Neural Network (DNN) machine learning methods are used to investigate PM2.5 concentration prediction. A grid search on hyperparameters with 10-fold cross-validation was carried out to find the best model. Authors find the XGBoost technique demonstrated the highest performance and an acceptable time of training. In addition to the above conventional ML algorithms, W. Qiao et al. designed a hybrid forecast model for hourly gas consumption based on an improved whale optimization algorithm and relevance vector machine (improved support machine learning). The results show that the convergence accuracy and convergence speed of the new algorithm is higher than other algorithms [15].
Due to the improvement of computer computing power, deep learning algorithms have achieved success in many fields, including speech [16], natural language [17] and vision [18]. Deep learning algorithms achieve accuracy that is far beyond that of classical ML methods and has high adaptability. These positive aspects make AI methods an applicable technique in different engineering issues [19]. Deep learning methods can identify the structure and pattern of data such as non-linearity and complexity in time series forecasting [20]. In particular, Long Short-Term Memory (LSTM) has achieved great success in time series forecasting. Yuanyuan Wang et al. proposed a new water quality prediction method based on Long Short-Term Memory (LSTM) for water quality in Taihu Lake [21]. The authors trained the best model through a series of simulations and parameter selection. By comparing with BP neural network, it is concluded that LSTM has the best performance in predicting water quality. In [22], authors chose Mean square error (MSE) as LSTM loss function, adopt Adaptive moment estimation (Adam) as LSTM optimization algorithm, and set the number of iterations to 100. The results indicate that the predicted values of the LSTM model and the actual values were in good agreement. Weibiao Qiao and Zhe Yang proposed a novel hybrid model called WT-SAE-LSTM, which is a combination of wavelet transform (WT), stacked autoencoder (SAE), and LSTM to forecast electricity price time series [24]. The results showed that the designed model outperforms other AI methods. Despite the fact that the LSTM method has recently achieved impressive performance on time series forecasting, LSTM is poor at quantifying predictive uncertainty and tends to produce overconfident predictions. Overconfident incorrect predictions can be harmful or offensive [25]. Hence uncertainty quantification is crucial in time series forecasting.
The above research has been widely used in time series forecasting and has achieved good performance. There are key problems in comparing the predicted value and the actual value of the time series. When comparing the accuracy of algorithms applied in time series data, The RMSE has been used as a standard statistical metric to measure model performance. However, there is a piece of growing evidence that RMSE is poor distance measurement in time series domain [30], [31]. What we need is a dynamic and elastic method to evaluate time series similarity. Dynamic time warping (DTW) algorithm can make up for the shortcomings of Euclidean distance and calculate the time series's similarity dynamically. This method allows non-linear alignments between two-time series to accommodate similar sequences, but locally out of phase [32]. DTW is widely used in various domains, including word image matching [33], voice recognition [34], human action recognition [35].
This study aimed to find and develop a model that would analyze and predict the concentration of POPs in the Great Lakes. The main challenges are twofold: First, this study's model should be flexible and robust to handle the time series; second, we must develop better criteria to choose which model is the best. To solve these challenges, we used machine learning algorithms, mainly including Linear Regression (LR) models, XGBoost model, and ensemble LSTM model, to analyze the series of pollutant concentration and make predictions. Then we proposed a new type of similarity comparison measure called Penalty-DTW combined with Root Mean Square Error (RMSE) to ensure the validity of models.
The main contribution of the work are; 1. We combine the LSTM algorithm and deep ensemble methods to quantify algorithms' uncertainty and make a confident prediction. The difference between the FIGURE 1. This paper presents the design schemes for the concentration prediction of persistent organic pollutants in the Great Lakes. Here, a complete workflow was performed using machine learning algorithms and the proposed ensemble LSTM model combined with the Penalty-DTW metric. This paper combines the LSTM algorithm and deep ensemble methods to quantify algorithms' uncertainty and make a confident prediction. The difference between the proposed model and the existing LSTM model improves prediction accuracy and measures prediction uncertainty. As a supplement to RMSE, we propose a new similarity comparison measure, Penalty-DTW, to compare LR, ARIMA, and LSTM algorithms' performance. Compared with the RMSE metric, Penalty-DTW is more suitable as the metric of time series data.
proposed model and the existing LSTM model is that it improves prediction accuracy and measure prediction uncertainty. 2. As a supplement to RMSE, we propose a new similarity comparison measure, Penalty-DTW, to compare LR, ARIMA, and LSTM algorithms' performance. Compared with the RMSE metric, Penalty-DTW is more suitable as the metric of time series data. Fig.1 shows the complete workflow of this work.
The rest of this paper is organized as follows: In the second section, this article will introduce the previous research's efforts and shortcomings on model uncertainty and the criteria and problems that this article will try to solve. In the third section, we elaborate on the model methods and definitions in detail, including the study area, data preparation, and method. The fourth section describes in detail the experimental process of this study. The fifth section compares the methods used in this study with different criteria. The sixth chapter summarizes some of this study's characteristics and the work to be carried out in the future.

A. UNCERTAINTY OF DEEP LEARNING
The uncertainty measurement of deep learning algorithms has been ignored for a long time. People only care about the deviation between the predicted value and the deep learning algorithm's actual value. Until the past 20 years, due to the rapid development of deep learning algorithms, more and more deep learning algorithms have been deployed to the actual application scenarios. It is necessary to research model uncertainty to make the deep learning algorithm have good performance in the data distribution and the data outside the data distribution, In the initial classification task, a softmax function is added to the last layer of the deep network to obtain the probability, but the probability cannot represent the model's uncertainty. The Bayesian neural network can obtain a confidence output first proposed by Mackay et al. [26]. It puts the prior distribution on the weights of the neural network and then derives the distribution of a parameter function group. Bayesian network is easy to understand; it is difficult to reason in practical application. Therefore, many approximate reasoning methods have been proposed. In the early process of the Bayesian neural network, the variational Inference is an approximate reasoning method, which takes the marginalization needed in the Bayesian reasoning process as the optimization problem [27]. Variational Inference simplifies the Bayesian neural network calculation and improves the reliability of the Bayesian neural network. The focus of the Bayesian neural network is to find a good approximation of posterior distribution. The prediction value and interval are calculated as the expected value of the posterior distribution. The accurate prediction depends on the accurate approximation of the difficult posterior probability. Monte Carlo hidden Markov (MCMC) method shows a random walk behavior. The Markov chain's basic idea generates the posterior distribution samples, and Monte Carlo integration is carried out based on the samples (effective samples) when the Markov chain reaches stationary distribution. Due to the complexity and high dimension of BNNs posterior function, this random walk behavior makes these methods not suitable for reasoning in any reasonable time. To solve the problem of MCMC in bnns, Yarin gal et al. Proposed a new Monte Carlo method and MC dropout [28]. As a Bayesian approximation method, the MC dropout method is interpreted as a Bayesian approximation of the Gaussian process, which solves the high computational cost of Bayesian approximation. MC dropout method does not need to modify the existing network model, and it only needs the dropout layer in the neural network model. In the model training, MC dropout and dropout layers are the same, but in the test process, when the neural network propagates forward, the neural network's dropout layer cannot be closed.
In recent years, non-Bayesian methods are also gradually developed. The core idea of the non-Bayesian method is to train multiple probabilistic neural networks and construct the best network model through self-sampling and ensemble learning. Lakshminarayanan et al. Evaluated the uncertainty of deep learning on the Imagenet dataset by using ensemble learning method [29]. Compared with the standard Bayesian method, his implementation method is simple and suitable for distributed computing and large-scale deep network.
To more accurately predict the concentration of sustainable organic pollutants in the Great Lakes, this paper combines the LSTM algorithm and ensemble learning method of the non-Bayesian method to estimate the model uncertainty in the depth network. Improved LSTM can estimate the uncertainty of the predicted value and get the predicted value more accurately.

III. SYSTEM MODELS AND DEFINITIONS A. STUDY AREA AND DATA SETS
The study area is Eagle Harbor(47.46306 • N, 88.14972 • W), located on the north shore of the Keweenaw Peninsula. The harbor is irregularly shaped, about 4900 feet long and 1100 feet in width. The sampling site is located 1km east of Eagle Harbor, about 100m from the lake.

B. DATA PREPARATION AND SPLIT
Integrated Atmospheric Deposition Network (IADN) is a long-term monitoring program run by the U.S. Environmental Protection Agency's Great Lakes National Program Office. It has measured the concentrations of persistent toxic chemicals in Great Lakes air and precipitation since 1990s [39]. The sampling time-frequency of pollutant concentration is once every two weeks. Since that time, over a million measurements of the concentrations of PCBs, pesticides, PAHs, flame retardants, and trace metals have been made. This study mainly analyzes Polychlorinated biphenyl (PCBs) ' total concentration in Eagle Harbor, a persistent organic pollutant. Features we analyze mainly include Phase, Site, Date, Unit, and SuitePCBs.
The standard k-fold cross-validation method is widely used in ML algorithms. It can evaluate the stability of ML models and optimizes model parameters to make more accurate predictions. However, it does not work well in time series data because it ignores the inherent temporal components. In this study, TimeSeriesSplit in scikit-learn package is used. It provides train/test indices to split time series data samples observed at fixed time intervals in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross-validator is inappropriate. This cross-validation object is a variation of KFold. In the kth split, it returns first k folds as train set and the k +1th fold as test set [40]. In this experiment, we set the number of time series split to 5.
DTW is a method to measure the similarity of time series. Unlike the traditional Euclidean distance measurement methods, the algorithm solves the template matching problem of different time series lengths. It is based on dynamic programming; to some extent, this is similar to solving the longest common substring (LCS) problem.
We now assume that there are two time series P(0 . . . M ) and Q(0 . . . N ). We construct a two-dimensional array of dp[i] [j], which represents the square of the similar distance between the series (1): Array dp is used to solve the problem that the two sequences' lengths must be equal. Refer to (1), we concluded that when we get the value of dp[m − 1][n − 1], we get the similarity of time series P(0 . . . M ) and Q(0 . . . N ).
Although DTW has solved the problem of time sequence length cannot equal, there is a problem we have to think about: when a time series is come by another time series translation, from the actual, two time series are equal, but when we calculate according to the similarity of the DTW, found that two time series is not equal, not only that, two time series similarity can vary widely.
To solve this problem, we propose some improved strategies for the DTW algorithm. We first need to calculate a penalty coefficient α, the multiplication of the penalty coefficient, and the DTW calculation result is the final similarity we need.
Based on dynamic programming, we first work out the longest common substring of two time series. Since time series P and Q both are numerical types, we set the tolerance VOLUME 9, 2021 of the maximum standard deviation. Within this tolerance range, the data of the two values are regarded as equal and will be added to the common substring. When we get the longest common sequence com_len, the penalty coefficient α is calculated as follows: com_len i ×com_len i lenP×lenQ (2) lenP is the length of time series P and lenQ is the length of Q. It can be seen from (2) that the longer the common substring of two time series is, the smaller the penalty coefficient α is, the smaller the similarity will be. The algorithm logic is shown in Algorithm 2. In statistics and econometrics, particularly in time series analysis, the ARIMA model summarizes the Autoregressive Moving Average (ARMA). Both models are fitted to time series data. ARIMA models are applied in some cases where data show evidence of nonstationarity, where an initial differencing step (corresponding to the ''integrated" part of the model) can be applied one or more times to eliminate the nonstationarity [6]. The core idea of the ARIMA model is to combine the AR model and the MA model. ARIMA model is denoted ARIMA (p, d, q) where p, d and q are nonnegative integers. p is the lags of the time series data used in the prediction model, also known as the Auto-Regressive (AR) term. d is the degree of differencing (the number of times the data have had past values subtracted). q is the lags of the prediction errors used in the prediction model, known as Moving Average (MA).

3) LINEAR REGRESSION
LR is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables. It was the first type of regression analysis to be studied rigorously and used extensively in practical applications [41]. In LR, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models [42]. Linear regression models are often fitted using the least-squares approach. However, they may also be fitted in other ways, such as by minimizing the ''lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least-squares cost function as in Ridge Regression (l1_norm penalty) and Lasso Regression (l2_norm penalty).

4) XGBoost
EXtreme Gradient Boosting (XGBoost) algorithm is a scalable tree boosting system widely used by data scientists and provides state-of-the-art results on many problems, proposed by Chen and Guestrin [12]. The core idea of XGBoost is to add trees constantly and constantly split the characteristics to grow a tree. Every time a tree is added, it is actually to learn a new function and use the new function to fit the residual. XGBoost can accurately predict the classification and regression problems, and it also has excellent performance for the prediction of time series.

5) LONG SHORT-TERM MEMORY
The techniques of ANNs are well indeed established as one of the most robust mathematical-based solutions promising reliable connections between the inputs and output(s) [43]. LSTM(Long Short-Term Memory) was proposed by Hochreiter and Scheduler 1997 and recently improved and promoted by Alex graves [44]. It is well-suited to classify, processing, and to make predictions based on time series data. LSTM is a variant of recurrent neural networks. It adds input gate, output gate, and forget gate based on recurrent neural network [45], which helps deal with vanishing or exploding gradients. Vanilla LSTM is a variation of origin LSTM, has become the state-of-the-art model for ML problems in recent years [46]. By utilizing vanilla LSTM, model make full use of the long short-term memory power to obtain precise accuracy. Vanilla LSTM have achieved great success in predictive accuracy in natural language processing, speech recognition domains.

IV. PROPOSED MODELS A. ARIMA MODEL
When building an ARIMA model for time series analysis, we need first to detect the stationarity of time series. If the series data is not stationary, we need to carry out a differential process. Generally speaking, the time series's first difference can achieve the series's stationarity, but sometimes the second difference check is needed. Then we identify the model, that is, to determine the categories of AR(p), MA(q) or ARIMA(p, d, q) models. Next step is estimating the parameters p and q. When parameters the estimation is completed, we must check whether the time series is white noise. Only when the time series is white noise can we build the model.
In testing stationarity of time series, we use Dicky-Fuller [48] test method to detect whether the time series is stationary. In parameter estimation, we mainly use the BIC criterion to select models among a finite set of models, and the lower BIC is preferred. In testing white noise of time series, we use the Lagrange Multiplier (LM) test for residual correlation LM Test. Suppose the corresponding p − value of F statistic is greater than the significance level α. In that case, the original hypothesis is acceptable, and the residual sequence is considered a white noise sequence, and the model has passed the test.

B. LINEAR REGRESSION AND XGBoost MODEL
LR and XGBoost models need multi-dimension features to train the model, but now we only have one-dimension features. We have to extract features from the one-dimension time series. We mainly use the two methods to extract features. The first method is to shift the series n steps back, then we get a feature column, in which the current value of the time series is aligned with its value at time t − n. The second method is to obtain the time characteristics of the current time, mainly including the month, week, quarter of the current time, whether the current time is summer (the concentration of the pollutant in summer may be higher than that in other months).
For LR, we used Lasso Regression and Ridge Regression models. They both add some more constraints to the loss function to resolve the problem of overfitting. Ridge regression responds to ordinary least squares' problems by imposing a penalty on the size of the coefficients. Lasso regression is a linear model that estimates sparse coefficients.
For XGBoost, the most notable is parameter tuning. We mainly tune three types of parameters: general parameters, booster parameters, and learning task parameters. General parameters are set automatically by XGBoost generally. Booster parameters define characteristics of building XGBoost tree, mainly includes max_depth, min_child_weight, sub_sample and other features. Learning task parameters are used to define the optimization objective the metric to be calculated at each step, mainly includes objective, eval m etric. objective and eval_metric, we set objective = linear and eval_metric = rmse. The general approach includes: Step 1. Choose a relatively high learning rate to determine the optimum number of trees for this learning rate. We initialize learning_rate = 1.
Step 3. Tune regularization parameters (lambda, alpha) for XGBoost can help reduce model complexity and enhance performance. Step 4. Lower the learning rate and decide the optimal parameters. In step2, we combine GridSearch with 5-fold crossvalidation. The whole dataset is divided into five parts. Each time GridSearch is used, one part is taken as the test set, and the other four parts are used as the training set to train the model. Then the performance of the model on the test set is calculated. Early Stopping is also used to ensure that the loss does not decline before it stops; we set early_stopping_rounds = 10.

C. ENSEMBLE LSTM MODEL
Before the ensemble LSTM model training, we carried out data transformation, divided into three steps. First of all, we converted our single column of data into a two columns dataset: the first column containing this day's (t) pollutant concentration and the second column containing the next day's (t +1) pollutant concentration, which is to be predicted. Because the stationary time series is more comfortable to model and can lead to more accurate prediction, we then transformed the time series into a stationary one, which is the same as the ARIMA model. Finally, we scale the time series values because the activation function adopted by the LSTM model is tangent function: Its value range is −1 to 1, so we have to transform the dataset to the range −1 to 1. The specific conversion method is as follows: Completing the data transformation, we developed ensemble LSTM model based on Vanilla LSTM model. Different from the standard Vanilla LSTM model, our model has two output gates, one is used to output the predictive value, and the other is used to evaluate the predictive uncertainty. We use the variance of ensemble LSTM predictions as approximate measurement of uncertainty. The ensemble model algorithm as follows.

A. PENALTY-DTW METRIC 1) COMPARISON OF EUCLIDEAN DISTANCE AND DTW
Time series cluster is typical verification to measure the similarity of different time series. We performed a simple k-means cluster experiment on the cylinder-bell-funnel dataset. The dataset is a deceptively simple-looking three-class problem. All classes are of length 128.
The algorithm's variants are available: standard Euclidean k-means, k-means based on DTW, and k-means based on Penalty-DTW. In Fig.2, each row corresponds to the result of a different clustering. In a row, each subfigure corresponds to a cluster. It represents the set of time series from the training set assigned to the considered cluster (in black) and the barycenter of the cluster (in red). The first row is the result that k-means clustering based on standard Euclidean distance. One problem is that Euclidean Distance assumes the ith point in one sequence is aligned with the ith point in the other. The second row shows the result of a k-means clustering that uses DTW as a primary metric, which allows a more intuitive distance measure to be calculated. In this example, we used preprocessing method to process input time series data. This preprocessing way ensures that each output time series have zero mean and unit variance.

2) COMPARISON OF DTW AND PENALTY-DTW ON TOY DATA
To prove obviously the effectiveness of the Penalty-DTW algorithm, we chose S1, S2 and S3 as time series with a length of 20, in which S2 is translated from time series S1 and S3 is another randomly generated sequence, S1, S2 and S3 are as follows: s1 = [0, 2, 1, 1, 1, 2, 0, 1, 1, 2, 0, 1, 1, 2, 0, 1, 1, 2, 0, 1], s2 = [0, 1, 0, 2, 1, 1, 2, 0, 1, 1, 2, 0, 1, 1, 2, 0, 1, 1, 2, 0], s3 = [0.1, 1.9, 0.9, 1.1, 1, 1.9, 0.1, 0.8, 1.8, 0, 0.8, 0.9, 0.2, 2.4, 0.5, 0.4, 0.9, 1.5, 0, 1]. Fig.3 illustrate DTW and Penalty-DTW computation between time series and plots the optimal alignment path. The images represent cost matrix, that is the squared Euclidean distance for each time point between both time series, which are represented at the left and at the top of the cost matrix. The optimal path, that is the path that minimizes the total cost to go from the first time point to the last one, is represented in red on the images. The upper left corner of the image shows the results of DTW and Penalty-DTW. Fig.3(a), (b) show the results of the unimproved DTW algorithm. We found that the similarity between time series S1 and S2 is 2.236, whereas S1 and S3 is 1.3077, the distance of S1 and S2 time series is larger than that of S1 and S3, which is not consistent with the actual situation. Fig.3(c), (d) show the results of the improved DTW algorithm. We observe that the similarity distance of S1 and S2 time series is smaller than that of S1 and S3; that is to say, the improved Penalty-DTW algorithm works better.  Fig.4 shows the time series of PCBs pollutant concentration since 2000. It can be seen from the figure that the overall trend of PCBs concentration is decreasing year by year, and there may be a seasonal trend. In some months of the year, the concentration of PCBs is on the high side, whereas in some months, PCBs' concentration is on the low side. To ensure the stationarity of the data, the first difference was performed for the nonstationary data. The results of the difference are shown in Fig.4. The raw data is significantly smoother than before after differential operation.

B. ARIMA RESULTS
We used Dicky-Fuller test methods to check whether the data are stationarity. Dicky-Fuller test results are shown in Table.1. From the results of the Dickey-Fuller test, we can see the value of the test statistic is −1.144889e + 01, which is far less than the critical value (1%) of −3.439327e + 00; that is to say, it is perfect to reject the original hypothesis, with 99% reliability to ensure the stability of the time series.
The critical point of ARIMA model construction lies in selecting p, q, and d parameters. To select these parameters more accurately, we adopted the BIC order criterion. We took the grid search method to select p and q according to the BIC criterion to select the appropriate parameters and observed that BIC is the lowest when p is 2, and q is 6. That is to say, the order of the autoregression term is 6, the order of difference item is 1, and the order of Moving Average is 6.   Before model fitting, we also need to test the model by residual. Fig.5 is the QQ chart to check whether it satisfies the positive and negative distribution. We see that the residual satisfies the distribution of the positive and the negative. The D-W method was then performed to test the residual's autocorrelation, and the D-W test result is 1.986. According to the D-W test characteristics, when the D-W test value is close to 2, there is no autocorrelation, which indicates that the model works well. Fig.5 shows the ARIMA model fitting results. The RMSE value is 60.84, ARIMA model has not good prediction results.

C. LINEAR REGRESSION RESULTS
We shifted the series n steps back. Then we get a feature column, in which the current value of the time series is aligned with its value at time t − n. In this experiment, we chose the value of n in the range of [2,26]. Then we added weekday, quarter, month, year, dayofyear, weekofyear, is_weekend, is_summer, is_winter and other features to the dataset. Datasets is split into 5 train-test folds.
In order to make our model have better generalization ability, we applied regularization to the model. Fig.7(a) and Fig.7(b) respectively show the performance of Lasso Regression and Ridge Regression on the testset. The value of RMSE on Ridge Regression is 13.78, and the value of RMSE on Lasso Regression is 12.55. Both models are better than the regression model without regularization. Fig.7(c) shows the performance of the XGBoost model in the test set. The effect is not apparent when predicting a sharp increase or decrease in PCBs concentration. However, compared with the ARIMA model, XGBoost is outstanding in predicting some stable data. The RMSE value of the XGBoost model is reduced from 60.84 to 16.49, which significantly improves the performance.

E. ENSEMBLE LSTM RESULTS
Our goal is to achieve state-of-art prediction performance of the LSTM model and give the predictive uncertainty.   . Results on part of test datasets using ensemble LSTM. x-axis denotes x. On the y-axis, Red dots represent prediction results, the blue shadow represents the Standard deviation of prediction results, that is, the uncertainty.
Datasets are split into 5 train-test folds. For LSTM, We used a vanilla LSTM with 2-hidden layers and tanh nonlinearities with batch normalization. One layer is the LSTM layer with different units; the other layer is the Dense layer. We set the number of LSTM layer units as a variable to seek the optimal network architecture. We trained with Adam optimizer with a constant learning rate of 0.001 and weight decay 10 −5 . For MC-dropout, we added dropout after each We trained for 100 epochs. When measuring the predictive uncertainty, the experiment's commonly used way is to train an ensemble model and obtain predictions. Then use the empirical variance according to the predictions as approximate uncertainty.
Results are shown in Fig.8. We observe that increasing the number of LSTM units significantly improves the performance in terms of RMSE score and Penalty-DTW score. Meanwhile, under RMSE and Penalty-DTW score, ensemble LSTM leads to better performance than LSTM. LSTM also performs much better than MC-dropout in terms of both metrics. Fig.7(d) shows the performance of the LSTM model on the test set. We observe that LSTM can predict the static data well and gets better performance for some peak data. The RMSE of the test set of the LSTM model is 5.16, which dramatically improves the performance of XGBoost.
To measure the uncertainty of prediction results, we experimented with an ensemble LSTM model trained before. The performance of the ensemble LSTM model on testsets is shown in Fig.9. The larger the area of the blue shadow, the higher the uncertainty of the prediction results. We observe that the shadow area of the prediction result is in an acceptable range, which indicates that the uncertainty of the prediction result is small and the reliability of the prediction is high.
To assess the model's robustness to known data and unknown data, we used the trained model and selected two groups of input data to evaluate predictive uncertainty; one is from our training data, the other is generated randomly. From Fig.10, we observe that the standard deviation of training data is much smaller than the standard deviation of random data. This result shows that ensemble LSTM the model can express higher uncertainty to unknown input data. Fig.11 shows the similarity between prediction and actual in four models. In the training step, the LSTM The goodness-of-fit of the model illustrates how well the models fit the training dataset. The prediction and generalization abilities of the model cannot be evaluated using the goodness-of-fit of the model because it is measured by the data that were used to calibrate the model [47]. The test step accurately takes into account the prediction ability of the model. For the Root Mean Square Error criteria, we conclude that the LSTM model's performance and the LR model on the dataset are better than other models. To more accurately verify the similarity between the predicted time series and the actual time series, we adopt an improved method based on DTW. The results show that the ensemble LSTM model has the best prediction ability.

G. MODEL PREDICTION
The predicted results of the model are shown in the Fig.12. It can be seen from the figure that the overall concentration of PCBs will gradually decrease in the next two years. In a year, the concentration from July to October will be higher than that in other months, and the concentration from December to March will be lower than that in other months.

VI. CONCLUSION AND FUTURE WORK
This study proposed a verification method based on the dynamic time warping (Penalty-DTW) algorithm. We proved that the algorithm is superior to the traditional Root Mean Square difference in the similarity comparison of time series. Meanwhile, we combined the LSTM model and ensemble methods to measure uncertainty and made an accurate and confident prediction. We found that the ensemble LSTM model and LR model have good performance in predicting PCBs concentration in Eagle Harbor through the similarity analysis. It can be reached from the results that the concentration of PCBs has a stable trend in recent years and is decreasing year by year. However, there is a considerable correlation between PCBs concentration and season. The concentration of PCBs will increase from July to October and reach the peak value around July. The concentration of PCBs will decrease from December to March, an excellent reference for environmental pollution control. It is required for us to study further and explore this time series model's application in other regions and pollutants.
Future research in this field can be carried out from the following two aspects. Firstly, we will continue to study the uncertainty in the prediction of pollutant concentration, and use different evaluation criteria, put forward different methods to measure the uncertainty of the prediction results In the following research. We can try to combine the MC-dropout method and deep ensemble method to consider the uncertainty comprehensively. Secondly, the study of pollutants in the Great Lakes is only limited to certain pollutants in one region. We can expand the pollutants to other regions in the following study.