Modeling of Wastewater Treatment Processes Using Dynamic Bayesian Networks Based on Fuzzy PLS

The complicated characteristics of wastewater treatment plants (WWTPs) significantly hinder the monitoring of industrial processes, and thus much attention has been paid to process modeling and prediction. A fuzzy partial least squares-based dynamic Bayesian networks (FPLS-DBN) is proposed to improve the modeling ability in WWTPs. To adapt the nonlinear process data, fuzzy partial least squares (FPLS) is introduced by using a fuzzy system to extract nonlinear features from process data. In addition, a dynamic extension is included by embedding augmented matrices into Bayesian networks to fit the uncertainty and time-varying characteristics. Regarding the quality indices for effluent suspended solid in the WWTP, the root mean square error of the FPLS-DBN model is decreased by 28.63% and 69.47%, respectively, in comparison with that for partial least squares and Bayesian networks. The results demonstrate the superiority of FPLS-DBN in modeling performance for an actual industrial WWTP application.


I. INTRODUCTION
In recent years, more attention is focused on wastewater treatment processes (WWTPs) and stricter requirements for wastewater discharge standards have been prompted to ensure the accurate monitoring of key quality indices. The effluent quality indices are often measured by online instruments or off-line laboratory analysis. However, these methods are difficult to measure the quality indices for real-time monitoring owing to continuous maintenance, insufficient accuracy, measurement delay, and aged deterioration [1], [2]. To improve the monitoring in WWTPs, it is necessary to structure an accurate process model, especially the real-time monitoring for the key process effluent indices [2].
To quickly and accurately obtain key variables, data-driven models have been widely concerned in many engineering fields, which could extract valid information from easy-tomeasure auxiliary variables to predict difficult-to-measure The associate editor coordinating the review of this manuscript and approving it for publication was Xiao-Sheng Si . key variables online [3]- [5]. Therefore, auxiliary variables based on limited data are more critical for modeling. Too many inputs, however, may lead to overfitting and inefficiency, and thus it is necessary to provide auxiliary variables that have lower dimensions for modeling.
As an essential part of multivariate statistical methods, the latent variable methods (LVMs) are noticed in many fields in recent years [6]. Common LVMs include principal component analysis (PCA) and partial least squares (PLS). The PLS differs from PCA in that PLS is not only searching for the direction of the largest variance in X, but also finding the direction of the largest covariance in X and Y space. Therefore, PLS can better describe the relationship between input and output variables.
Although conventional PLS latent variable methods have been widely used in modeling, some of the industrial processes often contain the variable with nonlinear features which could not be extracted easily, leading to the linear model may not perform well when facing complex industrial data. To solve the aforementioned problems, it is therefore VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ necessary to try to improve the model performance by nonlinear extensions of PLS. The quadratic PLS (QPLS) was developed by Wold et al. [7] who applied quadratic polynomial to PLS. In addition, the fuzzy PLS (FPLS) model, replacing the linear relation between input and output with a fuzzy system, was proposed and its regression performance was better than the PLS and QPLS [8]. Another method of robust adaptive partial least squares was applied to process monitoring of a complex industrial wastewater treatment process [9]. Also, the kernel method was used to structure the kernel-based PLS model to effectively capture the nonlinear relationship of process variables [10]. And the least squares vector machine was also applied to build least squares vector machine-PLS (LSSVM-PLS) [11]. Recently, the dynamic nonlinear PLS was developed using Gaussian process regression and achieved better performance compared to PLS, QPLS, and LSSVM-PLS [12].
In practice, the time-varying characteristic often exists among the process data collected in the form of a series, which leads to great difficulties for static process modeling [13], [14]. To overcome the problems of time-varying due to the complexity of processes, dynamic modeling techniques including recursive algorithms [15], time differences [16], augmented matrices [12], and auto-regressive [14] are preferred choices. In recent years, the dynamic Bayesian networks was structured by recursive algorithms to predict the bridge condition [17]. Dong and Qin [14] built the dynamicinner principal component analysis by maximizing autocovariance. Among these dynamic modeling techniques, the augmented matrix method as a simple but effective method is widely implemented in the dynamic model. Based on conventional modeling methods, the augmented matrix is embedded into PCA and ICA, respectively to structure the dynamic model for process fault detection [13], [18]. Recently, Liu, et al. [19] developed a dynamic concurrent kernel canonical correlation analysis method for fault diagnosis of the continuous annealing process. Besides, the augmented matrices technique based on the concurrent partial least squares was applied to structure concurrent PLS for the metro air quality monitoring [20].
Apart from the nonlinearity and time-varying characteristics, the industrial process also has some problems such as random noises, different treatment conditions, and the complicated mechanism of the biological reaction process, which often leads to uncertainty in WWTPs [21]. As a method with a full description of relationships among variables, Bayesian networks (BN) is widely used owing to the efficient interpretation of uncertainty in processes [22]- [24]. The applications of the BN model have been developed in works by [25]- [27]. As mentioned above, uncertainty and time-variability can be interpreted simultaneously by using the BN model based on dynamic technique.
In this work, the modeling performance can be further improved with the combination of dynamic probability and LVMs method. Firstly, FPLS is proposed to extract the nonlinear characteristic from the lower dimensions of latent variables. Then, the augmented matrices technique is embedded into the BN-based FPLS to obtain an FPLS-dynamic Bayesian networks (FPLS-DBN) model. The goal of this model is to achieve a better prediction performance when facing uncertainty, nonlinearity, and time-varying characteristics of the process. The structure of this paper is as follows. In Section II, we provide an introduction to the FPLS-DBN method in detail. In Section III, the wastewater treatment process and practical industrial datasets from WWTPs are introduced and the proposed model is applied to WWTPs to evaluate the modeling performance of FPLS-DBN. To make a fair comparison, the models of corresponding counterparts are conducted for process modeling. Also, the results are analyzed and discussed. Finally, the main conclusions are presented in Section IV.

II. MATERIALS AND METHODS
In this section, conventional methods are introduced and the details of FPLS-DBN are presented. Based on the PLS model, the FPLS model is structured by introducing the Takagi-Sugeno-Kang (TSK) and fuzzy c-means (FCM) methods. Also, the DBN model makes it possible to use both probabilistic methods and dynamic techniques. Then, the proposed method FPLS-DBN can be obtained with the combination of dynamic probability model and nonlinear method.

A. PARTIAL LEAST SQUARES
The partial least squares aims at searching for suitable latent variables that can exploit variance structures of variables. The input variables X and the output variables Y are projected by a PLS model, which takes the maximization of covariance between X and Y, extracting the orthogonal feature vectors in the space of X and Y. Therefore, PLS method is widely used in process modeling and dimensionality reduction.

B. FUZZY PARTIAL LEAST SQUARES
The PLS can better extract effective information when solving linear problems, however, it has some limitations for complex nonlinear processes. Therefore, the fuzzy rule and clustering method are used to decompose the complex non-linear system into multiple local linear models, which can effectively overcome the nonlinear problems [8].

1) FUZZY c-MEANS
Clustering is one of the most important methods for data analysis and modeling. Among the clustering algorithms, fuzzy c-means, as a fuzzy clustering method based on the objective function, is the most widely method used in data modeling.
The FCM algorithm can be formulated as follows: (1) Select the number of clusters L and the fuzzy index m (a larger m indicates a fuzzy membership relationship, and vice versa); (2) Calculate the membership: is the cluster center.
(3) Calculate the cluster center: (4) Structure the objective function of fuzzy clustering J G : The FCM method is the process of minimizing J G , using the objective function to optimize the clustering problem.

2) TSK FUZZY INFERENCE SYSTEM
The fuzzy system based on fuzzy reasoning is a method of analysis, reasoning and decision-making process, whose fuzzy language would be transformed into fuzzy rules. Therefore, a mathematical model corresponding to the process can be structured because of its better interpretability and learning ability. Classical fuzzy systems can be divided into three types [28]: Takagi-Sugeno-Kang (TSK), Mamdani-Larsen (ML), and Generalized fuzzy (GF). As an essential part of fuzzy systems with more effectiveness and flexibility, TSK is specifically suitable for dealing with complex nonlinear systems [29]. The IF-THEN rule of TSK whose form is defined as: where L is the number of fuzzy rules; x and y are input variables and local output variablend y are input variabs, The overall output of the TSK model is: where, τ i is the firing strength of i-th fuzzy rule, which is computed by The Gaussian membership function is defined as: where, c ir is the clustering center of Gaussian membership function, σ i is the center width of the membership function.

C. DYNAMIC BAYESIAN NETWORKS
The Bayesian networks is a probabilistic model based on a directed acyclic graph (DAG), illustrating the relationship among process variables and each variable represents a node in the networks. The details of the Bayesian networks model can be found in the literature [30]. The target of Bayesian networks is to fit the nonlinear and uncertain characteristics in complex modeling. Assume that X = [x 1 , x 2 , · · · , x m ], the Bayesian formula could be given by: where π (ξ ) is the prior distribution of ξ , The Bayesian networks model as a DAG model fully describes the relationship between different variables. However, there still may be limitations for the BN model with time-varying characteristics of the data. Therefore, an augmented matrix technique is embedded in the BN to fit the dynamic characteristics. The construction of the dynamic model is as follows: The original input matrix X (the parent nodes in the Bayesian network structure) is defined as: Based on the input matrix X, the augmented matrix X i is obtained as: The dynamic technique more clearly presents the structure of data with time-varying by expanding the matrix and introducing a time lag coefficient. Therefore, dynamic characteristics are also more easily extracted in modeling. The DBN model obtains a further development of the dynamic probability model, which improves the ability of the DBN to interpret the data from dynamic processes.

D. STRUCTURE OF FPLS-DBN MODEL
Based on the PLS model, the FPLS model is structured by introducing the TSK and FCM methods to preferably extract the nonlinear characteristics of process data. Moreover, the DBN model makes it possible to use both probabilistic methods and dynamic techniques, which could solve the problems of uncertainty and dynamic characteristics. Then, the latent variables (LVs) of FPLS update the original inputs used for the DBN model. Through combining the FPLS with the DBN, the FPLS-DBN is more conducive to practical applications and achieves a further development of dynamic probability models based on LVMs. As shown in Fig. 1, the FPLS-DBN model is structured according to the research above to provide an accurate modeling method for WWTPs.
The PLS is used to decompose the input X and output Y as follows: where T and U are score matrices for X and Y; P and Q are loading matrices, and E and F are residual matrices, respectively; and t and u are score vectors for X and Y, and p and q are loading matrices, respectively. Calculate eigenvectors t h and u h for each factor h: Obtain the clustering center of Gaussian membership function: where c i (i = 1, 2, · · · , L) is the cluster center, µ ij is membership grade; the fuzzy index m is taken as 2 in this study. The dataset is clustered into L categories, and sub-models are built for each category. Define the input variable as The TSK fuzzy function is defined as: where G i is the normalized firing strength.
The normalized firing strength G i and Gaussian-type firing strength for i-th rule τ i are calculated as: where σ i is the width of the membership function; i = 1, 2, · · · , L. The width of the membership function is calculated using the nearest neighbor rule: where c i and c l are the n closest cluster centers, respectively, and n is often taken as 2; l = 1, 2, · · · , n. The overall output of TSK model is calculated by: Minimize the objective function J G : Calculate the loading vectors of X and Y: Calculate the residual matrices E h and F h : Let h = h + 1 and perform the iterative calculation until the valid information is extracted from residual matrices E h and F h , and then the calculation is terminated.
The number of latent variables in the FPLS model is selected according to the cumulative variance, and a dynamic Bayesian networks is structured by augmented matrices techniques. Then the score matrices of FPLS update the original inputs for the dynamic probability method of DBN.
Expand the scoring matrix X T and introduce the time lag d to obtain the matrixX T : where x(t) is sample data; d is the time lag. UseX T as nodes for BN, and calculate the conditional density of the sample x 1 , x 2 , x 3 , · · · , x m for a random variable ξ ; then the prior distribution π (ξ ) and conditional density P(x 1 , x 2 , x 3 · · · , x m |ξ ) are used to calculate the posterior probability P(ξ |x 1 , x 2 , x 3 · · · , x m ) by the Bayes formula, which is defined as follows: The testing set variable ξ is inferred from the posterior probability density. Besides, root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R 2 ) are adopted to evaluate model performance. The statistical indices are defined by: where y i is real value,ŷ i is the predicted value,ȳ is the mean value, and N is the number of samples.

III. RESULTS AND DISCUSSION
Given the fact that the discharge of hazardous pollutants from WWTPs poses a significant challenge for the environment, it is necessary to structure an accurate and reliable model for process monitoring. In this section, the data for testing the modeling performance were collected from WWTPs.
To illustrate the performance of the FPLS-DBN, the key quality index SS is used for prediction. Besides, five counterparts, consisting of PLS, BN, PLS-BN, PLS-DBN, and FPLS-BN are considered for comparison.

A. WWTP DATA
The process data were collected from Daewoo nutrient removal wastewater treatment process from March 9, 2007, to February 29, 2008 [31]. As shown in Fig. 2, the wastewater treatment process mainly includes biological reactors, primary and secondary clarifiers, concentration tank, and dewatering system. The bioreactors include denitrification, anaerobic, anoxic, and oxic reactions. As shown in Fig. 3,  the influent variables include flow rate (Q), suspended solid (SS in ), biochemical oxygen demand (BOD in ), chemical oxygen demand (COD in ), total nitrogen (TN in ), and total phosphorus (TP in ). The effluent variable is suspended solid (SS eff ). The data consist of 358 groups, of which the first 238 groups were used as the training set and the last 120 groups were used as the testing set. Table 1 provides a statistical description of WWTP data.

B. RESULTS AND DISCUSSION
The PLS model as a linear method uses linear inner relation in latent space, whose internal relationship model of the latent variable is shown in Fig. 4 (t(1) and u(1) are input and output score vectors of the latent variable, respectively, and regression models are represented by the solid lines). As shown in Tables 2 to 4, ''This LV'' represents the vari- ance (%), ''Total'' represents the cumulative variance (%), and the LVs is selected by cumulative variance. According  to Table 2, there is no obvious increase in the cumulative variance of output after the second LVs is reached, and thus we selected second LVs for PLS modeling. The scattered points of the latent variable, as shown in Fig. 4, indicate that the process data has strong nonlinearity, it is therefore difficult to effectively extract data information. More specifically, in terms of Table 2, the variance by the second LVs is 1.88% for the output data, whereas the variance by the third LVs is just 0.53%. It follows that PLS modeling has large limitations and it is difficult to structure an accurate model for nonlinear processes.
To overcome the shortage of PLS in modeling, the FPLS model is structured based on the TSK and FCM methods to provide a nonlinear model. The fuzzy rules are critical to the model performance of the FPLS model. The different fuzzy rules thus are used for modeling to explore the impact of fuzzy rules on different models. The FPLS regression model of the latent variable is shown in Fig. 5, which is composed of two subplots. The upper score plot consists of t(1) and u(1) which are the input and output score vectors of the latent variable, and the lower plot is composed of u(1) and firing strength (the dashed lines show the normalized firing strength G i ,and solid lines represent the firing strength τ i ). In addition, Figs. 5(a) -(d) correspond to the score plot with fuzzy rules from 2 to 5, respectively. As shown in Table 3, the variance captured by different FPLS model is represented. And according to the cumulative variance of models, the LVs of FPLS_1 is 2, and the LVs of FPLS_2 to FPLS_4 is 3 in Table 3.
It can be seen from Fig. 5 that obvious nonlinearity is shown in industrial data, and regression lines with different fuzzy rules have different tracking for score points. In Fig. 5  (a), when the fuzzy rule is taken 2, the regression line of FPLS_1 can better fit the score points compared with the internal regression line of PLS (shown in Fig. 4). Similarly, as the number of fuzzy rules increases, the tracking for score points by regression lines becomes more accurate and complicated. Besides, it is concluded from Table 3 that the cumulative variance of the FPLS model is significantly higher than PLS, and the cumulative variance of FPLS is also raised with the increase of fuzzy rules.
However, it is worth noting that in the FPLS_3 model, the variance of the third LVs (1.67%) is lower than the third LVs of the FPLS_2 model (2.15%). It can be understood that effective information extraction is sufficient when the fuzzy rule is 4. Although more variance can be extracted in the FPLS_4 model, there may be redundancy in the extracted information, which is not conducive to modeling.
As shown in Fig. 6, the value of RMSE from different models is given (dashed line and solid line represent the static models and dynamic models, respectively). The fuzzy rule on the abscissa is 1 representing the PLS-BN or PLS-DBN models, and 2-5 representing the FPLS-BN or FPLS-DBN models with fuzzy rules of 2, 3, 4, and 5, respectively.
According to Fig. 6, the models based on the fuzzy method (FPLS-BN and FPLS-DBN) are better than the models based on the linear method (PLS-BN and PLS-DBN). Moreover, FPLS-BN and FPLS-DBN models achieve optimal model performance with a fuzzy rule of 4 (the number of the fuzzy rule is 3). However, there is a significant difference in RMSE value between FPLS-BN and FPLS-DBN. The reason is that time-varying characteristic usually exists in actual industrial processes, which greatly hinders modeling performance.
Considering the fact that the FPLS-BN has not taken the time-varying characteristic into consideration, it is necessary to try to improve the model performance by using the dynamic technique. Therefore, the FPLS-DBN shows an obvious improvement of model accuracy after the dynamic technique is applied to the static model. The prediction results of RMSE are shown in Fig. 6. In terms of FPLS-DBN model, the RMSE value of FPLS-DBN (RMSE = 0.7175) is decreased by 56.47% compared with FPLS-BN.
According to the results, the proposed dynamic model can obtain a better model performance when the fuzzy rule is 4. Moreover, the number of LVs for fuzzy rules may also lead to different (even worse) model performance. To explore the effectiveness of models with different number of fuzzy rules (the fuzzy rule is taken as 4), a different number of fuzzy rules are used to obtain the FPLS-DBN and then the   Table 4 (the LVs of FPLS_5 to FPLS_9 are 2, 3, 4, 5, and 5, respectively). Also, the cumulative variance and the number of LVs for FPLS models are shown in Fig. 7 (the bar graph displays the cumulative variance (%), and the complexity of the model is assessed through the number of LVs which is represented by   the line chart for FPLS models) and the prediction results of FPLS-DBN are shown in Table 5.
As shown in Table 4, there is no significant increase in terms of the cumulative variance after the second LVs for FPLS_5. Compared with the other FPLS models with the second LVs (FPLS_6 to FPLS_9), the cumulative variance values of FPLS_5 (17.22% for output data) seem to be similar to those of other FPLS models. However, there are some differences in variance between FPLS models with third LVs. The cumulative variance of FPLS_5 is lower than other FPLS models (FPLS_6 to FPLS_9) which have shown a better ability to extract the data information.
It is worth noting that a greater amount of fuzzy rules has a higher cumulative variance for the model, but inevitably leads to higher dimensions of latent variables, which is not conducive to modeling. The reason is that the increase in the number of LVs leads to more complex models, which may reduce model accuracy and raise the costs of modeling. This explains why more fuzzy rules result in the worse performance of the FPLS-DBN models. Table 5 provides the RMSE value of FPLS-DBN with different numbers of fuzzy rules, which clearly shows that the FPLS-DBN achieves a better model performance when the number of fuzzy rules is 3. The main reason for this result is that the FPLS-DBN model can effectively extract nonlinear data information by the appropriate fuzzy parameters (the fuzzy rule is 4, and the number of fuzzy rules is 3), and avoid the decrease in model accuracy caused by high data dimensions. Moreover, the dynamic probability method and Bayesian networks is combined with the FPLS latent variable model, leading to an improvement in the model performance. To evaluate the effectiveness of the proposed method, several conventional methods are used to compare modeling performance. For the detailed performance of different models, the prediction results are shown in Table 6 and the detailed prediction variations are shown in Fig. 8. In terms of the conventional models, PLS shows a better model performance than BN. This is probably because the BN model requires a significant amount of data in modeling. Besides, part of the issue is that the BN model may not meet the complexity of industrial processes. After using the latent-variables method of PLS, the PLS-BN shows an obvious improvement than BN. Nevertheless, the prediction results of PLS-BN are even worse compared with PLS. The reason is that in combination with the probabilistic method, process variables are more fully described by the model. When used in this way, however, the problem of data nonlinearity will become more complex, which is not conducive to modeling. For the nonlinear characteristic of process data, a fuzzy-based method is used to extract the nonlinear information. In terms of the fuzzy-based method, the RMSE value of FPLS-BN is decreased by 6.98% compared with PLS-BN.
Considering the fact that the static model unavoidably has some limitations in confronting the dynamic process. It, therefore, is necessary to fit the time-varying characteristic by dynamic technique. Compared with the static methods of BN, PLS, PLS-BN, and FPLS-BN, the RMSE value of FPLS-DBN is decreased by 69.47%, 28.63%, 59.50%, and 56.47%, respectively, and the MAE value of FPLS-DBN is decreased by 45.19%, 30.24%, 36.36%, and 33.88%, respectively. In addition, compared with the prediction results of PLS-DBN, the RMSE of FPLS-DBN is decreased by 3.52% and MAE is reduced by 5.21%, which is mainly due to the improvement of nonlinear modeling capacity. In recent years, Liu et al. [27] developed a locally weighted regression method based on BN (LW-BN) to predict butane content, and its RMSE is reduced by 21.35% compared with LW-PLS. Liu et al. [12] applied the dynamic Gaussian process regression based partial least squares method to predict the effluent ammonia concentration in BSM1 and the results showed that the RMSE of D-GPR-PLS is reduced by 16.80% compared to PLS method. The RMSE value of the proposed method FPLS-DBN in this study is decreased by 28.63% compared with the PLS, which illustrates that the fuzzy LVMs based on the dynamic probability model is beneficial to improve the modeling accuracy in industrial applications.

IV. CONCLUSIONS
In this paper, a dynamic probability model based on the nonlinear method denoted FPLS-DBN has been developed for industrial process modeling, which aims to improve the prediction accuracy of the key quality indices in WWTPs. The proposed method is structured based on the FPLS latent model framework by TSK and FCM fuzzy techniques, which effectively extracts the non-linear features and significantly decreases the data dimension. Then, the dynamic method of augmented matrices is embedded into the probabilistic method of Bayesian networks. Based on the case study of WWTPs, it was shown that the proposed FPLS-DBN model improves the accuracy of the model and effectively overcomes the nonlinearity, uncertainty, and timevariability in the complex wastewater treatment processes. The results show that the RMSE of FPLS-DBN is 0.7175, its RMSE is reduced by 28.63% and 69.47%, respectively, compared with the PLS and BN models, which effectively improves the prediction accuracy of the conventional model and enhances the modeling ability of wastewater treatment processes.
The modeling method from this study can be used to structure the process model online, providing an accurate and reliable modeling method. Besides, the simple but effective dynamic technique and nonlinear method are beneficial to practical application. However, there is still much space for improvement in terms of accuracy. Future studies will pay more attention to maximize modeling accuracy. MINGZHI HUANG received the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 2010. He has been a Professor with South China Normal University, China, since 2017. He is the author of more than 100 journal articles. His research interests include process modeling, process monitoring, fault detection and diagnosis, and wastewater treatment.