Reliable Fault Detection and Diagnosis of Large-Scale Nonlinear Uncertain Systems Using Interval Reduced Kernel PLS

Kernel partial least squares (KPLS) models are widely used as nonlinear data-driven methods for faults detection (FD) in industrial processes. However, KPLS models lead to irrelevant performance over long operation periods due to process parameters changes, errors and uncertainties associated with measurements. Therefore, in this paper, two different interval reduced KPLS (IRKPLS) models are developed for monitoring large scale nonlinear uncertain systems. The proposed IRKPLS models present an interval versions of the classical KPLS model. The two proposed IRKPLS models are based on the Euclidean distance between interval-valued observations as a dissimilarity metric to keep only the more relevant and informative samples. The first proposed IRKPLS technique uses the centers and ranges of intervals to estimate the interval model, while the second one is based on the upper and lower bounds of intervals for model identification. These obtained models are used to evaluate the monitored interval residuals. The aforementioned interval residuals are fed to the generalized likelihood ratio test (GLRT) chart to detect the faults. In addition to considering the uncertainties in the input-output systems, the new IRKPLS-based GLRT techniques aim to decrease the execution time when ensuring the fault detection performance. The developed IRKPLS-based GLRT approaches are evaluated across various faults of the well-known Tennessee Eastman (TE) process. The performance of the proposed IRKPLS-based GLRT methods is evaluated in terms of missed detection rate, false alarms rate, and execution time. The obtained results demonstrate the efficiency of the proposed approaches, compared with the classical interval KPLS.


I. INTRODUCTION
Fault detection (FD) becomes increasingly important to improve the quality product, ensure process safety and decrease energy and material consumption [1]. As a large amount of process data are gathered in historical databases The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang . in modern industrial processes, data-driven FD techniques have been discussed widely by researchers [2]- [9]. Partial least squares (PLS) regression is the most used data-driven technique in modeling and FD of industrial processes, and it has proved excellent performance [10], [11]. The main idea of PLS is to build the correlation between the process and quality variables [12]. By establishing a model through normal data, PLS can be applied for predicting and FD purposes. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ However, the PLS technique assumes that the process data are linear. This assumption makes it inappropriate for nonlinear industrial processes. In order to address this issue, a nonlinear version of PLS model, so-called kernel PLS (KPLS), has been developed in literature [13]- [15]. The KPLS technique aims to map the nonlinear input data into a high-dimensional space in which a linear PLS model is applied. According to the Covers theorem [14], the nonlinear relationship among variables in the original space is most probably linear after a high-dimensional nonlinear mapping. The KPLS can efficiently determine the regression coefficients in feature space using nonlinear kernel functions and improve the prediction performance. It has been successfully implemented for modeling and FD of many industrial processes [2], [16]- [18].
Generally, the quality of the available measurements is essential in the reliability of the identified model as well as in the FD performance. Using the KPLS technique, the measured process data are described by a single value representation. The single-valued data representation is a result of simplification during the data mining procedure. The necessity for interval-valued data can arise in connection with the imprecision of measurement devices, process uncertainties, or with the data fluctuations in the case of collected samples during a specific interval of time.
In actual practice, taking into account the minimum and maximum of the collected sample values gives a complete overview of the measured phenomenon than taking into account the average values. Thus, for more accuracy in the representation of real data, the uncertainty of the data may be represented by considering data at intervals representation [19]- [21]. In literature, several linear techniques for interval-valued data have been developed [21]- [27]. However, up to now there is no studies on the problem of nonlinear uncertain systems using interval kernel PLS.
The interval KPLS (IKPLS) or KPLS-based intervalvalued data method is an extension of the single-valued KPLS model in presence of imprecision, variation, and data confidence intervals. For example, the IKPLS UL model aims to build two KPLS models on the lower and upper (UL) bounds of interval values [21]. Another interval technique, so-called Bivariate Center and Range (CR) KPLS (IKPLS CR ) consists to create a KPLS model on the new numerical input and output matrices constructed by the concatenation of center and range matrices [21]. However, the major limitation of KPLS-based fault detection techniques is their computational complexity. Using KPLS model for FD leads to a high computational cost when the training data set is large since the recorded and stored data are applied for both modeling and monitoring. This is mainly explained by the fact that the kernel methods depend on the collected observations of the whole process variables. To reduce the computational complexity of the IKPLS models, two interval reduced KPLS (IRKPLS) models are proposed in the current work. The objective is to enhance data-driven monitoring methods, especially in experimental industrial applications with imperfect measurements which significantly deteriorate the detection performance. The proposed IRKPLS schemes do not depend on data outliers and model incompleteness and have the ability to improve both fault detection robustness and sensitivity while maintaining a satisfactory and stable performance over long periods of process operation. Both IRKPLS methods (IRKPLS CR and IRKPLS UL ) are based on the Euclidean distance between interval-valued samples as a dissimilarity metric to extract only the non-redundant ones and keep only one informative sample for every small and similar distance.
In addition to their abilities to efficiently monitor uncertain processes, the IRKPLS models are able to reduce the execution time and memory space.
The two proposed IRKPLS CR and IRKPLS UL models are used to generate the monitored interval residuals. The aforementioned interval residuals are fed to generalized likelihood ratio test (GLRT) chart as input for detecting faults. GLRT is considered the most used method for detection of industrial faults [28], [29].
Thus, in the proposed IRKPLS-based GLRT approaches, the IRKPLS methods are used for identifying the interval-valued models and the GLRT chart is applied for faults detection.
The detection efficiency of the proposed IRKPLS-based GLRT approaches is evaluated using a Tennessee Eastman (TE) process. The false alarm rate (FAR), missed detection rate (MDR), and execution time (ET) are used as metrics to evaluate the detection and monitoring abilities.
The remainder of this paper is structured as follows. In the Preliminaries section, PLS and kernel PLS are briefly introduced. The developed IRKPLS models are presented in Section 3. The fault detection using GLRT is detailed in Section 4. In section 5, the application of the developed IRKPLS-based GLRT approaches to a Tennessee Eastman (TE) process is presented. The conclusions are presented in Section 6.

A. PARTIAL LEAST SQUARES
Partial least squares (PLS) is a multivariate statistical analysis method. PLS models the relation between input variables (regressors) and output variables (responses). The PLS decomposition of input X ∈ R N ×m and response Y ∈ R N ×p have the following form [30], [31] where T = t 1 t 2 · · · t ∈ R N × and U = u 1 u 2 · · · u ∈ R N × represent the input and output score matrices, respectively. P = p 1 p 2 · · · p ∈ R m× and Q = q 1 q 2 · · · q ∈ R N × indicate loading matrices, respectively. E ∈ R N ×m and F ∈ R N ×p present the residual matrices.
The score vector t i , which denotes the column of the matrix T , can be determined as, where c i is the weight vector which is obtained by the optimization of the following problem [32], Likewise, with t, the score vector u i can be estimated as, The PLS regression model can be represented by using a regression coefficient matrix B and a residual matrix R as, PLS is a linear regression approach which assumes that the process variables are linear. In many applications, this assumption is not realistic. To solve this issue, a nonlinear regression technique so-called kernel PLS (KPLS) is developed [33]- [35]. KPLS is the most widely used nonlinear input-output statistical methods. KPLS consists to map the input data into high-dimensional space. Then, it aims to apply linear PLS in the feature space. Given an input data matrix X = x 1 x 2 · · · x N T ∈ R N ×m with N samples and m process variables (x i ∈ R m ), and output matrix Y = y 1 y 2 · · · y N T ∈ R N ×q regrouping N observations with q quality variables (y i ∈ R q ). We assume that the original data x i ∈ R m are mapped into the high dimension features space by using a nonlinear projection function φ, so that Note that p is the dimensionality of the feature space F which can be arbitrarily large or even infinite. The mapped input data matrix is given by where T ∈ R N ×A and U ∈ R N ×A represent the input and output score matrices, respectively. The loading matrices of and Y are denoted by P ∈ R m×A and Q ∈ R N ×A , respectively. r and Y r present the residuals. Denote A as the number of latent variables in high-dimensional space selected by the cumulative percent variance (CPV) criterion [36]. In KPLS, using the kernel function k( we can avoid to carry out explicit nonlinear mapping and determine dot products in space F using a kernel functions. Radial basis kernel is the most used kernel functions and it is defined as: The KPLS algorithm is outlined in Algorithm 1 6. Iterate the steps (3) to (5) until the convergence of t i . 7. Deflate the residuals; Retain the data in the matrices: T ← t i , U ← u i . 9. Set i = i + 1, and return to step (2). Stop when i > , with being the extracted number of latent variables.
After selecting the desired kernel latent variables, the regression coefficient matrix between and Y has the following form [37], [38]: The prediction outputs on training samples is given bŷ The output prediction of the testing samples x * with the trained model is written as: where k(x * ) is the kernel vector,

III. INTERVAL REDUCED KERNEL PLS METHOD
This section presents the interval kernel PLS (IKPLS) methods. In effect, interval-valued data presents a method that takes into account the available sensors information with uncertainty and variability. The dimension of the IKPLS model depends on the number of samples of the training data set. Reducing the structure of the IKPLS model leads to reduce the number of samples.
In this paper, therefore, two novel interval reduced kernel PLS (IRKPLS) models are developed to handle large data sets and reduce the computational complexity of the IKPLS models. In the proposed IRKPLS models, first, the Euclidean distance between samples is used as a dissimilarity metric to keep only the more relevant and informative observations in VOLUME 8, 2020 the input space. Then, they apply the new reduced data set to build the IRKPLS models. Next, two proposed IRKPLS models based on Euclidean distance (IRKPLS UL and IRKPLS CR ) are presented, respectively.
The new interval input matrix [X ] is given by . .
The interval valued data matrix [Y ] of outputs is expressed as . .
The Euclidean distance between observations is used as dissimilarity metric. The idea behind dissimilarity metric is adopted to extract a most relevant subset from all available interval-valued samples. To do so, data consisting of measures of dissimilarity between all pairs of intervalvalued observations can be designated using a dissimilarity matrix D as where d ij is the Euclidean distance between the rows [X i ] and X j of the data matrix [X ]. So, d ij is given by: with Hence, the dissimilarity matrix D is symmetric and its diagonal elements are null. Then, all redundancy in observations are eliminated from the input data matrix based on a pick out dissimilarity distance. This plays an important role in reducing the number of observations.
The resulting reduced interval matrix of inputs can be defined as: (17) where N r represents the number of reduced samples.
The corresponding reduced interval matrix of outputs is given by Once, the new reduced input and output matrices are evaluated, the interval-valued data matrices are transformed on a numerical data matrices, then, IKPLS model is applied on the computed matrices.

B. IRKPLS BASED ON INTERVAL UPPER AND LOWER BOUNDS (IRKPLS UL )
One of the first proposed approaches is IRKPLS based on interval upper and lower bounds IRKPLS UL . It consists to build two KPLS models using the reduced inputs and outputs data matrices formed, respectively, by the lower and upper bounds of interval measurements. Let . . , N r }, be the lower interval input vector with lower bound x j (i). Let consider y (i) = y 1 (i), · · · , y m (i) the lower bounds of interval output vector.
The reduced kernel matrix for lower bound data is expressed as, In IRKPLS technique, the regression coefficients matrix B for lower bound of interval-valued data can be written as, Therefore, the output prediction on reduced training samples for lower bound of interval-valued data iŝ The interval output prediction on test data x * can be made as, Define x (i) = x 1 (i), · · · x m (i) as the upper bound of interval input vector, with upper bound x j (i). The upper bound of interval output vector can be represented of the form y (i) = y 1 (i) · · · y m (i) .
In this case, the second reduced kernel matrix K for upper bound of interval valued data is given as In this case, the matrix of regression coefficient B for the upper bound in the IRKPLS technique has the following form, The prediction of interval output variables is determined bŷ As a result, the prediction on test subset for upper bound of interval-valued data is written as, The main steps of IRKPLS UL algorithm are outlined in Algorithm 2.

Algorithm 2 IRKPLS UL Algorithm
Training step: Inputs: N × m input data matrices X , X and N × q output data matrices Y , Y . 1. Calculate the elements of the dissimilarity matrix D 2. Pick the smallest distance from D and extract the new reduced data input matrices X ∈ R N r ×m , X ∈ R N r ×m and output data matrices Y ∈ R N r ×q , Y ∈ R N r ×q . The proposed IRKPLS CR model aims first to transform the input [X ] and output [Y ] matrices into numerical matrices based on the interval center and range. Then, it consists to decrease the dimensions of the original data matrices using the Euclidean distance between observations as dissimilarity metric. After that, it uses both the center and the range variables to construct a reduced regression model. In the proposed IRKPLS CR method, new data matrices are built by the concatenation of center and range data matrices, and a KPLS model is applied to this new reduced data matrices.
The new reduced input X CR and output Y CR data matrices are defined as, where the input center X c and range X r matrices, and output center Y c and range Y r matrices are given by: Then, the input reduced data matrix is mapped to a feature space and the reduced kernel matrix K CR based on center and range approach is defined as, The prediction of the output variables-based center and range method can be expressed aŝ where the new regression coefficient B CR is given by For a new observation x * CR using the reduced data set, the interval output estimationŷ CR based on the IRKPLSCR model is represented by, The IRKPLS CR based center and range of interval algorithm is presented in Algorithm 3.

IV. FAULT DETECTION BASED ON IRKPLS
Once the IRKPLS models are built, the monitored interval residuals are computed and used later for process monitoring purposes. To monitor the behavior of a process, different detection indices are developed in literature [16], [28], [34], [36], [39]- [43]. Among the fault detection charts, the GLRT shows satisfied rapidity for detecting large and small faults in many industrial systems. The GLRT statistic is a hypothesis testing scheme [36], [44]- [47]. Let E ∈ R N r is a measured vector follows one of the two Gaussian distributions: N (0, σ 2 I N r ) or N (θ = 0, σ 2 I N r ), where θ is the mean vector and σ 2 is the known variance. The hypothesis testing problem is given by, Using the GLRT approach, the unknown parameter θ is changed by its maximum likelihood estimate. The GLRT G(E) decision function is computed as where f θ (E) is the probability density function which is given by, The multivariate GLRT chart is determined for upper and lower bounds of interval-valued data in the feature space as where σ and σ are the variance of the residuals E and E, repectively. According to equations 22 and 27, the residuals based on the lower and upper bounds of interval-valued data are defined as A fault is detected using GLRT chart, if both upper and lower indices, G and G, are out of their respective thresholds, G α and G α , The control limits G α and G α are computed as where, with a and b are the mean and variance of the G index, and a and b are the mean and variance of the G index. The IRKPLS UL -based GLRT fault detection strategy is shown in Figure 1. From the IRKPLS CR model and using equation 34, the residual is given by The multivariate fault detection GLRT chart, G CR (x * CR ) of a new sample x * CR , is computed in the feature space as: Let G CR,α , be the corresponding control limit. A fault is detected if G CR (x * CR ) > G CR,α . The IRKPLS-based GLRT statistic algorithm is illustrated schematically in Figure 2.

V. CASE STUDY USING THE TENNESSEE EASTMAN (TE) PROCESS
To evaluate the detection efficiencies of the proposed techniques, an industrial benchmark of Tennessee Eastman (TE) process is used. [48]- [51]. The monitoring performance of both schemes is appraised based on false alarms rate (FAR) which is the percentage of false alarms in a healthy set, missed detection rate (MDR) which represents the percentage

A. TE PROCESS DESCRIPTION
The TE process has been widely used by the process monitoring community as a source of data for comparing various methods [48], [52]- [55]. TE process contains, G and H are the main products, A, C, D, and E are reactants (gaseous raw material), F is a by-product, and B is an inert gas. The irreversible exothermic chemical reactions that away place in the reactor are: TE process includes five principal units: a reactor, a compressor, a condenser, a stripper, and a vapor/liquid separator. More information about a process including the reactions are broadly detailed in the literature [56], [57] (Figure 3).

B. FAULT DETECTION RESULTS
In this study, a training dataset of 1024 samples is collected from the TE process under normal operating conditions, while 21 different testing datasets, each contains 1024 samples, are generated at a sampling rate of 3 minutes over 21 benchmark faults. Each sample of each variable is associated with an uncertainty bound of δ = 10% to generate an interval-valued sample for the training and testing datasets. Figure 4 illustrates the time evolution of the interval-valued variable XMEAS(1) during the healthy and faulty statuses. Two conventional and two developed IKPLS models are built using CPV criterion with 95% as an explained variance threshold. The two conventional IKPLS models, referred to IKPLS CR and IKPLS UL are structured by 13 and 12 retained kernel latent variables, respectively. The two IRKPLS CR and IRKPLS UL models are identified by only 585 extracted samples thanks to the Euclidean distance similarity measure from which 12 kernel latent variables are maintained for each of them. The nonlinear uncertain process monitoring is achieved using the different obtained IKPLS models, from which the fault detection index GLRT is assessed for comparison. Its threshold is established at α = 5% as a significance level.
The obtained FAR, MDR and ET values contributed by IKPLS CR , IKPLS UL , IRKPLS CR , and IRKPLS UL techniques based on GLRT statistic are summarized in Table 1  where the best of them are shown in bold. Whereas the underlined value represents the best value in all techniques for each performance metric. It can clearly seen that the FARs contributed by IRKPLS CR -based GLRT method for almost faults are slightly improved compared to those resulting from the IKPLS CR -based GLRT technique. Unlike, the IKPLS UL -based GLRT contributes to a fairly acceptable practical FAR and ensures higher robustness compared to all algorithms. While the two developed IRKPLS models provide better results compared to the conventional ones in terms of MDR which is successfully improved for the most faults, ensuring reasonably correct detection. Furthermore, they also give an important ET enhancement in all testing cases. Thus, more than 57% of ET is gained by the IRKPLS CR and IRKPLS UL based GLRT techniques, as a result, memory requirements are satisfactorily reduced to 55%.

VI. CONCLUSIONS
In this work, we considered the problem of fault detection in large-scale nonlinear uncertain systems using interval input-output reduced kernel PLS (IRKPLS). The IRKPLS model presented an interval version of the conventional KPLS model. It was developed to consider errors and imprecision of the measurements devices and uncertainties in the system. In addition to its ability to efficiently monitor these processes, the IRKPLS model had significantly reduced the execution time and memory space. The developed IRKPLS used the Euclidean distance between interval-valued samples as a dissimilarity measure to extract the more relevant and efficient samples. Then, the monitored interval residuals, evaluated using the IRKPLS models, were introduced to the generalized likelihood ratio test (GLRT) as input for fault detection purposes. The detection performance and effectiveness of the developed IRKPLS-based GLRT techniques were assessed through different kinds of faults of the well-known Tennessee Eastman (TE) process in terms of missed detection rate, false alarms rate and execution time. In addition, the execution time, and consequently the memory space had been remarkably reduced. Moreover, the proposed IRKPLS-based GLRT methods were able to easily manage the execution time and memory requirements while ensuring a required fault detection performance. His work focuses on statistical signal processing. He is the author of more than 150 refereed journal and conference publications and book chapters, and has worked on several projects as a Lead Principal Investigator (LPI) and a Principal Investigator (PI). He served as the technical committee member of several international journals and conferences.
ABDELMALEK KOUADRI is currently a Professor of electrical engineering with the Institute of Electrical and Electronics Engineering, University M'Hamed Bougara of Boumerdès, Algeria. He has more than ten years of combined academic and industrial experience. He has published more than 50 refereed journal and conference publications and book chapters. He served as the technical committee member of several international journals and conferences. His research interests are in the area of systems engineering and control, with emphasis on process modeling, monitoring, and estimation. HAZEM NOUNOU (Senior Member, IEEE) is currently a Professor of electrical and computer engineering with Texas A&M University at Qatar. He has more than 19 years of academic and industrial experience. He has significant experience in research on control systems, databased control, system identification and estimation, fault detection, and system biology. He has been awarded several national priorities research program (NPRP) research projects in these areas. He has published more than 200 refereed journal and conference papers and book chapters. He has successfully served as the lead PI and a PI on five QNRF projects, some of which were in collaboration with other PIs in this proposal. He has served as an Associate Editor and on the technical committees of several international journals and conferences.

MOHAMED NOUNOU (Senior Member, IEEE)
is currently a Professor of chemical engineering with TAMU-Texas A&M University at Qatar. He has more than 19 years of combined academic and industrial experience. He has published more than 200 refereed journal and conference publications and book chapters. His research interests are in the area of systems engineering and control, with emphasis on process modeling, monitoring, and estimation. He has successfully served as the lead PI and a PI on several QNRF projects (six NPRP projects and three UREP projects). He is a Senior Member of the AIChE (American Institute of Chemical Engineers).
HASSANI MESSAOUD has prepared the Ph.D. thesis at the University of Nice-Siophia Antipolis, France, in 1993, and the habilitation thesis at the School of Engineers, Tunis, in June 2001. Actually, he is currently a Professor and the Head of the Research Lab LARATSI (Code:LR13ES13), School of Engineers, Monastir, Tunisia. His main majoring is process identification and control as well as signal and image processing. He has published more than 200 refereed journal and conference publications and book chapters. He served as the technical committee member of several international journals and conferences. VOLUME 8, 2020