Fault Classification in Dynamic Processes Using Multiclass Relevance Vector Machine and Slow Feature Analysis

This paper proposes a modified relevance vector machine with slow feature analysis fault classification for industrial processes. Traditional support vector machine classification does not work well when there are insufficient training samples. A relevance vector machine, which is a Bayesian learning-based probabilistic sparse model, is developed to determine the probabilistic prediction and sparse solutions for the fault category. This approach has the benefits of good generalization ability and robustness to small training samples. To maximize the dynamic separability between classes and reduce the computational complexity, slow feature analysis is used to extract the inner dynamic features and reduce the dimension. Experiments comparing the proposed method, relevance vector machine and support vector machine classification are performed using the Tennessee Eastman process. For all faults, relevance vector machine has a classification rate of 39%, while the proposed algorithm has an overall classification rate of 76.1%. This shows the efficiency and advantages of the proposed method.


I. INTRODUCTION
With the increase in industrial plant complexity, the ability to easily extract information has become more difficult, especially when considering product quality and process safety as critical parameters in the manufacturing process. Datadriven process monitoring techniques have been applied in modern industry to ensure the safety and stability of processes [1]- [8]. In recent years, many multivariate statistical process monitoring algorithms have been proposed for large-scale industrial processes. He et al. proposed a transition identification and process monitoring method for multimode processes using a dynamic mutual information similarity analysis [9]. A supervised non-Gaussian latent structure was introduced by He et al. to model the relationship between predictor and quality variables [10]. He

and Zeng proposed double layer
The associate editor coordinating the review of this manuscript and approving it for publication was Gerard-Andre Capolino. distributed process monitoring based on hierarchical multiblock decomposition for large-scale distributed processes [11]. However, the multivariable statistical methods are not designed for fault classification. Fault detection is used to identify whether the process is under normal condition, but cannot provide any information about the fault type. The main idea of fault classification is to build a classifier on the basis of some known fault categories. Fault classification can provide the connection between the detected fault occurrence and some known faults. Recognizing the type of fault is an important step for process monitoring. FDA for fault classification [12]. However, Chiang et al., who investigated the advantages of FDA and support vector machine (SVM), showed that nonlinear SVM outperforms FDA [13]. Jing and Hou studied the multiclass classification problem of SVM and principal component analysis (PCA) [14], while Gao and Hou improved the multiclass SVM using a PCA approach and applied it to the Tennessee Eastman (TE) process [15]. An SVM classifier is the most commonly used classification technique. However, SVM has the following three disadvantages [16], [17]. First, the number of support vectors grows as the size of the training set increases, which also causes the computational complexity to rapidly increase. Secondly, the output from the SVM does not use a probabilistic approach for estimating the conditional distribution, which implies that the prediction cannot capture the uncertainty. This means that the accuracy of SVM is sensitive to the training set. Thus, the SVM training set should include a variety of training samples. Thirdly, the classification performance of SVM is sensitive to the parameters. Therefore, a probabilistic and sparse classification algorithm needs to be developed. Relevance vector machine (RVM) [16], [17] is a classification method that does not have any of the above limitations. Given its probabilistic Bayesian framework, RVM can be applied in cases where there are limited training samples. However, no research exists on the application of RVMs to the data-driven fault classification of chemical processes. Considering the sensitivity of the RVM prediction to the input dimension of the training set in industrial data, a sound method should be used to transform the training set into a lower-dimensional feature space, which maximizes the separability between classes.

B. DYNAMIC DIMENSIONAL REDUCTION METHODS
PCA, a widely used multivariate statistical algorithm, is used to reduce the dimensions. However, PCA is a static dimension reduction method. Sometimes it is necessary to consider the presence of time delay in dynamic systems. In recent years, slow feature analysis (SFA) [18] has been a popular method for extracting dynamic features by focusing on learning the time features that have a slow frequency. Shang et al. pointed out that the time dynamics in the extracted slow features (SFs) is an indicator of process changes [19]. Modified SFA methods have also been proposed [20], [21]. SFA can extract the dynamic representations of the original data from different levels of dynamics. The above studies have shown that SFA can easily extract dynamic features. However, in previous studies, the SFA algorithm was only used for fault detection, which does not provide information about the fault type. In fact, the SFA algorithm has not been applied to fault classification.

C. MOTIVATION FOR THIS PAPER
In modern chemical processes, sufficient training samples for fault classifiers are usually not available, which worsens the performance of traditional classifiers. RVM has the advantages of sparsity and probabilistic prediction. It can be applied when there are insufficient training samples. Furthermore, many chemical processes are large dimensional and have dynamic characteristics. Thus, such processes require that a dynamic dimensional reduction method be used to transform the training set into a lower-dimensional feature space. Since the SFA algorithm shows the inner dynamic characteristics of the faulty samples, it will indirectly increase the classification performance of RVM. Therefore, it seems plausible to combine SFA and RVM.
The contributions of this paper are the development of a robust, sparse RVM fault classification method that can be used when there are insufficient training samples and an SFA dimension reduction method that when combined with the RVM classifier can extract the distinct features, and hence, enhance the classification results. This paper proposes a modified multiclass RVM with SFA fault classification strategy for use in industrial processes. First, an SFA dynamic feature extraction model is built on the benchmark data. Then the important features are selected in order to reduce the dimension. After reducing the dimension, the high-dimensional training set is transformed into a low-dimensional dynamic feature space, which decreases the computational complexity of the RVM algorithm. Afterwards, some typical fault samples are collected and used for RVM training. RVM classifies the type of fault, providing a connection between the detected faulty samples and the known faults.

I:
Identity matrix s(t): Slow feature vector S k : Selected slow feature matrix s test : Slow features of the test sample t i : The label for sample i X: Data matrix x test : Testing sample w: Weight vector α: Classification parameters M : The number of classes S: Slow feature matrix S train : Training set for RVṀ s: First derivative of s x(t): Input sample X train : Training set W: Weight matrix W k : Selected weight matrix · t : Expectation

II. THEORETICAL BASIS
In this section, the SFA, SVM and RVM algorithms are reviewed. As well, the training results of the SVM and RVM algorithms are compared using a numerical example.

A. SLOW FEATURE ANALYSIS
SFA [22]- [24] extracts the slowly varying features. The input signal is expressed as x(t) = [x 1 (t), x 2 (t), . . . , x m (t)] T . The optimization objective of SFA is to find a transformation function g(x) such that the feature s(t) = g(x(t)) varies as slowly as possible [22], [23]. The function g(x) = [g 1 (x), g 2 (x), . . . , g m (x)] T is the transformation function, while s(t) = [s 1 (t), s 2 (t), . . . , s m (t)] T is the slow feature. The optimization problem for SFA is with the constraints The SFA algorithm is formalized as where W is the weight matrix which is expressed as W = [w 1 , w 2 , . . . , w m ] T . The whitening step is carried out by the singular value decomposition (SVD) of R = x(t)x(t) T t , which is given as The whitening matrix is Q = −1/2 U T . Then, the whitening step is written as Based on (5) and (7), the SF vector is given as where P = WQ −1 . Clearly, zz T t = Q xx T Q T = I and z t = 0. Constraints (3) and (4) imply that The optimization problem for SFA is to find an orthogonal matrix P to minimize ṡ 2 i t , which is given as The optimal solution is obtained by taking the SVD of the covariance matrix żż T t , that is, where P is the eigenvector matrix and = diag(λ 1 , λ 2 , . . . , λ m ) is the eigenvalue matrix, where the eigenvalues λ i are placed in ascending order. Thus, The main idea of the SVM classifier is to find support vectors that define the bounding hyperplanes, so as to maximize the margin between both planes. Let the training set be (x i , t i ) for i = 1, 2, . . . , n and t i {−1, 1}, where x i is the ith input sample and t i is the label corresponding to sample x i . The SVM optimization problem requires solving the optimization problem where the slack variable ξ i represents the violation for training sample x i . A penalty parameter C is introduced to control the total violation while maximizing the margin. This introduces a trade-off between maximizing the margin and minimizing the violation. As well, a nonlinear transformation z = (x) is used to project the training data onto a highdimensional linear space. Based on functional theory, a kernel function should satisfy the Mercer condition.
The Lagrange dual problem is where . . , t n z n ] T , and e is a unit column vector, that is, a column vector where each entry is 1. The input sample x i for which α i = 0 is called a support vector (SV).
The discriminant function f (x) for a test sample x can be obtained from C. RELEVANCE VECTOR MACHINE RVM seeks to maximize the possible sparsity, by discarding a large number of extremely small weights and thus, training samples that have no effect on the classification function. Originally, RVM [16], [17], [25], [26] was derived for binary classification. RVM is designed to predict the posterior probability of the membership of the input x. The training set for RVM is defined as (x i , t i ) for i = 1, 2, . . . , n. The learning function is where K (x, x i ) is a kernel transformation and ω i is the weight. Generally, a Gaussian kernel function is used. Bayes' rule gives the posterior probability of w, that is, where p(t|w, α) is the posterior probability, p(w|a) is the conditional prior likelihood for the hyperparameters α = [α 1 , . . . , α n ] T , and p(t|a) is the probability of the evidence. The logistic sigmoid function σ (y) = 1/(1 + e −y ) and Bernoulli distribution p(t|x) are used to generalize the linear model. Thus, (17) can be rewritten as In (17), p(w|a) is the Gaussian kernel function, where the weights w i are obtained from the parameter α i using The approximation procedure is based on Laplace's method [27], which consists of the following three steps: 1. Given a fixed value of α, find the maximum posterior weights w MP to provide the location of the mode of the posterior distribution. To solve the problem, the penalized logistic log-likelihood function is introduced. Since p(w|t, α) ∝ P(t|w)p(w|α), the weights w MP can be obtained from (20) where y i = σ {y(x i ; w)} and A = diag(α 0 , α 1 , . . . , α n ). 2. Laplace's method is used to find a quadratic approximation of the log-posterior based on its mode. Equation (20) is solved using the second-order Newton optimization method. The Hessian is This optimization procedure forces most of the α i to infinity, which implies that, based on (19), the corresponding weights w i are discarded. The remaining vectors, which are called the relevance vectors (RVs), give a sparse solution. The discriminant function f (x) can be calculated as where r is the number of RVs, x i is the RV, and w i represents the remaining weights. Fault classification usually involves more than 2 classes, whereas RVM and SVM are intrinsically binary classifiers. To convert binary classifiers into multiclass classifiers, the general strategy is to combine an ensemble of binary classifiers based on some decision rules. In this study, the one-versus-one strategy is used [17]. In the one-versusone approach, an ensemble of models, whose training set is made of any two classes is designed. Therefore, the total number of classifiers is M (M − 1)/2. When testing an unknown sample, all the discriminant functions need to be calculated. A voting mechanism is used to count the score of each category. The category that obtains the most votes determines the category of the test sample.

D. COMPARISON OF SVM AND RVM USING A NUMERICAL EXAMPLE
Consider the following numerical example with the measurements x = [x 1 , x 2 ] where x 1 is uniformly distributed over the interval [1,4], ν is a random Gaussian noise with zero mean and variance of 0.25, a is an unmeasurable variable with a uniform distribution over the interval [1,3], and y {−1, 1} is the label for sample x.
The training set consists of 100 samples of which the first 50 samples are labelled 1 and the last 50 samples are labelled −1. Figure 1 shows the results of training the SVM and RVM classifiers. The subfigure on the left shows the results for the SVM classifier, while the subfigure on the right shows the same for the RVM classifier. The circled samples represent the support (respectively, relevant) vectors. Comparing the two subfigures, it can be seen that SVM requires 40 support vectors, while RVM only requires 4 relevance vectors. Thus, RVM shows higher sparsity.

III. SLOW FEATURE ANALYSIS AND MULTICLASS RELEVANCE VECTOR MACHINE-BASED FAULT CLASSIFICATION METHOD
Since the number of extracted SFs using the SFA algorithm is always equal to the number of original variables, dimension reduction is required to select an appropriate number of SFs. The L 2 -norm method can be used for SF selection. The main idea of the L 2 -norm method is that a large L 2 norm of a row in W is assumed to capture more process information. The squared coefficient can be computed using where W ij is the (i, j)-entry of the weight matrix W.
As the training set is chosen randomly from the industrial data set, the correlation between samples may be broken. To extract the inner dynamic features, benchmark data are collected. The normalized benchmark dataset is written as X. The dynamic feature extraction from the dataset X can be expressed as S k = W k X, where S k is the selected slow feature matrix and weight matrix is given as W k .
The normalization for both the training set and testing set are based on the normalization of the benchmark data. Assume that the training set is expressed as X train . Dynamic features are extracted to avoid the computational complexity during RVM training. The dimension reduction operation is carried out before RVM training, so that, Similarly, the testing sample x test can be written as The training set for each binary RVM model can be expressed as S The assignment of the sample to a class is based on the total votes obtained by the class based on the following maximization The multiclass RVM approach is summarized by Algorithm 1.

Algorithm 1 The Multiclass RVM approach
Step Step 3: Decision Making Define the class assignment for x using (28). Stop.
For the testing vector s test , the discriminant function set f ij (x) is used to determine the fault category for each binary RVM classifier. Then, a voting mechanism is used to count the score of each fault category. The test vector belongs to the fault category that obtains the most votes. Figure 2 shows the flowchart of the SFA-RVM algorithm.

IV. SIMULATIONS AND EXPERIMENTS USING THE TE PROCESS A. BACKGROUND ABOUT THE TE PROCESS
The TE process was introduced by Downs and Vogel [28]. The control system for the TE process is shown in Figure 3 [29].   reduction is based on the L 2 -norm method. The weight vectors are re-ordered in descending order based on their L 2 -norm. The number of slow features k is determined by where θ is the selection threshold. The number of SFs depends on the data for the current application. After testing a number of threshold values, the most important SFs are retained when the θ equals to 0.9 with the data used. In the TE process, the number of slow features is 21 according to (29). Figure 4 shows the visualization of the first, second, and third dimensions of the training set. The different categories of fault samples are separated using different colors. Since faults 3, 9, and 15 are commonly considered to be hard to detect, the visualization of the training set does not include faults 3, 9, and 15. The training samples are gathered in a small area and are hardly visually separated. The TE process contains 33 variables that are combinations of latent factors. The characteristic of high dimensionality makes the classification model complex. Figures 5 and 6 show the visualization for the first 6 dimensions of the SFA dimensionally reduced dataset. Compared with Figure 4, part of the faulty samples in the SFA dimensionally reduced dataset are visually separated from the other fault categories. The extracted dynamic features maximize the separation between classes, which highlights the inner dynamic characteristics of faulty samples and indirectly increases the classification performance of RVM.

C. SIMULATION RESULTS
To test the online fault classification performance of the proposed method, regular RVM and SVM are used for comparison. The optimal parameters of the SVM are obtained using cross-validation and the grid search method. Table 1 gives the correct fault classification rate (CFCR) and precision for each fault for RVM, SVM, and SFA-RVM, while Table 2 shows the fusion matrix of the classification results using the RVM, SVM, and SFA-RVM methods for each fault type. Each row gives the fault classification results for a particular fault. The bold numbers on the diagonal show the number of samples that are properly classified. The CFCR is calculated using CFCR = Number of samples correctly classified Total number of testing samples The precision for each fault is calculated by Pr ecision = Number ofsamples correctly classified Totalnumber ofsamples classified asthisfault (31) In Tables 1 and 2, it can be seen that all three methods have very high correct fault classification rates for Faults 1, 2, and 7. The fault magnitudes for Faults 1, 2, and 7 are large, and their distinct fault features make it easy to classify the fault. However, for all methods, the correct fault classification rates for Faults 3, 9, and 15 are very poor. This can be attributed to the fact that these 3 faults have little extractable fault information, which then confuses the fault detection algorithms. The fault classification results for faults without Faults 3, 9, and 15 are also given in Table 1. Furthermore, RVM has a low correct fault classification rate (< 30%) for Faults 4,5,10,11,16,19,and 20. These faults are conflated with each other, that is, RVM cannot distinguish these faults. Compared with RVM, SVM has better fault classification  results for these faults. However, SVM has poor correct fault classification rates for Faults 6 and 18, where the accuracy is only 10%. The fusion matrix in Table 2 shows that most test samples for Faults 6 and 18 are incorrectly classified as Fault 21. The proposed SFA-RVM algorithm has improved the correct fault classification rates. SFA-RVM achieves the best classification rates for Faults 5,6,10,11,12,16,19,and 20. Of note, the classification rates for SFA-RVM are over 70% for faults excluding Faults 3, 9, and 15 (the last 3 columns in Table 1). Furthermore, the classification rates for Faults 1, 2, 5, 6, and 7 are over 90%. Thus, it has been shown that the proposed SFA-RVM is more robust than RVM and SVM.
Fault 10 is a random variation in the C feed temperature. RVM can hardly recognize fault 10 with low fault classification rates (approximately 10%). SVM shows better fault classification results. However, the rates are lower than 50%. In the previous fault detection studies, the fault detection rates of some modified process monitoring method for Fault 10 are approximately 0.85 [7], [30], [31]. Some traditional methods such as PCA and DPCA can only detect 50% of the fault samples. The reason why Fault 10 is hard to recognize is that Fault 10 is a random fault, in which the fault amplitude varies considerably. Amplitudes of some fault samples are extremely low, which makes the fault samples similar to the normal samples. Fault detection is a binary classification problem, which is much easier than fault classification, especially for the TE process. The fault classification rate of SFA-RVM (excluding Faults 3, 9, and 15) reaches 0.83, which is close to the best fault detection rate.
Fault 11 is a random change in the reactor cooling inlet temperature. Fault 11 has the same problem as Fault 10 in  that it also has low fault amplitudes. The fault classification rates of RVM are 0.11 and 0.13. SVM has better results with a classification rate over 50%. In the previous studies, the fault detection rates of some advanced methods are approximately 0.85 [7], [30], [31]. The fault classification rate of SFA-RVM (excluding Faults 3, 9, and 15) is 0.74. It should be noted that Faults 4 and 11 occurred in the reactor cooling inlet temperature. The only difference between Faults 4 and 11 is that Fault 4 is a step fault whereas Fault 11 is random. Some of the fault samples from Fault 11 are quite similar to samples from Fault 4, which will confuse the fault classification model. It can be seen in Table 2 that there are 74 test samples belonging to Fault 11 that are classified as belonging to Fault 4. In industrial processes, different faults may result in similar evidence because of the presence of the control system. Therefore, the definition of fault categories before model training is also important. Table 3 shows the overall classification rates for RVM, SVM, decision tree, SFA-based decision tree (SFA-decision tree), and SFA-RVM, as well as the classification rates for different number of training samples (100 and 200 training samples). For all faults (100 training samples), RVM can only recognize 39% of the test samples, while the accuracy of SVM reaches 55.8%. Furthermore, the SFA-RVM algorithm has an overall classification rate of 76.1%, which is the best fault recognition algorithm. The overall classification rates excluding Faults 3, 9, and 15 are higher than those for all faults. The RVM approach leads to a high dimensional solution that causes the accuracy to be lower than 50%. Since SVM is not a probabilistic algorithm, a complete training set is needed to contain enough samples. If the training set is limited, the accuracy of SVM is sensitive to the size of the training set. For example, some of the TE process faults are random faults, which means that the amplitude of the faults varies. If the randomly chosen training sets differ greatly at different times, the training models of SVM may have many differences. The proposed SFA-RVM algorithm extracts the inner dynamic features to maximize the separability between classes. Dimension reduction avoids the computational complexity of the RVM training. The extracted features are good dynamic representations of the TE process. RVM has probabilistic prediction and sparsity properties, which achieve good generalization ability and robustness for small scaled training samples. The proposed method takes advantages of both SFA in extracting dynamic features and RVM in classifying.

V. CONCLUSION
In this paper, a modified RVM with the SFA algorithm is proposed for fault classification in dynamic industrial processes. SFA extracts the lower-dimensional dynamic features, which maximizes separability and reduces the computational complexity of RVM. Simultaneously, a one-versus-one multiclass RVM model is built to recognize the fault category when there are insufficient training samples. RVM classifies the type of faults, providing the connection between the detected faulty samples and known faults. RVM has the advantages of probabilistic prediction and sparsity properties for small training sets. Detailed comparative studies between the SFA-RVM method and traditional RVM and SVM were carried out using the TE benchmark process. The simulations show that the proposed SFA-RVM method has much better overall fault classification rates than either the RVM or the SVM method alone. For all faults, RVM can only recognize 39%. The SFA-RVM algorithm has an overall classification rate of 76.1%, which is a remarkable improvement.
However, the proposed method was not tested in a real process. Collecting sufficient data with faults can be an issue for real processes. Furthermore, the knowledge of fault type may be not clear, that is, the training samples for fault classifiers are not available as fast as required. Limited by the diversity and quantity of the training samples for fault classification, the dataset of a realistic experimental application is hard to obtain. Finally, even if faulty data can be obtained, it may not be possible to obtain meaningful fault classifications due to the fact that some of the faults may have a very weak signal. Thus, future work will focus on the application of this method to fault classification on a real process and improving its ability to better handle weak faults.