Modified Label Propagation on Manifold With Applications to Fault Classification

In process monitoring, fault classification performance heavily relies on the labels of training data. However, the labeled data are inadequate and difficult to obtain because they require experienced human annotators. In this paper, a modified label propagation (MLP) method is proposed to propagate labels from labeled data to unlabeled data. The proposed label propagation algorithm has the following advantages: (1) It constructs a global and local consistency framework with the aid of a data graph, manifold learning, and data labels. This framework follows the assumption that data on the manifold will have similar structures, and nearby data will have similar labels. (2) Considering the inner relationship between the unlabeled data and historical data, a new definition for the initial label matrix is offered, which is significant for label propagation. (3) The new method propagates labels in a low-dimensional manifold space, which is different from most existing label propagation methods that propagate them in the original space. The results reveal that under the global and local consistency framework, soft labels of unlabeled data are given more effective predictions. With additional soft labels of unlabeled data, the MLP-based fault classification method is introduced. The simulation results obtained using a toy example demonstrate the label propagation performance of the MLP, and those obtained for the penicillin fermentation process verify the effectiveness of the MLP-based fault classification method.


I. INTRODUCTION
For process monitoring operations in control engineering, fault classification plays a very important role in locating the fault and helping operators take correct remedial measures [1]- [5]. However, data collected from an industrial process are usually difficult to classify because of the high-dimensional data characteristics and complex data relationships involved. Based on these facts, some classification methods have been proposed, such as Fisher discriminant analysis (FDA) [6]- [8], support vector machine (SVM) [9], [10], and the k nearest neighbor (kNN) classification [11], [12]. These methods are supervised learning methods, requiring that the classes of all training data are known; that is, all training data are labeled. However, in industrial production pro-The associate editor coordinating the review of this manuscript and approving it for publication was Moussa Boukhnifer . cesses, labeled data are usually inadequate, and the acquisition of labeled data by employing skilled experts is expensive.
Hence, in recent years, to overcome the disadvantages of supervised learning methods, semi-supervised learning (SSL) methods have drawn research interest [13]- [16]. These methods can acquire knowledge via both labeled and unlabeled data for classification and are different from supervised learning methods that are heavily reliant on labeled data. Several SSL methods have been introduced in process monitoring applications. For example, Feng et al. proposed semi-supervised principal component analysis for process monitoring [17]. Yan et al. constructed a semi-supervised mixture-discriminant monitoring scheme for an injection molding process [18]. Zhong et al. proposed a semi-supervised FDA model for fault classification in industrial processes [19]. SSL methods can improve fault detection and fault classification VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ performance even when labeled data are insufficient because the structures and features of unlabeled data are effectively explored. Label propagation is a type of SSL method that has attracted considerable attention recently [20]- [25]. Label propagation is a commonly used method for propagating labels of labeled data to unlabeled data using their similarities and initial states. According to whether the model can handle the outside data directly, existing label propagation methods can be divided into transductive and inductive ones. Transductive learning methods can predict the labels of inside unlabeled data but cannot predict the labels of outside unlabeled data, such as linear neighborhood propagation [26], special label propagation [27], projective label propagation [28], sparse neighborhood propagation [29], adaptive neighborhood propagation [30], and positive and negative label propagation [31]. Owing to its effectiveness and efficiency, label propagation is applied in many fields. Several related methods have been researched as well. For example, Wang et al. proposed a label propagation method for synthetic data, and digit and text classification tasks [26]. Zhang et al. proposed a novel mechanism to obtain more supervised information using propagated soft labels through special label propagation [32]. Zhang et al. proposed a nonnegative sparse neighborhood propagation method for semi-supervised image classification [33]. Lin et al. proposed dynamic graph-fusion label propagation for semi-supervised multi-modality classification [34]. Zhang et al. proposed a projective label propagation framework involving label embedding, which can obtain the deep labels of all new data [28]. Zoidi et al. proposed a positive and negative label propagation method, which extends regular label propagation to negative label propagation [31].
However, the aforementioned existing label propagation methods have certain potential shortcomings that may degrade the classification results. First, data graph construction is an independent procedure before label propagation; thus, the similarity weights resulting from precalculations on the data graph may not be jointly optimal for subsequent label propagation. Therefore, traditional label propagation methods may suffer from inaccurately predicted results. Second, in traditional label propagation methods, the initial label vector is usually simply defined as a zero vector for the unlabeled data, without considering the inner relationship between the unlabeled data and historical data. However, the relationship between unlabeled data and labeled data and their implicit characteristics are valuable and should be explored for defining the initial label matrix. Third, most traditional label propagation methods propagate labels based on the original high-dimensional space. However, real-world process data usually contain various noises; undesirable, unnecessary, and irrelevant features; and even corruptions. Therefore, it is more likely that the predicted labels are inaccurate in practical scenarios. These features may lead to inaccurate label prediction results.
To address these drawbacks, a modified label propagation (MLP) method is proposed in this paper. The major contributions of the study findings to the field at large are summarized as follows: 1) A global and local consistency framework is constructed based on a data graph, manifold learning, and label propagation. This framework follows the global consistency assumption that data on the manifold will have similar structures and the local consistency assumption that nearby data will have similar labels. By solving for minimization of the objective function, optimal parameters such as the similarity weights and projection matrix were obtained and used for further label propagation.
2) In the initialization phase of label propagation, considering the inner relationship between the unlabeled data and historical data, a new definition of the initial label matrix is proposed based on the similarity and weight of each class, which is significant in label propagation processes.
3) By minimizing the feature variations with respect to neighboring structures, a low-dimensional manifold can be obtained. Moreover, the proposed method propagates label in a low-dimensional manifold. Therefore, the label propagation result would be more accurate because a low-dimensional manifold can remove noise and unfavorable features and preserve the significant features hidden in the data.
The remainder of this paper is organized as follows. A new MLP method is proposed in Section II. A fault classification method based on the MLP is introduced in Section III. A toy example and the penicillin fermentation process (PFP) are discussed in Section IV to demonstrate the effectiveness of the proposed approaches. Finally, the conclusions are summarized in the Section V.

II. MODIFIED LABEL PROPAGATION METHOD
Label propagation is a method for propagating labels of labeled data to unlabeled data, according to the relationship between the two data classes [20]- [25]. However, traditional label propagation methods propagate labels after performing an independent data graph construction process, in which the similarity weights may not be optimal for subsequent label propagation. Moreover, most traditional label propagation methods propagate labels in the original high-dimensional space, which usually contains undesirable and unnecessary features. Considering this circumstance, an MLP method is proposed, which is described as follows.
A given dataset is mapped onto a graph, and each data in the dataset correspond to a node in that graph [27], [35]. Dataset . , x l+u ] ∈ R s×u is the unlabeled dataset, s is the original dimensionality of each data, and l + u = n is the number of samples. Assume that C = {1, 2, . . . , c} is the class label set, and sample x i in X L has a unique label vector The objective function of the MLP model can be formulated as follows where P ∈ R s×d is a projection matrix in the low-dimensional space; S is the similarity weight matrix on the graph; F = [f 1 , f 2 , . . . , f n ] ∈ R n×c is the soft label matrix; D ∈ R n×c and Q ∈ R n×c are the initial label matrices, which will be explained later in this section; α, β, γ 1 , and γ 2 are the regulation parameters; I is the identity matrix.
ness term. It expresses the total variation in the manifold features with respect to the neighboring structures. It follows the global consistency assumption that data on the manifold will have similar structures.
It expresses the total variation in the labels with respect to the neighboring structures. This term follows the local consistency assumption that nearby data will have similar labels.
it expresses how well the predicted soft labels fit the initial labels. It is worth noting that share the same similarity weight matrix. Therefore, this method can construct a global and local consistency framework to explicitly integrate the data graph, manifold learning, and label propagation. Parameters P, S, and F in (1) are unknown. Because these parameters are coupled, there is no direct method to solve for them. To this end, we adopted an optimization strategy that updates one of the parameters while fixing the others and vice versa. The procedure for optimizing the objective function is described below in detail.
The [22]. Then, projection matrix P will be calculated in the low-dimensional manifold [36]- [38], while the other parameters are fixed. The objective function about P is which is equal to P t+1 at the (t + 1)th iteration is obtained by choosing the smallest d eigenvectors, which correspond to the d smallest eigenvalues of XL t X T at the t th iteration [39]. Here, L t = (I − S t ) T (I − S t ), and d denotes the dimensionality of the low-dimensional manifold space.
After the low-dimensional manifold projection matrix P is computed and the other parameters are fixed, the similarity weight matrix S will be updated. The objective function about S is expressed as follows: S can be calculated from the derivative of J (S) with respect to S.
By setting ∂J (S)/∂S to zero, S t+1 at the (t + 1)th iteration can be updated as In this method, the similarity weights can be obtained in a global and local consistency framework, which is optimal for the next iteration. Next, the predicted soft labels matrix, F, can be updated by the following formulation The first term of (7) is a label smoothness term. It indicates that similar samples have similar labels. The second term is a fitted term, which measures the difference between the predicted soft labels and initial labels.
Because initial labels are significant to the label propagation process, a detailed definition of initial labels is given in this section. Let D = [d 1 , d 2 , . . . , d l+u ] ∈ R (l+u)×c denote the initial labels of all data based on similarity. For labeled data, d i,j = 1, if x i is labeled as j ∈ {1, 2, . . . , c}; otherwise, d i,j = 0. For unlabeled data, the initial values of d i,j are Here, q i,1 , q i,2 , · · · , q i,c is the initial label vector of x i . Fig. 1 illustrates the calculation procedure of q i,j for the unlabeled data. The red point at the center of Fig. 1 indicates unlabeled data. The black points surrounding the red point indicate labeled data. The digits on top of the black points are the labels, and the digits on the edges between the red point and black points indicate weights. In Fig. 1, the red point has six neighbors from three different classes. According to (8), the weights of the first, second, and third classes of data can be 0.3 + 0.2 + 0.1 = 0.6, 0.3 + 0.4 = 0.7, and 0.2, respectively. The total weight is 0.6 + 0.7 + 0.2 = 1.5. Finally, the initial label vector of the red point should be 0.6 1.5 , 0.7 1.5 , 0.2 1.5 . It follows from (7) that By setting ∂J (F)/∂F to zero, the updated F t+1 at the (t + 1)th iteration is F t+1 = (αL t+1 + β (γ 1 + γ 2 ) I) −1 (β (γ 1 D + γ 2 Q)) (10) Parameters 0 < α < 1 and 0 < β < 1 regulate the relative significance based on the label smoothness and fitted terms in (1), respectively, and α and β are restricted such that α + β = 1. Moreover, parameters 0 < γ 1 < 1 and 0 < γ 2 < 1 regulate the relative significance based on initial labels D and Q, respectively, and γ 1 and γ 2 are restricted such that γ 1 + γ 2 = 1. Parameter adjustment is mainly based on data structures and historical experience or knowledge. For example, for the centralized data in each class, γ 1 can be increased accordingly. When there is a sudden change in industrial production processes or the process data are highly clustered in a few operating points, γ 2 can be increased accordingly.
Parameters P, S, and F are updated based on the above iteration method until the following convergence condition is satisfied where ε is a threshold. The convergence condition indicates there is no significant difference between the predicted soft labels for two sequences. Next, optimal values of P, S, and F can be obtained. Eventually, the soft label of sample x i is determined according to arg max j {f ij }, i.e., the column index of the largest element in f i .

III. FAULT CLASSIFICATION BASED ON MODIFIED LABEL PROPAGATION
Label propagation is a transductive learning procedure for predicting the labels of unlabeled data. In other words, label propagation can predict the soft labels of unlabeled data only in a given dataset, indicating insufficient generalization. In the following paragraphs, a fault classification approach is proposed based on the MLP and FDA models [6], [40]. The proposed MLP method is used to obtain the soft labels of unlabeled data. Now, the within-class scatter matrix S U w and between-class scatter matrix S U b of the unlabeled data, and the within-class scatter matrix S L w and between-class scatter matrix S L b of the labeled data can be obtained. Using them in FDA, a semi-supervised counterpart, called SFDA, is derived. S b is the regularized between-class scatter matrix and S w is the regularized within-class scatter matrix, and they are defined as follows: where θ is a trade-off parameter adjusting the proportion based on the labeled data and unlabeled data. Then, the projection matrix is computed as W = arg max where ω is the regularization parameter. The original data can be projected onto a lower-dimensional space via W . Then, a classifier is designed to classify data in this low-dimensional space. In this study, we used the probability density function as the classifier.
First, the mean µ j and covariance ξ j of the j-th class data are calculated in the low-dimensional space. The j-th class mean and covariance are calculated as follows where µ l j and µ u j are the means of the j-th class of labeled data and unlabeled data, respectively; ξ l j and ξ u j are the covariances of the j-th class of labeled data and unlabeled data, respectively.
The linear discriminant analysis method assumes that each class of data obeys the Gaussian distribution. The conditional probability density function [34], [35] of the lower-dimensional projectio z can be represented by the mean µ j and covariance ξ j .
where r represents the number of low-dimensional spaces. Suppose that the prior probability of each class is equal; then, according to Bayes' formula [43], the posterior probability P (z ∈ j |z ) can be calculated. For any new data, x new , the projection on a low-dimensional space is z. Then, z is brought into the conditional probability density functions of all classes of data. Thus, the conditional and posterior probabilities can be calculated. After that, the category of new data can be identified through the following classification criterion The process modeling and monitoring procedures based on MLP are summarized in the following subsections.
(2) Construct a neighborhood graph and initialize S, F, D, and Q.
(3) Update P, S, and F until F is convergent, and then obtain the soft label matrix F.
(4) Given a set of training data, including labeled data and soft labeled data, use the SFDA method to obtain the projection matrix W .
(5) Calculate µ j and ξ j of the j-th class of data.

B. MONITORING
(1) Obtain new data, x new .
(2) Calculate the low-dimensional projection z using z = W T x new .
(3) Calculate the conditional probability and posterior probability.
(4) The class of x new can be identified. The flowchart of the MLP-based fault classification approach is shown in Fig. 2.

IV. CASE STUDY
In this section, a toy example is used to explain the label propagation performance of the MLP, and the PFP is adopted  to describe the performance of the MLP-based fault classification method.

A. TOY EXAMPLE
In the toy example [44], [45], we introduce the two-moon dataset. The two-moon datase contains two classes (called class 1 and class 2), each of which is located in a halfmoon shape. We generate two testing datasets, namely testing dataset 1 and testing dataset 2. In testing dataset 1, each class consists of 60 samples; in testing dataset 2, each class consists of 30 samples. We consider the traditional label propagation(LP), linear neighborhood propagation (LNP), and positive and negative label propagation (PNLP) methods for comparison. Table 1 lists the accuracy levels of the LP, LNP, PNLP, and MLP methods used in this case study.
Let us consider testing dataset 2, as shown in Fig. 3. Five samples from each class are labeled and represented by red solid squares and blue solid triangles, respectively. The remaining 25 samples in each class are unlabeled data represented by black points. Parameter d is set to 1 in this simulation, k is 10, both α and β are 0.5, γ 1 and γ 2 are 0.4 and 0.6, respectively, and ε is 1e −6 .  The label propagation results obtained using LP, LNP, PNLP, and MLP are shown in Fig. 4. In this case study, the accuracy levels of the LP, LNP, and PNLP methods are 72%, 54%, and 66%, respectively. In contrast, the MLP method can yield the desired LP result, and its accuracy is 94%. From these results, we inferred that the MLP method can effectively leverage labeled and unlabeled data in the LP procedure, and the optimal global and local consistency framework in the MLP is beneficial for LP.

B. PENICILLIN FERMENTATION PROCESS
The PFP is a complex biochemical process [46]- [49]. The process flow diagram of the PFP is shown in Fig. 5. It consists of two major operational phases: bacterial growth phase and penicillin secretory phase. Because data generated under different initial conditions and operation modes have different categories, the PFP is a good candidate for evaluating the performance of the MLP-based fault classification method.
Data used for this evaluation were generated using Pensim V2.0. Training data and testing data from different classes were obtained by setting different initial conditions, set points, temperature controllers, and controller types for monitoring the pH. To achieve the best fault classification performance, 14 measurement variables were selected for monitoring, which are listed in Table 2. The PFP was run under four different modes-namely the normal mode, Fault 1 mode, Fault 2 mode, and Fault 3 mode-for generating different types of data ( Table 3). The normal mode was run when default initial conditions, set points, and temperature controller settings were used. A PID controller was used to regulate the pH. Fault 1 was caused by increasing the aeration rate with a ramp fault. Fault 2 was caused by increasing the agitator power with a step fault. Fault 3 was caused by increasing the substrate feed rate with a ramp fault. The normal operation mode lasted for approximately 220 h. The   In the modeling phase, the training dataset contained 300 samples (120 normal samples, and 60 Fault-1, Fault-2, and Fault-3 samples) from four different classes. We generated two testing datasets, namely testing dataset 1 and testing dataset 2. In testing dataset 1, the magnitudes of Fault 1, After the FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA models are established, the corresponding projection matrix W in the five models can be obtained. This helps calculate the corresponding low-dimensional projection z. Fig. 6 shows the first, second, and third directions of the projection results of the test data with the FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA methods. The projections of the test data in low-dimensional subspace are separated using the five models. However, the projections obtained with FDA, as shown in Fig. 6(a), are closer to each other than those obtained with LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA, as illustrated in Figs. 6(b), 6(c), 6(d), and 6(e), respectively. Fig. 6(a) demonstrates that when only a few labeled data are used, the FDA-based discriminant results are poor. This is because the FDA method relies on the information of labeled data. As observed in Figs. 6(b), 6(c), and 6(d), the projections obtained with LP-SFDA, LNP-SFDA, and PNLP-SFDA are farther away from one another than in Fig. 6(a). This reveals that the information from unlabeled data has been effectively used in the discrimination. In contrast, the projections obtained with MLP-SFDA are distinctly separate from each other. This indicates that using the soft labels of unlabeled data and because of their semi-supervised nature, the MLP-SFDA method can obtain a better discriminant subspace for more accurate predictions.
Next, the fault classification performances of FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA are discussed. Fig. 7 shows the posterior probability values of testing samples in the five methods corresponding to FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA. The final classification results produced by the five methods are shown in Fig. 8. The classification accuracy rates achieved with FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA are 57.2%, 87.6%, 81.2%, 87.2%, and 97.6%, respectively. Specifically, according to Fig. 8(a), 23 samples that originally belong to Fault 2 are misclassified as belonging to Fault 3. For Fault 1, eight samples are misclassified as belonging to Fault 3. In the normal data, 76 samples are misclassified as belonging to Fault 3. Fig. 8(a) demonstrates that when there are only a few labeled data, the classification performance of FDA is poor because it depends on the information of labeled data. Figs. 8(b), 8(c), and 8(d) indicate that the classification accuracy of the LP-SFDA, LNP-SFDA, and PNLP-SFDA methods is slightly higher than that of the FDA method, but there are still some misclassifications. Therefore, using a semi-supervised method improves classification performance. In contrast, Fig. 8(e) shows that the  classification accuracy of the MLP-SFDA method is much higher than that of the FDA, LP-SFDA, LNP-SFDA, and PNLP-SFDA methods. Table 4 lists the classification accuracy rates achieved using FDA, LP-SFDA, LNP-SFDA, PNLP-SFDA, and MLP-SFDA in this case study. The resuls indicate improvements in classification performance owing to the LP performed using MLP and the semi-supervised characteristic of unlabeled data. Therefore, based on the above results, we can conclude that the soft labels of unlabeled data can be predicted more accurately using the proposed method. With the additional soft labels of unlabeled data, the fault classification ability of MLP-SFDA is greater than that of FDA, LP-SFDA, LNP-SFDA, and PNLP-SFDA.

V. CONCLUSIONS
In this paper, an MLP method is proposed to accurately propagate labels from labeled data to unlabeled data. The toy example is utilized to evaluate the performance of MLP. Compared to the LP, LNP, and PNLP methods, the MLP method with a new global and local consistency framework, has been validated in predicting the soft labels of unlabeled data accurately. In addition, the MLP-based fault classification method is introduced with additional soft labels of unlabeled data. As a proof of concept, the PFP is utilized to verify the fault classification performance of the proposed method. The results proved that the MLP-SFDA method can achieve higher classification accuracy than the traditional FDA, LP-SFDA, LNP-SFDA, and PNLP-SFDA methods. Furthermore, the proposed approach can improve the fault classification accuracy effectively.
Although the proposed method yields encouraging results, optimization of the model parameters still needs to be investigated, and more subjects and complex industrial process data will be required to test the MLP and MLP-SFDA-based fault classification methods in future.