Cross-Task Fault Diagnosis Based on Deep Domain Adaptation With Local Feature Learning

Data based intelligent fault diagnosis is a critical tool for the healthy development of industry process. In actual industrial production, there are often a few or even no labeled samples for target monitoring problem, while large amounts of training data come from different but related diagnosis task under variable working conditions. To utilize the labeled data on related issue for better monitoring performance, the cross-task fault diagnosis based on deep domain adaptation with local feature leaning is proposed. In our strategy, the two-stream stacked autoencoders based deep architecture is used to extract transferable features of collected data across the target diagnosis task domain and the related data-rich monitoring task domain. Then, the maximum mean discrepancy is introduced to establish a deep transfer diagnosis model. Moreover, to further optimize the model, we propose the local feature learning, which can make test data with better intra-class compactness and inter-class separability. Eventually, the proposed method is verified on the Tennessee Eastman process and the rolling bearing data, the results show that our approach achieves positive performance for cross-task fault diagnosis problems.


I. INTRODUCTION
Fault diagnosis technology can accurately identify the health status of industry processes by detecting and analyzing various state parameters. Comparing with conventional methods, machine learning technologies, such as support vector machine (SVM) [1], logistic regression (LR) [2] and K-nearest neighbor (KNN) [3], through highly efficient collection, feature extraction and classification of various parameters during the operation course of equipment, have achieved great success [4], [5].
Although the fault diagnosis based on machine learning has achieved such a high theoretical accuracy, significant deficiencies still exist in the currently proposed algorithm. Conventional machine learning methods work at full capacity under the simple hypothesis: the training and test data are both independent and with the same distribution [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Yakoub Bazi .
However, the performance of those methods could drop prominently when the distributions of the training and test data are different, and this kind of situation frequently occurs in many practical projects [7]. For example, in some complex industry systems, the labeled data obtained in the laboratory environment are sufficient, while that for real working conditions are rare, especially the labeled faults data. For other industry, it works under variable condition and the monitoring model is constructed in one typical condition and need to be used in other work conditions. There is also the case that some kinds of the fault data can be obtained easily while others are hard, but all the kinds of faults need to be diagnosed. Therefore, it is necessary to use the rich data related to monitoring problems to realize fault diagnosis in target process, we can define it as a cross-task fault diagnosis problem.
This kind of problem is the component part in the cross-domain learning [8]. When it comes to cross-domain learning, there are two concepts need to be understood, the datasets with a large number of labeled samples used to establish the model are identified as the source domain, and the datasets with limited or no labeled data for application are considered to be the target domain. Consequently, the working conditions of the equipment in the actual industrial application are complex and changeable, the model built with training data from source domain may not be directly applicable to target domain. Even more frustrating, to reconstruct the model for practical engineering applications, re-collecting the faulty training samples is inefficient and even impractical. Besides, most traditional machine learning methods will consume much time in the training phase, but that was regarded as the necessary cost for better capabilities of representations.
In order to overcome these challenges, a domain adaptation technique, which aims to extend the well-trained model from source domain to target domain, has attracted much attention and obtained much development [9]. There are three main subdivisions of domain adaptation methods: feature-based domain adaptation, instance-based domain adaptation and parameter-based domain adaptation [10]. The feature-based methods, whose purpose is to learn a shared feature representation by minimizing the distribution discrepancy between source and target domains [11], can be further classified into three ones as: (a) the considered class of transformations [12], [13], (b) the specific types of domain discrepancy measures, such as Maximum Mean Discrepancy (MMD) [14], Center Moment Discrepancy (CMD) [15], Correlation Alignment (CORAL) [16]. Another strategy is instance-based domain adaptation, which considers that samples across domains may be associated with poor cognitive performance even in the shared subspace [17]. Therefore, it reweights the source data in accordance with the shared information contained in target domain, and then the reweighted source data can be learned for further work. In addition, the parameter-based domain adaptation represents another independent process line [10], which proposes to construct the model by adopting the shared parameter information between the source and target domain [18].
Domain adaptation technique for fault diagnosis has aroused much interest preliminarily. Unsupervised domain adaptation proposes to learn discriminative representations from the source samples for building a model with high generalization performance for target data. The model based on singular value decomposition (SVD) feature extraction and transfer learning is built for intelligent fault diagnosis of bearing fault diagnosis [19]. Fault diagnosis techniques based on deep domain adaptation become prevalent as deep networks can learn more distinctive characteristics lately. Lu et al. propose to build a deep model based on domain adaptation for fault diagnosis through the MMD regularization term and weight regularization term [20]. A new deep transfer learning based on the Sparse Auto-Encoder for multi-condition fault diagnosis was proposed by Wen et al. [21].
As mentioned above, though fault diagnosis based on deep domain adaptation have achieved satisfactory results, the existing published domain adaptation works placed too much emphasis on aligning source and target holistic domain feature representations, neglecting local features which with more primitive and transferable characteristics [22]. Generally, the method just considering holistic domain alignment can just reduce, but not completely eliminate the domain discrepancy. It can be illustrated by Fig. 1 (b), samples from target domain can be roughly distinguished and there is a high probability that the samples distributed at the margin of the cluster or away from the cluster center would be misclassified. For comparison, the more detailed parts can be extracted by learning local features, as shown in Fig. 1 (c), there is almost no misclassification of the target data and the samples from source and target can be perfectly aligned. Therefore, the domain adaptation methods only considering holistic domain alignment could limit their ability to achieve a higher accuracy in the multi-state classification for cross-task fault diagnosis.
In this paper, to make the related data-rich monitoring task domain data for better target fault diagnosis, we propose the cross-task stacked autoencoders (SAE) based deep domain adaptation with transferable local feature learning (CSDA-L) for fault diagnosis problem. The main contributions of our work are summarized as follows.
(1) The two-stream SAE based deep model with shared weights is used to extract transferable features of the collected various state parameters for fault diagnosis across domains. (2) The key to domain adaptation intelligence for fault diagnosis is to minimize domain discrepancy. Therefore, CSDA-L adopts MMD as a relevant criterion to minimize the distribution discrepancy based on the Reproducing Kernel Hilbert Space (RKHS). (3) We proposed a more discriminative optimization strategy, local feature learning, which can improve the compactness and interclass distinction of the samples. Combining domain alignment and local feature learning, experimental results and analysis show that our model can get positive performance in cross-task fault diagnosis problem. The paper is organized as. In Section II, descriptions and some provisions are introduced. Our proposed methods and model structure are described in detail in Section III. Furthermore, a series of experiments and analysis are conducted in Section IV. Section V states the conclusion of our work.

II. PRELIMINARIES
In this section, we first present the description of the formulation, and then provisions of domain adaption would be introduced in detail. For clarity, the notations used frequently are summarized in Table 1 A. PROBLEM DEFINITION Firstly, we begin with the work by formally expressing cross-task domain adaption. Our work is based on the following assumptions and definitions: (1) This paper is focusing on the application of domain adaptation technology which mainly involves the situation that distributions between domains are different. Consequently, the settings of domain adaptation in our work as follows: (a) Fault types in source domain and target domain are different but the correlation between different fault types is significant; (b) All labeled samples in the source domain are selected to construct backbone model for cross-task fault diagnosis; (c) Samples from target domain are available to reduce domain discrepancy and be classified in the test process. (2) In principle, the domain data should contain feature space X and a marginal probability distribution P(x), such as {X, P(x)}, where x ∈ X. When it comes to the situation that source domain D s and target domain D t are different, it means that they have different feature space and marginal distributions, expressed as X s = X t and P(x s ) = P(x t ).
(3) For a given domain D, a transfer task T contains label space Y and a prediction function f (x), T = {Y, f (x)}, and f (x) = P(y|x), where y ∈ Y, can be considered as the conditional probability distribution. Generally, categories and conditional probability distribution of source and target domain are different, that suggests Y s = Y t and P(y s |x s ) = P(y t |x t ). (4) In our work, we intended to train a transformation function F(x), which can satisfy P(F(x s )) = P(F(x t )) and P(y s |F(x s )) = P(y t |F(x t )), the sample features from D s and D t with different distributions are all mapped to a shared space, then the cross-task fault diagnosis model established on D s could be extended to classify the unlabeled samples from D t directly.

B. SAE ARCHITECTURE
Deep Neural Network (DNN) has been successfully applied in various fields nowadays [23]. The typical representatives of DNN applied for fault diagnosis includes Deep Belief Networks (DBN) [24], SAE [25], Sparse Filtering (SF) [26], Convolutional Neural Networks (CNN) [27]. Low computational cost, high interpretability and automatic selection of features make SAE suitable for the backbone deep architecture in our work. The structure of SAE is constructed by superimposing multiple autoencoders, each autoencoder is an unsupervised three-layer neural network, the structure of an autoencoder is shown in Fig. 2. An autoencoder is composed of two parts: one part is encoder for extracting features from input data, the other part is decoder for reconstructing the input data from extracted features [28]. Given a training dataset x = [x 1 , x 2 , . . . , x m ] T , the encoding process can be defined as follows.
where g i donates the i-th element of the hidden layer and the number of neurons ism, u i indicates the induced local field, w ij and b i are the weight and bias term. The activation function for an autoencoder is logistic sigmoid function. From the hidden layer to the reconstruction layer, the identical featuresx = [x 1 ,x 3 , . . . ,x m ] T can be computed as In the equation (2),ũ i , g j ,w ij andb i donate the i-th induced local field, element of the hidden layer, weight and bias term in the encoding process. To obtain the optimal parameter set θ = {W,W, b,b}, the gradient descent (GD) algorithm is used for determination. Accordingly, parameters set θ is settled: as equation (2) shows, are the corresponding entries of weight and bias matrices W, W, b andb.
In this paper, SAE is used to complete the initialization, and then the sparse expression learned by SAE can be used to train the classifier and complete the training of the model, the structure of SAE is shown in Fig. 3. The output of the last hidden layer of the SAE structure is regarded as the output of the model for final classification in the softmax layer.

C. MMD
Maximum Mean Discrepancy (MMD) is an on-parametric discrepancy indicator measuring the difference of various distributions. Compared with many other parametric criteria such as kullback-leibler divergence [29] and Jensen Shannon divergence [30], MMD is a nuclear learning method whose purpose can be achieved by mapping samples from both domains into RKHS, the variant characteristic space, to minimize domain discrepancy [14]. The MMD of migration fault characteristics between source and target domains can be defined as: as Equation (4) shows, MMD donates the distance between domains, which is computed with respect to a particular representation, where H is a RKHS, we define a representation φ(·), φ : x s , x t → H, we want to minimize the distance for superior results in domain adaptation.

III. METHOD
Considering the cross-task fault diagnosis problem, the deep domain adaptation model with local feature learning is proposed. By minimizing the distance between the source and target domain, and exploiting the local discriminative deep features, the target data with better intra-class compactness and inter-class separability can be expected, which leads to better monitoring performance. The details are described as follow.

A. SAE BASED DEEP DOMAIN ADAPTATION
The standard deep domain adaptation model is built based on the neural network structure, as mentioned in Section II, we establish a two-stream SAE based deep architecture with shared weights (only the labeled source examples are used to train the shared weights), as shown in Fig. 4, the upper stream deals principally with the source data and the other stream processes the target data, which can extract the hierarchical characteristics of data from the source domain to build an efficient classifier through supervised learning, and combined with domain adaptation technology to complete the preliminary establishment of the model.
In our study, the samples are all collected from the same system, regardless of which fault category they belong to, which indicates the correlations of cross-domain samples from different categories are quite significant. Thus, we follow standard settings for unsupervised deep domain adaptation [9], the labeled source domain datasets are defined as The source domain data and target domain data are draw from the conditional probability distribution, and P(y s |x s ) = Q(y t |x t ). The architecture of the proposed cross-task fault diagnosis model is composed of three modules: the parameter-shared layers which can complete parameter initialization and share weight parameters, the adaptation layers whose aim is to reduce domain discrepancy and make a further optimization, the output layer to realize classification of fault. Since SAE is designed as the backbone model in our study, whose structure and function are illustrated in the previous section, so the high-level feature representation from source g s(L) is hierarchically elicited as g s(L) = f (W (L) g s(L−1) + b (L) ), where W (L) and b (L) are the weights and bias of the L-th hidden layer. The SAE parameters set θ = {W, b} can be trained through supervised learning, then in order to avoid the curse of the dimensionality of the loss function, we optimize the loss function by cross-entropy function: Specifically, c(θ|x s i , y s i ) denotes the standard classification loss, where y s i ,ỹ s i are the true label and the predicted possibility of the source sample x s i . After the completion of two-stream SAE based deep architecture, we fixed the parameters of the bottom hidden layers and set the last hidden layer as the adaptation layer VOLUME 8, 2020 inspired by Deep Domain Confusion (DDC) [31]. In our case, the adaptation layer can map the features of the source domain and the target domain to the common feature subspace, so as to reduce the domain discrepancy and complete the transfer of knowledge. Thus, the MMD metrics are employed to compute the discrepancy according to the categories and domains. The domain discrepancy loss measured by MMD can be expressed as: where we define a representation φ(·), which operates on the features extracted from source and target domains by parameter-shared layers, H s ∈ R b×L and H t ∈ R b×L are the characteristic values in the adaptation layers, accordingly b is the batch size in the training phase and L indicates the number of units in the adaptation layer, where h s i ∈ R L and h t j ∈ R L denote the deep features in the adaptation layer, . Obviously, domain features with different distributions could get closer in a RKHS by decreasing (6).

B. DEEP DOMAIN ADAPTATION WITH LOCAL FEATURE LEARNING
We proposed deep local feature learning in the adaptation layers for cross-domain transfer to further optimize the model. As illustrated in Fig. 5, the core element of our case is that the features of each sample should be as close as possible to its corresponding class center, and samples from different class must be separated by great distances from each other in their feature space.
In more terms, motivated by the Center loss [32] and Center-Based discriminative loss [33], we further developed and inter -class loss, which can be formulated as where h s i ∈ R L denotes the features extracted from the adaptation layer with respect to i-th input data. The c i ∈ R L , i ∈ {1, 2, · · · , K }, defined as the class center and there are K classes of conditions in all samples, and the c y i ∈ R L denotes the class center of the sample whose label is y i .
As we have seen in (7) and (8), m 1 and m 2 are two constraint margins, which represent the distances between samples and class center shall be at most m 1 , while distances between centers of different classes must be forced no less than m 2 . Certainly, this principle would make the deep features of the samples more recognizable. To balance the contributions of the intra-class loss 1 and inter-class loss 2 , the local feature learning loss can be defined as where α is the trade-off parameter, in the ideal situation, the class center c i can be determined by calculating the mean of the deep characteristic values from all training data. However, it is quite demanding to complete this task for the reason that the proposed method should be implemented based on mini-batch stochastic gradient descent (SGD). Thus, we make some necessary simplifications to the training strategy, considering the inter-class loss 2 in (9), the c i and c j estimating the inter-class divisibility are approximately calculated via averaging deep characteristics of current batch samples. On the contrary, the c y i used to measure the intraclass compactness should be updated as the deep features changed, therefore, we updated the c y i in each iteration as where δ(condition) = 1 denotes once the condition is satisfied, the output is 1, and δ(condition) = 0 indicates the condition is not met. Clearly, the class center from each category should be initialized before training and updated by the inputs via (10) and (11) in each round of iteration, γ denotes the learning rate in the formula (11). Since our work limits the distances between characteristic values and the class centers they belong to and further develop large margins between class centers, this is a great improvement for Center Loss. In our work, the local feature learning loss in the model encouraged the shared features to be more discriminative, which would be significantly helpful to both domain adaptation and final classification.
Above all, the cross-task fault diagnosis model could be built by minimizing the following three loss functions: (1) the standard classification loss with respect to the source data; (2) the domain discrepancy loss measured by the MMD term; (3) the local feature learning loss in reference to local feature learning; Based on the above-mentioned loss functions, the final objective function is written as where λ 1 > 0 and λ 2 > 0 represent the trade-off parameters keeping the balance of domain discrepancy loss and local feature learning loss.

C. FRAMEWORK OF TRAINING AND DIAGNOSIS
The model proposed in this paper is an end-to-end cross-task industrial process fault diagnosis model, which utilizes labeled source samples and unlabeled target samples to train the deep domain adaptation model. The framework of cross-task fault diagnosis model is illustrated in the Fig. 6. In the training phase, since the range of input data is very large, which will confuse the model and not be conducive to model training. Therefore, before training, the input data must be normalized, which can prevent that the absolute value of input data is too large to saturate the output and accelerate convergence velocity of iteration in SGD. The normalization of source data is as follows Considering the standard classification loss L c in (5) is stated by the cross-entropy function, L MMD in (6) and L d in (9) are both differentiable with regard to the input data certainly. Therefore, the parameters θ = {W, b} of the structure could be easily updated by the standard back propagation algorithms and mini-batch SGD in each iteration. VOLUME 8, 2020 The gradient calculation process of parameters is as follows where η represents the learning rate. ∇ W L(w i , x s i , y s i ) is the gradient of the loss function with respect to W; is the gradient of the loss function with respect to b. After the training based on the source domain, our model can be applied to target domain for classification.
In the classification stage, the transferable target features are used to examine the reliability of fault diagnosis model, which can be completed at the final layer with softmax regression for multiclass classification. Correspondingly, the output probability o k ∈ [0, 1] for class k is.
, k = 1, 2, . . . , K where θ k donates the parameter of the corresponding class, there are K classes of the training datasets, and K k=1 o k = 1.

IV. EXPERIMENTAL ANALYSIS
In order to verify the effectiveness of the proposed method, according to the different characteristics of source domain data and target domain data, several experiments are designed in this section.

A. DATA PREPARATION
In this section, we mainly evaluate the efficacy of our approach on the data based on the TE simulation platform and the rolling bearing data from the Case Western Reserve University (CWRU).

1) TE PROCESS DATA
Based on the actual chemical reaction process, TE process was created on an open and challenging chemical model simulation platform, whose purpose is to provide a realistic industrial process for evaluating process control and monitoring methods [34]. As a data source for comparing various methods, TE datasets has been widely used in research such as control optimization, process monitoring and fault diagnosis.
The TE process consists of five important unit: reactor, condenser, compressor, separator, gas tower, and contains five gas components and two liquid products. Hence the entire process involves many highly correlated variables, including 41 measurement variables and 12 controllable variables. The TE datasets are designed with 21 faults to simulate common faults and disturbances in actual industrial processes, which are regarded as 21 types of preprogrammed faults. In our work, all the fault types are summarized as {F 1 , F 2 , · · · , F 21 }, the description of parameters and number of samples under each condition is shown in table 2.
To simulate the scenarios for cross-task fault diagnosis, three settings should be followed.  , F 2 , · · · , F 5 }, remaining 5 categories of fault data with a total of 2500 samples. Domain B = {F 6 , F 7 , · · · , F 10 } and Domain C = {F 11 , F 12 , · · · , F 15 }, without taking the labels of these data into account, can be regard as the target domain, the total number of samples is the same as the source domain data.
(3) The purposes of diagnostic task A → B and A → C is to classify unlabeled fault samples from target data into their correct categories with our domain adaption approach.

2) ROLLING BEARING FAULT DATA
The rolling bearing data provided by Bearing Data Center of CWRU was acquired by a bearing test-rig. The test-rig consists of a motor, a prime mover for driving, a control circuit for speed control and bearing test, a dynamic dynamometer for signal transmission and an encoder. The data is composed of multivariate vibration series, including one normal condition and three fault types [the fault in the outer race (FO), the fault in inner race (FI), and the fault in ball (FB)]. The various state data were collected separately at different motor loads (0, 1, 2, and 3 hp) and the sampling frequency is 12 kHz. There are totally four fault sizes, which are 0.007, 0.014, 0.021 and 0.028 inches, corresponding to different damage diameters. In our work, six different working conditions were analyzed, and the time-domain vibration waveforms of the rolling bearing data are shown in Fig. 7. the parameters and number of samples under each condition are shown in table 3, several settings are also designed for the cross-task fault diagnosis.
(1) To verify the transfer effect in the datasets with different damage degrees, the rolling bearing data with damage diameter of 0.007 inches and 0.021 inches are designed for experimentation.

B. COMPARED METHODS
To verify the positive results of our model, our proposed method is compared with the existing successful supervised and domain adaptation approaches: VOLUME 8, 2020 (1) Logistic Regression (LR) [2] (2) Support Vector Machine (SVM) [

1] Cross-Domain
(3) Spectral Classification (CDSC) [35] (4) Transfer Component Analysis (TCA) [36] (5) Deep Domain Confusion (DDC) [34] (6) Domain Adversarial Neural Network (DANN) [37] LR and SVM are the classical conventional supervised classification approaches, which have been successfully applied for fault diagnosis. CDSC and TCA are both effective methods based on transfer subspace learning proposed for fault diagnosis issues, especially, TCA is the representative technique by searching the feature subspace in the domain adaptation field. In the methods based on deep transfer learning, DDC is the most traditional method for deep domain adaptation learning, DANN is the most characteristic approach based on Generative Adversarial Nets (GAN).

C. EVALUATION METRICS
F 1 score is adopted to evaluate the classification performance of the proposed fault diagnosis model in our work. F 1 score is a common evaluation metric demonstrating the effectiveness of a classification model [38], which is defined based on precision and recall. Precision and recall make the evaluation of the missed diagnosis rate and misdiagnosis rate in the industrial process monitoring respectively, and they can be expressed as where true positive (TP) denotes the quantities of test samples recognized as positive in perspective, false positive (FP) denotes the quantities of test samples erroneously recognized as positive, true negative (TN) denotes the quantities of test samples correctly recognized as negative, false negative (FN) denotes the quantities of test samples erroneously recognized as negative. F 1 score is the weighted harmonic average of precision and recall which is defined as In our cross-task fault diagnosis problem, each fault type can be considered as positive, while all other types are considered as negative. Since the F 1 score is constructed through synthetical consideration on the missed diagnosis rate and misdiagnosis rate, so the F 1 score can comprehensively evaluate the effectiveness of various methods.

D. IMPLEMENTATION DETAILS
To establish reasonable test schemes, we should follow standard evaluation principle for unsupervised deep domain adaptation: using all labeled source data to build model and all unlabeled target data to test. The SAE is used as the backbone model and each layer is randomly initialized with parameters pre-trained on the datasets. To investigate the necessity of proposed local feature learning, we analyze results in the light of the following two settings (1) The final loss function with only the classification loss and the domain discrepancy loss (λ 1 > 0 and λ 2 =0), which denoted as CSDA, similar to DDC, but it was based on the modified AlexNet [39]. (2) The Final loss function with the classification loss, the domain discrepancy loss and the local feature learning is applied to optimize the performance (λ 1 > 0 and λ 2 > 0), denoted as CSDA-L. Because of the correlations of cross-domain samples are quite significant in our datasets, we fix the hidden layers that were shared from pretrained model, then train classifier layer via back propagation, and after testing, the learning rate η can be set as 0.025 and the learning rate of the local feature learning loss γ we fixed as 0.5. In this case, to avoid overfitting, each hidden layer only possesses 10 hidden units. The constraint margins can be set as m 1 = 0 and m 2 = 120 throughout experiments, and the trade-off parameter α is set as 0.5. In order to make the proposed model work steadily across different transfer tasks, the determination of trade-off parameter λ 1 and λ 2 is very critical. According to the analysis on the following experiments, we find out that λ 2 = 0.25 and λ 2 = 0.1 could be the most appropriate strategy for our transfer tasks.
For the traditional methods LR and SVM, classifier can be built by the labeled source data and then classify unlabeled data from target domain, the trade-off parameter of LR is from {0.001, 0.01, 0.1, 1, 10}, the Gaussian kernel is adopted in SVM and the trade-off parameter is set as 1 in this case. Besides, CDSC and TCA utilize all samples from D s and D t to reduce dimensionality, then an SVM classifier is used to complete label predication, the optimal subspace dimension is determined by searching the space {4, 8, 16, 32}. For the methods based on deep domain adaptation, DDC and DANN make full use of all data from both D s and D t to reduce the discrepancy between distributions, then complete the transfer of knowledge to obtain an efficient classification model.

E. DIAGNOSIS PERFORMANCE EVALUATION 1) RESULTS ON THE TE DATA
The results of our method and other compared approaches on the TE dataset are shown in Table 4. and the F 1 scores for each fault type are shown in Fig. 8. From the results of fault classification, the proposed method outperforms all the compared models for two tasks. The average improvements of CSDA-L reach 0.9593, which is 0.1566 and 0.0668 higher than TCA and DANN, respectively. Due to the limited representation capacity, CSDA cannot effectively solve the domain adaptation problem, so its average F 1 score just reaches 0.7734. TCA obtains the performance 0.8027 among the methods which using the domain adaptation method based on transfer subspace learning. As seen from Fig. 8, It can be seen that the proposed CSDA-L method achieves higher F 1 scores in each fault type, which indicates our cross-task fault diagnosis model is more suitable for industrial process data.

2) RESULTS ON THE ROLLING BEARING FAULT DATA
From the working condition classification results listed in Table 5, it can be clearly seen that our method also achieves  a satisfactory performance on rolling bearing dataset. In all cases, the average performance of CSDA-L is as high as 0.9526, better than other methods. The classification results of TCA and DANN on the test data are 0.8729 and 0.9083 showing that our method improves 0.0797 and 0.0443 respectively. The performance of CSDA-L compared with CSDA confirms the validity and rationality of local feature learning. Furthermore, the confusion matrix of the each working state is presented in Fig. 9, rows and columns represent the actual health type and the predicted health type, respectively. From the result, only a few working states are misclassified, about 7.60% of FO12 are misclassified to FO3, 6.70% of FB are misclassified to FO6, most of them are classified accurately. Owing to the above experimental results are obtained from a wide variety of datasets, it can be convincingly proved that CSDA-L can establish a robust adaptive fault diagnosis model to classify the cross-task fault types.

F. FEATURE VISUALIZATION
For visualization purposes, we try to reduce the number of dimensions to two with the t-SNE embeddings [40]. To better demonstrate the transferability of our approach, we take the transfer task A → B on the TE datasets for feature visualization with t-SNE and plot the features scatter for analysis.
The visualizations of the learned features in Fig. 10 (a) and 10 (b) show great discrimination in the source domain, which could be defined as the experiment based on category information. To prove our model can equally work well on the target domain, we visualize the features of the adaptation layer learned by CSDA and CSDA-L on transfer task A → B of TE datasets in Fig. 11 (a) and 11 (b) (the experiment based on domain information) for comparison. We can make intuitive observations.
(1) Comparing to the features in Fig. 10 (a) obtained by methods without the local feature learning loss L d , the features obtained by the methods with L c + L MMD + L d present that the samples from the same category are well compact and the samples across categories are much more separated as shown in Fig 10 (b). It can be observed visually that the characteristics obtained without L d result in more scattered points distributed in the gap between categories and even some samples have been misclassified. While the features obtained with our local feature learning loss clump together tightly and large gaps exist between categories. It can be demonstrated that CSDA-L can encourage the cross-task fault diagnosis model to learn more distinctive features from source domain. (2) As shown in Fig. 11 (a) and 11 (b), the visualizations of the learned features with the classification loss and the domain discrepancy loss, the feature information of categories cannot be well aligned between source and target. By comparison, the features extracted by CSDA-L, the categories across different domains are aligned much better, the samples with the same category information from both domains are much closer. The above observed results could prove the advantages of our method: CSDA-L model can learn more distinguishing features wherever from the source domain or the target domain and make the target fault samples much more distinguishable with the optimization of the local feature learning.

G. CONVERGENCE PERFORMANCE
To assess the convergence performance of our approach, we compared the test errors in the training phase with these of other methods. Fig. 12. shows that the test classification errors of different methods on the TE datasets. It is noted that CSDA-L ensures the test errors are reduced to minimum with as few training epochs as possible. Besides, the trend of test error curve suggests CSDA-L converges stably compared to other method. Consequently, though the convergence rate of CSDA-L is lower than TCA and DANN, on the whole, CSDA-L still achieves much better performance. Classically, since we take the local deep feature information of the source domain into account in the training phase, our intelligent fault diagnosis model is efficient and practicable to achieve much better performance.

H. PARAMETER ANALYSIS 1) CONSTRAINT MARGINS
In the literature, the determination of constraint margins m 1 and m 2 is designed based on the principle that minimizing intra-class distance and maximizing inter-class distance. Ideally, m 1 should be as small as possible and m 2 must be as large as possible, so we set m 1 as 0, which is the smallest distance between samples and class center. Generally, classification precision can be improved as m 2 increases, the results of classification precision and training time with respect to m 2 , m 2 ∈ {50, 60, 70, 80, 90, 100, 110, 120, 130, 140,150}, are shown in Fig. 13. The variation trend of precision and training time are observed as expected, efficiency increases significantly firstly, but large amounts of time could be consumed as m 2 continues growing in the training phase, so the most appropriate value of m 2 is 120 through comparative analysis, which can reduce training time while get positive classification results.

2) TRADE-OFF PARAMETER
We validate the effects of the local feature learning loss by analyzing the contributions of the trade-off parameter λ 2 , all experimental results are presented in Fig. 14. Intuitively, the larger λ 2 could result in more discriminating deep  characteristic values and may strengthen the ability of classification. Fig. 14 (a) shows the variation of average accuracy as λ 2 ∈ {0.0001, 0.001, 0.003, 0.01, 0.03, 0.1, 1, 10} on the transfer task. We can observe that the classification precision of fault types is rising fast firstly and then drops rapidly as λ 2 increases and exhibit signs of a convex subsiding curve. This is because that too much local features learned from the source domain will lead to over fitting of the model and reduce the generalization ability, this is a clear demonstration that a proper equilibrium between holistic domain alignment and local deep feature learning could significantly elevate cross-task performance. Fig. 14 (b) reveals the relationship between λ 2 and convergence performance of the model, We can discover that better convergence performance can be expected as λ 2 is appropriately increased. All above results confirm our incentive that once the alignment efficiency of holistic domain can catch pace with the source deep characteristic values, a stable and desirable cross-task fault diagnosis model with excellent convergence and high accuracy can be brought.

V. CONCLUSION
In this paper, we have presented the cross-task fault diagnosis model based on deep domain adaptation model with local feature leaning, which can address cross-task problems that few or even no labeled samples for the target monitoring, while large amounts of training data come from different but related diagnosis task. Firstly, the two-stream SAE based deep structure is established to extract transferable features of source and target diagnosis task domain. Afterwards, MMD aimed to minimize the distribution discrepancy is applied into the training stage. Moreover, to further improve the performance of model, we propose the local feature learning in our strategy, which can encourage the target samples with better intra-class compactness and inter-class separability. Eventually, experimental results and analysis demonstrate that our approach achieves better diagnosis performance for cross-task problems and shows a good potential in multi-condition fault diagnosis.
YIN TANG received the B.S. degree in electrical engineering and automation from Jiangsu University, Zhenjiang, China, in 2018. He is currently pursuing the M.S. degree in electrical engineering with the University of Shanghai for Science and Technology.
His research interests include deep transfer learning, industrial process monitoring, and fault diagnosis. He served as a Lecturer with the East China University of Science and Technology. From 2017 to 2019, he worked as a Postdoctoral Researcher with the University of Duisburg-Essen. His current research interests include process monitoring and system modeling of chemical and biological processes, data mining, and the feature extraction of processed data. VOLUME 8, 2020