Transfer Learning Based Data Feature Transfer for Fault Diagnosis

The development of sensor technology provides massive data for data-driven fault diagnosis. In recent years, more and more scholars are studying artificial intelligence technology to solve the bottleneck in fault diagnosis. Compared with other classification and prediction problems, fault diagnosis often faces the problem of data scarcity. To overcome the lack of fault data, the transfer learning based on different working condition is gradually introduced into fault diagnosis by scholars. This paper discusses the current mainstream AI-based fault diagnosis methods, and analyzes the advantage of transfer learning for fault diagnosis problem. Then, a transfer component analysis (TCA) based method is proposed to transfer data features between different working conditions. Through the TCA-based method, the fault diagnosis model under the working condition can be established with the help of historical working condition. It effectively alleviates the problem of data scarcity under the condition to be predicted. Different from other fault diagnosis studies, this paper considers the online maintenance process based on TCA. A fault diagnosis framework including online maintenance process is proposed. Finally, a case study of bearing diagnosis from Case Western Reserve University proves the feasibility and effectiveness of the proposed TCA-based method and our fault diagnosis framework.


I. INTRODUCTION
With the popularization of sensor technology, more and more equipment in industrial manufacturing has achieved effective digital monitoring [1], [2]. Through digital monitoring, it is possible to quantify the operating status of the equipment, and comprehensively evaluate the health status of the equipment, thereby improving the reliability of the equipment and ensuring the quality of product manufacturing.
A complete fault diagnosis process generally includes two steps: fault signal processing and fault pattern recognition. Based on digital monitoring, academia has carried out extensive research work on fault signal processing [3]- [5] and fault pattern recognition [6]- [8], respectively. Fault signal processing mainly focuses on the separation and extraction of fault features to guide the subsequent fault recognition work. Fault pattern recognition often focuses on how to identify the The associate editor coordinating the review of this manuscript and approving it for publication was Hao Luo . type of fault by matching fault features obtained by signal processing with existing faults in the system.
On the one hand, high-precision pattern recognition and classification depends on the effective description and characterization of signals. In other words, an effective signal description will provide highly distinguished signal features for pattern recognition, thereby improving the accuracy of pattern recognition. On the other hand, the validity of signal characterization needs to be verified by pattern recognition. Recently, many scholars have proposed a large number of indicators and methods for signal characterization from the perspective of time domain and frequency domain. However, for different types of signals, it is still necessary to combine subsequent pattern recognition to verify the reasonableness of signal feature extraction. It is worth noting that some scholars have begun to advocate the application of deep neural networks and self-encoding methods to combine fault signal processing and fault pattern recognition together for fault diagnosis [7], [9]- [13]. The introduction of the deep neural network method and the comparison between this method and the traditional method will be analyzed in the Section II.
Since fault diagnosis is mainly conducted by data-driven methods, data play a vital role in the effectiveness of fault pattern recognition. But it is well known that the number of fault samples is small compared to normal samples. The problem of data class imbalance will cause many data-sensitive classification models to be overfitting [14]. Even in extreme cases, the classification model will classify each sample as the normal one if they ignore those few fault samples [15], [16]. In this case, although the classification accuracy of the model can still approach almost 100%, the meaning of classification has been lost.
Theoretically, data class imbalance can be solved by cost-sensitive weighting [15]. But for equipment fault diagnosis in the actual industry, cost-sensitive weighting is difficult to set. Therefore, more attention should be paid to the data source to solve this problem, that is, how to obtain as many and comprehensive types of fault data as possible. The easiest way to get the fault characteristics data is to simulate the fault scenario. Unfortunately, for most equipment or components, it is not feasible to deliberately destroy them to obtain fault features data. On the one hand, in general, one equipment will correspond to multiple faults and simulation experiments require a lot of resource support. On the other hand, since the equipment operates in specific scenarios, the fault characteristics under different working conditions are also different. Therefore, it is difficult to characterize fault features by exhausting all working conditions through simulation experiments.
In industrial practice, in addition to large-scale experiments, it is more likely to obtain fault characteristics from the following two situations.
(1) We only have a small amount of historical fault data for equipment that needs to be diagnosed, and we also have some fault data for several equipment similar to this equipment. Their functions are similar with minor differences in parameters. Therefore, the faults between them have similar characteristics and can be used to predict the fault types for each other.
(2) For one equipment, the equipment runs a lot of time under a certain operating condition, so a large amount of fault data are accumulated. When the equipment is put into a similar working condition, the fault characteristics also will be similar. Therefore, the fault types under the new working condition can be predicted by these fault types under the old working condition.
In view of the above circumstances, this paper adopts feature transfer to solve the problems of low classification accuracy and invalid classification caused by insufficient fault data. Among algorithms of feature transfer, TCA is a typical method for domain adaptation problems. It maps the both data of the source domain and target domain to a high-dimensional Reproducing Kernel Hilbert Space (RKHS). In RKHS, the data distance between the source and target domain is minimized, while retaining their respective internal attributes to the greatest extent.
The contribution of this paper is described as follows. The extraction method of data features based on TCA is proposed to discover general features between source domain and target domain. Then, different from other researches, this work further explores how to use few data labels from target domain to improve the accuracy of target domain fault diagnosis. In the actual maintenance work, the occurrence of fault in the target domain is gradual (as the maintenance continues, the maintenance personnel continue to label the fault data in the target domain). Therefore, how to add the newly acquired data in the target domain to the fault diagnosis will be a meaningful work.
The remainder of this paper is organized as follows. Section II reviews and compares the current mainstream methods of fault signal processing, fault pattern recognition and fault diagnosis. The feature transfer method based on transfer component analysis is introduced in Section III to support subsequent fault diagnosis. A case study is given to verify the effectiveness of the proposed method in Section IV. Finally, the conclusion is arranged in Section V.

II. RELATED WORK A. FAULT DIAGNOSIS BASED ON SIGNAL PROCESSING
Signal processing-based fault diagnosis usually consists of signal feature extraction and fault pattern recognition. The following describes the current researches on signal feature extraction and fault pattern recognition.

1) SIGNAL FEATURE EXTRACTION
The feature extraction of fault signals is to extract statistical feature values for samples with a specific length. Signal feature extraction can be mainly extracted from time domain, frequency domain and time-frequency domain [17]- [20].

2) FAULT PATTERN RECOGNITION
As we mentioned in Section I, the input of fault pattern recognition is the output of signal processing. Through the signal processing in the previous step, the fault signal can be converted into multiple fault samples. For samples with known fault types, learning algorithms can be used to train the classification model, then to identify and predict fault pattern. VOLUME 8, 2020 In recent years, various machine learning algorithms have become the mainstream methods for fault pattern recognition. For example, [32] conducted vibration fault recognition for hydroelectric generating via support vector machine. Reference [33] used decision tree for spur gear fault diagnostics. Bayesian Network was used to Intelligent Building Fault Diagnosis in [34]. Moreover, machine learning-based pattern recognition is often combined with computational intelligence methods such as evolutionary computation and fuzzy systems [35], [36]. For example, some researches used algorithms such as PSO and ACO to optimize model parameters for machine learning, thereby improving recognition accuracy [32], [37]. There are also some works that combine methods such as fuzzy decision making and information fusion with machine learning to improve the robustness of the pattern classifier [33], [38].

B. FAULT DIAGNOSIS BASED ON DEEP NEURAL NETWORK
In recent years, more and more researches have discussed how to use deep neural networks for fault diagnosis [7], [39]- [46]. Neural network-based fault diagnosis can integrate signal feature extraction and fault pattern recognition. This type of research work extracts signal features and fault mode recognition, directly imports the signals as samples into deep neural networks, uses the network to extract signal features, and directly derives the pattern classification results at the last layer of the network. For example, [47], [48] discussed how to use deep neural network with autoencoders for fault diagnosis. The first half of the neural network is responsible for signal extraction and the second half is responsible for pattern recognition. The advantage of such fault diagnosis method is that it can save the cost of manual signals extraction and use the network training to perform feature extraction autonomously.
However, such a fault diagnosis method has at least two obvious shortcomings as follows.
• First, Difficulties in network training. For neural networks, if the data samples and feature sizes are relatively large, then it will be difficult to use neural networks for training. On the one hand, network training takes a long time, so that it is difficult to quickly put the model into use. On the other hand, due to the huge scale of the network, training convergence is difficult to be guaranteed.
• Second, poor interpretability. Due to the black box training mode of neural network, the interpretability of the parameters after training is very poor, and the meaning of the fault features cannot be obtained intuitively and clearly. When the scale of the training network is large, even if high recognition accuracy is obtained, it is difficult to judge whether the model has really learned the features or the high accuracy is caused by overfitting [14].

C. DISCUSSION ON FAULT DIAGNOSIS METHOD
Intelligent fault diagnosis is different from artificial intelligence applications. The interpretability and effectiveness of its prediction results determine whether industrial activities can proceed smoothly and even affect important industrial safety issues. Therefore, although neural networks can make some progress in this filed, this paper still advocates using signal processing to extract features, which can retain the interpretability of the fault characteristics to the greatest extent, and it is also convenient for the fault maintenance personnel to evaluate the results of fault pattern recognition.

D. TRANSFER LEARNING FOR FAULT PATTERN RECOGNITION
The research on fault transfer benefits from the development of transfer learning [49]. The mainstream researches on fault transfer includes the following three categories.
• (1) Transfer from Data Due to the scarcity of fault data in a specific working condition, if fault data similar to the working condition are known, these data can be directly used as auxiliary data. By introducing auxiliary data into the current working condition, the fault classification model can be established, which can be used to predict the potential fault in the unknown signal.
• (2) Transfer from Model and Parameter [50] Unlike data transfer, model Transfer is performed from the perspective of algorithm and model. In other words, the structure and parameters of the classification model established under known work condition can be transferred to the new work conditions. By fine-tuning the structure and parameters [51], the model established under known work condition can be easy to classify and predict faults under new work condition.
• (3) Transfer from Feature Feature transfer focuses on the relationship between auxiliary data and the data generated by existing working condition. In general, the auxiliary data generated from past known working condition is large, and after a period of operation, the sensor system will also obtain some fault data generated under existing working condition. Therefore, data feature matching between the both data can be used to classify and predict based on the similarity of features.
All three above transfer strategies, to a certain extent, help alleviate the problem of sparse fault data under specific working condition. From the perspective of industrial applications, not all three are effective.
Data transfer is mainly used in situations where data are generated by approximate data distribution. Once the data distribution differs greatly under approximate working condition, it will cause a huge prediction error in predictive diagnosis.
The transfer of model and parameter is generally carried out on neural networks. Compared to data transfer, this method avoids the distribution error caused by the direct use of auxiliary data. And, from the perspective of model training, it does not need to train all parameters of the entire model, but only needs do some fine-tuning based on the existing model. However, the shortcoming of this method is also obvious, that is, the method has a strong dependence on the model. Once the model selection or model parameter fine-tuning is not done well, the effect of the entire model will be very poor.
Feature transfer is a method that can combine the feature of auxiliary data with the signal of the current working condition. This method considers the difference of data under different working condition and advocates a simpler fault prediction model. The process is as follows: In the first step, the both signal from historical known working condition and the current working condition have undergone spatial transformation simultaneously, which significantly reduces the difference between auxiliary data and current target data. Then, in the second step, the transformed auxiliary data and the predicted working condition data are combined into a new sample. Due to the elimination of differences, a simpler model can be used for fault classification and prediction, which reduces the dependence of the classification on the model. Based on the above advantages of feature transfer, this paper advocates using feature transfer to transfer fault data under different working conditions. The specific modeling of fault diagnosis based on feature transfer will be given in the subsequent Section III.

III. FAULT DIAGNOSIS BASED ON FEATURE TRANSFER A. MODEL ASSUMPTIONS
Assume the auxiliary fault data under known conditions is D S , which is defined in Equation (1).
where X S represents the extracted fault signal features, d represents the dimension of the fault signal features, n s represents the number of samples under known conditions, and y S represents the fault type corresponding to the fault signal.
The fault data under the working condition to be diagnosed is D T , which is defined in Equation (2).
where X T represents the extracted fault signal features, d represents the dimension of the fault signal features (consistent with the dimension of X S ), and n T represents the number of samples under the working condition to be diagnosed. The notation y T is used to indicate the fault type of the fault data under the working condition to be diagnosed.

B. FEATURE TRANSFER BASED ON TRANSFER COMPONENT ANALYSIS
Feature transfer is performed by transfer component analysis (TCA) [52]. Maximum mean discrepancy (MMD) is used to characterize the difference between the distribution of auxiliary data and the data to be diagnosed in TCA method. MMD is given by Equation (3).
That is, through mapping transformation ∅, X S and X T can be mapped to a high-dimensional space, and in this space X S and X T are as close as possible.
Considering the difficulty of solving the mapping transformation ∅, a kernel matrix K is introduced which is given by Equation (4).
And parameter matrix L is defined by Equation (5).
Now, the original problem can be transformed into the solution of K in the following Equation (6).
The problem can still be further structured by dimension reduction.
The matrix W here is a matrix with a lower dimension than K . As long as W is solved, the original problem can be solved.
Sort out the final optimization goals of TCA with W , which can be formulated by Equation (8).
where tr W T W is norm penalty for parameter [53] and H is a center matrix defined by Equation (9).
The optimization goal can be obtained by Lagrangian duality [54]. Let D be the feature vector corresponding to the maximum m eigenvalues of (KLK + µI ) −1 KHK , then there is a relationship amongD, ∅ (X S ), and ∅ (X T ), shown in Equation (10).
(10) VOLUME 8, 2020 In this way, the features X S and X T after transformation can are obtained. Where m determines the transformed feature dimension. Similar to the Principal Component Analysis [55], the larger m, the closer the transformed features are.

C. FAULT DIAGNOSIS BASED ON FEATURE TRANSFER
Once the distance between auxiliary data and the current target data is significantly reduced in the high-dimensional space ∅, the auxiliary data can be used to build a fault classification model. The classification model here can be a neural network, a support vector machine, a decision tree or other learning algorithms. In this paper, we use k-Nearest Neighbor (KNN) to do the classification work.  Figure 1 shows the framework of fault diagnosis based on feature transfer. What needs to be explained in Figure 1 is that, for fault diagnosis itself, fault data and fault discrimination results are constantly generated. Therefore, on the one hand, it is possible to continuously generate new fault data into the TCA and iteratively update the results of feature transfer. On the other hand, maintenance personnel will continue to label fault data under current working condition, which could be added into the training of fault classification model for a better fault diagnosis result.
From the perspective of algorithm, our framework of fault diagnosis adopts TCA to predict the fault type in target domain. The advantage of TCA is that the implementation is simple. The method itself does not have too many limitations, and it is as easy to use as Principal Component Analysis (PCA). More importantly, compared with the traditional transfer learning framework, our framework also takes the online maintenance of the target domain into account. In other words, the framework allows the labeled fault data obtained in the target domain to be added to the training of the fault diagnosis model in real time. It will help to further improve the accuracy of fault diagnosis in the target domain.

IV. CASE STUDY A. DATA DESCRIPTION
The data used in this paper come from the bearing data center of Case Western Reserve University (CWRU). CWRU provides test data of different size bearings and corresponding bearing faults. In CWRU experiments, the bearing is installed on the 2 horsepower (hp) motor. The real-time acceleration data are measured at different sampling frequencies (12K and 48K) by sensors installed near and far from the motor bearing.    For clear expression of the experimental data, Table 1 shows the working conditions used in our case study, and Table 2 shows the fault diameter and the corresponding fault 76124 VOLUME 8, 2020 types ID (for example, if there is a fault of 0.007'' in the inner race of the bearing, the fault types ID will be called as '1'). In addition, there are also normal data in the experiment. We will use '0' as the type ID. Figure 3 shows four groups of time series signals at speed 1772 rpm. Since the normal data in CRWU experiment are collected at 48K, compared to 12K, its data points are 4 times VOLUME 8, 2020 than that of 12K. For intuitive presentation, the time series signal of the normal data gives 3200 points of the time series signal of the normal data are shown in sub-figure(a), and 800 points of fault data are shown in sub-figure (b) (c) (d). It can be clearly seen from the signal comparison that the data from normal operation have a lower value change range and the signal is smoother than others. From the comparison of (b) and (c), it can be seen that the same fault type under different fault diameters has similar change patterns but different amplitudes. Therefore, this situation has the possibility of feature transfer. From the comparison of (b) and (d), it shows that there are obvious differences in the signal changes corresponding to different fault types, so that it is possible to be classified. Through the comparison of Figure 3, we find that there are obvious differences in the time series signals of different faults, so the time domain indicators of the signals are used to extract the features of the signals. Therefore, here we mainly consider the following seven time-domain indicators in Table 3 for feature extraction. And, set the time window scale of feature extraction with every 800 points in the 12K as a sample, and every 3200 points in the 48K as a sample. Use the above time-domain indicators to process all samples, and regularize all features to ensure that the scales between different indicators are consistent.

C. FAULT RECOGNITION 1) FAULT RECOGNITION UNDER 12K
After feature extraction, for 12K, 150 samples are obtained for each fault type and the normal operation. At 12K, the accuracy of fault prediction after feature transfer for the case of 0.007'' and 0.021'' is tested respectively. Here, each data set includes three fault classes and one normal class (four-classes fault recognition). Table 4 gives the transfer results among four types of speeds (See Table 1) one by one under the case of 0.007.'' For example, the result from Row 2, Column 3 means the accuracy of the data feature from Hp 2 to Hp 3 .  Same as Table 4, Table 5 gives the results under the case of 0.021.'' It can be found from Tables 4 and 5 that although the feature transfer uses only seven time-domain indicators, its prediction accuracy is quite high, proving the effectiveness of the data feature transfer.

2) FAULT RECOGNITION UNDER 48K
Since the amount of data at 48K is more abundant, we test more types of fault predictions in the 48K. Here we consider six types of faults 1, 2, 4, 5, 7, 8 and normal operation 0. That is, consider a seven-classes fault recognition, which is more difficult than four-classes fault recognition problem under 12K. The fault recognition accuracy is tested at three speeds: Hp 2 , Hp 3 , Hp 4 and the results are shown in the following Table 6. It can be found that its classification accuracy is still very good. Therefore, the validity of feature transfer on fault recognition is further proved.
According to Figure 1, the predicted results under the current working condition can be compared with the actual fault after inspection by maintenance personnel, then the correct label can be obtained. By using these correct labeled data under the current working condition and auxiliary data under the historical similar working condition for training, the accuracy of the model can also be improved. Now, assume that 10% of the data have been labeled for the current working condition. That is, for original 150 samples of each class, 15 samples have been labeled by maintenance personnel. Table 7 gives the transfer results among different speeds under 48K after 10% samples being labeled. It can be found that this approach is feasible, and almost all the results have been significantly improved than the original. The rationality and effectiveness of our method are verified again by compared with Table 6.

D. COMPARISON WITH OTHER ALGORITHMS
To verify the effectiveness of our TCA based method, deep belief network (DBN), support vector machine (SVM), artificial neural network (ANN), Bayesian method, Bagging method and Boosting method are used to compare with the proposed method. The accuracy results of the above methods for fault prediction have been given in [56]. Our proposed method uses KNN for classification after TCA. In order to illustrate the effectiveness of TCA for classification, KNN is used as beachmark in following experiment.
The comparison experiments are conducted on two data set, termed Group A and Group B. Group A represents transferring fault predict model from Hp 1 to Hp 2 . Group B represents transferring the fault predict model from Hp 3 to Hp 4 . Both data set all come from sampling under 12K. Different from Section IV.C, comparison experiments in this section consider all fault types (Inner Race, Ball, Outer Race) and all fault diameter (0.007,'' 0014'' and 0.021''). Therefore, there are nine fault classes and one normal class (i.e., ten-classes fault recognition).
To eliminate the interference caused by different data, we use the same method in [56] to generate the experiment data. For each fault class and normal class, 300 samples are generated for training, and 200 samples are used for testing.
For a clearer expression, Table 8 gives the experiment data settings.  Table 9 gives the comparison results of different methods. It can be clearly seen that the proposed TCA based method

V. CONCLUSION
This paper studies data-driven fault diagnosis, and analyzes the difference between classical signal processing-based fault diagnosis and deep learning fusion diagnosis. It is clear that the method using signal processing flow in fault diagnosis has better results and robustness, and can be promoted into various industrial fault diagnosis problems. Then, through the combination of fault diagnosis process and data feature transfer method, a fault diagnosis framework is proposed. Then, the framework is used in the fault diagnosis of CWRU bearing data, which proves the validity and rationality of the proposed framework and corresponding methods.
At present, the mainstream transfer learning methods require a high degree of similarity between different data domains. It restricts the use of transfer learning in fault diagnosis. Therefore, in future research, our research will focus on the fault transfer in case of great difference of working conditions. And, traditional transfer learning in fault diagnosis is always one-to-one (i.e., one source domain to one target domain). In many practical cases, there may be more than one historical working conditions. How to use the data from different historical working conditions to assist the fault diagnosis under the target working condition will be an urgent research work in the future.