Limited Data Rolling Bearing Fault Diagnosis With Few-Shot Learning

This paper focuses on bearing fault diagnosis with limited training data. A major challenge in fault diagnosis is the infeasibility of obtaining sufficient training samples for every fault type under all working conditions. Recently deep learning based fault diagnosis methods have achieved promising results. However, most of these methods require large amount of training data. In this study, we propose a deep neural network based few-shot learning approach for rolling bearing fault diagnosis with limited data. Our model is based on the siamese neural network, which learns by exploiting sample pairs of the same or different categories. Experimental results over the standard Case Western Reserve University (CWRU) bearing fault diagnosis benchmark dataset showed that our few-shot learning approach is more effective in fault diagnosis with limited data availability. When tested over different noise environments with minimal amount of training data, the performance of our few-shot learning model surpasses the one of the baseline with reasonable noise level. When evaluated over test sets with new fault types or new working conditions, few-shot models work better than the baseline trained with all fault types. All our models and datasets in this study are open sourced and can be downloaded from https://mekhub.cn/as/fault_diagnosis_with_few-shot_learning/.


I. INTRODUCTION
Fault diagnosis is widely applied in diverse areas such as manufacturing, aerospace, automotive, power generation, and transportation [1]- [4]. Recently, intelligent fault diagnosis techniques with deep learning have attracted a lot of attention due to their avoidance of dependency on the time-consuming and unreliable human analysis and increased efficiency in fault diagnosis [5], [6]. However, most of these techniques require a large amount of training data. In real-world fault diagnosis, the signals of the same faults often bear large difference between different working conditions, leading to a major challenge in fault diagnosis: it is often impossible to obtain sufficient samples to make the classifier robust for every fault types. This situation can arise for several reasons: (1) industry systems are not allowed to run into faulty states The associate editor coordinating the review of this article and approving it for publication was Li Zhang. due to the consequences, especially for critical systems and failures; (2) most electro-mechanical failures occur slowly and follow a degradation path such that failure degradation of a system might take months or even years, which makes it difficult to collect related datasets [7]. (3) working conditions of mechanical systems are very complicated and frequently change from time to time according to production requirements. It is unrealistic to collect and label enough training samples [8]. (4) especially in real-world applications, fault categories and working conditions are usually unbalanced. It is thus difficult to collect enough samples for every fault type under different working conditions.
There are some studies about limited data fault diagnosis. Hang et al. [9] proposed the principal component analysis (PCA) and applied it in the field of high-dimensional unbalanced fault diagnosis data. Duan et al. [10] applied a new support vector data description method for machinery fault diagnosis with unbalanced datasets. In recent years, VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ deep learning has achieved impressive results in application areas such as computer vision, image and video processing, speech recognition, and natural language processing [11]. Deep learning methods have also been applied to fault diagnosis and obtained state-of-the-art results [6] using techniques such as auto-encoders (AE) [12]- [15], restricted boltzmann machine (RBM) [16]- [19], convolutional neural networks (CNNs) [20]- [24], recurrent neural networks (RNNs) [25]- [27], transfer learning based neural networks [28]- [31]. and generative adversary networks GANs e.g. Sun et al. [12] applied sparse auto-encoder (SAE) neural network and achieve superior performance for feature learning and classification in the field of induction motor fault diagnosis. Li et al. [16] [32] proposed a feature-based transfer neural network (FTNN). To address the unbalanced data in fault diagnosis, Cabrera et al. [33] use GANs model to assess the data distribution for every minority faulty mode to synthetically increase its size. The ability of deep neural networks to learn low-level and high-level features from abundant datasets has been well known and has been widely exploited in fault diagnosis [5], [6]. However, except transfer learning based methods which can address the time-changing working conditions issue and GANs methods which can address the unbalanced data issue, most of these intelligent fault diagnosis methods based on deep neural networks have not addressed one of the major challenges: limited fault samples for one or more fault types under all working conditions.
Recently, few-shot learning based deep neural networks have made great progress in addressing the data scarcity issue [34]- [37] and has become an exciting field of machine learning. Few-shot learning was first addressed in the 1980s [38]. Fei-Fei et al. [34] developed a variational Bayesian framework for one-shot image classification using the premise that previously learned classes could be leveraged to help forecast future ones when very few examples are available from a given category. Wolf et al. [35] chose to focus on a metric learning approach using a standard bag of features representation to learn a similarity kernel for image classification of insects. Wu et al. [36] address one-shot learning in the context of path planning algorithms for robotic actuation. Koch et al. [37] proposed siamese neural networks for one-shot image recognition. Vinyals et al. [39] applied matching networks for one-shot learning by employing the ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Altae-Tran et al. [40] applied one-Shot learning for limited data drug discovery. Snell et al. [41] proposed prototypical networks for few-shot learning by learning a metric space in which classification can be performed by computing distances to prototype representations of each class. Qiao et al. [42] proposed few-shot image recognition by Predicting Parameters from activations. Zhang et al. [43] proposed a conceptually simple and general framework called MetaGAN for few-shot learning problems. Different from semi-supervised few-shot learning, their algorithms can deal with semi-supervision at both sample-level and task-level. However, despite the success of one-shot and few-shoting learning in other applications, to our knowledge, these methods have not been applied to solving the critical sample scarcity issue in rolling bearing fault diagnosis.
In this paper, we proposed a few-shot learning neural network approach for rolling bearing fault diagnosis with limited data. Our contribution in this paper includes: (1) We proposed a few-shot learning approach for bearing fault diagnosis with limited data, which is achieved by developing a siamese neural network model based on deep convolutional neural networks with wide firstlayer kernels (WDCNN) [8].
(2) We demonstrated for the first time that few-shot learning based diagnosis models can boost the performance of fault diagnosis by making full use of the same or different class sample pairs and recognize the test sample from the classes that have only a single or few samples. For example, in small training set with 90 training samples, our method can achieve an accuracy of 92.56% compared to 80.36% without using few-shot learning strategy. (3) As the number of training samples increases, the test performance does not monotonically increase when the test data set has a significant difference from training data set.
This rest of the paper is organized as follows: Section II describes the few-shot learning based fault diagnosis algorithm. Section III presents the experiments, results and discussion. Section IV concludes the paper.

A. FEW-SHOT LEARNING GENERAL STRATEGY
The few-shot learning general strategy is shown in Figure 1. It is based on multiple applications of one-shot learning. First, it trains a model with a collection of sample pairs with the same or different categories. Input is a sample pair with the same or different classes ( that two input samples are the same. Unlike traditional classification, the performance of few-shot learning is typically measured by N -shot K -way testing as shown in Figure 1(c).
In one-shot K -way testing, it is given a test sample x to classify and a support set S as shown in Equation 1, which contains K samples and each sample has a distinct label y.
And then it classifies the test sample according to the most similar sample in the support set as shown in Equation 2.
In N -shot K -way testing, the model is given a support set consisting of K different classes each having N samples (S 1 , . . . , S N ). And then the model has to determine which of the support set classes the test sample should belong to as shown in Equation 3.
The few-shot learning for bearing fault diagnosis is based on multiple applications of one-shot learning. As shown in Figure 2, there are three steps in few-shot learning: data preparation (top), model training (left), and model testing (right). The details of data preparation will be introduced in Section III. The input of model training is the collection of samples pair with the same or different category labels. The output of model training is a probability distance to judge if the pair are of the same or different category. The details of model testing will be explained in below subsection. Figure 3 shows our few-shot learning model for bearing fault diagnosis, which is a siamese neural network [37] based on deep convolutional neural networks with wide first-layer kernels (WDCNN) [8]. In this network model, two identical WDCNN networks are set up with the same network architecture and shared weights. The WDCNN network architecture as detailed in Table 1 is setup the same as [8]. It first uses wide kernels to extract features, and then uses small kernels to acquire better feature representation. This design strategy is due to the fact that designing a model with all small kernels is unrealistic and small kernels at the first layer can be easily disturbed by high frequency noise common in industrial environments.

1) INPUT
The input data is a sample pair of the same or different classes. We directly use the deep convolutional neural network (WDCNN) to extract features from the raw vibration signals. This is called end-to-end deep learning approach.

2) DISTANCE METRIC
Let M indicate the minibatch size and i, the ith minibatch. The twins sub-networks are optimised based on the distance metric between their outputs, which is calculated by Equation (4) where f is the WDCNN neural network.
The output is the ''distance'' of the feature vector outputs from the WDCNN twins in terms of whether their outputs are considered quite similar versus being quite dissimilar. It is obtained by Equation (5) which represents the probability that two input samples are the same, where sigm is the sigmoid function and FC is a dense fully connected layer.

4) LOSS FUNCTION
Let t j = y(x i 1j , x i 2j ) be a length-M vector which contains the labels for the minibatch. We let t i j = 1 whenever x i 1j and x i 2j are from the same fault class and t i j = 0 otherwise, where j is the jth sample pair from ith minibatch. The loss function is a regularized cross-entropy as: The network is optimized by Adam optimizer, which computes individual adaptive learning rates for each parameter. The parameters are updated by: where w (T ) means the parameters at epoch T, L (t) is the loss function, β 1 and β 2 are the forgetting factors for the first and second moments of gradients respectively, and m and v are moving averages.

6) TRAINING
The few-shot learning training for fault diagnosis are shown on the left in Figure 2. The input is a sample pair with the same or different clasess ( Equation 5, then the loss is calculated by Equation 6 and the model is optimised by Equation 7.

7) N-SHOT K -WAY TESTING
We use multiple one-shot K -way testings to simulate N -shot K -way Testing. As for five-shot N-way testing task, it is the same as N -shot K -way testing task. We repeat one-shot K -way testing five times as the five-shot data support set while each time the data support set S is randomly selected from the training data. After five times of one-shot N-way testing, we get five probability vector (P 1 , P 2 , P 3 , P 4 , P 5 ) and calculate the maximum sum of probability of the same label by Equation 3.

III. EXPERIMENTS AND RESULTS
To verify the performance of our few-shot learning algorithm for limited data fault diagnosis, we selected 12k drive end bearing fault data in the Case Western Reserve University (CWRU) Bearing Datasets [44], [45] as the original experiments data. As shown in Table 2, there are four types of the bearing fault location: normal, ball fault, inner race fault, and outer race fault. Each fault type contains three types: 0.007 inches, 0.014 inches, and 0.021 inches respectively, so we have ten types fault labels in total. Each fault label contains three type loads of 1, 2 and 3 hp (motor speeds of 1772, 1750 and 1730 RPM respectively).
In experiments, each sample is extracted from two vibration signals as shown in Figure 2  samples in total. In the following experiments, comparing with the baseline WDCNN method, we will discuss the effect of the proposed few-shot learning method to address the limited data fault diagnosis challenge. For following experiments, the detail compares baseline WDCNN and few-shot (One-shot and Five-shot) methods are shown in Table 3. It should be noted that the A, B, C, and D datasets are used to generate the data set in different experiments.

A. EFFECT OF THE NUMBER OF TRAINING SAMPLES ON PERFORMANCE
In this experiment, we will evaluate the effect of proposed few-shot learning method to address the first two challenges in limited data fault diagnosis: 1) industry systems are not allowed to run into faulty states due to the consequences, especially for critical systems and failures; 2) most electromechanical failures occur slowly and follow a degradation path such that failure degradation of a system might take months or even years [7]. We conducted a series of comparison experiments by setting the test dataset in Dataset D ( Table 2) as the test set and randomly selecting 60, 90, 120, 200, 300, 600, 900, 1500, 6000, 19800 samples respectively from the training samples of the whole dataset D. For every training set, the support vector machines (SVM) method uses whole training sets to fit models. We searched the proper parameters for the SVM algorithm. Other methods use 60% samples as the training set and the rest samples as the validation set. Then we evaluated the effect of the sample number on the performance of each training model. For each training set size, we repeated the sample selection process five times to generate five different training sets to deal with the bias of randomly selected small training sets. For each such random training sample set, we repeated the algorithm training and testing experiment four times to deal with the randomness of the algorithms. Together, for every series of experiments, we repeated twenty times in total.
First we compared the performances of deep learning algorithms and conventional machine learning algoithm, the SVM. As shown in Figure 4, it is clear that the accuracies of deep learning methods are much higher than those of SVM. For training sets with 60 samples, the WDCNN, One-shot, and Five-shot algorithms achieved accuracies of 73.97%, 79.33%, and 82.80% respectively, which are all sharply higher than 18.93%, the accuracy of SVM. For training sets with 90 samples, the WDCNN, One-shot, and Fiveshot algorithms achieved accuracies of 77.39%, 88.41%, and 91.37% respectively, which are all sharply higher than 26.56%, the accuracy of SVM. We also found that the accuracies of SVM, WDCNN and few-shot learning all increase while their standard deviations decline with increasing number of training samples. Next, we check whether few-shot learning algorithm performs better than standard WDCNN in experiments with small number of training samples (e.g. with 60, 90 and 120 samples). For all three training sample sizes, our few-shot model performs better with an average of 9% higher in accuracy than the WDCNN model. Especially when the training set size is set to 90, the few-shot model gets 13% higher in accuracy than the WDCNN model. When the number of training samples is further increased to 200, 300, and 600, results in Figure 4 shows that the performance of few-shot is slightly worse than that of WDCNN. The accuracies of both algorithms are actually very close and are both higher than 94%. When the number of training samples is even increased to 900 or more, the performances of these two algorithms become almost the same and the accuracies of both algorithms are higher than 98%. These performance comparisons show that our proposed few-shot learning algorithm enjoys the much better performance when trained with limited datasets without losing too much when there are aboundant training samples. Furthermore, it is also observed that the accuracies of five-shot learning are consistently better than the accuracies of one-shot learning, as shown in the third and fourth row in Figure 4 table.
To better understand the effect of few-shot learning in limited data diagnosis, Figure 5

B. PERFORMANCE UNDER NOISE ENVIRONMENT
In this experiment, we will evaluate effect of the proposed few-shot learning method to address the third challenge in limited data fault diagnosis: working conditions of mechanical systems are very complicated and change many times from time to time according to production requirement. It is unrealistic to collect and label enough training samples [8]. We discuss the performance under noise environment to simulate the change of working conditions in datasets D. Signalto-noise ratio (SNR) is defined as the ratio of signal power to the noise power, often expressed in decibels detailed as follows: SNR dB = 10 log 10 (P signal /P noise ) (8) where P signal and P noise are the power of the signal and the noise, respectively. In this case, the same as [8], the models are trained with the original data provided by CWRU, then tested with added different SNR white Gaussian test samples. The different SNR ranges from −4 dB to 10 dB. The smaller SNR value is, the stronger power of noise is. Table 4 shows the results of diagnosing noise signal by the proposed few-shot (one-shot and five-shot) learning models and the compared WDCNN model. The greener the background color, the better the result is. It is clear that the accuracy increases and is greener as the noise gets weak, e.g., the average accuracy is only near 40% when the SNR is −4 dB in 19800 training data samples, while the accuracy  surges to above 99% when SNR is 6 dB. And the accuracy increase as the data number for training gets more significant when the noise is not very strong.
As shown in Figure 6, to check the effect of the data number for training under the noise environment, we compare results with different training data number. From results, we easily found that five-shot is better than one-shot. For a small number of training data, when the features between training and test set are similar, the few-shot performance is better than the WDCNN. e.g., as shown in Figure 6(a), when the SNR equals to −4 dB as the features between training and test set have a big difference, the few-shot is inferior to the WDCNN. And when the SNR equals to 0 dB as the features between training and test set have some difference, few-shot and WDCNN performance are same. However, when SNR equals to 10 dB as the features between training and test set are similar, the few-shot is much better than the WDCNN, higher over 10%. Besides, when there has sufficient training data, the few-shot is better than the WDCNN. e.g., as shown in  Figure 6(b), the WDCNN is better than few-shot as the training data size increases. However, as shown in Figure 6(c), when the training data size increases to a certain amount, the gap between few-shot and WDCNN is decreasing. And as shown in Figure 6(d), when there is sufficient training data, the few-shot is better than the WDCNN.
In Figure 7, we compare the test results of three models under different noise environments. In Figure 7(a),(b) with 8 dB and 4 dB SNR respectively, there is less difference between the training set and the test set. The test results become better as the number of training samples increases. However, in Figure 7(c),(d) with 0 dB and −4 dB SNR respectively, there is large difference between the training set and the test set. The test performance does not monotonically increase as the number of training samples increases. Because the model trained by a small number of training samples is mainly considered to be under-fitting and the test accuracy increases as the number of training samples increases. Then when the number of training samples increases with the larger difference between training and test sets, the trained model performs well in the training set, but it may easily show overfitting on significantly different test sets which will cause test accuracy decrease.
As mentioned above, as the number of training samples increases, the test performance does not monotonically increase when the test data set has a significant difference from the training data set. Therefore, the appropriate number of training samples could get the best results on significant different test sets.

C. PERFORMANCE UNDER NEW CATEGORIES
In this experiment, we will evaluate effect of the proposed few-shot learning method to address the fourth challenge in limited data fault diagnosis: especially in real-world applications, fault categories and working conditions are usually unbalanced. It is thus difficult to collect enough samples for every fault type under different working conditions. We mainly pay attention to unbalanced fault categories. When new categories appear, traditional deep learning methods  need to retrain to deal with the new categories diagnosis. However, different from the traditional deep learning methods, the few-shot learning model can be directly used in the new category diagnosis by just giving a few new category samples. We train the WDCNN model from all categories and the few-shot model from 30% randomly new categories to all see categories in datasets D. We repeated each such experiment ten times to deal with the randomness of the algorithms.
The results are shown in Figure 8. The accuracy of the fewshot raises with the decrease of number of new categories. The accuracy of five-shot is better than the accuracy of oneshot. For the limited data with 90 training samples, the accuracy of few-shot is higher than the WDCNN model when number of new categories are equal or under 20%. Besides, for enough training samples, the accuracy of few-shot can get over 90% performance when number of new categories are equal or under 20%. Therefore, few-shot learning could enhance performance well under a few new categories. Thus, when new categories appear, the few-shot learning model could be directly used in the new category diagnosis and get better performance.

D. PERFORMANCE UNDER NEW WORK CONDITIONS
In this experiment, we will evaluate effect of the proposed few-shot learning method to address the fourth challenge in limited data fault diagnosis. We mainly pay attention in unbalanced working condition. When new categories appear, we would like to evaluate how well the few-shot performance. We use domain adaptation to simulate new work conditions. The description of scenario setting for domain adaptation is illustrated in Table 5, which the training or test set A, B, and C same as Dataset A, B, and C in Table 2.
As shown in Figure 9, for the limited data with 90 training samples, the average accuracy of few-shot learning method in the six scenarios performs better than the WDCNN method.    However, with enough training samples, the average accuracy of the WDCNN method in the six scenarios is more better than the few-shot learning method. Besides, both these methods perform poorly when training with set C and testing in set A. As shown in Table 6, as set A's speed is higher than set B, we think that set A is more complicated than set B, same as set B and set C. Therefore, when training with lower speed set C and testing in higher speed set A, there has lots features change. Thus it causes the poor performance. It can also be confirmed in Figure 10 results. As shown in Figure 10(a), when training with the complicated set A and testing in set C, the features change a little. Thus the test performance increases as the number of training samples increase. However, as shown in Figure 10(b), when training with the set C and test in the complicated set A, the features change a lot. Thus the test performance does not monotonically increase as the number of training samples increase.
The few-shot learning method could not be better when training with limited data to enough data and the WDCNN method could. However, the average accuracy of WDCNN method not better than the WDCNN trained with 200 labeled data ( Figure 4). We think that it is unreasonable to use the model trained only from one load condition and directly used in a new load, and the effect is not satisfactory. Because the model is trained only for one load situation, it is difficult for the model to learn the knowledge of the changes in the category features caused by load changes. Thus that may cause poor performance under new loads. So we designed the new scenario setting for domain adaptation illustrated    Table 7, by training with the labeled signals under two load and testing in unlabeled signals under new load.
As shown in Figure 11, for the limited data with 90 training samples, the average accuracy of these methods in the new scenario setting is better but not significate than in the old scenario setting. But when there are enough samples which allow to effectively learn the knowledge of the changes in the category features caused by load changes, the average accuracy of both methods perform much better than in the old scenario setting. For the few-shot method, it is higher over than 15%. For WDCNN method, it is higher over than 7%. Besides, in the new scenario setting, we find the few-shot performs better than the WDCNN.

IV. CONCLUSION
This paper presents a few-shot learning approach for rolling bearing fault diagnosis with limited data. Our algorithm addresses one of the major challenges in limited data fault diagnosis: the difficulty of obtaining sufficient numbers of samples in data-driven fault diagnosis. Our few-shot fault diagnosis model is based on the siamese neural network for one-shot learning. It works by exploiting sample pair of the same or different categories, measuring the ''distance'' of two WDCNN twins feature vector outputs in terms of whether their outputs are considered quite similar versus dissimilar.
Our method was validated on the CWRU Datasets with extensive experiments by comparing its performance with baseline (the popular WDCNN fault diagnosis model).
The experimental results showed that few-shot learning is effective for fault diagnosis with both limited data or sufficient data. By comparing testing results under different noise environments, we found that large difference between the training and test sets may cause the test performance to not monotonically increase as the number of training samples increases. When evaluated over test sets with new fault types or new working conditions, few-shot models work better than the baseline trained with all fault types. Furthermore, to facilitate open research in the field of bearing fault diagnosis, all our models in this work and datasets are open sourced and can be downloaded at the URL (https://mekhub.cn/as/fault_diagnosis_with_few-shot_learning/).