Improved Deep Transfer Auto-Encoder for Fault Diagnosis of Gearbox Under Variable Working Conditions With Small Training Samples

It is considerable to solve practical fault diagnosis task of gearbox under variable working conditions by introducing sufficient auxiliary data. For this purpose, a new approach called improved deep transfer auto-encoder is proposed for intelligent diagnosis of gearbox faults under variable working conditions with small training samples. First, multi-wavelet is employed as activation function for effectively learning useful features hidden in the non-stationary vibration data. Second, correntropy is used to modify the cost function to enhance the reconstruction quality. Third, pre-train an improved deep auto-encoder using sufficient auxiliary data in the source domain, and transfer its parameters to the target model. Finally, the improved deep transfer should be fine-tuned by small training samples in the target domain to adapt to the characteristics of the rest testing data. The proposed approach is used to analyze two sets of experimental vibration data collected from gearbox under variable working conditions. The results show that the proposed approach can accurately diagnose different faults of gearbox even the working conditions have significant changes, which is superior to the existing methods.


I. INTRODUCTION
Due to great loading capacity, large reduction ratio, high transmission efficiency and other prominent advantages, gearbox has a very wide application in aircraft engine, wind turbine and high speed railway.Different types of fault will occur in gearbox after long-term working under the conditions with high temperature, high speed, heavy loading and strong impact, which may lead to safety accidents [1]- [3].Therefore, gearbox fault diagnosis has become an important part in the field of intelligent maintenance and health management.
Artificial intelligence has attracted increasingly attention in recent years for enhancing automation monitoring and inference capabilities of industrial equipment [4].For gearbox health monitoring, despite intelligent diagnosis research The associate editor coordinating the review of this article and approving it for publication was Alicia Fornés.has made gratifying progress [5]- [11], still the following problems have not been well solved.(1) The raw vibration signals collected from gearbox are always nonlinear and nonstationary with a lot of background noise.In addition, different fault locations and fault severities lead to the diversity of fault types, which have put forward high requirements for signal pre-processing and feature extraction [2].(2) In consideration of economic cost and human labor, it is hard and unrealistic to obtain enough fault data in engineering practice, which will result in terrible lack of training samples for intelligent diagnosis model [12].(3) The complexity of working conditions (variable speeds and variable loadings) may lead to significant distribution differences between the training and testing samples, meaning that the intelligent diagnosis model trained by the vibration data collected under a certain working condition is usually not suitable for other cases [13].Thus, to automatically learn the characteristic information hidden in the raw data and realize accurate fault identification under different working conditions, new skills are urgently needed to improve the existing intelligent diagnosis methods.
Due the powerful and automatic feature learning ability, deep learning has become a highly concerned intelligent method for machinery fault diagnosis in the past few years [14]- [24].However, the successful construction of intelligent diagnosis models designed with deep structures is still inseparable from sufficient training data [13].Moreover, the training data and testing data should meet the demand of the same distribution.Transfer learning is another great breakthrough in artificial intelligence area, which aims to solve the tasks between different but related domains based on the existed knowledge [25].By means of transfer learning and deep learning, distribution differences between the training data (Source domain) and test data (Target domain) can be allowed to some degree.To date, transfer learning has made several academic achievements in the conventional pattern recognition fields [26]- [28].For intelligent fault diagnosis of rotating machinery, some researches have begun to explore the application of transfer learning in the last three years.Zhang et al. [29] combined transfer learning and neural network for bearing fault identification under changeable working conditions.Wen et al. [30] proposed transfer diagnosis approach using sparse auto-encoder for classifying different fault types of bearing under variable working conditions.Qian et al. [31] constructed transfer learning network based on high-order Kullback-Leibler divergence to achieved intelligent fault diagnosis of gearbox and bearing under variant working conditions.
Through literature review, it can be seen that transfer learning has shown some potential to overcome distribution difference problem of gearbox fault data collected from different operating conditions.However, in the transfer diagnosis cases mentioned above, the change range of rotating speed is very small, thereby making the distribution differences not serious.However, the rotating speed and working load of gearbox usually change greatly in practical engineering [32], leading to significant differences of data samples.Thus, it is of practical importance to build better deep transfer models to achieve gearbox fault diagnosis under obvious changes in working conditions.
In this paper, a new approach based on improved deep transfer auto-encoder is proposed to diagnose different gearbox faults under variable working conditions with small training samples.First, multi-wavelet is employed as activation function for effectively learning useful features hidden in the non-stationary vibration data.Second, correntropy is used to modify the cost function to enhance the reconstruction quality.Then, pre-train an improved deep auto-encoder using sufficient auxiliary data in the source domain, and transfer its parameters to the target model.Finally, the improved deep transfer should be fine-tuned by small training samples in the target domain to adapt to the characteristics of the rest testing samples.The proposed approach is used to analyze two sets of experimental vibration data collected from gearbox under variable working conditions.The results show that the proposed approach can accurately diagnose different health conditions of gearbox even the working conditions have significant changes, which is better than the existing methods.
The rest of this paper is arranged as follows.Section II shortly introduces basic auto-encoder theory.The proposed approach is described in Section III.Transfer fault diagnosis cases under variable working conditions are designed to verify the superiority of the proposed approach in Section IV.Section V gives the final conclusions and future work.

II. BRIEF INTRODUCTION OF BASIC AUTO-ENCODER
As shown in Figure 1, auto-encoder (AE) is composed of an encoder and a decoder, which has become a popular base model for constructing various deep architectures due to its good capability for unsupervised feature learning.The encoder tries to obtain the representative feature representation of input data, and the decoder aims to recover input from the representation [33].Some important formulas of the basic AE model are presented as following: h = s g w (1) x + b (1)  (1) z = s f w (2) h + b (2)  (2) in which x ∈ N denotes an input data sample, h ∈ M denotes the feature representation, z ∈ N denotes the output, w (1) , b (1) , w (2) , b (2) represents the parameter set, including the weights w (1) , w (2) and biases b (1) , b (2) in different layers, s g denotes the activation function of encoder, usually selected as Sigmoid, s f denotes the activation function of decoder, which is selected according to the specific normalized range of the input data.
The training purpose of AE model is to adjust the parameters to keep the output as close as possible to the input.The most widely used cost functions is expressed as [3] where x i and z i are the ith dimension elements of x and z, respectively, rdenotes sparsity penalty coefficient, µ denotes sparsity coefficient, μj denotes the average activation value for the jth hidden node, and λ denotes weight decay coefficient.

III. THE PROPOSED METHOD A. MULTI-WAVELET ACTIVATION FUNCTION
Generally, the activation function employed in hidden layer of basic AE is Sigmoid or Tanh, their main problems are computational complexity and gradient vanishing, which will result in low-efficiency weight updating.Rectified linear unit (ReLU) is fast and can avoid gradient vanishing [13].However, the non-zero centered output and neuron dying problem will degrade the training performance [34].
What is more important, the raw vibration signals collected from gearbox are always non-stationary with complex noise, researches have investigated that neural networks designed with conventional activation functions usually fail to achieve exact mapping between multiple output patterns and non-stationary input data [35].
Wavelet neural network (WNN) has good time-frequency localization property and zoom characteristic.Compared with conventional neural networks, the superiority of WNN for analyzing non-stationary signals has been verified in lots of classification and regression cases.Multi-wavelet neural network (MWNN) is the extension of WNN, which has faster convergence and better characteristics in the approximation of non-stationary signals [36]- [38].To date, few researches have reported about multi-wavelet activation functions applied in deep learning field, therefore, it is worth trying to design novel deep learning models using multi-wavelet to solve task.
Currently, there have been developed some multi-wavelets with excellent properties, such as GHM multi-wavelet, CL multi-wavelet and SA4 multi-wavelet.However, the scaling functions of these multi-wavelets have no explicit expressions, which will greatly increase the difficulty of calculating and updating the parameters of auto-encoder model.The scaling functions of the multi-wavelet developed by Plonka and Strela not only have very approximate expressions, but also hold some good properties such as orthogonality, regularity, symmetry and compact support [38], which are good choices to be used as activation functions of auto-encoder to improve the analysis performance for non-stationary vibration signal collected from gearbox.In this paper, the two scaling functions of the multi-wavelet are given in Figure 2, and defined as follows  The structure of improved AE model designed with multiwavelet activation function can be seen in Figure 3. Based on the two multi-wavelet scaling functions, for an input data sample x, the output expression for the hidden node is with in which h out1 j and h out2 j refer to the output portions of ϕ 1 (t) and ϕ 2 (t) for hidden node j, respectively, and h out j is the final output of multi-wavelet functions, x i is the ith dimension element of x, W ij is weight between hidden node j and input node i, U ij−1 and U ij−2 are weights between hidden node j and output node i based on ϕ 1 (t) and ϕ 2 (t), respectively, a j and c j are scale factor and shift factor, respectively.
The activation function of output layer is selected as Tanh, and then the reconstructed output can be calculated as where z i refers to the ith dimension element of z, and M refers to the number of hidden nodes.

B. MODIFIED COST FUNCTION
The widely used cost function of basic AE model given in ( 3) is sensitive to learn features from non-stationary signals with complex noise [3].Correntropy, a robust measure criterion [39], focuses on local similarity between two random vectors, which has shown advantages for dealing with complex signals with noise.Here, correntropy is used to modify the cost function to further reduce the reconstruction error, defined as in which σ is the kernel size of Gaussian kernel.To avoid over-fitting, a weight decay term is usually suggested to add into the cost function as well, and finally the cost function is modified as The weight parameters of the improved AE can be adjusted through iterative stochastic gradient descent by minimizing the modified cost function in (13), listed as follows where η refers to the learning rate.the input layer of the next improved AE model.The feature representations given by the last improved AE model are used as the input vector for softmax classifier.Figure 4 shows layer-by-layer construction process of the improved deep auto-encoder model with three improved AEs, in which Feature I, II, III are learned from Improved AE 1, 2, 3, respectively.More details about the construction of deep auto-encoders can be seen in [16].Improved deep transfer auto-encoder combines improved deep auto-encoder and the idea of parameter transfer.The specific process is described as follows.  T) with the completely same structure as Deep model (S) .( 4) Transfer the existing parameter knowledge of Deep model (S) to initialize Deep model (T) , i.e., W (S)  = W (T) , U (S) = U (T) .( 5) Fine-tune Deep model (T) using small training samples from target domain to adapt to the characteristics of the remaining testing data.By now, the construction of improved deep auto-encoder has been successfully implemented, which can be used for transfer diagnosis of gearbox faults under variable working conditions, and the flowchart is given in Figure 5.

IV. CASE STUDY CASE 1: TRANSFER DIAGNOSIS BETWEEN DIFFERENT WORKING CONDITIONS A. EXPERIMENTAL GEARBOX DATA DESCRIPTION
In this case study, gearbox fault data provided by PHM 2009 Data Challenge is used to test the feasibility of the proposed approach [40].Four spur gears are installed into the gearbox for simulating different health conditions, shown in Figure 6.Vibration data are collected at 66.67 kHz sampling frequency under five kinds of shaft speeds (30,35,40, 45 and 50Hz) and two kinds of loadings (High and Low).Some abbreviation rules are used for simplicity, i.e., 30L means the data form working condition with 30Hz (1800rpm) shaft speed and low loading, 50H means 50Hz (3000rpm) and high loading.
The vibration data from the input shaft is used in this case study.The source domain dataset is created by the collected  vibration data under 30L, and the target domain data is from 50H.Eight gearbox health conditions are created under different working conditions, including one normal condition and seven types of combined faults conditions, listed in Table I.
Each health condition from source domain has 145 samples consists of 120 training samples, while each target domain data only contains 10 training (Fine-tune) samples.Each sample refers to a signal segment including 6000 sampling points with 70% (4200 points) overlap.The details about the source domain and target domain can be seen in Table II.
Eight kinds of Data samples (After removing the mean) are plotted in Figure 7.It can be found that there seems to be little similarity between the data samples from source domain and target domain, meaning that obvious changes of working conditions will lead to serious distribution discrepancy.

B. COMPARISONS WITH OTHER DEEP LEARNING METHODS WITHOUT TRANSFER STRATEGY
In order to verify the superiority of transfer learning strategy, some existing deep learning techniques are used for comparisons, including basic DAE (deep auto-encoder with Sigmoid), DBN (deep belief network) and CNN (convolutional neural network).The following two things should be noted: The proposed method is firstly trained by 120 training samples from source domain, and then fine-tuned by 10 training sample from target domain.After that, it is used for analyzing the rest 20 testing samples.For all the comparative methods, the training and testing samples are both from target domain (without parameter transfer).The numbers of training samples are 10, 40 and 100, respectively, while the numbers of testing samples are always set as 20.
A total of 10 repeated validations are carried out to examine the accuracy and stability meanwhile.For each method, the input is the normalized form of the raw vibration data (6000-dimensional).It can be seen from Table III that the average testing accuracy given by the proposed method is 93.06% (1489/1600).The average accuracies of the nine comparative methods are 41.81%, 74.06%, 82.44%, 42.31%, 70.19%, 80.25%, 39.13%, 71.25%, and 89.31%, respectively,  which are lower than the proposed method.Besides, the standard deviation given by the proposed approach is 0.6215, and it is smaller than all the comparative approaches, meaning that the proposed approach holds better stability.Through the comparison results, the following two conclusions can be drawn.(1) The diagnosis performance of deep learning techniques depends heavily on sample size, without large amounts of training samples, deep learning techniques usually fail to show satisfactory results.(2) The proposed method is more effective than other deep learning techniques without transfer learning strategy.The main reason is that train a good deep neural network from scratch is difficult and time-consuming because lots of weights and biases are randomly initialized.In the proposed method, the number of adjusted parameters can be greatly reduced and reasonable initialization can be achieved through transferring parameters of model pre-trained by the source domain data to the target model.To enable the target model to further adapt to the characteristics of the testing samples in the target domain, small target training samples are then used to fine-tune the well pre-trained model.
The specific structure of the proposed method are 6000-2000-800-150-8, meaning that 2000, 800 and 150 nodes exist in the first, second and third hidden layers, respectively, which is determined by experimentation with a simple idea similar to [16].The iteration numbers in the pre-trained process and fine-tuning process are 60 and 25, respectively.parameters r,µ,λ, σ are 4, 0.08, 0.002 and 0.5, respectively.Most of these parameters are decided through experimentations.The structure of basic DAE is also 6000-2000-800-150-8, iteration numbers in the pre-trained process and fine-tuning process are both set as 100,  parameters r,µ,λ are selected as 4, 0.08 and 0.002, respectively.The structure of basic DBN is 6000-2000-800-150-8, learning rate, iteration number and momentum are set as 0.15, 100 and 0.85, respectively.The structure of basic CNN called LeNet-5 consists of an input layer, two convolutional layers, two pooling layers and an output layer [41].No further skills are used for improving the basic DAE, DBN and CNN.[42].The statistical results of 10 validations are given in Figure 8 and Table IV.The average accuracy on the testing samples of the proposed approach is 93.06% (1489/1600), which is higher than other 11 kinds of deep transfer models.For the third validation as example, the confusion matrix is shown in Figure 9. Thus, the proposed deep transfer model has higher accuracy and better stability than other deep transfer models faced with the same transfer diagnosis task.
The good performance of the proposed deep transfer model mainly benefits from the replaced multi-wavelet activation  Take the third validation as an example, Figure 10 is the reconstruction error curves of these models in pre-trained process.It can be observed that the reconstruction error given by deep transfer model 1 is smaller and provides faster convergence than others.

D. CONSIDERATION FOR TIME-DELAY
As mentioned before, each sample refers to a signal segment including 6000 sampling points, and two consecutive samples have 70% (4200 points) overlap, meaning that the first sample is [1,6000] (from the 1st data points to the 6000th data points), the second is [1801, 7200], and so on.In order to fully test the proposed method, the time-delay problem of time-series data is considered.The following two things should be noted: For source samples, the first sample still is [1,6000], the second is [1801, 7200], and so on.Each condition has 145 samples consists of 120 training samples.For target samples, the first sample is [1+n, 6000+n], the second is [1801+n, 7200+n], and so on.The time-delay parameter n ranges from 50 to 3000.Each condition has 10 samples for fine-tuning and 20 samples for testing.
Figure 11 shows the diagnosis results (average accuracy) of the proposed method under different time-delay parameters (50 to 3000).From Figure 12, it can be seen that the average testing accuracy of the proposed method become smaller (from 93% to 86%) with increase of time-delay degrees.The reason is that there exist distribution differences between training samples and testing samples, and time-delay problem will lead to larger differences.However, even time-delay parameter reach to 3000, the result given by the proposed method is still higher than 86%, because small training samples in the target domain are used to fine-tune the pre-trained model to adapt to the characteristics of the rest testing samples.

CASE 2: TRANSFER DIAGNOSIS FROM CONSTANT SPEED TO VARIABLE SPEEDS
The transfer diagnosis task in CASE 1 actually belongs to piecewise variable working conditions, however, in practical engineering, dynamic working regimes are more common.Thus, the feasibility of the proposed method for transfer fault diagnosis from constant speed to variable speeds is considered in CASE 2. The experimental device is shown in Figure 12, mainly consists of motor, tested gear (37 teeth) and bearing (SKF 6307).Four types of vibration data from gear and bearing are collected with sampling frequency of 8192 Hz, including tooth breakage, tooth breakage & outer race fault, tooth breakage & inner race fault and tooth breakage & ball fault.
Due to lack of engineering data, here, the collected vibration data under constant speed (600rpm) is treated as lab data, and the data from variable speeds is simulated as industrial on-site data.The details of variable speeds for four fault conditions are shown in Figure 13    samples and 40 testing samples.It should be noted that for each fault condition, the 30 target training samples are selected from three different positions (beginning, middle and end) in the collected signal.Each sample consists of 1024 data points with 50% overlap.Raw signals of the four fault conditions in this case are plotted in Figure 14.From Figure 14, it can be seen that target domain samples show strong nonstationary characteristics due to fluctuation of the rotating speeds during data acquisition process, leading to significant differences from source domain.
In this case study, fast Fourier transform, a well-known signal processing technique, is applied to acquire frequency spectrum (512-dimensional) of each data sample, so as to reduce the differences among different data samples caused by changeable speeds.Take three samples A total of 10 repeated validations are carried out to compare the diagnosis performance among deep transfer models 1, 2, 7, 8, 9 and 10, as listed in Table V.It can be seen that: (1) the average testing accuracies of all the methods based on frequency spectrum are higher than the raw data.
(3) Although the proposed method gives the best diagnosis results, it cannot be compared with the CASE 1.In order to further improve the accuracy, more advanced techniques should be introduced, such as nuisance attribute projection and order tracking.As a summary, with the help of multiwavelet, modified cost function and parameter transfer idea, the diagnosis knowledge learned from constant speed working conditions can be transferred to variable speeds to some degree.

V. CONCLUSIONS
In this paper, multi-wavelet activation function and modified cost function are used to enhance the deep auto-encoder.Based on improved deep auto-encoder and parameter transfer, improved deep transfer auto-encoder is proposed to diagnose gearbox faults under variable working conditions with small training samples.
Two sets of experimental vibration data of gearbox are used to validate the superiority of the proposed approach.The results demonstrate that the proposed method can accurately diagnosis different faults of gearbox, even the working conditions have significant changes, which is better than the existing methods.Deep transfer learning is able to solve hard tasks from largely different domains, which has big potential to be applied in engineering practice.Despite this paper preliminarily explores the applications of piecewise variable working conditions and simple dynamic working regimes, more complex and practical cases to be considered.

FIGURE 1 .
FIGURE 1.The structure of a basic auto-encoder (AE) model.

FIGURE 2 .
FIGURE 2. The waveforms of two multi-wavelet scaling functions.

FIGURE 3 .
FIGURE 3. The structure of improved AE designed with multi-wavelet activation function.

C
. IMPROVED DEEP TRANSFER AUTO-ENCODERIt is necessary to add the depth of the improved AE model and introduce softmax classifier into the highest level, so as to refine the quality of the learned features and achieve classification ability meanwhile.Specifically, each individual improved AE model is pre-trained in unsupervised way through minimizing the modified loss function, then the learned features of previous improved AE model are fed into

FIGURE 4 .
FIGURE 4. The construction of improved DAE with three improved AEs.

( 1 )
Train an improved deep auto-encoder (contains softmax classifier) denoted as Deep model (S) using the training samples in the source domain.(2) The excellent performance of the trained Deep model (S) is tested by the testing samples in the source domain.(3) Design another improved deep auto-encoder model denoted as Deep model

FIGURE 5 .
FIGURE 5.The framework of the proposed approach.

FIGURE 6 .
FIGURE 6. Gearbox fault test rig provided by PHM 2009 Data Challenge.

FIGURE 7 .
FIGURE 7. The vibration waveforms of the data samples from the eight gearbox health conditions: Source domain (black) and target domain (blue).(1-8mean condition 1-condition 8.

FIGURE 8 .
FIGURE 8. Statistical diagnosis results of different deep transfer models.

TABLE 4 .
The comparison results among different deep transfer models.

FIGURE 9 .
FIGURE 9. Confusion matrix of the proposed method for the third validation.

FIGURE 10 .
FIGURE 10.The reconstruction error curves of different deep transfer models in pre-trained process.

FIGURE 11 .
FIGURE 11.The diagnosis results of the proposed method under different time-delay parameters.
. It can be seen that the changing patterns of the rotating speeds are different.Each fault condition in the source domain has 158 samples consists of 140 training samples, while in target domain each fault condition contains only 30 training (fine-tuning)

FIGURE 12 .
FIGURE 12. Experimental setup for simulating faults of gear and bearing.

FIGURE 15 .TABLE 5 .
FIGURE 15.Frequency spectrums of three samples from different speeds.TABLE 5.The comparison results in case 2.

TABLE 1 .
Descriptions of eight types of gearbox health conditions.

TABLE 2 .
Details about the source domain dataset and target domain dataset.

TABLE 3 .
Diagnosis results of different methods.