DCNN With Explicable Training Guide and its Application to Fault Diagnosis of the Planetary Gearboxes

The diagnosis performance of Deep Convolutional Neural Network (DCNN) method is closely related to the generalization ability of the training model. An empirical training strategy is to randomly disperse the training samples and train the model with mini-batch training samples. But there are still two problems in the empirical method that need to be solved urgently. Firstly, what is the theoretical basis for random discretization of samples? Secondly, how to scientifically quantify batch division? Aiming at these two problems, the theoretical basis of sample random discretization has been deduced and proved, furthermore, a scientific quantitative batch division method is proposed based on the proved thesis. The fault diagnosis results of the planetary gearbox show that: (1) The model obtained by the training guide proposed in this paper has stronger generalization ability; (2) The DCNN with the training guide can accurately and effectively diagnose the faults of planetary gearbox and obtain ideal diagnosis results.


I. INTRODUCTION
It is of great importance to monitor the health state of mechanical equipment. Generally, fault diagnosis methods include data acquisition, signal processing, feature extraction and pattern recognition [1]- [4]. Aiming at the four parts aforementioned, many relevant methods, such as EMD, EEMD, VMD and SVM, BPNN et al., have been proposed [5]- [7]. However, these traditional fault diagnosis methods have high requirements for the acquired vibration signals and the signal processing methods, and they also rely on a lot of expert knowledge as well [8], [9].
In recent years, Deep Learning method has been widely used in Computer Vision [10], [11] and Speech Recognition [12]. Due to its powerful data processing and pattern recognition capabilities, some Deep Learning algorithms have been used for the mechanical equipment health monitoring too [13]- [15].
As a typical representative of Deep Learning, Deep Convolutional Neural Network (DCNN) has a strong ability to The associate editor coordinating the review of this manuscript and approving it for publication was Baoping Cai . extract distributed features from the original signal and identify the fault patterns adaptively, which can reduce dependence on expert knowledge [16]- [20]. For example, Heng Li et al. [21] combined the short time Fourier with DCNN to make fault diagnosis of rolling bearings, and the proposed method could avoid the process of feature extraction and classifier design. Zhou et al. [22] carried out fault diagnosis for rotating machinery based on 1D depth convolutional neural network, and compared with the traditional fault methods, their method achieved better performance. Through literature research, it could be found that the diagnosis performance of DCNN is closely related to the generalization ability of the training model. How to fully train the DCNN model with the existed samples and obtain an ideal diagnosis result? Around this issue, many scholars have carried out relevant research work in the fault diagnosis field.
In order to improve the diagnosis performance of DCNN, there are two main strategies at present. One is to perform signal processing first, and then input the processed signal into DCNN. The other is still to input the original signal into the network, but the strategy of adjusting the input data is changed to mini-batch. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ DCNN is generally used to process two-dimensional data, such as the images. Some scholars transformed one-dimensional vibration data into two-dimensional data, including images, or stacked one-dimensional decomposed vibration data to form a two-dimensional matrix. For example, Wen et al. [23] converted the time-domain vibration signal into images, and input them into the DCNN model for training and diagnosis. Zhao et al. [24] transformed the original time domain signal into the frequency domain signal, and fault diagnosis was carried out based on the spectrum data characteristic diagram. Some other scholars, such as Hu et al. [25] decomposed the signal using EMD method, screened samples in terms of the kurtosis of each decomposition component, and then, input the stacked components into the DCNN model for fault diagnosis. Islam and Kim [26] and Cao et al. [27] combined the DCNN model with the wavelet decomposition method, and applied the method in the bearing fault diagnosis. Han et al. [28] proposed a fault diagnosis method based on the enhanced DCNN, and applied it to the fault diagnosis of planetary gearboxes, the core idea of which was to convert one-dimensional signal into two-dimensional signal and increase the receptive field, so as to extract fault feature information better.
By transforming the dimension or the domain, the above-mentioned methods aim to improve the generalization performance of the training model. However, conducting signal processing will lose the meaning of using DCNN, because it still requires strong professional domain knowledge and expert experience, and the mini-batch strategy has more application and promotion value. Based on this, other researchers devote themselves to taking the original data as the research object and carried out related fault diagnosis without changing the original data. For example, Zhang et al. [29] considered dispersing the training samples and reducing the batch sample size, so as to realize the fault diagnosis with small samples. There are the two problems with the mini-batch strategy, which also exist in other model training methods. Firstly, what is the basis for random discretization of samples? Secondly, how should researchers scientifically quantify batch division?
In order to solve the above problems, this paper will give the key factors that affect the generalization performance of the DCNN training model in the form of rigorous mathematical derivation, so as to provide a theoretical basis and a guiding method for batch division. In addition, in order to test and verify the effectiveness of the DCNN model established under the training guide, a planetary gearbox failure experiment was designed. Furthermore, the method with training guide, original method and other deep learning method were compared.
The main contributions of this paper are summarized as the following three points. Firstly, the mathematical principle of random discretization of samples has been deduced and proved. Secondly, a scientific quantitative batch division method is proposed under the direction of the thesis proved. Thirdly, a planetary gearbox fault experiment has been designed and the test results show that the DCNN model under the training guidance strategy achieves a significant improvement as compared with the original methods and other Deep Learning methods. This paper is divided into five parts. The first part introduces the current state of the application of the DCNN method. The second part is a brief introduction to the DCNN method. The third part is a theoretical derivation part, which mainly provides the theoretical basis and the guiding method for how to train the DCNN model. The fourth part is an experimental verification. The fifth part concludes this paper.

II. BRIEF INTRODUCTION TO DCNN
Essentially, a typical 10-layer DCNN model shown in Fig.1 has two parts: a feature extractor and a softmax classifier. The feature extractor is composed of one inputting layer, three alternating convolutional layers (or C-layer), max-pooling layers (or P-layer), and two full connection layers (or FC-layer). The C-layer is responsible for extracting signal features, and the P-layer further reduces computation time and gradually establishes the invariance of space and structure. After several alternating C-layers and P-layers, the FC-layer is followed to compute the class scores. Next, the class scores are input into the softmax classifier and the diagnosis results could be obtained.

A. THE CONVOLUTIONAL LAYER (C-LAYERS)
The C-layers is used to enhance the characteristics of the original signal by means of convolution operations and reduce the noise at the same time. The filter kernel is described as follows: where, w l k , k ∈ 1, 2, . . . , d l h , is an m l × m l linear filter embedded in the l-th layer, d l h is the number of different kernels or filters in the W l . An matrix I l−1 p with size ω l−1 ×ω l−1 is convolved with the filter w l k . The model learned feature can be re-written as The output of the active function is

B. THE POOLING LAYER (P-LAYER)
By reducing the solution to the feature matrixes obtained in C-layers, the P-layer could achieve spatial invariance. The P-layer applies local pooling of feature matrixes using a max-pooling operation. After the operation, the size of the matrixes becomes Size : where, s is the down sampling size, for example, when the mean sampling method is used, s is 2.

C. THE FULL CONNECTION LAYER (FC-LAYER)
In the FC-layer, each value of the input vector is connected to each value of the output vector. If the length of the input and output vectors are M and N , respectively, the output vector of the l-th layer can be calculated as follows where w ij denotes the weight of the j-th output value connected to the i-th input value. The computation for the number of all the parameters of a fully connected layer is described as follows D. THE OUTPUT LAYER: SOFTMAX CLASSIFIER Softmax classifier can be described as where, p W (l) (·)(i ∈ {0, 1}) is a sigmoid function with parameters W (l) , and the x l is the feature learned by the DCNN. The parameter W (l) is learned by a training set. Eq. (7) produces a label between 0 and 1. The predicted class ∧ i and prediction score ∧ s( ∧ i ) can be described as and ∧ s(

III. TRAINING GUIDE METHOD OF DCNN A. BASIS FOR RANDOM DISCRETIZATION OF SAMPLES
Assuming that m samples constitute the sample set (1) , y (1) ), . . . , (x (m) , y (m) )}, and they are n categories, respectively, where, x (i) refers to the input signal vector and y (i) refers to the target value, namely, the fault-pattern index. The cost function of the DCNN model can be represented as where, W is the weight value of each unit and b is the bias term, h W ,b (x (i) ) is the output of the last neural network layer, namely, the fault-pattern index of the sample x i . The target of the training network is to find the minimum value of the function R(W , b) by adjusting the W and b. Based on the diagnosis principle of the DCNN, it can obtain the conclusion that the performance of DCNN mainly depends on the parameters of the trained model. This process is mainly achieved through multiple batches of training samples. In other words, when training samples are given, scientific training strategy determines the performance of the DCNN to some extent.
So, the trained DCNN model can be expressed as where, h (·) is the training model, − → x k is the batch sample set, X is the total sample set and − → x k ∈ X , size (X ) = K * n, and n is the batch sample capacity. Each batch samples can train a DCNN, it can be defined as Batch Deep Convolutional Neural Network (BDCNN), K is the number of batches or the number of BDCNN obtained, − → θ k is the model parameters set obtained from the batch of training.
Based on Eq. (10), the Confidence Function (CF) of DCNN can be defined as where, X and Y represent the sample set and label set, respectively, I (·) is the indicator function, av k (·) is the mean value function, Y is the correct classified label set, and J is the misclassified set.
The CF(X , Y ) measures the degree that the number of BDCNN correctly classified exceeds the number of any other misclassified BDCNN in the process of model training. The larger the value of CF(X , Y ), the stronger the diagnosis ability of the trained DCNN model. In order to measure the performance of DCNN, the Generalization Error (GE) has been introduced and furthermore, the GE of the DCNN model can be defined as where the subscripts X , Y indicate that the probability P X ,Y (·) is over the X , Y space. Next, the factors that are connected to the GE will be found, and the conclusion can be summarized as follows: Conclusion: the Generalization Error of the DCNN model is positively correlated with the correlation between BDCNN and negatively correlated with the classification ability of BDCNN.
Before proving the conclusion, some definitions or property theorems need to be introduced.
1. The BDCNN correlation is the correlation between the models obtained by batch training, and the detailed mathematical representation will be given in the following parts of the paper.
2. The diagnosis ability of BDCNN is the recognition ability of the model obtained by batch training, and the detailed mathematical forms will be given in the process of subsequent proof in this paper as well.
→ ξ . 4. Borel strong law of large numbers [31]: suppose that {ξ n } is an independent sequence of random variables with the same distribution in probabilistic space ( , F, P), if P (ξ n = 1) = p and P (ξ n = 0) = 1 − p, 0 < p < 1, S n = n k=1 ξ k , there is Suppose that K r denote the number of S r , in other word, K r represent the number of misclassified BDCNN, there will be When there exist K → ∞, according to ''borel strong law of large numbers'' theorem, the formula (15) holds where the subscripts θ indicate that the probability is over the model parameters set θ space. Therefore, based on the ''almost everywhere convergent'' theorem, for any J , there is a zero test set C in the value space of − → x k , the following expression holds so, Eq. (17) holds.
Note that Eq. (12) can be re-written as follows Note that Further, there is It has been proved that Eq. (18) holds, the upper bound of generalization error GE can be obtained by analyzing P X ,Y (CF (X , Y ) < 0). In order to prove that the diagnosis result of DCNN model is reliable, there is E X ,Y CF (X , Y ) > 0, where E X ,Y CF (X , Y ) represents the degree of expectation of classification results of each sample by DCNN, and E X ,Y CF (X , Y ) > 0 indicates that the classification result is reliable. According to the ''Chebyshev inequality'', there is so, The classification ability of BDCNN is defined as s, and the average correlation between BDCNN is ρ, and the expressions are as follows where, ρ (θ, θ * ) represents the correlation between rmg (θ, X , Y ) and rmg (θ * , X , Y ), and sd (θ) represents the standard deviation of rmg (θ, X , Y ).
The upper bound of GE represented by s and ρ can be obtained by the following proof process.
For independent identical distribution variables θ and The result can be obtained as follows So, the following result can be obtained Known from the foregoing definition, ρ represents the average correlation between BDCNN, and s represents classification intensity of BDCNN. Here, the key factors that affect the diagnosis ability of DCNN have been found. Therefore, the Generalization Error of DCNN model is positively correlated with the correlation between

BDCNN and negatively correlated with the classification ability of BDCNN.
Based on this conclusion, the generalization ability of the DCNN model can be enhanced and the confidence in diagnosis results can be improved by reducing the correlation between BDCNN and improving the classification ability of BDCNN.
From Eq. (26), it can be seen that the correlation between BDCNN is closely related with the correlation between the training samples. The sample correlation in the same health state is certainly higher than that in different health state. So, if the correlations between samples have been reduced by sample random discretization, and then, the correlation between BDCNN could be reduced. This is the reason that why the generalization ability of the DCNN can be enhanced by dispersing the training samples.

B. OPTIMIZATION OF THE BATCH DIVISION
Based on the conclusion obtained, when we do the training sample batch division, we should consider how to improve the classification ability of BDCNN scientifically.
After the experimental research shown in Fig. 2, it can be found that the GE of diagnosis model changes along with the different iterations and the batch sample capacity. Proper parameters can effectively improve the classification ability of BDCNN. Therefore, it is very important to find the balance between the iterations and the batch sample capacity.
Suppose that X is the sample set, and size (X ) = K * n, K is the number of batches, and n is the batch sample capacity. The experience strategy used to adopt multi-batch and minicapacity strategy, so that it could train the diagnosis model with much more times and obtain better trained model. The experimental results are the same. Fig. 3 shows that the GE of diagnosis model is getting smaller along with bigger number of iterations, and the classification ability of the model can be improved by increasing training times or iterations.
However, just like the experimental result shown in Fig. 4, when the batch sample capacity is too small, the sample information learned by the training model will also be meager. It is   not conducive to improving the generalization ability of the diagnosis model.
The method selected in this paper is to traverse and optimize the batch sample capacity, just like Fig. 5. On the premise of ensuring the prediction accuracy, the goal is to search for the best sample batch capacity with less iteration.
The specific application method is the linear interpolation, and the batch division method can be divided into the following steps: (1) Training samples preprocessing and batch samples inputting.
(2) Set the parameter of DCNN initialization and increase the number of batch sample capacity and iteration by equal steps.
(3) Obtain the GE of DCNN model under different parameter sets and seek to obtain model diagnosis performance inflection points as much as possible.
(4) Use the linear interpolation method to obtain the Landform Map of the GE of DCNN model under different iterations and batch sample capacity.
(5) Find the bottom of the Landform Map on the basis of larger sample batch capacity and less iteration, and it is the optimal parameter set.

C. THE OVERALL FRAMEWORK OF DCNN'S TRAINING GUIDE
Based on the training strategy proposed in this paper, the overall framework of the DCNN's training guide can be constructed as shown in Fig. 6. The detailed operation process is expressed as follows.
(1) In the process of sample processing, shift sample the original vibration data under different health status with a sampling window of a certain width.
(2) Convert one-dimensional signal into two-dimensional signal matrix using MATLAB's own dimensional conversion function.
(3) Mix the signal matrix of all health status together and disperse them randomly.
(4) Conduct the operation process expressed in section 3.2.
(5) Enter the test samples and get the diagnosis results. The training guide proposed in this paper has two highlights. One, the key factors that affect the diagnosis ability of the trained DCNN model have been found, and it is more scientifically instructive to the designation of training strategies. The other, based on the highlight one, the batch division quantification could be directed and a new division method could be proposed.

IV. EXPERIMENTAL AND VERIFICATION A. TEST RIG AND EXPERIMENT SETTING UP
In order to verify the feasibility of the training guiding method, the data were obtained on the planetary gearbox fault experimental platform. The planetary gearbox test rig is shown in Fig. 7.
The experiment was carried out on the planetary gearbox 1. Four kinds of faults were planted artificially on the sun gear, such as worn tooth, eccentric, pitting and chipped tooth (shown in Fig. 8). The acceleration sensor was installed on the planetary gearbox 1. The motor speed is 1200 r/min. The sampling frequency is 5 kHz. The load is 41.2 N · m. The sampling points are 196608 points respectively. The time domain waveform of the collected vibration signals are shown in Fig. 9.

B. TRAINING SAMPLES PREPARING
The sample method used in this paper is shown in Fig. 10. The original vibration signal is shift sampled with a sampling window of a certain width (for example, 1024 points), and finally, n samples are obtained.   The diagnostic object has five different types of health states, and 5 × n samples could be obtained. Then the dimension deformation of each sample signal could be carried out, and the sample data can be transformed from one-dimensional (1024 points) into two-dimensional (32 × 32). In order to represent the sample form more intuitively, the sample matrix is given in the form of confusion matrix, as shown in Fig. 6. The sample matrix of all health   states is combined to form the training sample set, and then, all the samples are randomly dispersed.

C. CONSTRUCTION OF DCNN MODEL AND OPTIMIZATION OF PARAMETERS
The parameters related to the DCNN model are shown in Table 1 12C-2S-24C-2S means that DCNN has two   convolutional layer and two down sampling layer, the number of the kernels are 12 and 24 respectively, the method that down sampling layer used is mean sampling. The iterations and batch sample capacity are determined by the final optimization result.
Input the dispersed sample set into the DCNN model, and the diagnostic results under different iterations and batch sample capacity are shown in Table 2 and Fig. 11 below. The methods used contain the grid search method and the interpolation method, so as to obtain the Landform Map of       Table 1, the number of training samples is 900. Considering that the number of training batches in a single cycle should be an integer in the specific algorithm, the batch sample capacity should be divisible to 900. In the optimization process, with the decrease of sample capacity, the number of batches will increase, but the training efficiency will be reduced. So, based on the ideal diagnostic accuracy, the iterations and batch sample capacity should be fewer and larger respectively as possible. The optimum number of iterations and batch sample capacity can be found as 150 and 10, respectively.

D. INFLUENCE OF RANDOM DISCRETIZATION OF SAMPLES ON THE DIAGNOSIS ACCURACY
In order to carry out the comparison, the Deep Belief Network (DBN) method and the sample non-random discretization experiment have been adopted. The results of DBN with sample non-random discretization are shown   in Fig. 13∼Fig. 16. The results of DBN with training guide are shown in Fig. 17∼Fig. 20. The results of DCNN with sample non-random discretization are shown in Fig. 21∼Fig. 24. The results of DCNN with training guide are shown in Fig. 25∼Fig. 28. In order to understand the process of the diagnosis intuitively, an example of 4-layers DCNN diagnosis   process has been shown in Fig. 12. Furthermore, the model learned feature f v , the f v * W + b, the Sigmoid (f v * W + b) and the final diagnosis result are shown visually as follows too.
The distributed data features f v are shown in Fig. 13,  Fig. 17, Fig. 21 and Fig. 25, which are obtained from a single  sample fully connected under five healthy states. The outputs before activation are shown in Fig. 14, Fig. 18, Fig. 22 and Fig. 26, the formula of the output is f v * W + b, in which, W is the weight, and b is the bias. The outputs after activation are shown in Fig. 15, Fig. 19, Fig. 23 and Fig. 27, the formula is Sigmoid(f v * W + b). The diagnosis results are shown in Fig. 16, Fig. 20, Fig. 24, Fig. 28 and Table 3.
The diagnosis results show that the training strategy proposed in this paper can improve the generalization performance of the diagnosis model and obtain higher diagnosis accuracy indeed.

V. CONCLUSION
As a data-driven diagnosis method, DCNN's diagnosis performance mainly depends on whether the training model has good generalization performance or not. In this paper, the theoretical derivation and experimental verification of the two problems aforementioned are carried out. The work and significance of this paper are summarized as follows: (1) The thesis that the Generalization Error of DCNN model is positively correlated with the correlation between BDCNN and negatively correlated with the classification ability of BDCNN has been proved by the mathematical deduction method and the theoretical basis for random discretization of samples has been found.
(2) Based on the thesis proved, a scientific quantitative batch division method is proposed. As it turns out, this attempt works.
(3) It is conducive to improving the interpretability of deep learning algorithms.
(4) An explicable training guide has been proposed for the popularization and application of DCNN in mechanical equipment fault diagnosis. Furthermore, when the Deep Learning algorithm is changed, the training strategy is still useful.
PENG LUO received the B.S. and M.S. degrees in mechanical engineering from Hunan University, in 2012 and 2018, respectively. He is currently pursuing the Ph.D. degree in mechanical engineering with the National University of Defense Technology. His research interests include dynamic modeling, signal processing, and machinery fault diagnosis.
NIAOQING HU received the B.S., M.S., and Ph.D. degrees in mechanical engineering from the National University of Defense Technology, in 1989, 1992, and 2001, respectively. From 1993 to 1998, he was a Lecturer with the Department of Mechanical and Electronic Engineering and Instrumentation. From 1998 to 2003, he was an Associate Professor with the Department of Mechatronics and Automation. Since 2004, he has been a Professor with the National University of Defense Technology. He is the author of 4 books, over 260 articles, and more than 7 inventions. His research interests include condition monitoring, prognosis and health management, signal processing, mechanical dynamics, nonlinear systems, structure health monitoring, and artificial intelligence.