A Novel Method of Bearing Fault Diagnosis in Time-Frequency Graphs Using InceptionResnet and Deformable Convolution Networks

Bearing fault diagnosis has attracted increasing attention due to its importance in the health status of rotating machinery. The data-driven models based on deep learning (DL) have become more and more intelligent in the ﬁeld of fault diagnosis, and among them convolutional neural network (CNN) has been widely used in recent researches. However, traditional CNN is not easy to capture right fault features due to their ﬁxed geometric structures, especially under complex working conditions in fault diagnosis. To address these challenges, we propose a novel model by combining InceptionResnetV2 with Deformable Convolution Networks, named DeIN. We replace the basic form of convolution with deformable convolution in speciﬁc layers, and a main classiﬁer and an auxiliary classiﬁer are designed to output the classiﬁcation result of our proposed model, to adapt to the non-rigid characters and larger receptive ﬁeld in time-frequency graph (TFG). Experimentally, the one-dimensional signals are transformed into TFGs and as input of the proposed model, and this aims to ﬁnd useful features during the training process. To verify the generalization ability of the proposed model, we apply a set of cross-over tests based on two popular datasets, and our model achieved 99.87% and 94.52% highest-precision fault classiﬁcation results comparing with other state-of-the-art


I. INTRODUCTION
As an indispensable component of rotating machinery, rolling bearings always determine the state of machines. Due to complex working environments and conditions, rolling bearing failures are inevitable and unpredictable. These failures severely impair the stability and performance of machines. Therefore, many bearing fault diagnosis methods are proposed. These fault diagnosis methods can be divided into model-driven methods, and data-driven methods [1]. In big data era, data-driven models have become more and more attractive. Deep learning is such a kind of powerful data-driven method that has been successfully adopted in The associate editor coordinating the review of this manuscript and approving it for publication was Lance Fiondella. various areas such as computer vision, automatic speech recognition, natural language processing, audio recognition, remaining useful life prediction and bioinformatics [2]- [7].
Deep learning attempts to model hierarchical representations behind data and classify(predict) patterns via stacking multiple layers of information processing modules in hierarchical architectures. Convolutional neural networks (CNNs) proposed by LeCun [8] is characterized by two key properties: spatially shared weights and spatial pooling. CNNs can effectively extract features from massive data. Therefore, in current intelligent fault diagnosis field, CNNs has been widely adopted. Zhang et al. [9] proposed WDCNN (wide convolution kernel deep convolutional neural network) based on time domain fault signal, the first layer of large convolution kernel was trained by the optimization algorithm that VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ could automatically remove the useless noise for diagnosis, and WDCNN allowed the wide convolution kernel to have greater domain adaptability to the fluctuation of the raw signal, which has the advantage of automatically learning for diagnosis features on the time domain. WDCNN also added batch standardization (AdaBN) and ensemble learning of TICNN with training Interference [10] to improve its stability. Due to the convolution kernels of the same size have the same range of receptive fields then as the input to a fixed-size pooling layer for pooling operation, this method makes the feature data often suffered background noise influence. Atrous convolution [11]- [13] adjusts the expansion ratio to change the offset of the convolution kernel, when the dilation ratio [14] equal 1, the convolution characteristics are equivalent to the standard convolution, and when the expansion ratio is more than or equal to 2, a filter 'with holes', which can avoid part of the background noise. Chen et al. [15] proposed a deep inception net with atrous convolution (ACDIN) model, and it has excellent performance when tested on the time domain data from natural damaged bearings and trained on the artificial conditions, and we also proposed a small sample learning method of few-shot learning [16] which used Siamese networks base on limited data. However, for raw fault signals, it is often difficult to extract enough useful signal features for fault classification, especially under varied working conditions. Thus, the method of transforming vibration signal to time-frequency graphs (TFGs) has been proposed, which using short-time Fourier transform(STFT) and TFGs can reflect the characteristic information of bearings [17]- [19]. Zhu et al. [20] used the capsule network (ICN) to analyze the TFGs which effectively improved the fault classification accuracy. Zhang et al. [21] proposed a 1-D convolutional adversarial network (A2CNN) model to training dataset for transfer learning, the coefficients of the fast Fourier transform(FFT) were trained as the input data of A2CNN, Domain adaptation (DA) was used between different operating conditions to the original labeled data and the target domain information, Zhao et al. [22] applied wavelet transform to original signals and combined analysis with other neural network architectures. Yuan et al. [23] proposed the Wavelet FCNN method for dealing with the faults of wind turbine blade icing. The discrete wavelet coefficients processing were used in the FCNN model to improve recognition accuracy [24].
Most of the existing studies have been able to achieve a higher accuracy rate, mostly because it depends on the assume -all the data must be in the same working condition and have the same distribution and feature space. While in real world, bearing conditions vary from different working conditions. Meanwhile, classic convolutional neural network is difficult to capture right fault features due to the fixed geometric structures in their building modules [25], [26]. So, deformable convolution and deformable RoI pooling were proposed [26], [27]. These new modules can enhance the transformation modeling capability of CNNs. The experiment in this paper shows that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation [28]- [30]. Furthermore, in convolutional network, such InceptionResNet, a series of convolution kernel combining different sizes in convolution feature extraction can improve recognition accuracy [31], [32].
In this paper, in order to achieve higher accuracy and better performance in bearing fault diagnosis under varying working conditions, we propose a novel DeIN model, which uses deformed convolution in low-dimensional features to extract useful signal features, and apply InceptionResNetV2 to process the output of deformed convolution. In experiments, DeIN can better adapt to the data of different working conditions with signal features and recognition ability, which method is similar to the attention mechanism and can find features that are valuable to the model during the network training process, and DeIN will be used in a wide range of PHM problems and solve fault diagnosis problems in more fields. In experiments we compare the deep convolution models: ACDIN, WDCNN, AlexNet, ResNet, ICN, IRv2-lite and DeIN. By cross-testing between different conditions, we obtain the generalization ability of each deep convolution model. In particular, we performed multiple experimental comparisons between DeIN and IRv2-lite without deformed convolution and found better generalization of DelN. Due to the similarity in the data distribution of the same type of faults in different working conditions, DeIN is better able to fit the fault type features than conventional convolutional neural networks. The main contributions of this article lie in the following aspects: (1) The novel DeIN method filters the background noise of signal data automatically by using deformable convolution kernels, we purposefully modify DeIN structure and redefine the loss function. so that it performs better when in different working condition data. (2) By using comparatively fewer parameters, experimental results showed that DeIN is effective for fault diagnosis both on sufficient and limited data and comparing convolutional neural network models such as WDCNN when the sample is relatively small (5% of the total sample), DeIN has a higher accuracy of 88.54% and WDCNN is 78.32%. This rest of the paper is organized as follows: Section II describes the deformable convolution and inception-resnet based fault diagnosis algorithm. Section III presents the experiments and different working condition, results and discussion. Section IV concludes the paper.

II. PROPOSED DeIN ARCHITECTURE
In bearing fault diagnosis, our proposed DeIN architecture is shown in Figure 1. In this model, combining experimental data with real fault data as input, and cross-over testing is used to diagnose bearing faults under different working conditions. DeIN's efficient feature extraction capability is used to fit multiple fault data and improve model's generalization.

A. GENERAL CNN LAYER
The input data is convolved with filters kernel in convolutional layer, the convolution process is described as Equation (1), each filter uses the same kernel size to extract the partial area features of the input data, which is commonly referred to weight distribution K l i and b l i denotes bias in i − th filter kernel with layer l,and use x l (j) to denote the j − th partial area in the layer l, the above denotes the input of j−th neuron in frame i of layer l + 1, the notation * computes the dot product of the kernel and the partial area. Pooling layers are usually added after the convolutional layer in the CNN architecture. The max-pooling layer is most commonly used, it reduces the parameters and obtains the function of positional invariance. The max-pooling layer described as Equation (2), where q l i (t) denotes the value of t − th neuron in i − th frame of layer l, and t ∈ [(j − 1)W + 1, jW ], W denotes pooling area, P l+1 i (j) denotes pooling operation in layer l + 1. After the convolution operation, the Rectified Linear Unit (ReLU), which is widely used as an activation function in deep learning. y l+1 i (j) denotes the output of convolution operation, a l+1 i (j) denotes the result of y l+1 i (j) access the activation function f , the activation layer is described as Equation (3) The role of the auxiliary classifier overcomes the disappearance of gradient and provides regularization. An auxiliary classifier can achieve higher level of accuracy and stability than a network without it [15], [33]. According to the data features and training objectives, the model obtains a series of weight values for ground truth with the optimization of the parameters, expressed as Equation (4), where W M = W (1) , . . . , W (M ) denotes the combining all layers of weights, W m = W (1) , . . . , W (m) denotes the corresponding weights of an auxiliary classifier with the mth layer, the classifier weights for the ith layer are denoted as w (i) , P (W M ) is the main output objective, and Q (W m ) is the auxiliary output objective.

B. PRINCIPLE OF DEFORMABLE CONVOLUTION
In the convolution operation mentioned above, the standard convolution kernel in Figure 2(a) and the standard convolution kernels are usually a series of fixed shape for data calculation operations. If the size changed, such as Atrous kernel, shown in Figure 2(b), which reduces background effects by increasing the range of receptive field. In addition, if the receptive field could be deformed when the object changes, which makes models more accessible to extract useful feature, shown in Figure 2(c). VOLUME 8, 2020  Furthermore, the principle of deformable convolution can be explained as follows: the bilinear interpolation method makes deformable convolution obtain the offset p n in the training by backpropagation. The offset p n is obtained by bilinear interpolation on the 2-D TFG, shown in Figure 3, and Equation (5)(6).
Through Bilinear interpolation, we get the values of the four integers (r 1 , r 2 , r 3 , r 4 ) pixel closest to the point P, the weights (w 1 , w 2 , w 3 , w 4 ) of each integer pixel values are obtained by backpropagation (BP), shown as Equation (7).
The critical part of the deformable convolution method in this research is to obtain the offset p n by training model, and p n denotes as shown in Equation (8)(9).
where grid R defines the receptive field size in conventional convolution.
For each location p 0 on the output feature map y, the convolution method is expressed as Equation (10).
where p 0 is initial location, p n enumerates the fixed locations in R.  In the bearing fault dataset, TFGs are the result of the STFT layer, at the same time, the offset field layers calculate the offset p n of the convolution kernel, which makes the convolution kernel obtains deformable receptive field, shown in Figure 4, and deformable convolution method expression of the convolution kernel by increasing p n could be expressed as Equation (11). (11) In convolutional network, the pooling operation used for feature dimensionality reduction, and the number of parameters is reduced to avoid over-fitting while improving the fault tolerance. The pooling layer size is usually fixed in the standard convolution so that the convolved feature map in the pooling operation will result in unavoidable loss of feature. The method of deformable pooling layer is similar to the method of deformable convolution layer, and the shape of the pooling layer is changed by increasing the offset p ij , shown in Figure 5.
In this paper, we use max-pooling leads to faster convergence rate by selecting superior invariant features which improve generalization performance. Conventional max-pooling method is expressed as Equation (12). where n ij is the number of pixels in the bins, the (i, j) − th bin denote two vectors of TFGs. The expression of the deformable max-pooling layer by increasing p ij as Equation 13.

C. Inceptionresnetv2 OF DeIN ARCHITECTURE
In inceptionResnetV2, deformable convolution simply expressed as Figure 6(a), meanwhile, inception-ResnetA and inception-ResnetB which include max-pooling layers, but in DeIN, these max-pooling layers become deformable max-pooling layer by adding offset layer, which increased the adaptability of DeIN to data, as shown in the Figure 6(b)(c), Due to a large parameters of network layers of Inception-Resnetv2 [34], it is not suitable for TFGs rapid diagnosis, considering the fault signal appear periodically in a short time. So, in this paper, the Inception-ResnetV2 was simplified, and the computing resource consumption saved by reducing the number of network layers. IRv2_lite is a retrenched version of inceptionRestnetV2 to improve diagnostic speed.
In DeIN, Block16, Block8 and Block4 replace standard inception Resnetv2 layers, a deformable convolution and pooling layer extract features from TFGs. The deformed convolutional layer placed on the input layer as Offset-low, Aux output, Main output as Offset-top. Among them, in consideration of the fault signal appear periodically in a short time, we put the Aux output followed by block16, and a light network can be used to output the result. Overview of IRv2-lite and DeIN structure illustrated in Figure 7, such as (b), Offset is 'yes' denotes it has deformable pooling layer and detail parameters shown in Table 1,

D. LOSS FUNCTION
We encode labels by one-hot Equation (14), and choose softmax function and cross-entropy as loss function, which are used in the multiple classification task, the loss function is where p i is the distribution of prediction values, q i is the distribution of true values, x j is the output of fully connected (FC) layers, and n denotes n classes.
In the training process, the model output will tend to a specific class due to certain defect of softmax function [35], we introduce the uniform distribution(UD) loss function in the DeIN and optimize it, so that the loss function does not only fit the one-hot distribution, but also consider fitting the uniform distribution [36], [37]. The modified loss function improves the accuracy of the classification and effectively VOLUME 8, 2020 prevents overfitting, which is calculated by Equation 17, where ε denotes harmonic parameter, in this research, we set ε = 0.1.

III. EXPERIMENTAL SETUP AND DATA DESCRIPTION A. DATA PREPROCESSING
STFT is widely used for time-frequency processing of timedomain signals, STFT uses the sliding window with same stride to extract the signal features. Usually, in order to avoid information redundancy, the stride size is half of the window length, the time-frequency graphs (TFGs) are output of STFT, as shown in Figure 8. The proposed DeIN focuses on the part of interesting feature-area, ignoring redundant data features and background noise. The offset is an important method for performing deformation convolution and makes the extraction area more flexible than the fixed feature extractors in TFGs, as shown in Figure 9. When the model was trained, the offset was updated by each weight-optimized, and then the model accurately extracted the valuable features for the classification task. Furthermore, the standard pooling layer becomes a deformed max-pooling layer by adding an offset, the pooling layer extract interesting feature-area from the output of deformable convolution. In experimental preparation, the raw data are transformed to time-frequency graphs from a time-domain signal. The shape of the input TFG is 64 × 64. The experiments we run on a Ubuntu 18.04 operating system computer with 8 kernels Intel Xeon(R) W-2123@3.60GHz CPU, 32GB memory and a NVIDIA GTX 1080ti GPU, the CNN models created based on deep learning were implemented using the Keras 2.0 framework with TensorFlow1.9 as the backend.

B. CWRU DATASET
The test stand of CWRU bearing dataset is shown in Figure 10, the damage of the inner ring, the outer ring, and the ball of the bearing is single point damaged by EDM. Since the position of the outer ring damage is relatively fixed, the damage point is relative to the bearing load area. The different positions have a direct impact on the vibration response of the motor/bearing system. The fault location at 6 o'clock and 48k sample ratio are selected in the study,   and vibration acceleration signal data was recorded under the working conditions of the motor load conditions of 0, 1, 2, and 3 horsepower(HP), and 1HP, 2HP, 3HP conditions (except 0HP) named A, B, C, shown as Table 2.
Each training dataset made into 3 categories according to the type of bearing fault, label '0' denotes None (Normal bearing), label '1' denotes 3 types of ball faults (EDM manual fault size in 0.007inch, 0.014inch, 0.021inch ) and label '2' denotes 3 types of inner ring faults (EDM manual fault size in 0.007inch, 0.014inch, 0.021inch), label '3' denotes 3 types outer ring fault (EDM manual fault size in 0.007inch, 0.014inch, 0.021inch), as shown in Table 3, Before this experiment, the training dataset of CWRU bearing contained 18000 samples, 2000 samples and 3000 samples for validation and testing respectively. For AlexNet, ResNet, ICN, IRv2-lite, and DeIN, the input sample is 64×64 TFG, but the ACDIN and WDCNN models, which use the raw time-domain signals as input, and every sample consists 4096 points.
The accuracy of these deep model depicted in Figure 11, the A →A, B→B, C→C experiments based on CWRU data have approximately reached 100% [9], this is because the training data and testing data has the same distribution, but this situation is different under real working conditions. So, in order to verify the generalization ability of DeIN, we apply   Figure 11. It is clear that ACDIN only has an average accuracy of 83.99 %, so it performed the worst under the cross-over testing. The WDCNN, which has an average accuracy of 87%, ranking second to last. The AlexNet, ResNet, ICN, and IRv2-lite models using TFGs as input are generally better than ACDIN, WDCNN. We review the Table 2, A load is 1 HP and rotational speed is 1772 rpm, B load is 2HP and rotational speed is 1750 rpm, C load is 3HP and rotational speed is 1730 rpm, the data distribution is different due to the rotational speed changes, the rotational speed is the most obvious between A and C, so the accuracy of all the deep learning models reduced under the two cross-over testing in A→C and C→A. In A→C, ICN has an accuracy of 97.17%, IRv2-lite has 86.1%, DeIN has 98.93% more than ICN's and IRv2-lite's, In C→A, ICN has an accuracy of 94.93%, IRv2-lite has 93.43%, DeIN has 96.93% more than ICN's and IRv2-lite's too. In total, the proposed DeIN has an average accuracy of 99.87%, which is much better than other models, it clearly shown the accuracy distribution in Figure 12.  Since the low-to-high speed(C: 1730[rpm]→A: 1772[rpm]) condition gets low accuracy during the whole training process in deep learning models, in order to further understand the training process, features of the last hidden layer are visualized via t-SNE, shown in Figure 13(a)(b), and confusion matrix is further compared with the case of C→A, as shown in Figure 13(c)(d). Both of label '0'(Normal bearing) clustered well, for IRv2-lite, the result of the label '2' is not as good, it can be found that the label '2' is misclassified to the label '3' in Figure 13(a). But in the proposed model DeIN, it obviously shown that label '2' classified well in Figure 13(b) and confusion matrix results is better in Figure 13(d) than IRv2-lite that without deformable convolution. this experimental reflect that deformable convolution can improve the accuracy of classification in DeIN by extracting useful features during the network training process.
In order to further prove the performance of DeIN under limited data, using 5%, 10%, 15%, 25%, 50%, 70% and 100% of 18000 samples respectively, the dataset is shown in Table 4.   WDCNN, IRv2-lite and DeIN were used in this experiment, the results are shown in Figure 14. It is clear that IRv2-lite and DeIN performance are better than WDCNN. In addition, DeIN has deformable convolution which is better than IRv2-lite. In 5% of all samples, the accuracy DeIN is 88.54%, which is higher 5-10% than WDCNN's and IRv2-lite's. When the training sample is 100%, DeIN reached 99.87% more than 91.14% of WDCNN and 93.02% of IRv2-lite.
At the same time, in box diagram as shown in Figure 15, when trained in 5%, 10%, 15%, 25%, 50%, 70% and 100% samples, the model results are most divergent in smallest training samples of 5%, because shuffle training samples and local datasets are stochastic, so the results of A→C and C→A are not stable. When training samples gradually increased to 100%, the results of A→C and C→A are increasing and gradually stable. It is proved that the DeIN can effectively fit the datasets and the limited data does have specific research value for the deep learning model.

C. PADERBORN BEARING DATASET
To generate experimental data for the development of damaged bearings by using vibration signals, the test rig is a modular system that ensures flexible use of different defects in different working conditions [38], the test rig consists of several modules: (1) an electric motor, (2) a torque-measurement shaft, (3) a rolling bearing test module, (4) a flywheel and (5) a load motor, shown in Figure 16.
The ball bearings with different types of damage are mounted in the bearing test module to generate the experimental data, The other is the bearing run-to-failure test rig in Figure 17, which applied radial force is higher than in usual bearing applications to accelerate the appearance of fatigue    damages, both of the artificial damage bearings test rig and run-to-failure test rig include a high sampling rate (64 kHz).
In the two test rigs, the 32 different types of bearing tests performed in 3 different working conditions: 12 bearings with artificial damage, 14 bearings with natural damage from accelerated lifetime tests and 6 healthy data. The main bearings fault location shown in Figure 18.
The basic setup (set No. 0) of the operation parameters, the test rig runs at n = 1,500 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1,000 N. Three additional settings are used by reducing the parameters one by one to n = 900 rpm, M = 0.1 Nm and F = 400N (set No.1-3), respectively. These three conditions named D, E, F, shown as Table 5.   Each bearing in 20 measurements under a load setting measured a vibration signal of approximately 4s at a sampling rate of 64 kHz and a total of approximately 256,000 points. For this experiment, this datasets contained signal data obtained from healthy bearings, artificially damaged bearings, and real damaged bearings, all these bearings were running under three different loads and at a speed of 1500 rpm, the rolling bearing datasets filename selected is shown in Table 6. The detail of rolling bearing datasets selected in Table 7.
To better identify the DeIN's ability, the names of the working conditions are D, E, and F. D→E denotes the model is trained under D working condition and tested under E working condition. Similarly, we conducted the experiments of D→F, E→D, E→F, F→D, F→E, in the Paderborn experiment, the 7 deep learning models of ACDIN, WDCNN, AlexNet, ResNet, ICN, IRv2-lite, and DeIN were carried out, as shown in Figure 19, the average accuracy of ResNet was 77.52%, which was the worst performance. The average accuracy of the DeIN is far more than the other 6 deep learning models. Due to the rotational speed and the load torque are the same in D and F, among of the 7 deep learning models can achieve better results in D→F and F→D,  the accuracy of DeIN in D→F and F→D even reached 99.63% and 99.70% respectively, far exceeding the ACDIN has accuracy of 78.73% and 79.53%. At the same time, the performance of IRv2-lite is also very good overall, except in D→E and E→D performance is bad. In the CWRU experiment, ICN has accuracy of 97.16%, but in the Paderborn experiment, it only has 82.05% which is lower than the average accuracy of DeIN' s 94.52%. which is much better than other models, it clearly shown the accuracy distribution in Figure 20.
To better understand the effect of IRv2-lite and DeIN in Paderborn bearing datasets, Figure 21(a)(b) show feature visualization via t-SNE: the last hidden layer visualization. Figure 21(c)(d) show confusion matrix results. From Figure 21(a)(b), both of label '1'(out race damages) clustered well, for IRv2-lite, the resolution of the label '0'(normal bearings) is not as good, it can be found that the label '0' is misclassified to the label '2'(inner race damages) in Figure 21(a). But in the proposed model DeIN, it obviously shown that label '2' classified well in Figure 21(b) and confusion matrix results is better in Figure 21(d) than IRv2-lite that without deformable convolution. this experimental proved again that deformable convolution can improve the accuracy of classification in DeIN by extracting useful features during the network training process. VOLUME 8, 2020

IV. CONCLUSION
In this paper, we have proposed a powerful approach for rolling bearing fault diagnosis named DeIN, which can improve the feature extraction ability of the model in the bearing fault data by applying deformable convolution and InceptionResnetv2. Experiments with various deep learning models show that the bearing data processed to TFGs by STFT could achieve better performance in fault diagnosis. DeIN based on TFGs has excellent ability to eliminate background noise, and the output of the Auxiliaryoutput classifier and the Main-output classifier guarantees the training speed of the model while improving high-precision diagnostics. Comparing with the popular time-domain signal models WDCNN, ACDIN and the latest TFGs fault diagnosis models (such as ICN), DeIN reaching 99.87% and 94.52% higher accuracy than these models based on CWRU and Paderborn datasets respectively. Furthermore, to facilitate open source in the field of bearing fault diagnosis, all our models in this work and datasets are open sourced and can be downloaded at the URL (https://mekhub.cn/ philbaz/Bearing_Fault_Diagnosis_in_TFGs_Using_DeIN). In the future, DeIN will be used to solve fault diagnosis problems in more fields.