Deep Residual Network for Identifying Bearing Fault Location and Fault Severity Concurrently

Fault diagnosis is composed of two tasks, i.e., fault location detection and fault severity identification, which are both significant to equipment maintenance. The former can indicate where the defective component lies in, and the latter provides evidence on the residual life of the component. However, traditional fault diagnosis methods, like the time-based methods, frequency-based methods and time-frequency-based methods, can only achieve one goal every time. They are not able to produce highly representative features for dealing with above-mentioned two tasks simultaneously. In addition, there is a huge increase in the amount of monitoring data of equipment. There is urgent need for handling this massive data, obtaining highly discriminative features, and further producing accurate diagnosis results in the field of fault diagnosis. Aimed at these problems, a deep residual network based on multi-task learning is proposed, taking detection of fault location and judgment of fault severity into account simultaneously. This network is fed with two kinds of diagnostic information, which is helpful to mine the potential links between two tasks of fault diagnosis and generate very representative features. Moreover, based on maximizing activation value, a visualization method of role of deep neural network is proposed. It can break in the traditional way of using deep neural network as black box. A real bearing experiment validates that the proposed method is reliable and effective in bearing fault diagnosis.


I. INTRODUCTION
Bearing is widely used in many rotating machines and plays a significant role in them. According to statistics, there are about forty-five percent of the machinery faults that are generated by bearings [1]. In order to ensure the normal operation of the equipment, there are many sensors installed on them for monitoring their operation status, which results in a surge of the amount of data. The data brings a lot of obstacles and challenge, but also excellent research material. Many researchers have been taking much consideration on the issue of handling massive status monitoring data. They mainly focus on the methods of the production of highly The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei.
representative features that can be applied to accomplish fault diagnosis tasks with high accuracy [2], [3].
Due to vibration data of bearing containing rich information of their operation status, the vibration-based feature extraction methods have been being attracted much attention [4]- [6]. They could be classified into two categories. One is traditional and classical feature extraction methods based on signal processing, another is modern intelligent feature extraction methods. They all made great achievements in the past decades. The former is further divided into three sub-classes. The first is time-domain statistics features, such as root mean square (RMS) [7], kurtosis [8], and so on. The second is frequency-domain characteristics. The Fourier frequency transform (FFT) [9] can be seen as representative. The third is time-frequency-domain features. For example, short time Fourier transform (STFT) [10], Wigner-Ville distribution (WVD) [11], empirical mode decomposition (EMD) [12] and wavelet transform [13], [14] are all classical feature extraction methods in this kind of category. However, the above-mentioned methods are aimed at single component, single type of fault, and single short time series data. They are not good at dealing with massive data and multiple components. Then the modern intelligent feature extraction methods have been being developed rapidly. The artificial neural network (ANN) [15], [16] and sparse coding (SC) [17], [18] are the typical representatives. Sparse coding can be seen as one kind of ANN with shallow structures. In essence, they are all data-driven methods. The fault diagnosis issue is converted to pattern recognition problem by them. Though these methods did work in intelligent fault diagnosis of rotating machinery with massive data, they still have two shortcomings: the first is that the effectiveness of these methods is heavily dependent on input features that are designed by professional researchers or experts manually. The second is that the capacity of them is limited in handling complex non-linear problems of fault diagnosis since they just have shallow structures. Being against with these deficiencies, a promising feature extraction method named deep learning (DL) [19] has been being developed in recent years.
Deep learning is one of the frontiers and research hotspots in machine learning or artificial intelligence, which is represented as deep neural network (DNN). It enhances the ability of traditional ANN by adding many non-linear mapping layers, since the phenomenon of gradient vanishing is restrained greatly. Now, the DNN has made great success in many domains, such as image processing [20], [21], audio classification [22], [23] and other domains [24]. It also includes fault diagnosis [25]- [29]. From these case studies of fault diagnosis, it can be known that the DNN used in them have two deficiencies. The one is that these constructed DNNs can only deal with single issue of fault diagnosis which contains two aspects, i.e., fault location detection and fault severity identification. Another is that these DNNs are used as ''black box'' in most of the research cases. In other words, the features extracted by these DNNs cannot be clearly understood. It brings many obstacles to subsequent researchers. They cannot clearly understand why the features extracted by these DNNs perform well, and how to improve these models later.
Aimed at these problems, a deep residual network (DRN) based on multi-task learning and generic visualization method of DNN's role is proposed. The former can be applied to handle the above-mentioned two tasks of fault diagnosis synchronously. It has two output layers that are against with different task. Those tasks share partial weights parameters, which is helpful to reduce complexity of networks and mine the relationship between the two tasks. In other words, this network is easier to learn more intrinsic features that can characterize the essential expression of bearings fault, since it is fed with much more knowledge, e.g., information of fault location and fault severity. Meanwhile, in order to help researchers understand why the proposed deep residual network works, this paper puts up a visualization method of network role. The core of this method is that the activation value reflects the response of target neuron to the input. This method can find sensitive input pattern of any neuron in input feature space where human could understand intuitively, based on the maximization of activation value.
The main contributions of this paper are summarized as follows: (1) A deep residual network based on multi-task learning is proposed to handle bearing fault location and fault severity judgment simultaneously. It can produce highly discriminant feature representation and is helpful to mine the potential links between these two tasks of fault diagnosis.
(2) A mathematical model of visualizing deep neural network is constructed on the basis of maximizing activation value, breaking in the traditional way of using deep neural network as ''black box''. (3) A real bearing experiment verifies the effectiveness and reliability of the proposed methods. The experimental results also reveals that the frequency and energy information of resonance zone of vibration signal is significant to bearing fault diagnosis.
The rest of paper is organized as follows: Section 2 introduces the construction of the proposed deep residual network on the basis of multi-task learning. Section 3 deduces generic visualization method of the DNN. Section 4 confirms the effectiveness of the proposed network model and visualization method by a real bearing experiment. The conclusion is drawn in Section 5.

A. DEEP RESIDUAL NETWORK FOR FEATURE EXTRACTION
Inspired by the works [30], [31], a novel deep residual network that is aimed at one-dimensional time domain vibration signal is constructed. After all, bearing vibration signals contain much operation status information, and the occurrence of bearing fault could be reflected in it. Many vibration-based methods have verified this point of view [32]- [34]. Therefore the raw time-domain vibration signal is used as network input. Fig. 1 shows the overall structure of the proposed deep residual network. This network is designed by obeying general CNN construction principle, i.e., (1) the backbone is made up of alternately stacking convolution layers and pooling layers; (2) and the tail layer is composed of fully-connected (FC) layers that are applied to project the extracted feature representation into class label information.
It is worth mentioning that the network input is just raw vibration signal, without any signal preprocessing or handcrafted features. The consecutive 27 convolution layers are employed to do hierarchical feature learning, followed by one global average pooling layer (GAP) and two fully-connected layers. One FC layer corresponds to bearing fault location; the other is against bearing fault severity. The reason why the convolution is used as sub-module is that the input vibration signal is considered to be local stationary. In other words, the extracted local features that are calculated in one shorttime window of original signal are also suitable in other window regions, for one convolution kernel. The head layer of the proposed network is designed specially since pre-activation block employed [31]. The middle part contains 12 residual blocks, and each block includes two batch normalization (BN) layers and one dropout layer. The residual structure makes gradient information easily propagated to more front layer, and further makes network deeper and easier to be trained [30].
Batch normalization method can also weaken the problems that deep neural network is hard to be trained and the training of network converges slowly. It speeds up network training through achieving a stable distribution of activation values and allowing higher learning rate, since the internal covariate shift that is caused by the change of network parameters is reduced [35]. The core of this method is stated as follows: the symbol χ denotes one mini-batch data in one training step, i.e., χ = {x 1 , x 2 , x 3 , . . . , x m }. The output of batch normalization layer is computed by: where γ and β are learned parameters; x i is calculated through: From (1) to (4), it can be found that batch normalization can also work as a way of data augmentation, which is able to prevent over-fitting to a certain degree. In the training process of network, a training sample is randomly taken into account with other samples in one mini-batch. The training network no longer produces deterministic values with regard to a training sample, since one training sample constitutes different mini-batch with different other samples and each mini-batch have an directly effect on the learned parameters γ and β. This is equivalent to bringing disturbances into samples data, which means training samples are augmented. Dropout layer is also introduced into the proposed deep residual network, so as to explicitly avoid over-fitting problem. The core idea is to randomly make some neurons deactivated at a certain probability p. It means that partial connections between two adjacent layers are cut off. This technology can be considered as sampling ''small'' network from original big network, and only optimizes ''small'' network during training stage. In the stage of testing inference, the dropout is no longer used. Then, the whole network can be seen as the average result of many deep residual networks that share same network structure, which is very similar to the ensemble learning thought. This technology is able to largely improve generalization ability of DNN, outperforming many other regularization methods in many research cases [36]. The detailed procedures of dropout method are as follows: The symbol a (l) denotes the vector of activation value in layer l. Then, in the training period, the input vector i (l+1) is calculated through: where denotes element-wise product and r (l) is the vector of independent Bernoulli random variables.

B. MULTI-TASK LEARNING
Though we can optimize a corresponding model for each diagnosis task to get acceptable result, some information that comes from related task and helps to improve generalization ability of model may be neglected. By sharing partial or whole feature representation, the performance of fault diagnosis model can be generalized well on the homologous task. Considering that there is a certain connection between two tasks of fault diagnosis, that is, the detection of fault location is related to fault characteristic frequency of resonance demodulation spectrum, while the evaluation of fault severity is connected with the energy of resonance zone of vibration signal. The deep residual network based on multitask learning is built up. Multi-task learning enhances model generalization ability by utilizing domain-specific information involved in training signals of related tasks. It can be seen as a way of inductive transfer that introduces an inductive bias into a model to improve its performance and makes model prefer some hypotheses over others.
In this research, hard parameter sharing is applied. It is one of multi-task learning approaches for deep neural network. As shown in Fig. 1, the proposed network shares hidden convolution layers between two tasks and keeps one task-specific FC output layer. This network is aimed to deal with these two different but associated fault diagnosis tasks simultaneously. With regard to the detection of fault location, the cost function is expressed as follows: where the subscript L represents the meaning of location; θ L is the weight parameters among the feature-shared layer and corresponding output layer of fault location task; m and k L are the number of samples and fault location type respectively. 1{x} denotes indicative function; P(y (i) = j|x (i) ) is the probability of i − th sample belonging to j class conditioned on giving feature x (i) . R(θ L ) is the regularization term of parameters θ L , and λ L is the weight factor. As for the task of evaluation of fault severity in here, it is converted to the classification problem. The fault severity of bearing is divided into multiple levels. Then, the cost function of this task is denoted as follows: where the subscript S represents the meaning of severity. This loss function is similar to (7). Then, the overall optimization object of the proposed whole network model is computed as follows: The symbol α is used to weigh the importance of these two tasks. From the (9), it is clear that this network is aimed to optimize two tasks simultaneously. This network is fed with more knowledge, which makes it learn more intrinsic and representative features. It is worth pointing out that hard parameter sharing largely weakens the risk of over-fitting. It is demonstrated that the risk of over-fitting the shared parameters is an order N [37]. The symbol N denotes the number of tasks. Intuitively, the more tasks the model is aimed at learning concurrently, the more general representation the model need to learn among all tasks, and the less is chance of over-fitting on original task. The proposed network is helpful to mine the potential links between two tasks of bearing fault diagnosis.

III. VISUALIZATION OF DEEP NEURAL NETWORK
In most research cases, DNN is used as ''black box''. The features in hidden layers could not be understood intuitively, and the reason why they perform so well is not clear. They are processed via multiple non-linear mapping layers from the input layer to the current layer. This work makes human not to comprehend the abstract and high-dimensional features in higher layers. Considering this problem, a visualization method of role of deep neural network is proposed on the basis of activation maximization. That is, the approximately optimal pattern is calculated in the input feature space, which makes the given unit or neuron have the largest activation value.

A. MATHEMATICAL MODEL
In this research, the visualization of what a neuron captures in FC layers is taken as an example, which can help to mine the link between two tasks of fault diagnosis. This visualization method also applies to any hidden unit in DNN. With regard to one hidden unit h (j) i in the j − th layer, the activation value a (j) i represents its response status to the specific input. Since the activation function used in the proposed deep residual network is rectified linear unit. The larger the value is, the more sensitive to this input stimulus the hidden unit is. Then, when the neural network is trained over, the feature patterns extracted by this neuron can be obtained by solving what input pattern can make its activation value largest. In other words, we can solve the following optimization problem to find out the most sensitive input pattern of this neuron in the input feature space: where x * indicates the optimal solution in the input feature space. The symbol E represents the maximum energy of all vibration signals in the training set, which limits the searching space of solution. On the basis of Lagrange multiplier method, the above-mentioned optimization problem can be converted to extreme value problem of the following Lagrange function: where a (j) i (x) represents the activation value with regard to input vibration signal x, and the symbol λ denotes the Lagrange coefficient. The Kuhn-Tucker (KT) requirements of (11) are ignored, but this transform indeed works well in practice.

B. OPTIMIZATION METHOD
From the (11), it is obvious that the function R(x, λ) is highly non-convex. Therefore it is almost impossible to obtain the optimal solution. As such, the numerical optimization algorithm named gradient descent (GD) is employed to search the approximately optimal solution. Its process often includes three steps, i.e., initializing solution randomly, calculating gradient towards to the parameters and updating parameters.
Obeying generic steps of GD, we unearth that there is a big difference among the approximately optimal solutions when each of them is initialized randomly. The consistency of solution is quite poor, which makes researchers puzzled to comprehend what the target unit captures. This phenomenon is caused by the non-convexity of function R(x, λ). The solution is easily limited to local minimum. Aimed at this problem, a natural initialization method of optimal solution is proposed. That is, any sample in training set could be seen as a good initial value of it. This initialization method is based on the following proposed hypothesis: (1) The features in any hidden layer of deep neural network is trained and learned from the training set.
(2) From the view of input feature space, the features extracted in hidden layer can be considered as the whole or partial feature expression of training samples.
The above assumptions limit the searching range of the optimal solution. The result of subsequent experiment demonstrates the aforementioned hypothesis, and it also validates the effectiveness of initialization method. The experimental result shows that the proposed initialization method makes the feature expression of approximately optimal solution more stable.

IV. EXPERIMENTAL VERIFICATION A. EXPERIMENTAL SETUP
To investigate the effectiveness of the proposed method, an experiment is carried out on a defective bearing dataset that is from Case Western Reserve University Bearing Data Center [38]. It is the only open public data set that supports bearing fault type and bearing fault severity simultaneously. As shown in Fig. 2, the experiment rig is mainly composed of five parts, i.e., a 2 hp motor, a torque transducer, a dynamometer, a load motor and tested bearings. The defective drive end bearings are seeded with single-point faults at the outer raceway, inner raceway, and the rolling elements by using electro-discharge machining (EDM) technology. The fault diameters include four categories, i.e., 7 mils, 14 mils, 21 mils and 28 mils (1mil = 0.001inches). The first three types are selected as test subjects in this research. The type of tested bearings is deep groove ball bearing 6205-2RS JEM SKF. Vibration data is collected by an accelerometer with the sampling frequency being 12kHz.

B. DATASET PRODUCTION
In order to train the proposed deep residual network well, a simple data augmentation method is introduced. That is, each continuous acquisition of the raw vibration signal is divided into multiple segments with allowing for fifty percent overlap. For example, one vibration signal with 1535 points could be broken up into two samples whose length is 1024 points. Note that there is no intersection of sampling point between training set and test set. In general, there are four types of bearing fault location, i.e., healthy state (HS), outer race fault state (ORFS), inner race fault state (IRFS) and rolling element fault state (REFS). Each of the latter three has three levels of defective severity (7 mils, 14 mils, 21 mils). Therefore, the bearing health status can be divided into ten classes. Detailed information of the training set and test set is provided in Table 1. Fig. 3(a) shows the typical vibration signal of each class.

C. RESULTS OF FAULT DIAGNOSIS AND NETWORK VISUALIZATION
Based on the principle that is illustrated in Section 2, the proposed deep residual network is constructed. The data length of each input vibration signal is equal to 1024. Max-pooling (MP) operation is applied in the pooling layer. The size of all convolution kernels is 16. Through every other residual block, the input signal is down-sampled with ratio being 2. The activation function is selected as rectified linear unit. The number of neurons is set to 4 in each fully-connected layer. They correspond to 4 different fault locations and 4 different fault severities, respectively. This network is established and programmed by deep learning framework named Pytorch, which is one of the most famous deep learning toolbox and proposed by Facebook.
The proposed deep residual network is trained on the bearing training set by the back propagation (BP) algorithm [39]. When the model is trained and optimized over, the weight parameters among all the layers are obtained. Then, the features with rich semantic information and discriminant information can be calculated in the global average pooling layer of proposed deep network, by the use of network inference. This model achieves high accuracy in both task of fault location detection and task of fault severity judgment. The diagnosis accuracy is listed in Table 2 and Table 3 in detail. Fig. 3(b) shows the typically feature of each class. From this figure, it can be seen that these features are hard to be comprehended by researchers even though there are some differences between them. Then, the visualization of what the neurons in FC layers capture goes ahead. The original vibration signals that are depicted in Fig. 3(a) are selected as the initialization values. On the basis of (11) and GD, the sensitive input patterns of the selected neurons are obtained. Fig. 4 depicts them in the input feature space.
The result is amazing. It is obvious that the information of fault characteristic frequencies and energy of resonance zone are retained fair well while the noise is suppressed heavily. For example, the sensitive input pattern of the neurons that are related to normal health status is very similar to DC signal with zero mean. The sensitive input signal of neurons that are relevant to slight ORFS behaves that the local disturbance between adjacent transient impulses is almost completely restrained and eliminated. From this result, it can be known that the frequency and energy information of resonance zone of vibration signal play a crucial role in the real fault features. The result also demonstrates that the features extracted by the proposed deep residual network indeed characterize the intrinsic feature of bearing fault.
To further investigate the effectiveness of the proposed method, a hierarchical diagnosis network (HDN) [25] based on deep belief network (DBN) is employed for comparison. It is also one kind of DNN, and is a cascade network stacked by multiple DBNs. Through literature review, this network is one of the few models that can simultaneously accomplish two tasks of fault diagnosis. It diagnoses fault location and fault severity sequentially. It firstly gives the result of fault location by the previous one DBN, and on this base, the result of fault severity is produced by the latter three DBN models, each of which corresponds to one kind of fault location type. HDN is built by using the same network parameters used in Ref. [25], except that the length of each raw vibration signal is shortened to 1024 points. The input feature of HDN is the wavelet packet energy (WPE) that is calculated from 4-level wavelet packet transform decomposition. The diagnosis result of HDN is listed in Table 2 and Table 3. At the same time, the proposed model with only one task of fault diagnosis is explored while other parts of the network remain unchanged, so as to explore the effectiveness of multi-task learning. For the task of fault location detection, it is trained with only one loss function J L . And for the rest, i.e., the task of fault severity evaluation, it is trained with single loss J S . The results of them are also listed in Table 3 in the names of DRN-L and DRN-S respectively.
Obviously, four models achieve comparative performance, while the proposed DRN performs best with 98.9% accuracy in task of fault severity evaluation compared with HDN (98.4%) and DRN-S (97.8%). The result reflects two messages. On the one hand, the proposed deep residual network is also effective in extracting fault-related features from original raw vibration signal, compared with HDN that employs handcrafted features. On the other hand, the DRN outperform DRN-S, which indicates that the former is easier to learn more intrinsic features of bearing fault by feeding more knowledge. The loss of fault location task works as a good regularization term for the fault severity task. Overall, the DRN can be consider as a comprehensive and precise bearing fault diagnosis model by both detecting fault location and judging fault severity.
At last, to study the generalization capability and the robustness to noise of the proposed DRN model, each sample in test set is added with additive Gaussian white noise in different signal-to-noise ratio (SNR), while the DRN is trained on the original training set without additive noise. This situation is very close to real industrial production where the noise varies a lot. After all, not all the labeled training samples can be obtained under different noisy environment. Fig. 5 shows the diagnosis accuracy of DRN in different SNR. By observation, the DRN model achieves pretty high accuracy with more than 90% accuracy in two tasks when SNR is larger than 0db. The SNR with 0db means that the power of noise is equal to that of original signal. The result demonstrates that the DRN model has pretty good robustness against noise. In the real industrial environment, the performance of DRN model could be improved with the increase of data amount.

V. CONCLUSION
This paper proposes an intelligent diagnosis system for bearings based on deep residual network. The input features are just raw time-domain vibration signal, not the hand-crafted or pre-defined features by experts. The proposed network combines the loss function of fault location task and that of fault severity task, fed with more knowledge, which makes it extract more intrinsic features of bearing fault. The features extracted by this network can be used in both fault location detection and fault severity evaluation. In this manner, the weak link between the two tasks could be discovered. The subsequent visualization of sensitive input pattern indeed reveals that the frequency and energy information of resonance zone of vibration signal is significant to bearing fault diagnosis. To further deepen the understanding of this network and break the traditional way of using deep learning as ''black box'', the visualization method of higher-layer features is proposed on the basis of maximizing activation value. In this process, one natural and elegant initialization rule of the optimal solution is raised. It helps the researchers comprehend what the hidden units capture in the input feature space. In addition, to provide a comprehensive evidence for the effectiveness of DRN, one other intelligent diagnosis method named HDN is employed for comparison. Meanwhile, the DRN with only one task loss, i.e., DRN-L and DRN-S, are also used as baseline models. As a result, the DRN achieves comparable or even better results in different fault diagnosis tasks. It demonstrates that DRN has a good potential for bearing fault diagnosis with two tasks. In future work, DRN may be combined with other fault information to improve its performance of fault diagnosis.
GUANGHUA XU (Member, IEEE) received the B.S., M.S., and Ph.D. degrees from Xi'an Jiaotong University, China, in 1995, all in mechanical engineering.
He is currently a Professor with the State Key Laboratory for Manufacturing Systems Engineering, Xi'an Jiaotong University. His current research directions are in mechanical system reliability, fault diagnosis, and wavelet analysis. He is currently a Professor with the School of Mechanical Engineering, Xi'an Jiaotong University. His research interests include machine learning, pattern recognition, condition monitoring and fault diagnosis, and automatic control.
QINGQIANG WU received the B.S. degree from Xi'an Jiaotong University, Xi'an, China, in 2013, where he is currently pursuing the Ph.D. degree with the State Key Laboratory for Manufacturing System Engineering.
His research interests includes deep learning, computer vision, rehabilitation robot, and signal processing. VOLUME 8, 2020