A Novel Intelligent Fault Diagnosis Method for Rolling Bearings Based on Compressed Sensing and Stacked Multi-Granularity Convolution Denoising Auto-Encoder

This paper investigates the unsupervised automatic feature extraction method with a large amount of unlabeled data for the fault diagnosis of rolling bearings in automobile production line, where the fault information is hard to identify due to the low-level features of a single category and the massive fault data is difficult to process. Different from the existing methods, which only combine the compressive sensing with single category of low-level features, or extract features from raw data, a novel intelligent fault diagnosis method for rolling bearings based on the compressive sensing and a stacked multi-granularity convolution denoise auto-encoder network is proposed, which utilizes the nonlinear projection to achieve the compressed acquisition and resolves issues with character unicity by extracting a diverse category of high-level features. Moreover, a regularization method called ‘dropout’ is used to prevent overfitting during the training process. The amount of measured data that contained all the information of faults is reduced and the classification accuracy is improved by extracting more robust features based on the proposed method. Finally, the effectiveness of the proposed method is validated using data sets from rolling bearings in an automotive production line and the analysis result show that it is superior to the existing methods and is able to obtain high diagnostic accuracies.


I. INTRODUCTION
Rotating machinery plays an important role in modern automobile industry. With the upgrading of automotive production capacity, the line stop caused by the faults of rotating machinery will cause heavy economic losses and even endanger the personal safety of producers. Hence, the condition monitoring of the rotating machinery has attracted great attentions [1]. According to the statistics of historical machinery failures, almost 40% of the faults about rotating machinery come from rolling bearings, which will suffer various faults due to the harsh production conditions [2,3]. Consequently, the reliable bearing fault diagnosis methods are meaningful and practical.
Vibration analysis is widely adopted in the bearing fault diagnosis [4]. After a literature review, signal processing and intelligent diagnosis are two main methods that have proved to be effective [5,6]. Intelligent diagnosis methods mainly include two steps: feature extraction and fault recognition. It should be noted that feature extraction is more significant which intends to obtain representative characteristic from raw signals based on signal processing methods. For instance, spectral analysis, time-domain statistical analysis [7,8], transform domain analysis [9,10], entropy and adaptive decomposition [11][12][13]. Nevertheless, some insensitive or redundant information may be in these extracted features. Then some dimension reduction strategies and feature selection methods are presented to obtain sensitive characteristic, which affect the computational efficiency as well as the diagnosis results, such as feature discriminant analysis [14] and principal component analysis. On the other hand, some artificial intelligence methods are adopted to identify the bearing faults. For example, support vector machine (SVM) [15], random forest [16], artificial neural network (ANN), and k-nearest neighbor [17]. The general procedure is shown as Figure 1, and the left hand side is the traditional method.
However, intelligent fault diagnosis methods still have two deficiencies. Firstly, traditional feature extraction methods not only rely heavily on diagnostic expertise and professional technology, but also need to extract features manually. Also, these methods are ordinarily studied according to one specific diagnosis issue with low generalization. Secondly, traditional artificial intelligent methods cannot distinguish the primary differences between the complex information effectively from massive raw signals. It generally uses highdimensional signals to show the information in the complex mechanical system. The Shannon-Nyquist theorem is the common way to extract the vibration signals. In this traditional way, multiple sensors with long operation periods over high sampling could produce a large amount of data, which put forward high requirements on the transmission bandwidth, data storage, acquisition hardware, and subsequent processing [18]. Thus how to extract features effectively from massive raw data and identify the faults accurately are worth researching.
In this paper, we adopt the compressive sensing (CS) to extract the raw data, which is fundamentally different from Nyquist theory. In recent years, CS has attracted considerable attention in some areas, such as signal-pixel camera [19], radar imaging [20], and electrocardiogram [21]. CS reduces the amount of sampled data while retaining most of the useful information. To a certain extent, CS provides a new thinking in the field of data acquisition and processing because of its low requirement for the storage and computational [22]. The general procedure of the CS method introduced by some researchers includes three parts: the projection acquisition of the raw data, compressed and reconstruction. The procedure has shown in the middle of the Figure 1.
Obviously, the methods based on CS are still draw support from the conventional ways, although it reduces the demands for data storage and computational. Histon et.al. developed the deep learning (DL) theory [23], which provides the theoretical support for the above difficulties. The fundamental of the DL is that it can map the original space data into the feature space by learning a nonlinear input function under the structure of a multilayer neural network. DL is applied in different fields successfully with the function of dealing with massive data analysis automatically since its emergence, such as the image recognition, the speech identification [24], and other applications. The development of DL has significantly reduced the dependence on expertise and the manual selection of features in intelligent diagnosis [25]. The procedure has shown in the right hand of the Figure 1. Common DL algorithms include convolution neural networks (CNN), recurrent neural network (RNN), deep belief network (DBN), and autoencoder (AE) [26]. Among these algorithms, CNN is more popular because of its character of sparse connections and weight shares. However, this technique need a backpropagating (BP) error approach [27] to train the network using massive labeled datasets. Hence, the acquisition of data set requires extensive resources, which restricts the application and development of CNN. Under these circumstances, unsupervised learning [28] becomes a better choice, which can automatically extract features via unlabeled data. The auto-encoder (AE) has the unsupervised neural network structure. Nevertheless, too many network parameters were introduced in AE, due to its performance of the full connectivity between layers [29][30][31][32]  Summing up the above, existing DL methods have the difficulties in acquiring massive labeled data. On the other hand, feature extraction also has many limitations. Instead, unmarked data can be easily obtained. So we propose a novel fault diagnosis framework based on CS and stacked multigranularity convolution denoising auto-encoder (SMGCDAE) method. This framework provides a new bearing fault classification solution for bearing fault diagnosis. Firstly, CS reduces the amount of sampled data while retaining most of the useful information. Then, multi-granularity convolution denoising auto-encoders (MGCDAE) combines an ensemble learning thought called multi-granularity convolution kernel [35][36][37] and the denoising auto-encoder (DAE), which can use fewer parameters and lower computational learning cost to extract robust features from unlabeled data. We can obtain different features, because the size of the kernels varies. This approach adds the receptive fields due to the function of the convolution kernels, allowing them to acquire multifarious fault features. In addition, dropout [38] is utilized in the hidden layer of the auto-encoders to prevent the overfitting by averaging the model. In brief, the generalization performance in fault diagnosis is improved, because of the diversify of the attributes. Finally, we stack several multigranularity convolution denoising auto-encoders (MGCDAEs) to form a SMGCDAE structure and use a pretraining method to train this network [39]. We summarize the main insights and contributions of this work as follows: 1. Employed a novel bearing fault diagnosis framework, which integrates CS with SMGCDAE method in this paper. Compared to other techniques, this framework can reduce environmental demands, transmission costs and computations. The original vibration data is linearly mapped into a lower dimension space. The small amount of compressed signal not only gets rid of the dependence on diagnostic expertise and prior knowledge, but also contains most of the information. In addition, this framework can obtain more comprehensive key features using diverse characteristics exhibited and acquire robust features based on unsupervised learning.
2. In the diagnosis case of a real data set from an automotive production line, the effects of the key parameters and the selection of the proposed method are thoroughly studied. In addition, the experiment shows the superiority of our proposed framework by comparing with traditional methods.
The remainder of this paper is organized as follows. Section 2 overviews the theory of CS and AE utilized in the proposed technique, and presents the details of the SMGCDAE with dropout. Section 3 describes the proposed CS-SMGCDAE intelligent method for rotating machinery fault diagnosis. Section 4, the performance of the raised method is verified by experiments from an automotive production line. Finally, conclusions and future work are included in section 5.

II. COMPRESSIVE SENSING
This section gives a brief introduction to CS [40], which is a special case of sparse representation. To a certain extent, CS is even an extended of sparse representation. The sample idea of CS is that so many real-world signals have sparse features in some domain, e.g., Fourier Transform (FT), we can use fewer measurements to reconstruct it under some conditions. CS has two principles: one is the sparsity of the signals; the other one is that the measurements matrix from the original signals could be compressed by their sparse representations. In other words, the measurements matrix in the second principle must satisfy the data minimal information loss, i.e., Restricted Isometry Property (RIP). Briefly, we describe CS as follows.
Assuming an unknown original signal , ∈ , which has data points. To allow these data points produce a set of sparse components, for a given sparse transformation matrix , ∈ × , the mathematical definition of can be expressed as Equation (1): Or more efficiently = (2) Where represent the sparse elements and a * 1 column vector of coefficients. When the dictionary (sparse transformation) is incoherent with the measurement matrix ∅, the original signal can be reconstructed by the compressed measurements based on the theory of the compressive sampling, and can be written as follows: is the compression measurement, represented by a * 1 column vector.
is the measurement matrix, = ∅ , and the matrix must satisfy the data minimal information loss, i.e., Restricted Isometry Property (RIP) [41]. Definition 1.1: The measurement matrix satisfies the Restricted Isometry Property (RIP) if there is a parameter ∈ (0,1) as follows: The size of the measurement matrix is * , which depends on the compressive sampling rate ( ), and the length of is significantly lower than the Nyquist rate ( ≪ ). Figure 2 show the compressive sampling framework.
To some extents, the compressed data can cover most of the raw signal information if the measurement matrix satisfies the RIP. [42] proved that the random Gaussian matrix satisfies the RIP with good universality. Hence, the measurement matrix employs the random Gaussian matrix to obtain the compressed data.

A. DEEP NEURAL NETWORK AND AUTO-ENCODERS
The deep neural network (DNN) is developed from deep learning with the deep architectures. In this network, the representative information in the approximate complex nonlinear functions and compressed measurements can be captured with small errors. In addition, DNN has the ability of amplifying the differences in the explanatory information contained from the original data and suppressing irrelevant parts that cause interference, thus can distinguish the different fault classes.
An auto-encoder is a widely used unsupervised neural network, which has three layers. The target of the output in an auto-encoder is to reconstruct the input data via the backpropagation [43]. As depicted in Figure 3, an autoencoder consist of encoder part and decoder part like many unsupervised feature learning methods. The encoder network not only transforms the high-dimensional input data into the low-dimensional output codes, but also produces the feature vectors. The decoder network reconstructs the inputs from these feature vectors.
The encoder network can be defined as a feature extraction function . For each measured signal , that can compute a feature vector ℎ , as shown in Equation (5): Where ℎ is the feature representation obtained from . The decoder network can be denoted by a recovery function , which can transform the feature space ℎ into the input space ̂, producing a reconstruction: ̂= (ℎ ) (6) The parameter sets of the auto-encoder are learned simultaneously on an approximation such that ̂ is similar to , also attempting to attain the lowest possible reconstruction error ( ,̂) . Where the loss function ( ,̂) can measure the discrepancy between and ̂. Hence, we can obtain the following equation: In fact, affine mapping is the most common used form for auto-encoder [44], and that keep collinearity followed by nonlinearity: Where and are the activation functions of the encoder and decoder, respectively, e.g. hyperbolic and sigmoid. and are bias vectors, and and are the weight matrices.

Input
Output Hidden layer Encoder Decoder Although the original input data can be reconstructed by the learned feature representation perfectly, the generalization performance of the model is not good.

B. DROPOUT
Dropout is a technique used to prevent overfitting in the fully connected layers. The network will remove some hidden units in each layer randomly with a certain probability during each training iteration, thus the hidden units can change their states without the help of other hidden units. In this study, the dropout technique is applied to avoid the extraction of the same feature repeatedly and prevent complex co-adaptations on the training data.

C. SIGNAL MULTI-GRANULARITY CONVOLUTION DENOISING AUTO-ENCODER WITH DROPOUT
In real implementation, achieving sufficient feature learning is susceptible to interference because of complicated factors. For instance, instrumentation errors and inaccurate data collection could cause data deviation. Consequently, capture more information to measure the latent high-level feature representation is highly necessary. In order to learn high-level features effectively in this paper, we adopt the multi-granularity convolution kernels. Under this concept, each convolution layer contains convolution kernels of varying sizes, with each convolution kernels corresponding to a unique feature. This structure integrates high-level features to present more comprehensive information through various mappings.
In this paper, the proposed MGCDAE pipeline contains three dimensions of convolution kernels: 1×1, 3×3, and 5×5. As the number of kernel increases, the amount of computation and required runtime increases. Therefore, the 1×1 kernel was mainly applied to alleviate computational bottleneck, reduce network parameters, and decrease dimensionality. The first part of the multi-granularity is a 1×1 convolution layer. To allow local connection of each pipeline as sparse as possible, another 1×1 convolution layer is also constructed (see Figure 4).  To optimize the network, convolution kernels are trained by DAE with dropout technique in the training stage of the proposed network. DAE was firstly introduced in 2008, which is an unsupervised approach used for extracting robust features. In this study, we firstly contaminate the original data to obtain the noisy data, then extract the robust features, which can ensure stability and improve generalization performance. Here, we choose random Gaussian noise to destroy the raw data.
In the encoder process as shown in Figure 4, add the Gaussian noise randomly into the original input vector , thus can obtain a corrupt input vector ̃, then get into the nonlinear activation function by linear mapping. Next, the ̃ is mapped to a latent vector representation by the function .
In brief, we use the benefits of variability of the number of convolution kernels to obtain different high-level representations. When the obtained features were extracted from a given pipeline, it could be integrated using a weighted average, because of the same dimensions. In other words, we put forward a feature fusion method by matching the dimensions of the convolution layer. This method is beneficial for improving the generalization performance of the network.
After that, to optimize the training process, the dropout technique is applied to the network, which can prevent overfitting in the fully connected layers. Technically, the "dropout" can be realized by omitting the neural units in the hidden layers randomly with a probability . Then we can get a dropped representation ̃ by a scalar product with a masking vector . ̃= • (11) A unique network is trained in each iteration, since the network is updated iteratively by dropping the neurons randomly in the hidden layer. This operation improves the subsequent classification performance greatly.

MGCDAEn
Softmax classifier Inspired by [45,46], we find out the remarkable abstractness of the deep neural networks. Hence, multiple MGCDAEs are stacked in a deep neural network. The BP algorithm is used to train the first MGCDAE1. Then the output of the encoder 1 become the input for the next MGCDAE2. This process is shown in Figure 5.
Finally, a deep stacked MGCDAE (SMGCDAE) is formed by MGCDAEs. We can calculate the latent feature representation : = ( 1 * −1 + 1 ) (14) Where 1 is the weight matrix, 1 is the bias vector. The aim of the SMGCDAE network is to improve the nonlinear mapping capabilities of the MGCDAE. We can obtain and fuse the high-level features by abstracting the initial feature layers. At last, the features are put into the classifier to complete the final classification.

IV. SOFTMAX CLASSIFIER
In this study, we use the softmax classifier [47,48] to classify the fault types as follows: Where k and are the data category and the weight of the sample . To minimize the reconstruction error, we also need a cost function during the training process in this deep network, thus we can obtain more similar data with the original data.

V. PROPOSED FAULT DIAGNOSIS METHOD
In consideration of the challenges caused by the restrictions of traditional approach and the difficulties in processing massive raw data in bearing fault diagnosis. This paper initially adopts a data acquisition method, which can realize fault signal acquisition by using the transform domain projection in the CS domain. Moreover, a SMGCDAE deep learning algorithm is constructed to realize intelligent diagnosis. The procedure of the proposed framework is shown in Figure 6.  Firstly, acquire the compressed data via a specific compression ratio. Then obtain a dataset = { , } =1 , where is the total number of samples, is the label corresponding to , is the i-th compressed sample. The original dataset is divided into two parts, the training set = { , } =1 and the testing set = { , } =1 . The former is used to train the constructed SMGCDAE network, which is a greedy training method including two main processes: one is to initialize the weight in the network through pre-training the MGCDAE. another one is to improve the performance of the network by further fine-tuning the networks with BP algorithm. The test set has responsible to validate the performance of the proposed diagnosis network.

A. DATASET DESCRIPTION
This subsection aims to verify the superiority of the proposed method by conducting the fault diagnosis of 'scissor' lifter located in the assembly production line of an automotive company as shown in Figure 7. In the automotive production line, the 'scissor' lifter is a kind of car lifting equipment with car lifting stability and has a wide range of applications. It is mainly used for cars transportation between the height difference of production line. Hence, it will cause huge losses once this important equipment breaks down. The most easily damaged part of the equipment is the rolling bearing on the rotating spindle. The vibration signal of the bearings is extracted and analyzed. In the experiment, seven kinds of rolling bearing conditions were existing in the test: normal (NOR), outer race fault (Stripping with a size of 40mm*3mm, ORF1), outer race fault (pitting, ORF2), inner race fault (Stripping with a size of 40mm*3mm, IRF1), inner race fault (pitting, IRF2), rolling element fault (REF), and lubrication shortage fault (LSF), as illustrated in Figure 8. Each type has 150 samples with 4800 in length, which is marked as set .
Then, the compressed data under different compression ratios (CRs) with measurement matrix can be obtained by compressed acquisition theory. For example, a 1440×4800 random Gaussian matrix is generated if given the CR=70%, and which matrix can be used to obtain the compressed sample ′ . The raw dataset and the compressed dataset are marked as and ′ in the subsequent processing. Table 1 details the bearing datasets information.  As seen in Figure 9, the time domain waveforms and frequency domain waveforms of 'scissor' lifter can hardly distinguish the conditions because of the complexity test condition.

B. EFFECT OF COMPRESSION RATIO(CR)
The CR is related to the length of the original signal and the size of the measurement matrix. The sampling points required by CS decrease with the increase of CR. In a limited number of observations, it could not obtain complete information from raw data if the value of CR is too small. Hence, the CR has an upper bound based on the limited of the RIP theory. When the value of CR is less than 40%, it cannot exhibit a good compression effect. Figure 10 shows the influence of CR changes on diagnostic accuracy and computing time. As shown in Figure 10, the computing time increases gradually with the decrease of CR within the scope of our research. However, the accuracy rate has been very high, and there is no positive or negative correlation with CR. Finally, 70% is selected as the CR by analyzing the results and influenced factors. We can conclude that a higher CR can be adopted if the computing time requirement is strict, which also could slightly reduce the accuracy. In addition, the requirements of communication and data storage are lower for the higher value of CR.

C. COMPARISON
This section mainly contains two phases of experimentation. Firstly, we investigate the differences in accuracy between our proposed method and the prototype. Then to evaluate the effectiveness of our approach, we compare the proposed method with other existing classification approaches. The parameters of the model were conventional taken from literatures. The number of the filter is set to 96. Set stride to 2. Moreover, the activation function for neurons is typically a Leaky Relu function [49].
In this part, each method was run for 10 times. Thus we can obtain a general comparison after averaging the value of each experiment. The average classification accuracy is shown in Table 2. To facilitate analysis, we divided all the conditions into four kinds of dataset. Let normal condition be the dataset1, ORF1 and ORF2 be the dataset2, IRF1 and IRF2 be the dataset3, REF and LSF be the dataset4.We initially focus on the influence between single-grained and multigranularity convolution kernels on classification performance. In this study, we use three convolution kernel sizes of CAE to compare with the multi-granular convolution kernel as shown in Table 2. In addition, the 20% random Gaussian noise was added into the raw data to improve the generalization performance in the process of the training, and the further comparison verifies the effective of our approach. The results show that the accuracy of the MGCAE method on the dataset1 was 90%, the dataset2 was 91%, the dataset3 was 87%, and the dataset4 was 85.5%, respectively. The above accuracy results are meaningfully higher than CAE(5×5), CAE(3×3), and CAE(1×1). It is evident that the others across all four datasets are inferior to our approach. In addition, the accuracy of the MGCDAE method on the dataset1 was 91%, the dataset2 was 93%, the dataset3 was 90%, and the dataset4 was 87%, respectively. Comparing with the condition of no noise, the accuracy of noise adding has been increasing by 1%, 2%, 3%, and 1.5%, respectively. Hence, we can obtain the robust features to improve the classification accuracy by adding the Gaussian noise. Figure 11 illustrates the effect of varied noise levels on classification performance. In each dataset, the performance for different proportions of added Gaussian noise were indicated in Figure 11 (a), (b), (c), (d), respectively. Among these figures, signal MGCDAE was represented by 'MGCDAE', the stack of three MGCDAEs was represented by 'stack-3', the stack of five MGCDAEs was represented by 'stack-5', the stack of seven MGCDAEs was represented by 'stack-7'. The results show that the generalization performance of the model increased with the number of the layers. Furthermore, compared to noise-free conditions (the proportion of added Gaussian noise is 0), adding noise will improve generalization performance during the training phase. In the dataset1, when the proportion of added noise is 10%, the accuracy of the model is the highest. In other datasets, the proportion corresponding to the highest accuracy is different. Hence, the proportion of added noise has a certain impact on the prediction accuracy of the model.
To evaluate the feasibility and stability of our approach (MGCDAE-7), we compare it with different types of traditional machine learning classification methods, for instance, random forests (RF), convolution neural network (CNN), support vector machine (SVM), and deep belief network (DBN). Moreover, the 20% random Gaussian noise was added into the raw data. As shown in Table 3, the proposed SMGCDAE method has the highest average accuracy of 97%, 99%, 98%, and 93%, respectively, demonstrating superior performance, across all four datasets. These performances benefit from that the proposed method not only can extract the robust features, but also includes a sparse network structure.

D. ANALYSIS OF DROPOUT
Finally, as a supplement, we also investigate the effect of dropout on performance of the proposed method. Here, we set the step size to 0.1 and the dropout rate is changed from 0 to 0.5, the 20% random Gaussian noise was added into the raw data, and the number of the stack was 7. As shown in Figure 12, different dropout rates have different classification performances. The result show that when the dropout rate was close to 0.2, we can obtain the best classification performance. When the dropout rate was more than 0.2, the performance would decrease. This result indicated that appropriate dropout is beneficial for the performance of the proposed method.

E. DISSCUSION
As reported above experiments, our approach has the better generalization performance for general classification tasks. Among these comparative experiments, we discussed the influence of some parameters on experimental results, such as the size of the convolution neural, whether to add the noise, the proportion of the added noise, the number of the MGCDAE stacked and whether to add dropout method. In a word, add certain proportion of noise and stack MGCDAE with dropout method can improve the accuracy of the model.
Although the results show that our approach achieved high quality generalization performance, there still remain some issues. For example, it is not superior to other methods in computing time. In some cases, it even exceeds the comparison method. With increase of the number of superposition on MGCDAE, the computing time could increase further, but the accuracy is unlikely to continue to grow. Furthermore, we use three size convolution kernels to optimize the model, but the further problem is that how to automatic select the type of convolution kernel for different data types. The larger the convolution kernel, the greater the time and complexity of calculation.

VII. CONCLUSIONS AND FUTURE WORK
This paper proposed a novel intelligent bearing fault diagnosis method based on the CS and an unsupervised feature extraction approach (SMGCDAE). The compressed data has the ability to capture the discriminative information that can be used to extract features automatically. Then, a CNN based on DAE is built to mine the useful information and finish the fault classification by softmax classifier. In addition, the SMGCDAE improved on existing approach via introducing the concept of the multi-granularity convolution kernels and used the dropout to prevent overfitting. The case studies of bearing data sets demonstrated the robustness and effectiveness of this technique. The CS-SMGCDAE intelligent diagnosis method can obtain relatively high identification accuracy with small amount of measure data. The proposed method provides a new idea for mechanical big data processing. In the future work, we will further explore a general model and apply it to other datasets.