A Novel Unsupervised Learning Method Based on Cross-Normalization for Machinery Fault Diagnosis

Sparse representation is the important principle of unsupervised learning method. In order to accurately identify the fault condition of machines, the desired feature distribution should show population sparsity and lifetime sparsity. In this paper, to improve the accuracy and robustness of the classiﬁcation, a novel fault diagnosis method named Cross-sparse Filtering (Cr-SF) is proposed based on the cross l 1 / 2 - norms of the feature matrix, which mean the population sparsity and lifetime sparsity terms. After the weights training process, a novel nonlinear activation function is used for feature extraction in the test process. Cr-SF can learn discriminative features from the raw data and accurately identify the fault condition. Rolling bearing fault and gear-box fault datasets are employed to validate the performance of the proposed method. The veriﬁcation results conﬁrm that Cr-SF is an effective tool for handling big data. The robustness and accuracy of the classiﬁcation results using Cr-SF are comparable to convolutional networks with a much faster training process.


I. INTRODUCTION
Rotating machinery is the important component of the modern mechanical systems. As the main components of rotating machinery, bearings and gears are prone to failure during the runtime, which may reduce the working efficiency and even cause accidents and disasters [1], [2]. In order to ensure the safety and effectiveness of the machines operation and avoid the safety accidents and economic losses, intelligent fault diagnosis of the rotating machinery has attracted widely attention [3], [4]. Deep learning is an effective tool for learning the discriminative information which can reduce the dependence of human labor and prior knowledge. Therefore, The associate editor coordinating the review of this manuscript and approving it for publication was Ruqiang Yan. it has become a promise diagnosis technology for the era of big data [5], [6].
In recent years, many deep learning methods have been developed for the machinery fault diagnosis [7]. In [8], stacked denoising autoencoder is firstly employed for the rotating machinery diagnosis. Jia et al. proposed a local layer-based autoencoder with the normalized weights for the rolling bearing fault diagnosis and the weight vectors are visually interpreted and studied in time domain and frequency domain [9]. Convolutional neural network (CNN), which is a popular and effective deep learning method, has attracted many researches due to its stable feature extraction performance and strong noisy adaptability [10]- [13]. An et al. proposed a rolling bearings fault diagnosis method using feedback mechanism convolutional neural network-sparse representation [14]. Sparse coding is used to sparsely express VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the deep learning model to remove redundant information. However, the limitation of CNN-based method for real applications is that the training process is very time consuming. It should be noticed that these methods have many parameters to be weighted, which is also a time-consuming and prioribased process.
In the application of intelligent fault diagnosis, the key is to distinguish different samples and identify fault conditions accurately. Therefore, unsupervised learning method, which is a simple and efficient algorithm when it just focuses on the feature distribution but not the accurate expression of raw data, can make the diagnosis simpler and more intelligent. In [6], sparse filtering and softmax regression are employed to construct a two layers network structure for the rotating machinery fault diagnosis. Extracted features are employed to train the softmax regression and identify the health condition. Sparse representation is the core principle in unsupervised learning method, sparse filtering using the row normalization and column l 1/2 -norm of the normalized features to achieve the sparse distribution of features [15]. Zhang et al. discussed the population and lifetime sparsity with the generalized normalization and developed the general normalized sparse filtering for fault diagnosis [16]. To address the deficiency that SF fails to consider the local structure of input samples, Zhang et al. proposed a novel method constructed by SF and a local structural regularization formulated to preserve the local structure of the input samples [17]. In addition, sparse filtering can also be applied in the weak signature detection [18]. Jia et al. proposed a blind deconvolutional method using convolutional sparse filtering and developed its general normalized 1-dimmentional sparse filtering for impulsive signature enhancement [19].
Usually, the number of segments need to be set in the training which can change the dimension of training matrix and improve the feature learning ability of the algorithm. In the testing process, the weight vector has been trained, the function of segment number (N s2 ) is to average the test features of each sample in this case. However, both N s1 and N s2 are set to the same number in most studies.
In this paper, a novel fault diagnosis method is proposed based on the cross l 1/2 -norm of feature matrix to improve the robustness and accuracy of diagnosis mode. A novel nonlinear activation function is introduced for feature extraction after the training process. The proposed method can learn discriminative information from original data and accurately identify the health condition. The proposed method is employed to classify the health condition of a bearing fault dataset and a gear-box fault dataset. The contributions of this paper can be summarized as follows: First, a novel unsupervised learning method using crossnormalization is developed for the machinery intelligent fault diagnosis, which shows higher sensitivity to features and stronger feature extraction performance. Second, a novel nonlinear activation function is introduced to improve the accuracy and robustness of the diagnosis model. Third, this paper studies the influence of N s1 and N s2 on the performance of the algorithm.
The rest of this paper is organized as follows. In section II, standard sparse filtering and feature distribution is briefly introduced. The proposed method is detailed in section 3III. The application of the proposed method and the experimental results are discussed in Section IV. Finally, conclusions are given in Section V.

II. STANDARD SPARSE FILTERING AND FEATURE DISTRIBUTION
In this section, the theory of standard sparse filtering and the relationship between feature distribution and normalization are briefly introduced.

A. STANDARD SPARSE FILTERING
As shown in Fig. 1, sparse filtering can be regarded as a twolayer network [20], [21]. The features f i j ∈ R L×M can be obtained using f = Wx, where x i M i=1 are the training samples, W ∈ R L×N means the weight matrix, M is the number of samples, f i j denotes the jth feature of the ith sample. First, the normalized features are obtained using l 2 -normalization, as shown in (1) and (2).
Then, the l 1 -norm off i is used as the objective function for sparseness optimization which can be written as

B. FEATURE DISTRIBUTION
In the field of intelligent fault diagnosis, the discrimination of features between different fault conditions is important for the classification accuracy. Competition exists among all features because of the l 2 -norm constraint of the features across rows. When one feature increases, the other feature will decrease to ensure that all features are lied on their l 2 -norm sphere. This is actually a cross-competition, which makes the feature matrix sparse in columns and discriminative in rows. Finally, the final features of samples with similar faults are uniformly distributed by averaging the local features of segments. However, one obvious weakness of sparse filtering is that it requires a lot of samples for training. It is difficult to extract features from a single fault signal. Meanwhile, when the training samples or categories are small, the performance of the algorithm will deteriorate. In this paper, we propose another sparsity optimization method, which also achieves cross-competition between rows and columns. The robustness of feature learning performance of a single fault signal is enhanced. Meanwhile, higher accuracy can be guaranteed with fewer training samples.

III. PROPOSED METHOD
The proposed method using cross-normalization is detailed in this section, and the proposed algorithm is used to construct a diagnosis method for the rotating machinery fault diagnosis.

A. PROPOSED METHOD WITH CROSS-NORMALIZATION
There are two items in the objective function of the proposed method: l 1/2 -norms of rows and l 1/2 -norms of columns of feature matrix, which is different from the objective function of sparse filtering. These two terms are simultaneously optimized to realize the cross-optimization of population sparsity and lifetime sparsity. Specifically, supposing that N in and N out are the input dimension and output dimension of the proposed method, the input signal x ∈ R N is transformed into a segment matrix S ∈ R N in ×M using random segmentation, where M = N s1 ×m, N s1 is the number of segments in the training process, m means the number of training samples.
The linear expression of the training matrix and the weight, F = WS, is used to obtain the features, where W ∈ R N out ×N in is the weight matrix. Then, each row and column of the feature matrix are normalized using their l 2 -norm, respectively. As shown in (4) and (5) In order to eliminate the influence of redundancy in optimization process, the weight vectors are constrained to unit vectors. The final objective function of the proposed method is written as where λ ≥ 0 control the tradeoff between the two terms. L is nonsmooth and nonconvex. Therefore, |F| is replaced by F 2 + ε, which is a soft-absolute function and ε equals 1×10 −8 . Then, L-BFGS algorithm is used to optimize the objective function. The gradient function is given by where o ∈ R N in ×M is a matrix of all ones.
It should be noted that λ = 0 is a special condition, in which the sparse feature between samples is not optimized, and the algorithm only ensure the sparse distribution of features per sample.

B. DIAGNOSIS MODEL USING THE PROPOSED METHOD
This section presents an intelligent diagnosis model based on the proposed algorithm. The flow chat of the diagnosis model is shown in Fig. 2 and the general procedures are summarized as follows.
Step 1: Training matrix construction. The original signal x i M 1 are segmented into a training set S ∈ R N in ×M .
Step 2: Feature activation using the linear expression and normalization following Eq. 4 and 5.
Step 3: Weights training using the proposed algorithm.
Step 4: Feature extraction using the non-linear activation function Log (1 + WS) and the trained weights. This activation function is different from that of the training process. The waveform of the activation function is shown in Fig 3. Step 5: Classifier training process. In this step, the extracted features f i m i=1 and the corresponding label set y i m i=1 are fed into the softmax classifier for training, where y i ∈ {1, 2, 3, . . . ,k is the label vector. and the weight parameters of the classifier are obtained. The hypothesis h θ (f ) takes the form: where θ 1 , θ 2 , . . . , θ k are the parameters of softmax model and p y i = j | f is the probability for each f i . VOLUME 8, 2020 The objective function of the softmax regression can be expressed as: where m is the number of training samples, 1{·} is the indicator function, and k is the category number.
Step 6: Feature activation of test samples using the nonlinear activation function introduced in step 4. The collected signal is segmented into matrix S 2 ∈ R N in ×N s2 .
Step 7: Diagnostic results. The extracted test features are fed into the trained classifier and the diagnostic results are given.

IV. EXPERIMENTAL VERIFICATION A. DATA DESCRIPTION
In this experiment, the rolling bearings fault dataset provided from the Case Western Reserve University Lab are used to validate the performance of the proposed method. The main components of the motor bearing test bench are the testing bearings, an induction electrical motor and an There are totally ten kinds of fault conditions and 1000 samples for this study, which is presented in Table 1.

B. DISCUSSION ON THE FEATURE LEARNING PERFORMANCE
The feature learning performance of the proposed algorithm for single fault signals is firstly studied. Roller fault condition is selected for this analysis. Each sample is randomly divided into 100 segments that containing 100 data points. The extracted features and the corresponding sparsity measured by l 1/2 -norm are shown in Fig. 4 and Fig. 5, respectively. The sparsity of feature measured by l 1/2 -norm using the proposed method is about 6. However, the feature sparsity of features using sparse filtering is about 9.4. The distribution of features extracted by the proposed method shows obvious sparsity within the sample. However, the features extracted by sparse filtering show little difference. The spectral of the trained weight vectors are shown in Fig. 6. Compared to the sparse filtering, vectors trained by the proposed method corresponding to the features with larger values show obvious frequency components. When the amplitude is small, the weights have no frequency component, which means a strong filtering performance. We can draw the conclusion that the proposed method shows stronger feature extraction performance and higher sensitivity to features.

C. CASE STUDY 1: ROLLING BEARING FAULT DIAGNOSIS
In this experiment, the fault dataset detailed in the previous section are used to validate the performance of the proposed method as displayed in Table 1. We randomly select 10% of the samples as the training data. The accuracy and robustness of diagnostic results under different value of lambda are firstly analyzed to demonstrate the performance of the proposed method. Each original data is randomly divided into 100 segments with a dimension of 100. The weight decay term of softmax is set as 1E-5. The output and input dimension are both 100. In this paper, to reduce the randomness, the diagnostic results of experiment under each parameter setting are averaged by 20 repeated trials and the standard deviations are represented using the error bars (the computation platform is a PC with an Intel I7 CPU and 8G RAM). Fig. 7 presents the influence of W -norm, activation function and lambda. When λ > 10, the diagnostic accuracy will be significantly reduced. When the value of λ is small, the robustness and accuracy are better. When λ = 0, only column constraints exist in the objective function. The accuracy of the proposed method can still achieve 99.59% in this case. However, the calculation time is a little longer when λ = 0. The calculation time is 13.2s, when lambda is not equal to 0, the training time is about 12.8s. Through the comparison of three cases, we can see that W -norm and activation function can significantly improve the diagnostic accuracy rate of the diagnosis model. When the value of Lambda is small, W -norm can improve the accuracy more obviously. By the contrast, when the value of lambda is large, the effect of activation function is more obvious . Fig 7 also shows that lambda has a wide suitable range that can achieve enough accuracy, which indicates that lambda is a very easy parameter to choose.

1) PROPERTY OF THE PROPOSED METHOD
Through the implementation of this method, we can infer that W -norm mainly affects the feature extraction and optimization of the algorithm. The activation function is not in the training process of the algorithm, it mainly affects the extracted features and distinguishability. Therefore, to further explore the property of the proposed algorithm, we analyze the weight vector and the corresponding spectrum. Fig. 8 is the spectrum color map of the weight vectors obtained by different methods. The weight vectors have obvious high-frequency noise under constraints. In order to realize the sparse distribution between features, the proposed method will generate high-frequency or low-frequency based weight vectors. These vectors have little practical significance in VOLUME 8, 2020  the process of feature extraction. Another advantage of the W -norm is that the frequency components of the weight is more complete, similar to sparse filtering, which can guarantee to learn more complete information. In case of without W -norm, the energy gap between the frequency components weights is large, and the extracted features are not comprehensive. This also explains why the performance of constrained case is better than that of unconstrained case.

2) PERFORMANCE WITH DIFFERENT SEGMENT NUMBERS
In the process of fault diagnosis, there are two steps needs to set the number of segments of the original sample. Generally, these two numbers will be studied as one in most papers. In fact, their function is quite different. The first segment number is set in the training process, which can change the dimension of training matrix and improve the feature learning performance of the algorithm. The second segment number is set in the test process. In this case, the weight vectors have been trained, and the segment number is mainly used as the average of a feature. The influence of N s1 and N s2 on the performance of the algorithm is studied in this experiment. Fig. 9 displays the influence of N s1 on the diagnosis performance and the comparison results with sparse filtering with various N s2 . When the segment number N s1 is increasing with small value of N s2 , the accuracy is becoming higher and the corresponding standard deviations are continuously smaller. When N s2 > 100, the increase of N s1 cannot improve the accuracy, but only increase the training time. Even when N s1 = 5, the accuracy is higher than 99% as long as N s2 is greater than 50. On the contrary, even if N s1 is taken as 500, the accuracy is only 98% with N s2 = 20. As can be seen from Fig 10, In any case of N s1 , the accuracy become higher with the increase of N s2 . As long as N s2 is greater than 50, the accuracy of the proposed method can reach above 99%. We can also see that under the same parameter settings, the accuracy of the proposed method is significantly higher than that of sparse filtering. Even when the values of N s1 and N s2 are small, the accuracy of the proposed method can still achieve high accuracy. N S1 can change the dimension of training matrix and affect the calculation efficiency. As shown in Fig. 11, with the increase of N S1 , the computation time shows a linear increase trend.  As a summary, N s2 is a very important parameter for classification accuracy and N s1 has a great influence on the calculation efficiency.
To ensure the accuracy and calculation efficiency through the above analysis, the calculation efficiency and accuracy with N s1 = 20 and N s2 = 100 are selected as the comparison VOLUME 8, 2020  with the existing algorithm. They are sparse filtering in [6], GNSF and PCA in [17], CNN in [13] and the proposed method. The dataset under four working conditions are used as training samples. In this comparison, the mean diagnostic accuracy and computation time are the final measure term. The comparison results are displayed in Table 2. Sparse filtering, GNSF with p = 3 and q = 2 and the proposed method have the same parameter settings.
The accuracy of the proposed method which can reach 99.85%, is better than that of sparse filtering and its improved method GNSF under the same parameter settings, and is close to that of CNN (99.63%). However, CNN needs more training samples. In addition, the training process is time-consuming. The calculation efficiency of the proposed method is significantly higher.

D. CASE STUDY 2: PLANETARY GEARBOX FAULT DIAGNOSIS
The gear fault datasets collected from the bearing seat at the drive end of gearbox are used to demonstrate the    performance of the proposed diagnosis method in this experiment. As depicted in Fig. 11, the test devices mainly include a driving electrical motor, a planetary gearbox, a vibration sensor and a tachometer. The rotation speed of the driving motor is maintained at 1500 rpm. There are 4 health conditions in this experiment: normal condition (NC), worn fault (WF), broken tooth (BT) and compound fault between WF and BT (CF). The samples number in each health condition is 100 with the sampling frequency of 12.8KHz. Therefore, this experiment contains 400 samples and is a four-class classification problem, as displayed in Table 3. It should be noticed that compared with the data of CWRU Lab, the fault dataset of gearbox is collected in the machining workshop which means that environmental noise exists in the collected signals. Fig 13 and 14 shows the diagnostic results using different methods. The accuracy gradually going higher with the increase of N s1 . When N s2 > 50, N s1 has little effect on the accuracy. However, increasing N s1 at this time will significantly increase training time. The standard deviation of the diagnostic result of sparse filter is larger, which shows that the robustness is poor. As can be seen from Fig 13, the accuracy of the proposed method and sparse filtering are both becoming higher with the increase of N s2 . Sparse filtering has much lower accuracy compared with the proposed method. The proposed method can still get higher accuracy with fewer segments. For example, when  N s1 = 20 and N s2 = 50, the accuracy of the proposed method is above 98%. However, the accuracy of sparse filtering is only 87% and the highest accuracy of sparse filtering is only 97% with cases of N s1 = 100, N s2 = 100. Table 4 presents the comparison results of various methods with N s1 = 5 and N s2 = 100. Similar to the results of bearing fault diagnosis, the proposed method can achieve 99.64% test accuracy with 1.73s training time. The calculation efficiency of sparse filtering and the proposed method is close, but the accuracy of the proposed method is significantly higher than that of standard sparse filter and its variant. CNN can achieve the accuracy of 99.13%, but it needs 95s for training, which is significantly more time-consuming than the proposed method.

V. CONCLUSION
Based on cross-normalization, a novel unsupervised learning method is proposed for machinery fault diagnosis in this paper. A novel nonlinear activation function is introduced to improve the robustness and accuracy. The proposed method is applied in the diagnosis of the rolling bearing and gear-box fault datasets.
The proposed method shows higher sensitivity to features and stronger feature learning performance. The extracted features using the proposed method show better sparsity. The proposed feature learning algorithm and nonlinear activation function can improve the robustness and accuracy of the diagnosis model.
The research on the two segment numbers shows that larger segment number in the test process can improve the accuracy, and smaller segment number in the training process can significantly improve the calculation efficiency. Therefore, in practical application, we can choose a small number of training segments and a large number of test segments to ensure both accuracy and training efficiency.
The comparison experiment indicated that the proposed method is a promising fault diagnosis technology, which can significantly improve the accuracy and training efficiency.