Research and Application of Regularized Sparse Filtering Model for Intelligent Fault Diagnosis Under Large Speed Fluctuation

The speed of mechanical rotating parts often fluctuates during the working process. Vibration signals collected under constant speed have a strong correlation with the corresponding fault types. However, the mapping relationship becomes complex under large speed fluctuation, which is an urgent research subject in intelligent fault diagnosis. As an effective unsupervised learning method, sparse filtering (SF) has been successfully used in intelligent fault diagnosis. However, the generalization capability of this method to deal with large speed fluctuation remains poor. To overcome this deficiency, this study adds regularization to the loss function of SF to obtain regularized SF methods. The frequency domain signals under large speed fluctuation are directly input to regularized SF for feature extraction, and softmax regression is used as a classifier for fault type identification. Experimental results of gearbox and bearing datasets show that L1/2 regularized sparse filtering (L1/2-SF) model can solve the problem of large speed fluctuation more effectively than other regularized SF models can.


I. INTRODUCTION
Rotating parts are important components of many machines used in spacecraft, high-speed rail, and automobile. Failure of these parts directly affects the operation of machines and results in huge economic losses and even casualties [1]. Thus, precise diagnosis of the health condition of machines is necessary. Intelligent fault diagnosis has achieved remarkable success in recent years [2], [3]. Most of the methods used in fault diagnosis can achieve high accuracy under the general assumption that training and testing datasets are extracted at constant rotating speeds [4], [5]. However, the rotating speed of mechanical transmission varies under different operating conditions and loads [6]. For example, the wheel bearing of high-speed rail is influenced by the running speed and start-stop of the train [7], the speed of the automobile gearbox is regulated by the road condition [8], the speed of the top drive transmission system in an oil drilling machine fluctuates considerably during the working process [9], and the speed of wind turbine blades varies The associate editor coordinating the review of this manuscript and approving it for publication was Qiang Miao .
with the wind condition [10]. In summary, speed fluctuation widely exists in many mechanical equipment.
Increased attention has been paid to research on intelligent fault diagnosis under large speed fluctuation. Time-frequency analysis technologies [11], such as short-time Fourier transform, Wigner-Ville distribution [12], and wavelet transform [13], have been widely used in such research. Given that the vibration signal of mechanical equipment is time-varying and strongly coupled in the time-frequency domain, solving the high time-frequency resolution display problem of speed fluctuation signals is difficult when these methods are used. Several researchers [14]- [16] utilized order tracking to process time-varying signals and found that this approach is effective in eliminating the effects of speed fluctuation. However, all of the methods mentioned above require human labor and cause inconvenience.
Unsupervised feature learning could overcome these shortcomings. In essence, unsupervised learning can be regarded as learning a nonlinear function for transforming unlabeled raw data from the original space to a feature space. This technique has been widely applied in audio, video, image, and other fields [17], [18]. Sparse filtering (SF) [19] is an unsupervised two-layer network that automatically extracts features without label participation. This method needs to adjust only one parameter, thus demonstrating simplicity and easy implementation. Lei et al. [4] utilized SF in bearing fault diagnosis by adopting a two-stage learning method, which greatly reduced human labor and eased the handling of big data in intelligent fault diagnosis. Yang et al. [20] used SF to extract sparse features directly from raw timedomain signals and classified health status using traditional support vector machines based on improved particle swarm optimization. An et al. [21] solved the over fitting problem in bearing fault diagnosis by removing multi-correlation operations in the weight matrix of SF.
Although SF can directly learn from the original vibration signal to the deep discriminative features, its generalization capability in dealing with large speed fluctuation remains poor. In deep learning, many strategies have been explicitly designed to reduce generalization errors, and these strategies are collectively known as regularization [22]. In this study, we add regularization strategy to the objective function of SF to constrain the sparsity of the weight matrix and improve the generalization capability of standard SF. In addition, the initial phase of the time-domain signal is different, and it increases the difficulty of feature exaction under large speed fluctuation. However, the frequency spectra can make the samples exhibit regularity [2], which can eliminate this interference. The frequency spectra of the vibration signal are adopted as the model input. So, in this paper, regularized SF models are proposed to deal with the frequency-domain signal under large speed fluctuation.
Therefore, the main contributions of this paper are summarized as follows.
(1) To deal with large speed fluctuation problem in fault diagnosis, some regularization strategies are introduced into the penalty term of sparse filtering objective function, so as to improve the feature extraction ability of sparse filtering.
(2) We try to explore a physical interpretation of L1/2 regularization [23] in its robust ability on dealing with non-stationary signal.
(3) Through the comparison of different regularized SF models. The L1/2 regularized sparse filtering (L1/2-SF) model can get sparser solution and higher precision than other models.
The remainder of this paper is organized as follows. In Section 2, SF and the regularization strategy in deep learning are introduced. In Sections 3 and 4, the regularization strategy is used for SF, and gear and bearing datasets are investigated using the different regularized SF models, respectively. The conclusion is presented in Section 5.

II. THEORETICAL BACKGROUND A. SPARSE FILTERING
Among the many unsupervised algorithms, SF, a simple and effective feature learning algorithm, requires the least parameter adjustment, and it can directly analyze the optimal feature distribution bypassing the estimation of data distribution. As shown in Fig. 1, the inputs and outputs of SF are the sample dataset and the learned features, respectively.
First, the original time-domain data collected from each health condition are divided into M samples, and the training set . Finally, each sample is mapped into feature vector f i = N ×1 by W . In addition, the rectified linear unit function [24] is used as the activation function. The features of each sample can be calculated as The feature matrix is composed of f i j , and the L2-norm of the feature matrix is used to normalize each row of samples.
Then, each column is normalized by its L2-norm, and the features are mapped into the unit L2-sphere.
Afterward, the L1 penalty is used to optimize the sparsity of the normalized feature. The cost function of SF is given as (4)

B. L1/2 REGULARIZATION
Regularization is the process of reducing test errors. The ultimate goal of the deep learning model is to achieve satisfactory performance when dealing with new data. However, overfitting occurs when the assumption is complex or the input dataset is small. Regularization terms are often introduced into the learning process to prevent overfitting and improve generalization performance [25]. Many regularization strategies are currently available; several of them add constraints to the learning model to constrain the value of parameters or add items to the objective function to soft-constrain the value of parameters. This action is equivalent to creating a constraint on the original solution space when the regularization term is added to the original objective function and then determining the minimum value under this constraint. In general, the regularization method can be expressed as  where l represents the objective function, } is a known dataset, and λ denotes the regularization parameter that aims to control the model complexity of machine learning algorithms. If f is a linear function and its objective function is a square loss, w k is generally used to represent the k-norm of the coefficient.
Almost all machine learning algorithms can be interpreted as the above mentioned regularization form. If k = 0, then L0 regularization is obtained, that is, the number of non-zero coefficients in the model will be punished according to AIC and BIC rules. If k = 1, then L1 regularization corresponding to the Lasso algorithm is obtained. If k = 2, then L2 regularization corresponding to ridge estimation is present.
If too many redundant variables are present in a representation model, the feature extraction process should eliminate as many redundant variables as possible while ensuring that the real variables can be identified, which corresponds to the sparsity problem. We consider the variable selection problem of the following sparse linear model.
where y = (y 1 , y 2 , · · · , y n ) T is an n × 1 vector; X = (X 1 , X 2 , · · · , X n ), X i = (x 1i , x 2i , · · · , x ni ), i = 1, 2, · · · , n; b = (b 1 , b 2 , · · · , b n ) T is an unknown p × 1 parameter; ε is the random error; and σ 2 is the error variance. We assume that the real model is L0 regularization can be expressed as L0 regularization can produce the sparsest solution. Thus, it is the most ideal variable selection method among all regularization methods. However, L0 regularization is inapplicable to massive or high-dimensional data it needs to solve a combinatorial optimization problem.
Therefore, L2 regularization is developed as L2 regularization can process high-dimensional and massive data and generate smooth solutions, thereby exhibiting excellent performance in solving the overfitting problem. However, this method does not possess sparsity.
Thus, L1 regularization is developed as L1 regularization can take advantage of its feature selection ability to produce a relatively sparse solution by solving a convex optimization problem [26]. However, considering that the objective function of the regularization model is not smooth near zero, we cannot directly use the conventional derivation optimization method. Furthermore, the complexity of the solution might increase, and an inconsistent problem in the selection of a certain variable may be generated [27].
Thus, L1/2 regularization based on a non-convex penalty is developed as At this time, the penalty term is not convex, which provides a new idea for dealing with the problem of variable selection and feature extraction. Many studies have reported that the L1/2 regularization model can produce a sparser solution than that of L1 regularization. In addition, it possesses excellent theoretical properties, such as robustness and non-bi-partiality [28]. Fig. 2 shows the corresponding penalty term graphs to analyze the sparsity of the above mentioned regularization models in a two-dimensional case (i.e., only two parameters, namely, w 1 and w 2 , need to be learned). The center of the red circle is the smallest error, and the errors on each red line are the same. The regularized equation adds the extra error on the blue edge, and the errors on the blue edge are also the same. The green point at the intersection of the red line and blue edge can minimize the sum of two errors, which represents the regularized solution for w 1 and w 2 . As shown in Fig. 2(a), the contour line and L2 regularization cannot intersect on the coordinate axis, so the solution is not sparse. In Fig. 2(b), the constraint region of L1 regularization is a square, so the intersection of the contour line and square vertex must be zero, which makes the solution sparse. The constraint region of L1/2 regularization is shown in Fig. 2(c); it is likely to intersect with the contour line on the coordinate axis, thereby increasing the sparsity of the solution.
Only w 2 (w 1 = 0) is obtainable when L1 and L1/2 regularizations are used. The solution of L1 regularization is unstable when the batch dataset is used for training. Fig. 3 shows that the error curves of each training epoch are slightly different. The green point does not move far from L2 regularization, as displayed in Fig. 3(a), but the green point of L1 regularization can jump to many different places, as shown in Fig. 3(b), because the total error is almost the same. By contrast, the green point of L1/2 regularization is unmoved, as presented in Fig. 3(c), indicating that this regularization can produce the sparsest and most robust solution among the three regularization models.
A batch dataset with the same fault type is expected to demonstrate a huge difference due to the effect of speed fluctuation, which makes network training difficult. In this study, the excellent characteristics of the regularization strategy in machine learning are combined with those of the SF model to address the speed fluctuation problem in fault diagnosis.
The three regularization terms are then added to the objective function of SF.

III. PROPOSED FRAMEWORK
The framework of the proposed method is displayed in Fig. 4. The specific steps are as follows.
(1)Data collection: The original time-domain data collected from each health condition are averagely segmented into M samples, and dataset s j M j=1 is formed through FFT, where s j ∈ N in ×1 denotes each sample containing N in Fourier coefficients, N in represents the input dimension of SF, and N out is the output dimension.
(2)Training dataset selection: K samples are randomly selected from dataset s j M j=1 to form training dataset x j K j=1 . Dataset x j K j=1 is then rewritten to matrix form S ∈ N in ×K . (3)Model training: The obtained S is inputted to the regularized SF model for the training of weight matrix W .
(4)Obtainment of learning features: The feature vector f j ∈ N out ×1 of each sample x j is calculated using the trained W . The feature matrix of the training samples is expressed as f j K j=1 . (5) Fault classification: The feature matrix is combined with the corresponding sample labels to train the softmax regression classifier, and the remaining samples are used as testing samples to test the accuracy of the proposed method.
To display more information about the improved sparse filtering algorithm. The algorithm of our method is described in Algorithm 1.

IV. CASE 1: FAULT DIAGNOSIS OF A PLANETARY GEARBOX A. DATA DESCRIPTION
A specially designed planetary gearbox fault test bench is used to collect vibration signals under large speed fluctuation. Fig. 5 presents the test bench, which includes a motor, a planetary gearbox, two shaft couplings, and two bearing seats. As shown in Fig. 6, the following 10 health conditions of gear are designed: normal condition (NC), sun wheel crack (WC), sun wheel pit (WP), sun wheel worn tooth (WW), pinion crack (PC), pinion pit (PP), pinion worn tooth (PW), wheel worn and pinion worn (WWPW), wheel pit and pinion crack (WPPC), and wheel pit and pinion worn (WPPW). The sampling frequency is set to 12.8 kHz, and the rotating speed ranges from 700 r/min to 1500 r/min. 250 samples are collected for each health condition, and each sample contains 2400 data points; thus, a total of 2500 samples are obtained. After applying FFT to each time-domain sample, 1200 Fourier coefficients are obtained for each sample.    Fig. 8 shows that distinguishing the different health conditions from the time and frequency domains of the signal is difficult because even the samples under the same health condition have different feature distributions.

B. DIAGNOSIS RESULTS
In this section, the regularized SF is used to deal with the gear dataset under large speed fluctuation. Two feature parameters, namely, N in and N out , need to be adjusted. After testing, we set 1200 as the dimension of N in and N out . First, we randomly select 50% of the frequency-domain samples from each health condition. Second, the frequency-domain samples are used to create the training set for training the regularized SF model and obtaining W. Third, we use the remaining 50% samples as the testing set and calculate the learned features using W. A 1200-dimensional feature VOLUME 8, 2020   vector is achieved from each sample. Lastly, the learned features are inputted to the softmax classifier for classification.
We investigate the selection of regularization parameters. The diagnosis results of L1-SF are displayed in Fig. 9. The accuracies are stable after λ = 1E − 5, so 1E-5 is chosen as the regularization parameter in this paper. The raw SF model without regularization is utilized for comparison. Each experiment is repeated 20 times to reduce  the effects of randomness. The testing accuracies of the four methods are displayed in Fig. 10. L1/2-SF has the highest average testing accuracy (99.54%) and the lowest standard deviation (0.07%) among all the methods. The testing accuracies of L2-SF and L1-SF are 98.88% ± 0.28% and 98.63% ± 0.11%, respectively, which are slightly lower than that of L1/2-SF. The raw SF model, with an average testing accuracy of 98.6% ± 0.21%, exhibits the most unsatisfactory performance among the methods. In conclusion, L1/2-SF demonstrates the best performance when dealing with large fluctuations in rotating speed.
To confirm the advantages of L1/2-SF, we use t-distributed stochastic neighbor embedding (t-SNE) [29] to reduce the dimensions of the high-dimensional features obtained by the four methods and provide visualization. The results of the gear dataset processed by t-SNE are shown in Fig. 11. The dimension reduction features of SF and L1-SF are not clustered well, and the scattered points of the gear samples    Fig. 11(c) illustrates that the features learned by the L2-SF cluster are better than those learned by SF and L1-SF. However, many samples are still separated based on the corresponding health condition. The dimension reduction result of L1/2-SF is displayed in Fig. 11(d), which shows that except for the slight mixing of WC and PC, the other gear samples under different health conditions are well distinguished, and feature points with the same health condition are gathered together. Therefore, L1/2-SF can effectively extract features from the frequency-domain signal and process gear datasets under large speed fluctuation.

V. CASE 2: FAULT DIAGNOSIS OF A MOTOR BEARING A. DATA DESCRIPTION
The bearing data are measured from the test rig, as shown in Fig. 12(a). The rig includes a motor, three shaft couplings, a bearing seat, and a brake. The bearing dataset contains three fault types, namely, outer race (OF), inner race (IF) and roller faults (RF). The vibration signals of three damage severity levels (0.2, 0.4, and 0.6 mm) are separately collected from the three fault types, and the signals of NC are added to the dataset, resulting in 10 health conditions. The three fault bearings are depicted in Fig. 12(b). The accelerometer is mounted on the bearing box with a sampling frequency    For the speed fluctuation condition, four bearing examples (NC, IF0.2, OF0.4, and RF0.6) are randomly selected. Fig. 13 shows the irregular speed fluctuation information of the four bearing fault types between 500 and 3000 r/min and the four different curves corresponding to the four different gear speeds.

B. DIAGNOSIS RESULTS
The network structure and parameter set are similar to those in Case 1. Fifty percent of the samples are used for training, and the rest is used for testing. The classification accuracies of the 20 trials using the four methods are shown in Fig. 14. The second trial testing accuracy of L1/2-SF is 99.99%, and that of the rest are all 100%. L2-SF, which has 99.93%0.10% testing accuracy, ranks a very close second. By contrast, the average accuracies of SF and L1-SF are low and have considerably large standard deviations. This finding indicates that the L1/2-SF method achieves the highest accuracy and robustness among the four methods in terms of diagnosing bearing fault types under large speed fluctuation.
The results of the bearing dataset processed by t-SNE are shown in Fig. 15. Fig. 15(a) indicates that the dimension reduction features of the SF model are the most unsatisfactory among the four methods, with only four bearing types separated and the other types of samples still mixed. The performance of L1-SF is slightly better than that of SF, as shown in Fig. 15(b). However, a large area of mixing is observed, which indicates failure to achieve fault classification. Figs. 15(c) and 15(d) present the respective feature visualization maps obtained by L2-SF and L1/2-SF methods. The feature points of the same health condition are gathered together. The L1/2-SF method has a better effect compared with L2-SF, and the mapped features between different fault types are obviously separated. This result reveals the superior performance of L1/2-SF in dealing with bearing fault diagnosis under large speed fluctuation.

VI. COMPARING WITH OTHER MOTHEDS
In order to prove the effectiveness of the proposed L1/2-SF model, it is compared with other intelligent fault diagnosis methods, as summarized in Table 1. The gearbox experiment shows that the L1/2-SF method can achieve the highest testing accuracy of 98.6%. Meanwhile, the L1/2-SF achieves the best testing accuracy of 100% when classifying ten different bearing faults. Two experiments show that L1/2-SF is more efficient and accurate than the other three classical methods.

VII. CONCLUSION
(1)The different regularized SF models are proposed to deal with frequency-domain signals under large speed fluctuation. The results of case studies on gearbox and bearing datasets show that the L1/2-SF model can adaptively extract features from the frequency spectra signal under large speed fluctuation and is superior to other regularized SF models.
(2) An intelligent fault diagnosis framework based on regularized SF is constructed for the problem of large speed fluctuation. The frequency spectra of the vibration signal are used as the input of the framework to overcome the problem of different initial phases of the time-domain signal and obtain high diagnosis accuracy.
(3) Given that the signal exhibits variability under large speed fluctuation, massive data are required to train the model. However, data are insufficient in many cases. Our future study will utilize data enhancement techniques to overcome this shortcoming.
(4) Two experiments show that the algorithm can be applied in the range of 500-3000r/min. We will study the maximum speed range of the algorithm in the future work.