A Feature Selection Approach Hybrid Grey Wolf and Heap-Based Optimizer Applied in Bearing Fault Diagnosis

An effective bearing fault diagnosis model based on machine learning is proposed in this study. The model can separate into three stages: feature extraction, feature selection, and classification. In the stage of feature extraction, multiresolution analysis (MRA) and fast Fourier transform (FFT) are applied to extract the features from the raw signal measured from the rotating machine. In the second stage, a powerful feature selection method is proposed and utilized in the stage of feature selection. The new feature selection method is based on grey wolf optimization (GWO) and heap-based optimizer (HBO) with new strategies combined. Finally, support vector machine (SVM) and linear discriminant analysis (LDA) are used as the classifier independently. To verify the capability of the proposed model, four different datasets are applied to test the model in this study, respectively University of California Irvine (UCI) benchmark dataset, bearing dataset, Case Western Reserve University (CWRU) benchmark dataset, and Machinery Failure Prevention Technology (MFPT) benchmark dataset. The proposed method is compared with the existing methods and can certify the robustness with the experiment results.


I. INTRODUCTION
In modern industry, the rotating machine is widely used in the manufacturing factory and the most important part of a rotating machine is the rolling bearing [1]. The rotating machine failure occurs on the rolling bearing is the most common seen [2]- [4]. Rotating machine failure can lead to motor breakdown or even human injuries [5]. To detect the fault of rotating machine, or even to avoid the damage which may occur in the future is the most important task that needs to be solved. Therefore, the main purpose of this study is to investigate an effective bearing fault diagnosis approach based on machine learning.
The process of machine learning can generally separate into three stages: feature extraction, feature selection, and classification. The raw signal measured from the rotating machine is often mixed with various information and noise. The main purpose of feature extraction is to find the most important information and extract them as features, where the quality of features can significantly affect the classification The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . accuracy [6], [7]. In recent years, several feature extraction methods have been proposed such as Hilbert-Huang transform (HHT) [8], empirical mode decomposition (EMD) [9], wavelet transform (WT) [10], discrete wavelet transform (DWT) [11], multiresolution analysis (MRA) [12], short time Fourier transform (STFT) [13], and fast Fourier transform (FFT) [14]. The MRA was proposed by Mallat in 1989. In the processing of the high-frequency signals, MRA has it good performance in decomposing the signal at different time windows and frequency bands [15], [16]. On the other hand, FFT is a well-known and useful algorithm for fault diagnosis, which is based on decomposing the discrete Fourier transform (DFT) to analyze the frequency domain [17], [18]. Thence, MRA and FFT were applied in this study.
After the stage of feature extraction, the feature set has been generated. However, with the amount of the collected data having considerable growth, the feature set extracted from the raw signal is usually accompanied by a great number of redundant features, especially high dimensional feature sets [19]. The redundant features might lead to lower classification accuracy [20], [21]. Therefore, feature selection plays a significant role in machine learning. Feature selection is a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ pre-processing stage of the data before it is fed into the classifier. With the revolution in feature selection, metaheuristic algorithms have got attention to solve optimization problems [22]. The classical algorithms are: particle swarm optimization (PSO) [23], genetic algorithm (GA) [24], ant colony optimization (ACO) [25], artificial bee colony (ABC) [26], whale optimization algorithm (WOA) [27], Harris hawks optimizer (HHO) [28], grey wolf optimizer (GWO) [29], and heap-based optimizer (HBO) [30]. GWO was proposed by S. Mirjalili et al. in 2014, which imitates the characteristics of grey wolves with their social hierarchy [29]. GWO has success in its exploitation phase but still suffers from a local trap and premature convergence problem in the exploration phase [22]. HBO was proposed by Q. Askari et al. in 2020 [30]. Inspired by the corporate rank hierarchy (CRH), HBO simulates CRH based on the heap data structure, where the heap is a nonlinear tree structure [30]. In this study, a new feature selection method was proposed based on GWO and HBO to balance the exploration phase and exploitation phase, which improved the capability of space searching. The last stage of fault diagnosis based on machine learning is classification. In this stage, the classifiers distinguish the features in the optimal feature subset which represents the fault condition. Several classification techniques were applied in machine learning cases, such as artificial neural network (ANN) [31], support vector machine (SVM) [32], decision trees (DT) [33], random forest (RF) [34], k-nearest neighbors (k-NN) [35], principal component analysis (PCA) [36] and linear discriminant analysis (LDA) [37]. With the techniques mentioned, SVM has the capability in making decision for small number datasets and good generalization [38]. LDA can utilize in the case of classification and dimensionally reduction. LDA has good performance with complete data sets in the case of classification [39]. Thus, in this study, SVM and LDA were independently applied as the classifier. Combine with the feature extraction strategies and the proposed feature selection method, a bearing fault diagnosis model has been accomplished.
In this study, the main purpose is to investigate an effective bearing fault diagnosis approach. For this purpose, the main contributions were present as follows: 1) In the stage of feature extraction, a new strategy was proposed based on MRA and FFT, which can extract the features from the raw signal more effectively. 2) In the stage of feature selection, a new method was proposed based on the GWO algorithm and the HBO algorithm with two new strategies combined.
3) The proposed feature selection method was compared with other classical methods using the University of California Irvine (UCI) benchmark datasets and the bearing datasets to verify the capability of the method [40]. On the other hand, to verify the robustness, the classification accuracy of the proposed bearing fault diagnosis model was compared with the existing models using the Case Western Reserve University (CWRU) benchmark dataset and the Machinery Failure Prevention Technology (MFPT) benchmark dataset [41], [42]. The other sections in this study are as follows: section II discusses the process of feature extraction. Section III reviews the GWO algorithm and HBO algorithm, also the details of the proposed feature selection method are given. Section IV illustrates the procedure of the bearing fault diagnosis model. The experiment results are present in section V. In this section, UCI benchmark datasets, bearing dataset, CWRU benchmark dataset, and MFPT benchmark dataset were applied to verify the capability of the model. Finally, section 6 presents the conclusion.

II. THE FEATURE EXTRACTION METHOD
In this study, two feature extraction methods MRA and FFT were applied. The procedure of feature extraction is shown in Fig. 1, where Fig. 1a shows the raw current signal measured from the induction motor. Fig. 1b illustrates that 25 features were extracted from the raw signal by utilizing MRA and Fig. 1c illustrates that other 25 features by FFT. Table 1 describes the mathematical expressions of each feature are respectively: mean, maximum (max), kurtosis (kur), skewness (ske) and root mean square (mse). Finally, 50 features were extracted in this stage.

A. MULTIRESOLUTION ANALYSIS
The main idea of MRA is finding the discrete wavelet transform (DWT) coefficient vector to analyze the raw signal, as depict in Fig. 2 [43]. The signal x(t) decomposed into approximation coefficients and detail coefficients by using scale function and wavelet function [44]. One-dimensional filters were applied in this algorithm, respectively high-pass filter H and low-pass filter G, and is down sampling which means it ignores the odd-indexed elements of the signal. The calculation of decomposing the raw signal x(t) is shown in (1), and the calculation of scale function and wavelet function are shown in (2) and (3) respectively.
where A j is approximation coefficients, and D j is detail coefficients. It is an essential issue to choose the appropriate mother wavelet, which can significantly influence the features extracted from the signal useful or not [45], [46]. In this study, the detail coefficients d 1 to d 5 were applied.

B. FAST FOURIER TRANSFORM
To obtain the information from the frequency domain, the signal x(t) transforms from the time domain into the frequency domain on a finite number of frequency lines f by using FFT [47]. The calculation of FFT is shown in (4). In this   study, the input of FFT is the detail coefficients d1 to d5 as shown in Fig. 1.

III. THE FEATURE SELECTION METHOD
In this section, a new feature selection method heap-based binary grey wolf optimization (HBBGWO) was proposed which hybrid GWO, HBO, and two new strategies to eliminate the redundant and irrelevant features.

A. GREY WOLF OPTIMIZATION
GWO divided wolves into α, β, δ and ω. In the case of optimization, α means that it stored the most optimal solution, and the second is stored as β and the third is δ, while the lowest hierarchy is ω. Thence, only α, β and δ can be stored. According to the habit of grey wolves, hunting can be divided into three stages: 1) Encircling. 2) hunting. 3) attacking [29].
In the stage of encircling, the wolves surround the prey and the prey will be at the center of wolves. Each wolf kept a certain distance from the prey. To simulate the situation, the distance D between the wolf and the prey is shown in (5). While the position of the wolves changes continuously according to the prey, shown in (6).
where t is the current number of iterations, A and C are coefficient vectors, X p is the position of prey, and X (t) is the position of wolves. The calculation of A and C are shown in (7) and (8).
where r 1 and r 2 are stochastic vectors in [0, 1], and the calculation of coefficient a is shown in (9). Coefficient a decreased from 2 to 0 gradually while t increases, and A is decided by a. When A < 1, represents the search agent alternate from exploration phase into the exploitation phase.
where T is the max number of iterations. VOLUME 10, 2022 In the stage of hunting, the position of wolves will update according to (10).

B. HEAP-BASED OPTIMIZER
In the heap structure, each heap node represents an individual in a corporation. The highest-level node is the chief executive officer (CEO) of the company, which represents the most potential solution. The nodes at lower levels are subordinates and at higher levels are senior executives. HBO has three types of interaction between the heap nodes, respectively: 1) Interact with immediate boss. 2) Interact with colleagues. 3) Self-contribution [30]. The heap updates continuously with the interaction between heap nodes to complete the searching task. To hybrid with GWO correctly, the solution generated based on HBO will be represented as the sol in this study.

1) INTERACT WITH IMMEDIATE BOSS
In the heap, the parent node can be regarded as the immediate boss. The interaction with the immediate boss is shown in (17).
where − → sol is the solution generated, B is the parent node, λ k is the k th bit in the stochastic vector λ and γ is the parameter. λ is calculate in (18) and γ is calculate in (19).
where r is stochastic number which evenly distributed in [0, 1], T is the max number of iterations, C is a parameter defined by users. In this case, the value of C is defined as T /25.

2) INTERACT WITH COLLEAGUES
The heap nodes with the same level can be regarded as colleagues. The interaction with colleagues is shown in (20).
where S is the vector of colleague node, f is objective function which calculate the fitness value of search agent.

3) SELF CONTRIBUTION
The self-contribution is shown in (21).
To combine three types of interaction, roulette wheel is applied to balance the exploration phase and exploitation phase. The probability of roulette wheel is divided into p 1 , p 2 and p 3 based on the number of iterations t. The combination of interactions is shown in (22).
HBO applied a 3-ary heap in the structure of CRH, which means up to three subordinates that each heap node can managed. In order to make the heap work correctly, the heap represents in the form of an array. The index of the parent of node i is in a d-ary heap is shown in (26).
where floor is the floor function. In d-ary heap, each node can have up to d child node. The calculation of the index that represents the j th child of node i is shown in (27).
56694 VOLUME 10, 2022 In order to verify the position of each node, the nodes belonging to different levels are divided into different depths. For example, a child node has one more depth than its parent node, and nodes located at the same depth as their colleague nodes. The depth of the CEO node is 0. The depth of node i is shown in (28).
For a node i, the index of its colleague node is randomly generated in the interval between In order to keep the heap correct, HBO search upward to insert the node i as shown in pseudo-code 1. Finally, complete the establishment of heap architecture shown in pseudo-code 2. Feature selection is a binary problem, in other words, which is accomplished by transforming the bits between 1 and 0. For example, converting 1 to 0 means the feature is deleted, and converting the bit value from 0 to 1 means the feature is selected. Therefore, all the bits must be represented in binary. In the beginning of iteration, the position of each wolf is randomly initialized at the beginning of the iteration as (29).
where rand 1 is stochastic number which evenly distributed in [0, 1]. The GWO algorithm search the space continuously. Therefore, GWO algorithm cannot be utilized in feature selection before it changed into binary. A binary version with sigmoid function is used to deal with feature selection problem. Each bit converts into 0 or 1 is based on (30) [48].
where rand 2 is stochastic number which evenly distributed in [0, 1], and sigmoid is the sigmoid function shown in (31).
GWO has its superior performance of its exploitation phase, but still prone to fall into local traps [22]. Therefore, in this study, a new approach HBBGWO which hybrid GWO and HBO was proposed. The proposed method improved the capability of both global and local searching. In addition, to increase the opportunity of escaping from the local traps, two strategies were proposed in this study. The two strategies operated while the condition is corresponded. The detail of two strategies and the proposed feature selection method have illustrated in the subsection: 1) Updating solution based on crossover operation. 2) Updating solution based on mutation.
3) A new feature selection approach HBBGWO is proposed.

1) UPDATING SOLUTION BASED ON CROSSOVER OPERATION
Crossover operation is one of the operators of the GA algorithm which can enhance the capability of global searching. The main process of crossover operation is to choose two parent genes and stochastic cut into several parts, then combine them into a new solution, where the length of the solution is the same as its parents [49]. Therefore, good parents are important in this process. Choosing good parents can prone to generating a good solution.
The new − → sol was stored as the record after the interaction. In the first iteration, X α was saved as the record. Otherwise, the record update after the interaction every time. To implement the crossover operation at the appropriate occasion, a threshold was established before the crossover operation implemented. First, compare the − → sol with the record. Crossover operation implement while the − → sol is more than 90% similar with the record. In this study, three-point crossover operation was applied. The process of crossover operation is shown in Fig. 3. Roulette wheel was also applied to choose one of the parents from the heap, and operate crossover with the − → sol. The weight of the roulette wheel is set as the fitness value of each heap node so the good parent has more chance to be choose. In this study, the k-NN classifier is adopted to calculate the fitness value. Only consider to reduce the error rate, the calculation of fitness value is shown in (32) [50].
where N T is the number of instances that is true predicted, N F is the number of instances that is false predicted.

2) UPDATING SOLUTION BASED ON MUTATION
Mutation is one of the other operators based on genetic algorithm [49]. The process of mutation is shown in Fig. 4. The bit which has been choice converting from 0 to 1 or from 1 to 0. Different from crossover operation, the purpose of mutation is to increase the diversity of solutions, which can increase the ability of local searching. After implementing the crossover operation, check again if the − → sol is still the same as the record. The mutation implemented while the similarity of the − → sol and the record achieve 100%, this threshold can assist the searching in the space more detail. Another threshold was established to balance exploration and exploitation as shown in (33). If the random number rand 3 conforms with (33), choose 10% of bits randomly to implement mutation. The random number rand 3 is evenly distributed in [0, 1]. Otherwise, 5% of bits will be chosen to implement mutation. Based on (33), there has no chance to choose 10% of bits for mutation after t exceeds over 76% of T, which can speed up the searching progress during the early stage and slow down in the later stage.

3) THE PROPOSED HBBGWO ALGORITHM
After the interaction, the − → sol will be utilized to update the wolves position. If the fitness value of the − → sol greater than or equal to the fitness value of α, then the − → sol replace α. If the fitness value of the − → sol less than α and greater than or equal to β, then the − → solreplace β, and so on. If the fitness value of the − → sol less than δ, then the position of the wolves will not update. The details of updating the position of the wolves are shown in pseudo-code 3. Finally, a new feature selection method HBBGWO was accomplished. Fig. 5 shows the flow chart of HBBGWO and the details are explained as follows: Step 1: Set the parameters up (population N , user defined parameter C, maximum iteration T ).
Step 2: Initialize the position of the wolves X i by using (29).
Step 3: Calculate the fitness value of X i by using (32). 56696 VOLUME 10, 2022 Step 4: Determine the position of wolves X α , X β , X δ respectively. In this study, the wolf with the highest fitness value will determine as α, where the second one is β, and the third one is δ.

Step 5:
Store X α as the record.

Step 6:
Update the position of wolves X i by using (6) and (10), and convert into binary by using (30).

Step 7:
Calculate the fitness value of X i by using (32).

Step 8:
Build the heap based on pseudo-code 1 and pseudo-code 2.
Step 10: Interact between the heap nodes by using (22) and Step 16: Choose 5% of the − → sol stochastic to implement mutation.
Step 17: Update the record as the − → sol. The record only store the latest − → sol that generated after the interaction.
Step 18: Calculate the fitness value of the − → sol by using (32).
Step 19: Update the position X α , X β , and X δ based on pseudo-code 3.
Step 20: Check if t exceeds the maximum number of iterations. If ''Yes'', stop the process. Otherwise, go to Step 6 to start the next iteration.

IV. THE BEARING FAULT DIAGNOSIS MODEL
The bearing fault diagnosis model can generally separate into three stages: feature extraction, feature selection, and classification. Fig. 6  In the stage of feature selection, a powerful feature selection method HBBGWO was proposed and applied. The proposed method can eliminate the redundant features more efficiently and robustly, which can increase the classification accuracy. In this stage, an optimal feature subset was generated.
In the stage of classification, SVM and LDA were applied to distinguish the features of normal motor bearing and different severities of fault motor bearing. SVM is a widely used classifier because of its advantages in resolving small sample and nonlinear problems. The basic idea of SVM is to map the input samples to a high dimensional space and find hyperplanes that can best separate all the classes while training [51]. After that, the hyperplane is used to separate different classes between testing samples. The parameters can deeply influence the classification performance, 10-fold cross validation is applied in this study [52], [53]. The main idea of LDA is to find a projection vector to enlarge the distance between the samples from the same class [54]. The type of a sample can be separate based on the line while predicting. In this stage, the optimal feature set was provided for the classifier to accomplish the task.

V. EXPERIMENTAL VERIFICATION AND RESULTS
In this study, the hardware equipment includes a personal computer (4 CPUs Intel Core i5-6500, 16GB of RAM), and the software of the experiment is MATLAB 2017-a version. In this study, four different datasets were applied to certify VOLUME 10, 2022 the capability and the robustness of the proposed motor fault diagnosis approach, which is explained as follows.
Case Study 1: The UCI benchmark datasets were adopted to verify the searching capability of the proposed feature selection method. In this case study, the proposed feature selection method was compared with three other methods: BGWO, BPSO and GA.
Case Study 2: The bearing dataset was applied in this case study to verify the capability and the robustness of the model. The raw signal is the current signal measured from the experimental induction motors. In this case study, the proposed bearing fault diagnosis model with the proposed feature selection method was compared with three other methods mentioned in case study 1.
Case Study 3: The benchmark dataset provided by the CWRU machine learning repository was applied to certify the capability and the robustness of the model. In this case study, the proposed bearing fault diagnosis model was compared with other state-of-art methods.
Case Study 4: Another benchmark dataset provided by the MFPT society was applied to verify the robustness of the model. In this case study, the proposed bearing fault diagnosis model was compared with other state-of-art methods.

A. PARAMETERS SETTINGS
The proposed HBBGWO is a wrapper approach of feature selection, which need a classification method as a predictor. The performance of the predictor can be used as the objective function to evaluate the fitness value of the feature subset [55]. In this study, the k-NN algorithm is applied to evaluate the fitness value which was described in subsection III.C.1. The k-NN algorithm is one of the simplest machine learning algorithm, which can evaluate the fitness value rapidly. The nearest neighbors k of k-NN algorithm is set to 5, and the number of cross-validation k-fold is defined as 10. Table 2 shows the parameters setting of the proposed feature selection method and three other methods (BGWO, BPSO and GA). All the parameters are used throughout four case studies in this study.

B. CASE STUDY 1: UCI BENCHMARK DATASETS 1) DESCRIPTION OF THE DATASETS
Six UCI datasets were adopted in this case study, respectively BreastEW, Ionosphere, PenglungEW, Segmentation, Sonar, and Vehicle. The details of each dataset are shown in Table 3. BreastEW is breast cancer Wisconsin (diagnostic) dataset that predicts whether the cancer is benign or malignant. Ionosphere is radar data used to classify radar signals returned from the ionosphere as good or bad. PenglungEW has a larger dimension with seven classes which has higher sensitivity. Segmentation is the dataset in which the instances were drawn randomly from a database of seven outdoor images. The images were hand segmented to create a classification for every pixel. Sonar is a dataset obtained by bouncing off a sonar signal from a metal cylinder or a cylindrical rock at various angles and under different conditions. Vehicle is the dataset of vehicle silhouettes, the images were acquired by a camera looking downwards at the model vehicle from a fixed angle of elevation.

2) COMPARISON WITH OTHER FEATURE SELECTION METHODS
To verify the proposed method comprehensive, three criterions were established: 1) The average fitness value.
2) The average number of selected features. 3) The average operating times. The average fitness value can reflect the space searching capability of the proposed method, and the number of selected features can significantly influence the efficacy of the classifier. Finally, the operating times can immediately reflect the computation cost. The results are average number after 30 times independent implementation. Table 4 shows the results of each method. The best results of each criterion are bolded. In terms of the fitness value, the proposed method got the best on Ionosphere, PenglungEW, Segmentation, Sonar, and Vehicle. The average fitness values of the proposed method are only lower than BGWO 0.03% on BreastEW but higher than BGWO 0.2%, 2.19%, 0.46%, 0.35%, and 0.22% on each other datasets respectively. For all the datasets, the proposed method got the higher average fitness values than BPSO and GA, especially better than GA. Therefore, the proposed method has better space searching capability than the three other methods.
On the other hand, in terms of the average number of selected features. The results of the proposed method have better than the three other compared methods on all the datasets. The average number of selected features of the proposed method have 0.4, 0.5, 59.4, 1.3, 3.4, and 0.2 less than BGWO on six datasets respectively. Besides that, the results of the proposed method have substantially better than BPSO and GA. The significant improvement in the number of the proposed method helps the classifier accomplish its detecting task more efficiently.
Finally, the results of average operating times are shown in  The raw current signals are measured from the experimental induction motor with normal and three different types of fault. Fig. 8 shows the experimental hardware, respectively a 2hp induction motor (4 poles, 380V, 60Hz), a dynamometer (11kW, 2k rpm, 69 Hz), a data acquisition, and a personal computer. Fig. 9 shows the bearing with three different severities of artificial defect. Fig. 9a shows the square hole located on the inner race. Three different sizes of hole respectively 0.53mm×0.53mm, 0.53mm×1.24mm and 0.53mm×1.96mm, which respectively shows in Fig. 9b,  Fig 9c, and Fig. 9d. Normal condition and each type of fault collect 100 samples, totally 400 samples are collected, which has the length of 2000 data points. The rate of sampling is defined as 1k Hz. The ratio of training dataset and testing dataset is 70:30. Both the training dataset and testing dataset were random split from the optimal feature subset. The same process of splitting training dataset and testing dataset is also applied in Case study 3 and Case study 4. The raw signal measured from induction motor is usually full with noises in reality. In order to simulate the real situation more realistically, Gaussian white noise was added to the raw signal at five different levels (∞dB, 30dB, 20dB, 10dB, 0dB). The signal to noise ratio (SNR) is the indication of the noise level, which means that ∞dB indicates no noise and 0dB indicates the largest noise. The robustness of the bearing fault diagnosis model was verified with different level of noise.

2) COMPARISON WITH OTHER FEATURE SELECTION METHODS
The optimal feature subset was generated after feature selection, then fed to SVM and LDA respectively. Fig. 10 shows the optimal convergence curves of each noise level. Fig. 10a shows the curves while under ∞dB. Although every feature selection method got 100% of fitness value, the proposed HBBGWO reached 100% with the least number of iterations. Fig. 10b shows the curves while under 30dB of noise. BGWO, GA, and the proposed method reached 99.50% of fitness value and BGWO reached the top slightly faster than the proposed method. Fig. 10c shows the curves while under 20dB VOLUME 10, 2022  of noise. Both GA and the proposed method reached 96.50% of fitness value, but GA converge faster than the proposed method. Fig. 10d shows the curves while under 10dB of noise, where the proposed method got the highest 87.00% of fitness value and can reach the top first. Finally, Fig. 10e shows the curves while under 0dB of noise. The proposed method still got the highest 74.50% of fitness value, higher than BGWO 1.75%, BPSO 5.75%, and GA 2.50%. This figure shows the robustness of the proposed method while under greater noise, also a better optimal feature subset will provide to the classifier. Table 5 shows the optimal features of each method under different levels of noise. The proposed method contains the least number of features while under each level of noise, which points out that the good capability of selecting features in this case, where the fewer number of features can reduce the computation cost of the classifier. The indication of each feature is described in Table 1.
In this case study, the major purpose is to verify the capability and robustness of the bearing fault diagnosis model. Two criterions were established in this case study: 1) The average classification accuracy. 2) The average operating times. Table 6 shows the average accuracy using the SVM classifier. Every method reached 100% while under dB of noise except the accuracy of without feature selection (without FS), and the accuracy decreased when the noise got greater. The accuracy of feature selection with any method has significantly higher than the accuracy without feature selection while under each level of noise, which points out the importance of the step of feature selection. While under 30dB of noise, GA and the proposed method got the same accuracy of 99.20%, higher than BGWO 0.87%. While under 20dB of noise, the proposed method got the highest 95.83% of accuracy, higher than BGWO 4.16%. While under 10dB of noise, the proposed method reached the highest 83.33% of accuracy, higher than BGWO 0.83%. Finally, the proposed method got the highest 71.67% of accuracy while under 0dB of noise, higher than BGWO 5%. The results show the robustness of the proposed method while the noise gets larger. Table 7 shows the average accuracy of using the LDA classifier. The accuracy with feature selection is still higher than the accuracy without feature selection under every level of noise. Both BGWO and the proposed method reached 100% of accuracy while under dB of noise. The performance of the proposed method is worse than BPSO while under 30dB of noise, and worse than BPSO while under 20dB of noise. The proposed method got the highest accuracy under 10dB and 0dB, which shows robustness under greater noises again. Compare Table 6 and Table 7, the highest accuracy of using the SVM classifier under each level of noise is higher than using the LDA classifier, except under dB of noise. Therefore, the SVM classifier is more suitable to apply in this case study. The average operating times of each method under different levels of noise are depicted in Fig. 11. The proposed method has the best performance under each level of noise. The average operating times are lesser than BGWO of 12.6 sec. In this case study, the capability and the robustness of the proposed bearing fault diagnosis model were verified. On the other hand, the SVM classifier is more suitable than LDA classifier in this case study.    experimental hardware includes the 2hp induction motor, the load motor, and the torque encoder. The raw signals are the vibration signal measured from the induction motor, which includes four different levels of load (0hp, 1hp, 2hp, 3hp), three different fault locations (inner race, outer race, ball), three different defect diameters (0.007 inches, 0.014 inches, 0.021 inches), and the sampling rate is 12kHz. The main purpose is to identify the severities and the location of the bearing fault. Thus, the different motor load is not considered in this study. Table 8 shows the details of the CWRU benchmark dataset after processed. While under normal condition, the signal is cut into 100 samples under each level of load, a total of 400 samples are collected. Except for the normal condition, the signals are cut into 50 samples under each level of loads with three different defect diameters, a total of 200 samples are collected. The length of the samples is all 2000 data points. Finally, the data is separated into 10 classes.

2) COMPARISON WITH THE STATE-OF-ART STUDY
In recent years, several bearing fault diagnosis methods were proposed whether based on machine learning or deep learning. Chen et al. proposed multi-scale convolution neural network and long short-term memory (MCNN-LSTM) model, using raw vibration signals as input for an automatic feature learning neural network which extracted different frequency signal characteristics utilizing two convolutional neural networks, then long short-term memory was utilized to identify the fault type according to the learned features [56]. Li et al. proposed a rolling bearing fault diagnosis method which combined wavelet threshold denoising, genetic algorithm optimization variational mode decomposition (GA-VMD) and the whale optimization algorithm based on the von Neumann topology optimization least squares SVM (VNWOA-LSSVM) [57]. Yuan et al. proposed a rolling bearing fault diagnosis method based on convolutional neural network and support vector machine (CNN-SVM), using the continuous wavelet transform to convert one-dimensional original vibration signals into twodimensional time-frequency images, then input for the constructed model [58]. Zuo et al. proposed a bearing fault diagnosis method spiking neural network (SNN), as the third generation neural network, is tailored for fault diagnosis in rotating machinery [59]. Zhang et al. proposed a selfsupervised joint learning (SSJL) fault diagnosis method to reduce the dependent on labeled data, which based on threechannel vibration images [60].
In this case study, the proposed bearing fault diagnosis model was compared with the state-of-art model. Table 9 shows the classification accuracy of the proposed method and other state-of-art methods also using the CWRU benchmark dataset. The proposed model got the 99.55% of accuracy with the SVM classifier, and 97.09% accuracy with the LDA classifier. The proposed model with the SVM classifier got the highest accuracy in this case study, higher than the proposed model with the LDA classifier. In this case study, the operating time of the proposed method is 107.5 seconds, higher than the operating times of case study 2 because the complexity of CWRU benchmark dataset is higher than the bearing dataset. The proposed method not only reached a 56702 VOLUME 10, 2022 higher accuracy than the SNN model but also can detect more classes. Therefore, the proposed bearing fault diagnosis model has the better capability that is suitable for appliance in practical tasks of fault diagnosis.

E. CASE STUDY 4: MFPT BENCHMARK DATASET 1) DESCRIPTION OF THE DATASET
MFPT is another benchmark dataset widely used in fault diagnosis. The details of the MFPT benchmark dataset after the process are shown in Table 10. The raw signals are vibration signals measured from the experimental motor, which includes baseline, outer race fault, and inner race fault three different health conditions. The sampling frequency of the baseline condition is 97656Hz with 270 pounds of loads, and the signal is cut into 600 samples. The sampling frequency of outer races failure and inner race failure are both 48828Hz with respectively 0, 50, 100, 150, 200, 250, and 300 pounds of loads. The signals were cut into 50 samples with each level of load, a total of 350 samples under seven types of load. The length of samples is all 2000 data points. Finally, the data is separated into 3 classes.

2) COMPARISON WITH OTHER FEATURE SELECTION METHODS
In this case study, the main purpose is also to compared the proposed bearing fault diagnosis model with the state-of-art method. Yuan et al. proposed the model CNN-SVM, which is illustrated in the last subsection [58]. Sun et al. proposed the second-order time-reassigned multi synchrosqueezing transform (STMSST) [61], which based on Gaussian modulated linear group delay (GLGD) model to deal with the vibration signals of fault features from those obtained time-frequency images. Wang et al. proposed a lightweight convolutional neural network (LCNN) method, which can largely satisfy the need of lesser parameter amount and storage space with high accuracy [62]. Zuo et al. proposed a bearing fault diagnosis method SNN, which illustrated in the last subsection [59]. Jiang et al. test the method based on Chen-Lee Chaotic and chaotic dynamic error maps were used in analysis and feature status identification [63].
In this case study, the proposed bearing fault diagnosis model was compared with the state-of-art methods using the MFPT dataset. Table 11 shows the results of five existing fault diagnosis methods and the proposed method. The proposed model got 99.49% of accuracy while using the SVM classifier, and 98.95% of accuracy while using the LDA classifier. Although the accuracy of the proposed model with the SVM classifier is lower than the accuracy of the LCNN method but still higher than four other methods. From the results of case study 2, case study 3, and case study 4, the proposed model with the SVM classifier is more suitable than LDA, and the capability and robustness of the proposed model are certificated. On the other hand, the operating time of the proposed method using MFPT benchmark dataset is 87 seconds, which points out that the complexity of MFPT benchmark dataset is lower than CWRU benchmark dataset.

VI. CONCLUSION
In this study, a new bearing fault diagnosis model was proposed. In the stage of feature extraction, the model use MRA and FFT to extract the features from the raw signal. In the stage of feature selection, a new method HBBGWO was proposed. HBBGWO is an enhanced method, that hybrid the GWO algorithm, the HBO algorithm, and two strategies of updating position. Finally, SVM and LDA were adopted respectively as the classifier to distinguish the optimal feature subset provided by the feature selection methods.
Four different datasets were applied to verify the capability and the robustness of the proposed method. In Case study 1, the UCI benchmark dataset was applied to certify the performance of the proposed feature selection methods. The results show that the space searching capability of the proposed method is better than the BGWO, BPSO, and GA. On the other hand, the number of selected features and the operating times are both lesser than three other methods. In Case study 2, the bearing dataset measured from induction motors was adopted to verify the bearing fault diagnosis model. Both VOLUME 10, 2022 the performance of the selected features and the operating times of the proposed method are better than the three other methods, the proposed model also reached the highest classification accuracy under different levels of noise with the SVM classifier. In Case study 3, the CWRU benchmark dataset was applied to compare the proposed bearing fault diagnosis model with the state-of-art methods. The proposed model can reach 99.55% of classification accuracy with the SVM classifier. In Case study 4, the MFPT dataset was adopted, and the proposed model reached 99.49% classification accuracy with the SVM classifier. According to the results, the model using SVM classifier is more suitable than the LDA classifier in this study. In conclusion, although the proposed method has complex procedures to be simplify, the proposed method can still detect the bearing fault with noises and different sources of signals more effectively and with robustness.