Ensemble Pruning of RF via Multi-Objective TLBO Algorithm and Its Parallelization on Spark

Ensemble learning has been widely used in various fields. Still, too many base classifiers will affect the classification time of the ensemble classifier under the big data environment, while reducing base classifiers will affect the classification accuracy of the ensemble classifier. Therefore, the multi-objective teaching-learning-based optimization (MO-TLBO) algorithm is used to carry out ensemble pruning of random forest (RF) to improve the classification accuracy and speed of RF. MO-TLBO algorithm aims at maximizing classification accuracy and minimizing classification time, and it can find a sub-forest with higher classification accuracy and faster classification speed. In addition, considering the vast computational time of ensemble pruning of RF via MO-TLBO algorithm under the big data environment, a vote set is constructed to improve the fitness evaluation process. In the Spark platform, the RF improved by the MO-TLBO algorithm (MO-TLBO-RF) is parallelized based on data parallelism. The Shuffle optimization strategy is proposed to reduce the number of Shuffles in the execution of parallel MO-TLBO-RF. The proposed MO-TLBO-RF is applied to rolling bearing fault diagnosis. The experimental results show that the algorithm can obtain an RF with high fault diagnosis accuracy and fast fault diagnosis speed. The results also prove that the ensemble pruning time can be greatly reduced via the vote set and parallelization of MO-TLBO-RF.


I. INTRODUCTION
Ensemble learning combines multiple base classifiers to form an ensemble classifier, which has been widely used in biology, transportation, energy, industry, medicine, and other fields [1]- [5]. The vibration signals collected from the rolling bearing are time series, and the ensemble learning is suitable for time series classification [6]- [8]. Moreover, due to the high classification accuracy and strong generalization ability, the researches on the fault diagnosis based on ensemble learning increase gradually [9]- [12]. For example, Li et al. [9] firstly trained two different feature extractors to enhance feature representation, and then they adopted ensemble learning to obtain fault diagnosis results from different classifiers trained by different feature extractors. Yu and Zhao [10] proposed a probabilistic ensemble learning method based on Bayesian network, which can effectively improve the fault diagnosis accuracy. Ensemble learning can fully utilize multiple fault classifiers to diagnose faults more accurately, but the number of classifiers will restrict the diagnosis speed. With the expansion of industrial production scale and the popularization of intelligent manufacturing, the operations of production equipment will produce a lot of data, which poses a new challenge to ensure the diagnosis accuracy and the diagnosis speed of equipment fault diagnosis using ensemble learning.
Ensemble pruning is a method of selecting multiple base classifiers from the ensemble classifier to get a smaller but better ensemble classifier [13], which mainly includes three pruning ways: pruning based on sorting [14]- [16], pruning based on clustering [17], [18], and pruning based on optimization [19]- [21]. Guo et al. [16] proposed a measurement method based on margin and diversity to evaluate the importance of base classifiers. Multiple base classifiers can be selected in descending order to form an ensemble classifier with better performance using the proposed measurement method. Cela and Suárez [18] used the clustering algorithm based on energy to cluster the base classifiers in the ensemble classifier. An ensemble classifier consisting of a representative classifier of each cluster is got. Zhou et al. [21] optimized the trained neural networks by genetic algorithm (GA) to obtain a small-scale but better generalization ensemble neural network. Moreover, some researchers [22]- [27] combined the three pruning ways for ensemble pruning. For example, Zhu et al. [26] first filtered base classifiers based on the minimization of margin distance and then used an artificial fish swarm algorithm to find the best ensemble classifier from the remaining base classifiers. Onan et al. [27] combined the clustering algorithm and multi-objective evolutionary algorithm to find candidate classifiers in each cluster, and a smaller-scale ensemble classifier with better performance is obtained. In recent ensemble pruning techniques [28], [29], a meta-heuristic algorithm is firstly used to generate multiple individuals, then the generated individuals are filtered according to an index such as reduce-error, and finally the rest of individuals is used as the initial population of the metaheuristic algorithm to search the best ensemble model. The above studies show that ensemble pruning can reduce the size of the ensemble classifier and improve the generalization of the ensemble classifier.
It is a non-deterministic polynomial complete problem to find the best combination of base classifiers in the ensemble classifier [30]. In recent years, the single-objective metaheuristic algorithms have been usually used for ensemble pruning to find a near-optimal solution in limited time, and the goal is generally the classification accuracy [31], [32]. Furthermore, multi-objective meta-heuristic algorithms can find a satisfactory solution in multiple performance criterias [33]- [35], and some researchers [36]- [40] have explored the application of multi-objective meta-heuristic algorithms in ensemble pruning and achieved good results. For example, Qian et al. [39] proposed a multi-objective particle swarm optimization algorithm to maximize the generalization and minimize the size of the ensemble classifier. Peimankar et al. [40] used the multi-objective evolutionary algorithm to maximize the classification accuracy and diversity. The existing researches use multi-objective meta-heuristic algorithms to effectively improve the classification accuracy and reduce the size of the ensemble classifier. However, they do not take the classification time of the ensemble classifier as one goal. Because the classification time of different base classifiers may be different, the ensemble classifiers which have the same number of base classifiers may have different classification time. Taking the minimization of the classification time as one goal, the ensemble classifier with the fastest classification speed can be found from these ensemble classifiers with the same number of base classifiers. Therefore, it is necessary to take the minimization of the classification time as one goal, which can further improve the classification speed. In addition, ensemble pruning via meta-heuristic algorithms usually has a huge computational cost under the big data environment.
In meta-heuristic algorithms, swarm intelligence optimization algorithms that solve combinatorial optimization problems by imitating biological activities have been widely favored [41]. In recent years, some new swarm optimization algorithms have been proposed, such as the TLBO algorithm [42], monarch butterfly optimization algorithm [43], slime mould algorithm [44], moth search algorithm [45], hunger games search algorithm [46], Runge Kutta method [47], and Harris hawks optimization algorithm [48]. In these algorithms, TLBO algorithm doesn't have the algorithm-specific parameters [49], i.e., it only needs to adjust the number of individuals and the number of iterations of the population. Therefore, TLBO algorithm is adopted in this paper.
Therefore, in order to find a sub-forest with higher classification accuracy and faster classification speed, the MO-TLBO algorithm, whose two goals are the maximization of the classification accuracy and the minimization of the classification time, is proposed for ensemble pruning of RF. Furthermore, in order to reduce the enormous computational time of ensemble pruning of RF via MO-TLBO algorithm under the big data environment, the RF improved by MO-TLBO algorithm is parallelized on Spark according to data parallelism, the Shuffle optimization strategy is proposed, and a vote set is constructed.
The main contributions of this paper are as follows.
• The MO-TLBO algorithm whose two goals are the maximization of classification accuracy and the minimization of classification time is proposed, and a crossover operator with an adaptive crossover rate is designed to better find the best combination of base classifiers.
• Considering the vast computational time of ensemble pruning of RF via MO-TLBO algorithm under the big data environment, a vote set is constructed to improve the fitness evaluation process.
• MO-TLBO-RF is parallelized on Spark according to data parallelism, which greatly reduces the training time, ensemble pruning time, and the classification time of the RF model.

•
The Shuffle optimization strategy is proposed to reduce the number of Shuffles in the execution of parallel MO-TLBO-RF, which further reduces the ensemble pruning time.
• A large number of experiments verifies the effectiveness of MO-TLBO-RF. The results show that it not only has better fault diagnosis accuracy and generalization but also has faster training speed and fault diagnosis speed under the big data environment. The rest of this paper is organized as follows. Section II introduces the classic TLBO algorithm. Section III discusses the proposed MO-TLBO-RF and its parallelization. Section IV presents the experimental results and analysis. Section V gives the conclusion of this paper.

II. THE CLASSIC TLBO ALGORITHM
The TLBO algorithm is a novel swarm intelligence optimization algorithm proposed by Rao et al. [42]. The algorithm is used to solve the optimization problem by simulating the knowledge transfer behaviors of teachers and students. The TLBO algorithm is mainly composed of the teaching stage and learning stage. In the teaching stage, teachers teach students the knowledge to improve students' average scores. In the learning stage, students learn from other students to improve their scores. Supposing that there is a population P = {x 1 , x 2 , · · · , x n }, x old i and x new i represent the i-th learner before and after learning, respectively. The f (x) represents the objective function and the learner with the highest score of f (x) in each iteration is the teacher. The detailed descriptions of teaching stage and learning stage are as follows.
In the teaching stage, a teacher teaches students knowledge according to their average scores, and students' positions are updated through knowledge transfer. The new position of student x i is calculated by where In (1), r i is a random floating-point number between 0 and 1, and x teacher represents the teacher. T F is the learning factor, which is used to determine the change of x mean , and x mean represents the average value of positions of all students in one iteration. Only when f ( In the learning stage, student x i randomly selects another student x j to communicate, and the student who has more knowledge can learn something new. The new position of student x i is calculated by

III. THE PROPOSED ALGORITHM
This section mainly discusses the basic process of finding the optimal sub-forest (i.e., the best combination of base classifiers) and the MO-TLBO-RF algorithm.

A. THE BASIC PROCESS OF FINDING THE OPTIMAL SUB-FOREST
In order to further improve the classification accuracy and speed of RF under the big data environment, the ensemble pruning can be used in RF. In this paper, the MO-TLBO algorithm is proposed to combine the decision trees in the original RF model to get a sub-forest with high classification accuracy and fast classification speed, which is described as follows.
Step 1. Get the original RF model. The training set is used to train an RF model with k decision trees.
Step 2. Construct the vote set. The original RF model is used to classify samples in the validation set, and the vote results of all samples and the classification time of each decision tree are recorded.
Step 3. Evaluate the original RF model. Based on the vote results of each decision tree, the classification accuracy of the original RF model for each label in the validation set can be obtained.
Step 4. Reduce the size of the vote set. If the original RF model can achieve 100% accuracy for a certain label, the vote results of the label are removed from the vote set.
Step 5. Find the optimal sub-forest. The MO-TLBO-RF algorithm is used to arrange and combine the decision trees in the original RF model to find the optimal sub-forest, and the vote set is used to complete the fitness evaluation of the sub-forest quickly.

B. MO-TLBO-RF
This section describes the proposed MO-TLBO-RF algorithm, which is shown in Algorithm 1.

1) Weighted Multi-Objective
The multi-objective problem is usually solved by transforming a multi-objective into a single-objective. The common methods are linear weighting method [50], hierarchical optimization method [51], and Pareto optimization method [52]. The two goals of the proposed MO-TLBO-RF algorithm are the maximization of classification accuracy and the minimization of classification time. Since the two goals can be transformed into a dimensionless single goal, the linear weighting method is adopted in this paper to avoid the excessive amount of calculation.
The classification accuracy is a dimensionless quantity, representing the proportion of the samples of correct classification to the total samples. The sum of the classification time of each decision tree in the original RF model is divided using the classification time of the sub-forest. The classification time can be converted into a dimensionless quantity, which represents the proportion of the classification time of the subforest to the classification time of the original RF model. Since the classification time is expected to be minimized, a new dimensionless quantity is obtained by subtracting the proportion from 1, which represents the proportion of the classification time of the unused decision trees to the classification time of the original RF model. Therefore, the fitness value of the sub-forest is calculated by where F i represents the fitness value of the sub-forest i, t i is the classification time of the sub-forest i, and t all is the Input: An original RF model, the number of individuals in the population w, the iteration number of the population p, voteSet, timeRecord, timeWeight, and the early-stop threshold earlyStop Output: An optimal sub-forest 1: Initialize the variable counter, the key-value pair counterMap, and the population by a random number generator; 2: Evaluate the fitnesses of individuals in the population by (4); 3: Identify the teacher (i.e., the best solution); 4: for j = 1 to p do 5: for k = 1 to w − 1 do 6: Get the new position of student k by (5) and (6); 7: Get the classification accuracy of student k by voteSet; 8: Get the classification time of student k by timeRecord; 9: Evaluate the fitness of new position of student k by (4); 10: if the new fitness value > the current fitness value then 11: The current position and fitness are replaced with the new position and fitness; 12: counterMap(student k ) = 0; 13: else 14: counterMap(student k ) += 1; 15: end if 16: Get the new position of student k by (6) and (7); 17: Evaluate the fitness of new position of student k by (4); 18: if the new fitness value > the current fitness value then 19: The current position and fitness are replaced with the new position and fitness; 20: counterMap(student k ) = 0; 21: else 22: counterMap(student k ) += 1; 23: end if 24: if counterMap(student k ) ≥ 4 then 25: Get the new position of student k by mutation operator; 26: Evaluate the fitness of new position by (4); 27: The current position and fitness are replaced with the new position and fitness; 28: end if 29: end for 30: Identify the new teacher; 31: if the new teacher == the current teacher then if counter > earlyStop then 37: break; 38: end if 39: end for sum of classification time of all decision trees in the original RF model. timeWeight is a decimal to constrain MO-TLBO-RF to find the sub-forest with high classification accuracy and less classification time. If the value of timeWeight is too large, MO-TLBO-RF will tend to find the sub-forest with less classification time but low classification accuracy. If the value of timeWeight is too small, MO-TLBO-RF will tend to find the sub-forest with high classification accuracy but more classification time.

2) Adaptive Crossover Operator
TLBO algorithm is initially used to solve continuous optimization problems, but many studies show that TLBO algo-rithm is also suitable for discrete optimization problems. For example, Shao et al. [53] proposed a discrete TLBO algorithm based on a teaching-probabilistic learning mechanism to solve the job-shop scheduling problem. Ender and Tansel [54] combined TLBO algorithm and GA to solve discrete optimization problems.
To better find the optimal sub-forest, the TLBO algorithm and GA are combined. Therefore, binary coding is used to represent the position of an individual in the population, and 1 represents the decision tree is selected, as shown in Fig.  1. Considering that the knowledge learned by students from teachers has two possibilities of continuity and discontinuity, the crossover operator of bit crossover is used to simulate the knowledge transfer in TLBO algorithm, as shown in Fig. 2.  However, the crossover operator brings the problem of parameter adjustment of crossover rate and mutation rate. Rao [49] pointed out that TLBO algorithm is a heuristic algorithm without algorithm-specific parameters, and there is no additional burden of parameters. Based on this point, a crossover operator with an adaptive crossover rate is designed. Note that when the crossover rate is greater than 1, it will be modified to 0.9 (i.e., a commonly used crossover rate).
In the teaching stage, each student's ability to receive knowledge is affected by the teachers' teaching methods and their learning enthusiasm, so that the crossover rate of each knowledge transfer is different. The crossover rate CR teaching is calculated by where In (5), f (x) mean is the average value of students' fitness values, and student_num represents the number of students.
In (6), T F is the learning factor which is used to control the fluctuation of crossover rate to express the effect of learning knowledge.
In the teaching stage, student x i randomly selects another student x j to communicate, and the student can learn something new from the gap between two communicated students.
The crossover operator is also used to update the positions of students, and the crossover rate CR learning is calculated by where f (x i ) is the fitness value of student i, and f (x j ) is the fitness value of student j.

3) Mutation Operator
In TLBO algorithm, whether in the teaching or learning stage, only when the fitness value of the new individual is higher than the fitness value of the old individual, the old individual will be updated to the new individual. In other words, ideally, the individual can be updated twice in one iteration. Therefore, a counting table is used to record the number of times each individual has not been updated continuously. When the teaching or learning stage is completed, the count will be increased by one if an individual is not updated. When the count reaches four, it is considered that the individual has fallen into the local optimal solution, and the mutation operator is carried out. The mutation operator is help to jump out of the local optimal solution and reduce the risk of premature convergence. In this paper, a mutation operator of bitwise negation with a 50% mutation rate is used to get a new individual, which directly replaces the old individual to maintain the diversity of the population.

4) Vote Set
When a swarm intelligence optimization algorithm is used in ensemble pruning, it is necessary to evaluate the fitness values of generated sub-forests. The traditional fitness evaluation process is as follows. Firstly, the binary coding is used to represent the selections of decision trees in the sub-forest. Secondly, the binary coding of an individual is decoded to get an actual sub-forest, i.e., an ensemble classifier. Thirdly, the validation set is used to evaluate the classification accuracy and classification time of the sub-forest. Finally, the fitness value of the sub-forest is calculated according to (4). In experiments, it is found that the above process is timeconsuming when the size of validation set is large so that the computational time of using swarm intelligence optimization algorithm for ensemble pruning will be very large under the big data environment. In order to quickly evaluate the fitness of the sub-forest, a vote set is constructed, and the new fitness evaluation process is as follows.
Step 1. The original RF model is used to classify samples in the validation set, and the vote results and classification time of each decision tree are recorded. Thus, the voteSet and timeRecord as shown below are obtained.
In the voteSet, label i represents the label of sample i, vote j,i is the result of decision tree j voting on sample i, m is the number of samples in the validation set, and n is the number of decision trees. In the timeRecord, t i is the classification time of decision tree i for the validation set, and t all represents the sum of classification time of each decision tree.
Step 2. According to the binary coding of the individual, the vote results and classification time of the corresponding decision trees are extracted from the voteSet and timeRecord, respectively. The vote results of the sub-forest are obtained according to the majority voting method, and the classification time of the sub-forest is obtained by summing the classification time extracted from the timeRecord.
Step 3. According to the sample labels in the voteSet, the classification accuracy of the sub-forest can be got, and the fitness value of the sub-forest is calculated by (4).
Compared with the traditional fitness evaluation process, the new process eliminates the decoding stage and avoids the repeated classification of decision trees for the validation set. It is found that the vote set can greatly reduce the time of evaluating the fitness of an individual. Note that the vote set can replace a validation set with m samples of d-dimension features and 1-dimension label with an m × (n + 1) matrix. If d is less than n, the size of the voteSet will be larger than the size of the validation set, and therefore the computational time will be reduced by increasing the memory space. If d is larger than n, the size of the voteSet is smaller than the size of the validation set, which means that the vote set can reduce not only the computational time, but also reduce the memory space.

5) Time Complexity Analysis of the Serial MO-TLBO-RF
In the serial MO-TLBO-RF described in Algorithm 1, the outer for-loop repeats p times (see line 4), the inner forloop repeats w − 1 times (see line 5), multiple fitness calculations are performed during each execution of the inner for-loop (see lines 9, 17, and 26), and the time complexity of the fitness calculation is O(w). In addition, the time complexity of RF is O(n(md log m)) [55]. Therefore, the time complexity of the serial MO-TLBO-RF algorithm is O(n(md log m) + pw 2 ). Although the time complexity of the fitness calculation is small, it becomes the most timeconsuming part of the serial MO-TLBO-RF algorithm due to the large amount of data processed and many times of operations.

C. THE PARALLELIZATION OF MO-TLBO-RF
This section discusses the parallel design, Shuffle optimization, and parallel implementation of MO-TLBO-RF.

1) The Parallel Design
Currently, the two main ways of parallelization are task parallelism and data parallelism. The parallelization process of MO-TLBO-RF based on task parallelism is shown in Fig.  3. Each element of the populationRDD includes a vote set, a teacher, and two students, where rStudent i represents a   student randomly selected from all students except for the i-th student. The populationRDD includes w − 1 elements, thus one iteration of the population is divided into w − 1 computational tasks which can be executed in parallel. Each task needs to complete the teaching stage, learning stage, and mutation. After w − 1 tasks have been completed, and the teacher is updated according to the fitness values of individuals in the population. The parallelization process of fitness evaluation based on data parallelism is shown in Fig. 4. The voteSet is transformed into an RDD voteSetRDD, and an element of vote-SetRDD is a row of voteSet. After the update of student position and statistics of classification time have been completed on Spark driver. The parallelization process of fitness evaluation is as follows. Firstly, the ensemble vote results of the student for each sample are obtained according to its position and the majority voting method. Secondly, the ensemble vote results are compared with the sample labels, 0 means the wrong classification, and 1 means the correct classification. Thirdly, the classification results obtained by each Spark worker are summed and divided by the total number of samples to get the classification accuracy. Finally, the fitness value of the new student position is calculated by (4).
If the task parallelism is adopted, each element of pop-ulationRDD needs to be regarded as a partition, and each partition needs to contain a voteSet. Because students and the teacher will be updated in each iteration, populationRDD needs to be recreated according to the new population in each iteration, which will cause w − 1 voteSet to be uploaded to the Spark cluster frequently, resulting in large additional data transfer overhead. If the data parallelism is adopted, the voteSet used in Algorithm 1 can remain unchanged. In other words, the voteSetRDD can remain unchanged, which means that the voteSet only needs to be uploaded to the Spark cluster once so that the data transfer overhead is small. Therefore, MO-TLBO-RF is parallelized based on data parallelism.

2) The Shuffle Optimization Strategy
Spark driver is responsible for executing the serial parts of the program, and all worker nodes are responsible for parallelly executing RDD calculation. Spark programs need to avoid too many Shuffles, because Shuffle usually will produce a lot of data transfer overhead [56]. Therefore, the fewer Shuffles are performed in a Spark program, the shorter the parallel execution time.
The MO-TLBO-RF algorithm needs to perform 2(w − 1)p fitness evaluations without considering early stop and mutation. As shown in Fig. 4, the process of using the reduce operator to sum each element of statisticsRDD to obtain the classification accuracy will produce a Shuffle, so that the parallel MO-TLBO-RF algorithm needs to perform 2(w−1)p Shuffles. The computational time of the parallel MO-TLBO-RF algorithm can be significantly reduced by reducing the number of Shuffles. Therefore, the parallelizations of the teaching and learning stages in the MO-TLBO-RF algorithm are improved, which can be described as follows. At first all positions of students are updated and all classification time of all students is calculated on Spark driver, then the fitness values of all students are calculated in parallel on multiple worker nodes, and finally each student whose fitness value is lower than the new fitness value is updated. In the parallel MO-TLBO-RF algorithm using the Shuffle optimization strategy, an RDD can be used to calculate the fitness values of all students. As shown in Fig. 5, each element of classifyRDD includes a sample label and the ensemble vote results of all students for the label, the statisticsRDD can be obtained by comparing sample labels and the ensemble vote results of all students, and the process of using the reduce operator to sum each element of statisticsRDD to obtain the classification accuracies of all students will produce a Shuffle. The parallel MO-TLBO-RF algorithm using the Shuffle optimization strategy needs to perform 2p fitness evaluations so that it only needs to perform 2p Shuffles.

3) The Parallel Implementation
The parallel MO-TLBO-RF using the Shuffle optimization strategy based on Spark platform is shown in Algorithm 2, which can be described as follows.
Step 1. Initialize the population. The population is initialized, and the fitness of each individual in the population is evaluated. The individual with the highest fitness value is teacher, and the rest of the individuals are students.
Step 2. Update students in the teaching stage. All students carry out the teaching stage, the fitness values of all students are calculated in parallel, and each student whose fitness is lower than the new fitness is updated.

Algorithm 2 The parallel MO-TLBO-RF algorithm
Input: An original RF model, the number of individuals in the population w, the iteration number of the population p, vote-SetRDD, timeRecord, timeWeight, and the early-stop threshold earlyStop Output: An optimal sub-forest 1: Initialize the variable counter, the key-value pair counterMap, and the population by a random number generator; 2: Evaluate the fitnesses of individuals in the population by (4); 3: Identify the teacher; 4: for j = 1 to p do 5: for k = 1 to w − 1 do 6: Get the new position of student k by (5)  Parallelly calculate the fitness values of all students by (4); 10: for k = 1 to w − 1 do 11: if the new fitness value > the current fitness value then 12: The current position and fitness of student k are replaced with the new position and fitness; 13: counterMap(student k ) = 0; 14: Get the new position of student k by (6) and (7); 20: Get the classification time of student k by timeRecord; 21: end for 22: Parallelly calculate the fitness values of all students by (4); 23: for k = 1 to w − 1 do 24: if the new fitness value > the current fitness value then 25: The current position and fitness of student k are replaced with the new position and fitness; 26: counterMap(student k ) = 0; 27: if counterMap(student k ) ≥ 4 then 33: Get the new position of student k by mutation operator; 34: Get the classification time of student k by timeRecord; 35: end if 36: end for 37: if the number of mutated students > 0 then 38: Parallelly evaluate fitnesses of mutated students by (4); 39: for r = 1 to the number of mutated students do 40: The current position and fitness of studentr are replaced with the new position and fitness; 41: counterMap(studentr) = 0; Step 3. Update students in the learning stage. All students carry out the learning stage, the fitness values of all students are calculated in parallel, and each student whose fitness is lower than the new fitness is updated.
Step 4. Carry out the mutation. The student who doesn't continuously update its position four times carries out the mutation operator, the fitness values of the mutated students are calculated, and the old students are replaced with the mutated students.
Step 5. Update the teacher. The optimal solution of the population is regarded as the new teacher. If the fitness value of the new teacher is larger than the fitness value of the current teacher, the teacher is updated, and the rest of the individuals are students.
Step 6. If the number of times the teacher has not been updated continuously exceeds the earlyStop, or the iteration number of the population exceeds p, the algorithm is stopped, and the current teacher as the optimal solution is output; otherwise, return to Step 2.

4) Time Complexity Analysis of the Parallel MO-TLBO-RF
In the parallel MO-TLBO-RF described in Algorithm 2, the outer for-loop repeats p times (see line 4), multiple fitness calculations of the population are performed in parallel during each execution of the outer for-loop (see lines 9, 22, and 38). and the time complexity of the parallel fitness calculation of the population is O(w 2 /uv), where u is the number of worker nodes and v is the number of CPU cores within a worker node. Because the computational time of six inner for-loops (see lines 5, 10, 18, 23, 31, and 39) is very small, their time complexities need not be considered. In addition, the time complexity of the parallel RF algorithm based on Spark is O(n(md log m)/uv). Therefore, the time complexity of the parallel MO-TLBO-RF algorithm is O((n(md log m) + pw 2 )/uv).

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. EXPERIMENTAL DATASET
In order to evaluate the effectiveness of the proposed MO-TLBO-RF algorithm, the rolling bearing dataset provided by Paderborn University in Germany [57] is used to carry out experiments. The dataset includes normal state data, outer race fault data, and inner race fault data, which are current signals and vibration signals collected at the frequency of 64 kHz. The fault data include artificial fault data and real fault data. The artificial damages of rolling bearings are made by drilling, electric engraver, and electrical discharge machining. The real damages of rolling bearings are made by the test rig of accelerated life.
Paderborn University [57] pointed out the vibration signals can more accurately reflect the state of rolling bearings than the current signals. In view of this, the vibration signals are selected to conduct experiments. In the data preprocessing, firstly, the vibration signals are divided into many samples, and each sample includes 4096 sampling data. Secondly, the wavelet packet decomposition [58] is used to decompose each sample at three levels. The data obtained from the third level decomposition is used to calculate the wavelet energy to get eight time-frequency features. Thirdly, 10 time-domain features are extracted from the time-domain signals of each sample, which include the mean value, variance, standard deviation, root mean square, skewness, kurtosis, waveform factor, peak factor, pulse factor, and margin factor. Finally, 18-dimension feature vectors are obtained by connecting time-frequency features with time-domain features.
In order to evaluate the rolling bearing fault diagnosis accuracy of the proposed MO-TLBO-RF algorithm under the real environment, the normal state data and real fault data are selected. The categorization of data files selected from the dataset provided by Paderborn University is shown in Table 1. These data are preprocessed to get the dataset DATA A. Moreover, in order to better evaluate the performance of the proposed algorithm under the big data environment, the sliding window [59] is used to enhance these data listed in Table 1, and the enhanced data are preprocessed to get the dataset DATA B whose data size reaches 35 GB. DATA A and DATA B are divided into a training set, validation set, and test set according to the ratio of 7:1:2, respectively. Specifically, the feature dimensions of the training set, validation set, and test set are 18, the dimensions of vote set is 100, and the sizes of the training set, validation set, and test sets from DATA B are 24.5 GB, 3.5 GB, and 7.0 GB, respectively.

B. EXPERIMENTAL SETTINGS
The Spark cluster used in experiments includes one master node and four worker nodes. The hardware configuration of the master node includes one quad-core Intel Xeon E3-1225 V5 CPU at 3.3 GHz and 32 GB main memory. The hardware configuration of each worker node includes one eight-core Intel Core i7-9700k CPU at 3.6 GHz and 64 GB main memory. The software configuration of the Spark cluster is as follows: CentOS 8.1, Hadoop 3.2, and Spark 3.0. The parameter settings of RF are shown in Table 2. The detailed explanations of parameters can be found in [60], where numTrees represents the number of decision trees in RF. Increasing the value of numTrees can significantly improve the classification accuracy of RF. Still, when it reaches a certain value, the classification accuracy of RF will not continue to be improved. It will significantly increase the training time and classification time of the RF model.
The parameters of MO-TLBO-RF are listed in Table 3. timeWeight is used to constrain MO-TLBO-RF to find the   sub-forest with high classification accuracy and less classification time. The consequences of improper selection of timeWeight are described in Section III-B1. populationSize is the number of individuals in the population. The choice of the parameter will affect the quality of the sub-forest and the ensemble pruning time. iterations denotes the iteration number of the population. earlyStop is the threshold of early stopping the algorithm. In order to avoid too long execution time of the algorithm, when the number of times the teacher has not been updated continuously exceeds the value of earlyStop, it is considered that the algorithm has found the optimal sub-forest. Note that the parameters used in this paper are selected by the grid-search method, and the fitness calculated by (4) is used as the evaluation index of the gridsearch method. In this paper, to accurately evaluate the performance of a fault diagnosis model, each experiment is repeated 30 times. The measurement results are expressed in the form of the mean and standard deviation (std).

C. MODEL TRAINING AND VERIFICATION 1) Comparison of RF, IRF, gcForest, and MO-TLBO-RF
To evaluate the effectiveness of the proposed MO-TLBO-RF, four different RF algorithms, i.e., Spark-RF [61], Spark-IRF [60], gcForest [62], and MO-TLBO-RF are used for fault diagnosis model training with DATA A, respectively. Spark-RF is an RF algorithm provided by Spark MLlib, Spark-IRF is an improved RF algorithm based on sub-forest optimization and it is implemented with Spark, and gcForest is a latest available open-source ensemble learning model. In this experiment, the parameter settings of Spark-RF are listed in Table 2, the parameter settings of Spark-IRF can be found in [60], and the key parameter settings of gcForest are as follows: the number of estimators in each cascade layer is set to 4, the number of decision trees in each estimator is set to 100, and the type of the predictor concatenated to the deep forest is specified as "forest". Table 4 shows the average fault diagnosis accuracies ob-  [57] is consistent with DATA A. The results show that the ensemble pruning of RF via MO-TLBO algorithm can improve the fault diagnosis accuracy to a certain extent and the data preprocessing method used in this paper is suitable for real fault data. The average fault diagnosis accuracy of MO-TLBO-RF is 0.07% higher than the average fault diagnosis accuracy of Spark-IRF, which means that the proposed MO-TLBO algorithm is more effective than the sub-forest optimization method based on similarity proposed in [60]. The average fault diagnosis accuracy of MO-TLBO-RF is 0.81% higher than the average fault diagnosis accuracy of gcForest, which indicates that the proposed MO-TLBO algorithm is more suitable for fault diagnosis of rolling bearing than gcForest. As shown in Table 4, compared with Spark-RF, the number of decision trees of Spark-IRF is reduced by 61%, and the number of decision trees of MO-TLBO-RF is reduced by 73%, which indicates that the ensemble pruning can effectively reduce the number of decision trees. Compared with Spark-IRF, the number of decision trees of MO-TLBO-RF is reduced by 31%, which means that the sub-forest with higher fault diagnosis accuracy and fewer decision trees can be found by using the MO-TLBO algorithm. The number of decision trees of MO-TLBO-RF is 1.39% of that of gcForest, which indicates that the structure of MO-TLBO-RF is much simpler than gcForest.
To further validate the effectiveness of the proposed MO-TLBO-RF in various scenarios, a series of experiments are conducted on 28 different datasets from UCI machine learning repository [63]. Table 5 shows the comparison of classification accuracies of Spark-RF, gcForest, and MO-TLBO-RF for different datasets. As shown in Table 5, compared with Spark-RF and gcForest, MO-TLBO-RF has better classification accuracies for 22 different datasets. The experimental results show that the ensemble pruning via MO-TLBO algorithm is effective in most datasets, and it has a minor negative effect on the classification accuracy of RF in a few datasets. Moreover, compared with Spark-RF and gcForest, MO-TLBO-RF has the lowest standard deviations in 20 different datasets, which means that the classification accuracies obtained by MO-TLBO-RF on different datasets VOLUME 4, 2016 are relatively stable. In addition, gcForest has lower classification accuracies in some datasets (such as ecoli, hayesroth, and mammographic mass), and the reason may be that gcForest is not suitable for these datasets with low feature dimensions and few samples.
To demonstrate that the experimental results are statistically significant, the Bonferroni-Dunn test is performed in the data from Table 5, and the test details can be seen in [64]. In Fig. 6, the critical difference whose value is 0.60 is clearly marked as CD, and the numbers on the axis represent the average rankings of the three different RF algorithms used in the experimental comparison. As can be seen form Fig. 6, the average rankings of Spark-RF, gcForest, and MO-TLBO-RF are 1.86, 2.93, and 1.21, respectively. The difference of average ranking between Spark-RF and MO-TLBO-RF is larger than the critical difference, and the difference of average ranking between gcForest and MO-TLBO-RF is also larger than the critical difference. Fig. 6 shows the MO-TLBO-RF is better in classification accuracy.

2) Comparison of Different Swarm Intelligence Optimization Algorithms
To evaluate the effectiveness of the MO-TLBO algorithm, three different swarm intelligence optimization algorithms are used for ensemble pruning of RF, i.e., RF improved by multi-objective genetic algorithm (MO-GA-RF), RF improved by multi-objective whale optimization algorithm (MO-WOA-RF), and MO-TLBO-RF. In MO-WOA-RF, the sigmoid function is adopted, which can enable WOA to solve discrete optimization problems [65]. In this experiment, DATA A is used to evaluate the fault diagnosis accuracies of three different algorithms, and DATA B is used to evaluate the fault diagnosis time of three different algorithms on the Spark cluster. For the sake of fairness, the three algorithms use the same RF model trained by the same training set. It is worth noting that when evaluating the fault diagnosis time of the three algorithms, the fault diagnosis model is used to diagnose all data of DATA B. The parameter settings of MO-GA and MO-WOA are shown in Table 6.

3) Validation of Model Generalization
To evaluate the generalization of MO-TLBO-RF, the artificial fault data are used to train fault diagnosis models by Spark-RF, Spark-IRF, and MO-TLBO-RF, and the trained models are used to diagnose the real fault data. The artificial fault data used for training and the real fault data used for testing are shown in Table 6, and they are preprocessed according to the data preprocessing process mentioned in Section IV-A, where the test data are divided into validation set and test set according to the ratio of 3:7. The same generalization experiment has been carried out by Paderborn University [57] using RF, and the experimental data used in [57] are consistent with the data listed in Table 7.  Fig. 8 presents the comparison of fault diagnosis accuracies of different RF models trained with artificial fault data for real fault data. As shown in Fig. 8, the average fault diagnosis accuracy of Spark-RF is 7.6% lower than the average fault diagnosis accuracy of the RF model, which indicates that the data preprocessing method adopted in [57] is more able to enhance the generalization of the RF model. The average fault diagnosis accuracies of Spark-IRF and MO-TLBO-RF are 2.6% and 13.2% higher than the average fault diagnosis accuracy of Spark-RF, respectively, which means that ensemble pruning can enhance the generalization of the RF model. The reason for the low fault diagnosis accuracy of Spark-RF is that it contains a large number of decision trees that can not accurately diagnose real bearing faults. The average fault diagnosis accuracy of MO-TLBO-RF is 10 model than the sub-forest optimization based on similarity.
The average fault diagnosis accuracy of MO-TLBO-RF is 5.6% higher than the average fault diagnosis accuracy of RF, which shows that the proposed MO-TLBO-RF algorithm is more effective.

D. PERFORMANCE ANALYSIS OF MODEL TRAINING AND FAULT DIAGNOSIS 1) Validation of Parallelization Effectiveness
To analyze the influence of parallelization on the training time and fault diagnosis time, the serial MO-TLBO-RF and parallel MO-TLBO-RF are used to train the rolling bearing fault diagnosis model with DATA B, respectively, where the serial MO-TLBO-RF is performed using one CPU core of a single worker node and the parallel MO-TLBO-RF is performed on the cluster with four worker nodes. In addition, the rolling bearing fault diagnosis models are used to diagnose all data of DATA B.  will greatly affect the efficiency of model training. Fig. 9(b) presents the comparison of fault diagnosis time of the serial and parallel fault diagnosis models. As shown in Fig. 9(b), the fault diagnosis time of the serial RF and MO-TLBO-RF is 145.2 minutes and 24.2 minutes, respectively, and the fault diagnosis time of the parallel RF and MO-TLBO-RF is 10.3 minutes and 2.9 minutes, respectively. The fault diagnosis time of the parallel RF is reduced by 92.9% than the fault diagnosis time of the serial RF, and the fault diagnosis time of the parallel MO-TLBO-RF is reduced by 88.0% than the fault diagnosis time of the serial MO-TLBO-RF. The results show that parallelization can significantly improve the diagnosis speed of rolling bearing fault diagnosis models.

2) Influence of the Vote Set on Ensemble Pruning
To analyze the influence of the vote set on the ensemble pruning time, the parallel MO-TLBO-RF with the vote set and that without the vote set are used to train the rolling bearing fault diagnosis model with DATA B on the Spark cluster, respectively. For the sake of fairness, the two algorithms use the same validation set to carry out ensemble pruning for the same original RF model.  minutes to carry out ensemble pruning with the vote set, and the ensemble pruning time of MO-TLBO-RF with the vote set is reduced by 99.3% than the ensemble pruning time of MO-TLBO-RF without the vote set, which shows that the vote set can greatly reduce the ensemble pruning time. Therefore, the vote set makes it feasible to use multiobjective swarm intelligence optimization algorithms for ensemble pruning under the big data environment.

3) Influence of the Shuffle optimization strategy on Ensemble Pruning
To analyze the influence of the Shuffle optimization strategy on the ensemble pruning time in the case of using the vote set, the parallel MO-TLBO-RF with the Shuffle optimization strategy and that without the Shuffle optimization strategy are used to train the rolling bearing fault diagnosis model with DATA B on the Spark cluster, respectively.
As seen in Fig. 11, the average ensemble pruning time of the parallel MO-TLBO-RF without the Shuffle optimization strategy and that with the Shuffle optimization strategy are 19 minutes and 6.7 minutes, respectively, i.e., the Shuffle optimization strategy reduces the ensemble pruning time by 64.7% on average. The results show that the Shuffle optimization strategy can significantly reduce the ensemble pruning time.

4) Comparison with Spark-IRF
To compare the ensemble pruning time and fault diagnosis time of Spark-IRF and that of the parallel MO-TLBO-RF, the two algorithms are used to train the rolling bearing fault diagnosis models with DATA B on the Spark cluster. Fig. 12(a) shows the time spent on original RF model training and ensemble pruning in the process of training fault diagnosis models with Spark-IRF and the parallel MO-TLBO-RF. As shown in Fig. 12(a), the training time of the average original RF model is 173.6 minutes, while the average ensemble pruning time of Spark-IRF and that of the parallel MO-TLBO-RF is 14 minutes and 6.7 minutes, respectively, accounting for 7.5% and 3.7% of the total model training time. The results show that the average ensemble pruning time of Spark-IRF and the parallel MO-TLBO-RF is short. Compared with Spark-IRF, the parallel MO-TLBO-RF has less ensemble pruning time, because the time complexity of Spark-IRF is higher, and the computational time of Spark-IRF is related to the complexity of the decision tree. The complexity of the decision tree is increased with the increase of dataset size. However, the computational time of the parallel MO-TLBO-RF is not related to the complexity of the decision tree, and it is only related to the number of individuals and the size of the vote set. Therefore, the parallel MO-TLBO-RF is more suitable for training rolling bearing fault diagnosis models under the big data environment.
As seen in Fig. 12(b), the average fault diagnosis time of Spark-RF, Spark-IRF, and the parallel MO-TLBO-RF is 10.3 minutes, 3.7 minutes, and 2.9 minutes, respectively. Compared with Spark-RF and Spark-IRF, the fault diagnosis time of the parallel MO-TLBO-RF is reduced by 71.8% and 21.6%, respectively. The results show that the parallel MO-TLBO-RF has a faster fault diagnosis speed than Spark-IRF.  This is because the proposed MO-TLBO-RF can find the sub-forest with fewer decision trees, and these decision trees have a shorter classification time. Therefore, the parallel MO-TLBO-RF is more suitable for diagnosing the rolling bearing faults under the big data environment.

5) Comparison with gcForest
To further validate the effectiveness of the parallel MO-TLBO-RF, gcForest and parallel MO-TLBO-RF are used to train the rolling bearing fault diagnosis models with DATA B. Note that gcForest only can use all CPU cores of a worker node to train a fault diagnosis model.  Table 8 presents the comparison of the model training time and fault diagnosis time of gcForest and parallel MO-TLBO-RF. As shown in Table 8, the model training time of parallel MO-TLBO-RF is lower 90.18% than the model training time of gcForest, and the fault diagnosis time of parallel MO-TLBO-RF is lower 96.02% than the fault diagnosis time of gcForest. Table 8 demonstrates that the parallel MO-TLBO-RF has better performance than gcForest in terms of the model training time and fault diagnosis time. The reason is that gcForest needs more computational time to build multiple forests, and its diagnosis result is obtained after synthesizing the votes of the multiple forests. However, the parallel MO-TLBO-RF can utilize all worker nodes of the Spark cluster to perform model training and fault diagnosis, it only needs to train a forest, and its diagnosis result is obtained after synthesizing the votes of fewer decision trees. Table 9 presents the summary of the computational complexity of RF, Spark-RF, Spark-IRF, MO-TLBO-RF, and parallel MO-TLBO-RF. In Table 9, n is the number of decision trees, m is the number of samples, d is the number of feature dimensions, u is the number of worker nodes, v is the number of CPU cores within a worker node, and the specific meaning of x, l, t, and ϕ can be found in [60]. As can be seen from Table 9, the time complexity of Spark-RF is the lowest, and the time complexity of parallel MO-TLBO-RF is the second lowest, which demonstrates that the proposed parallel MO-TLBO-RF doesn't increase too much complexity while carrying out ensemble pruning of RF.

V. CONCLUSION
This paper proposes an RF improved by MO-TLBO algorithm to find the sub-forest with high classification accuracy and less classification time. The proposed MO-TLBO-RF is applied in the rolling bearing fault diagnosis. The diagnosis accuracy is 99.49% using the fault diagnosis model trained with real fault data. In view of the huge computational time of ensemble pruning of RF via MO-TLBO algorithm under the big data environment, a vote set is constructed to improve the fitness evaluation process, which reduces the ensemble pruning time by 99.3%. In addition, MO-TLBO-RF is parallelized on Spark, which reduces the ensemble pruning time by 97.2%, increases the training speed of the fault diagnosis model by 13.4 times, and reduces the fault diagnosis time by 92.9%. In order to effectively reduce the number of Shuffles of MO-TLBO-RF, the Shuffle optimization strategy is proposed, which further reduces the ensemble pruning time by 64.7% in the case of using the vote set.
The size of the validation set will affect the performance of ensemble pruning. In the future, we will try to generate a smaller validation set containing more information for ensemble pruning, which is helpful for finding the sub-forest with better generalization performance in less pruning time.
ZHIBING WANG was born in Hunan, China, in 1974. He received the M.S. degree in computer science and technology from Hunan University of Technology, Zhuzhou, China, in 2010. He is currently an Associate Professor with the School of Computer Science, Hunan University of Technology, Zhuzhou, China. His research interests include industrial internet of things, trusted software, and knowledge graph.
XIAOJUN DENG was born in Hunan, China, in 1974. He received the M.S. degree in computer science and technology from National University of Defense Technology, Changsha, China, in 2004. He is currently a Full Professor with the School of Computer Science, Hunan University of Technology, Zhuzhou, China. His major research interests include industrial big data analysis, industry equipment health management, internet of things, and image processing. He has published many research articles in international conferences and journals, such as IJSN, JCSE, and IEEE Access. VOLUME 4, 2016