A Pruning Optimized Fast Learn++NSE Algorithm

Due to the large number of typical applications, it is very important and urgent to study the fast classification learning of accumulated big data in nonstationary environments. The newly proposed algorithm, named Learn++.NSE, is one of the important research results in this research field. And a pruning version, named Learn++.NSE-Error-based, was given for accumulated big data to improve the learning efficiency. However, the studies have found that the Learn++.NSE-Error-based algorithm often encounters a situation that the newly generated base classifier is pruned in the next integration, which reduces the accuracy of the ensemble classifier. The newly generated base classifier is very important in the next ensemble learning and should be retained. Therefore, the two latest base classifiers are retained without being pruned, and a new pruning algorithm named NewLearn++.NSE-Error-based was proposed. The experimental results on the generated dataset and the real-world dataset show that NewLearn++.NSE-Error-based can further improve the accuracy of the ensemble classifier under the premise of obtaining the same time complexity as Learn++.NSE algorithm. It is suitable for fast classification learning of long-term accumulated big data.

the method of efficient classification learning of big data, which is continuously accumulated in nonstationary environments, has become one of the focuses and hotspots in the research field of data mining [1][2][3][4][5][6][7] . For example, in the user comment analysis system of the e-commerce platform, the classification model can be established according to the positive and negative comment texts, so as to understand the advantages and disadvantages of the enterprise's products, and then continuously improve the products and maintain the competitiveness of the enterprise. User comments are affected by the product's life cycle, trend direction and the maturity, which is a typical big data classification learning problem in nonstationary environments. And for example, in the smart grid control system, smart meters distributed around the site gradually collect users' electricity consumption information and return them to the data processing center to further mine the classification model, which can be used to predict the electricity consumption in advance, formulate energy-saving electricity generation and transmission plans, and smooth the electricity load.
Obviously, users' electricity consumption data is affected by different seasons, weather and the time, and the smart meter itself also would have aging, failure and attenuation. Therefore, the smart control system of the smart grid also whether the generation environment of accumulated big data has changed, they can be divided into active detection algorithms and passive detection algorithms [8][9][10] .
Among the numerous algorithms, it is worth paying attention to the recently proposed Learn++.NSE algorithm [7][11] [12] . Learn++.NSE is a passive, multiclassifier integrated batch classification algorithm. The experimental results on the real-world dataset and the generated dataset show that in the process of classification learning for the accumulated big data in nonstationary environments, the Learn++.NSE algorithm has achieved more accurate and stable classification results than the single-classifier algorithm. The Literature [11] also points out that if not pruning, Learn++.NSE will achieve higher classification accuracy. Especially in the environment of periodic recurrence, Learn++.NSE will better show the advantage of retaining all the base classifiers. However, if all base classifiers are retained, as the number of base classifiers increases, the ensemble time of Learn++.NSE will increase exponentially. Therefore, Learn++.NSE also offers a pruning version when considering execution efficiency, which includes two different pruning strategies: pruning according to the accuracy of the base classifier and pruning according to the generation time of the base classifier.
In our previous research, it can be found that there is a way to further optimize the pruning mechanism of the Learn++.NSE-Error-based algorithm, which perform pruning based on the accuracy of the base classifier. The main contributions of this paper include the following three aspects: 1) The shortcomings of the pruning mechanism of the original Learn++.NSE-Error-based algorithm were found, and a new pruning mechanism was designed to remain the latest two base classifiers without pruning, which built a foundation for the NewLearn++.NSE-Error-based algorithm.
2) Based on the new pruning mechanism, a new pruning algorithm for Learn++.NSE is proposed. The pruning algorithm further improves the accuracy of the integrated classification model for accumulated big data.
3) Comparison and analysis were made between the proposed NewLearn++.NSE-error-based algorithm and the original Learn++.NSE-Error-based algorithm on the data set generated by the program and the data set in the real world.
The experimental results verify that under the same learning scenario, the NewLearn++.NSE-error-based algorithm can further improve the accuracy of the integrated classification model compared with the Learn++.NSE-error-based algorithm, which is suitable for the rapid classification learning for the accumulated big data.

II. The Related Research Work and Algorithms
The algorithms as far as we know for leaning in nonstationary environments can be summarized as a mind map in Figure 1. In the algorithms of classification learning for big data accumulated in nonstationary environments, the online classification algorithm processes one instance data at a time, which can adapt to changes of data generation environment more quickly, but with poor stability. In addition, the online classification algorithm is susceptible to interference from the order of learning data and noise data [12] .
The batch processing classification algorithm processes a batch of data at a time, which can rely on the more data to overcome the interference of noise data, and the classification result is relatively stable. However, if the batch data is multi-source and the distribution probability is different, the algorithm is more difficult to process. The typical batch processing classification algorithms include IGMM-CD [17] , EDTU [18] , Learn++.NSE [11] , etc. In order to improve the execution efficiency, the batch processing algorithm could make use of the parallel computation mechanism. Under this mechanism, different data batches are analyzed by different computers or different CPU cores.
The parallel computer mechanism accelerates the data processing process, which is very suitable for big data. Take the PRLearn++.NSE [19] for example, which is proposed in our previous research to improve classification learning efficiency for big data, it uses the old base-classifiers as a supplement to the new base-classifier. It constructs a fast and parallel ensemble mechanism to accelerate the execution of original Learn++.NSE.
The single classifier algorithm generally has a low amount of calculation, but the classification accuracy is also low. And in order to take into account the retention of the original classification information, the adjustment is slow when processing new data. Typical algorithms include VFDT [20] , ODTC [21] , iCaRL [22] , and the online limit learning algorithm WOS-ELMK [23] , which using neural networks for learning, etc. The multi-classifier ensemble algorithm is more popular for the classification learning of accumulated big data where the data generation environment may change.
Because the multi-classifier ensemble algorithm generally has a lower classification error rate than the single-classifier algorithm, it is easier to adjust to the new data generation environment by adding and removing the base classifier, and the outdated classification information can be eliminated by deleting the base classifier. By flexibly adding and removing base classifiers and adjusting the weight of each base classifier, the multi-classifier ensemble algorithm can generally obtain a lower classification error rate and a more stable classification result than the single-classifier algorithm. Typical multi-classifier ensemble algorithms include SEA [24] , ONSBoost [25] , DWM [26] , Learn++.NSE, etc.
The passive detection algorithm assumes that the probability distribution of data generation may change over time. Therefore, regardless of the change, the classification model must be updated whenever a new dataset arrives.
However, with the increase of base classifiers, the update efficiency decreases rapidly. Pruning the base classifier can improve the update efficiency, but useful classification information may be lost, which needs to be comprehensively considered according to the actual situation. Typical passive detection algorithms include OLIN [27] , Learn++.NSE, etc.
The active detection algorithms attempt to determine whether the probability distribution of data generation has changed, and updates the classification model only when the probability distribution has changed, so as to avoid invalid update and can improve the execution efficiency. However, there are risks of false positive and false negative in the judgment of the active detection algorithm, especially when disturbed by noise data. Typical active detection algorithms include CUSUM [28] , JIT [29] , ICI [30] , etc. There is no definite advantage or disadvantage between the two methods, which needs to be selected according to the data generation situation in actual applications. Considering the research results in the literature [11], under the same test conditions, Learn++.NSE can achieve a lower classification error rate compared to SEA algorithm and DWM algorithm. In addition, the Learn++.NSE algorithm can also deal with the case of variable rates and periodic data generation that the above algorithms cannot handle well. Therefore, taking Learn.++NSE-Error-based as an optimization research object, the classification accuracy of the classification model is further improved.

III. Learn++.NSE-Error-based Algorithm
Learn++.NSE algorithm is a passive, batch-type multi- represents the number of data points, represents batch data gradually accumulating to form big data. ○ 4 The capacity of the base classifier ensembleSize.

Output: The Ensemble classification model H t .
Perform the following steps for each batch of dataset as Fig. 2. The purpose of doing this is that when calculating the error rate of the base classifier, if the base classifier cannot correctly classify the data that the current ensemble classifier misclassified, the base classifier will get a higher penalty and get a lower voting weight. In other words, the base classifier that can correctly classify data that the current ensemble classifier cannot correctly classify will get a higher voting weight. In this way, the overall ensemble classifier tends to be positively optimized and strives to identify data that is not If t=1 3)Update and normalize instance weight 4)Call the classification algorithm on dataset to get the base classifier ℎ : → .

5)Calculate the weighted error rate
= σ =1 ቂሾℎ ( ( )) ≠ ( )ሿቃ, = 1, … , of all base classifiers on the current data set based on . If = > 1/2, regenerate the base classifier ℎ . If 6)Use the Sigmoid function to calculate the weighted average normalized error rate . The research results show that the Learn++.NSE algorithm significantly improves the classification accuracy compared to the single-classifier algorithm, SEA algorithm and DWM algorithm [11] . In addition, its pruning version Learn++.NSE-Error-based algorithm was given a Java implementation in the latest version of the massive online learning platform Massive online analysis (MOA).

IV. NewLearn++.NSE-Error-based Algorithm
A careful analysis about the implementation details of the represents the number of data points, represents batch data gradually accumulating to form big data.

V. The Experimental Results and Analysis
Using the dataset described in the literature [11] (http://users.rowan.edu/~polikar/research/nse/) to analyze the classification learning effect of the proposed NewLearn++.NSE-Error-based algorithm.

A. The SEA dataset
The SEA dataset is a data set introduced when the SEA algorithm was proposed, and it is currently a benchmark dataset for testing a classification learning algorithm for accumulated big data in nonstationary environments. The dataset consists of three numeric fields and a class label. To increase the difficulty of classification learning, only two of the three numeric fields are related to classification labels, and another one is interference data. The classification label is a two-valued attribute. When the sum of the two relevant numeric fields is less than the preset threshold , the class label is 2, otherwise 1. In addition, in order to increase the complexity of ensemble classification learning, 10% noise data was added to the training dataset during the experiment.
At a given moment, the preset threshold will change to simulate the sudden change in the data generation environment.    and 69% of the data were labeled rain=no, 31% were labeled rain=yes. During the experiment, 10% of noise data was randomly generated. The accumulated big data was divided according to the year, and every 365 records constitute a batch of data for classifier training, which was divided into form of training then testing, that is, the first batch of dataset was used for training, the second batch of dataset was used for testing, then the second batch of dataset was trained, and the third batch of dataset was tested, and so on. A total of 49 classification prediction results were obtained. The average classification error rate and average g-mean of each prediction were recorded, as shown in Figure 9 and Figure   10. Similarly, considering the g-mean evaluation value in the case of imbalanced training data categories, as shown in Figure 10, it can be concluded that the NewLearn++.NSE-Error-based algorithm is better than Learn++.NSE-Errorbased. It can also be seen from the rank sum test of the classification learning result evaluation in Table 1 that there is a significant difference in the classification learning effect of these two algorithms. And the classification learning effect of the newly proposed NewLearn++.NSE-Error-based algorithm is better than that of Learn++.NSE-Error-based algorithm, which reflects the effectiveness of the optimized pruning strategy.

VI. CONCLUSION
The classic classification learning algorithm Learn++.NSE Although this research has made some progress, there are still some shortcomings. For example, it is still dissatisfied that the pruning strategy of the algorithm proposed so far. The base classifier of current algorithms will no longer be selected to use after it is deleted, which may make the ensemble classifiers cannot well track the changes of data set and reduce the accuracy of it. The reuse of baseclassifiers will be the key breakthrough in the future research. YUQUAN