DME: An Adaptive and Just-in-Time Weighted Ensemble Learning Method for Classifying Block-Based Concept Drift Steam

This study proposes a novel incremental learning algorithm called distribution matching ensemble (DME) in context of adaptive weighted ensemble learning. In particular, DME estimates the distribution of each received data block by Gaussian mixture model (GMM) and reserves the corresponding distribution information, as well it maintains a group of classifiers in a buffer. When we receive a new data block which is required to be predicted, the similarity between its distribution and each reserved distribution will be calculated by Kullback-Leibler (KL) divergence, and then the similarities can be used to guide the weight assignment of each corresponding classifier to further make adaptive ensemble decision. DME gets rid of the underlying hypothesis that the most recent labeled data block always has the most similar distribution with the current unlabeled data block. In addition, to avoid infinite extension of ensemble buffer during incremental learning, we also develop two dynamic classifier update rules. Experiments results on some synthetic and real-world streaming datasets show that the proposed DME algorithm is able to track and adapt to various types of concept drift just in time. Especially, on data stream with frequent reoccurring drifts, the DME significantly outperforms to several state-of-the-art algorithms, indicating its superiority.


I. INTRODUCTION
Learning from data stream, or as it is also called online learning or incremental learning, has been demonstrated to be very useful for a growing number of applications in which data are available continuously in time (data stream) and/or there are time and space constraints. Examples of such applications are sensor network monitoring [1], malware detection [2], credit card fraud detection [3], spam filtering [4], and traffic management [5].
Two key challenges posed by learning from data stream are as follows. One is the learner has to process each training example or each block of training instances once ''on arrival'', without the need of storage or reprocessing [6], [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Joanna Kołodziej .
The other is the online environment is often non-stationary, and the data to be predicted by the learning models may dynamically change over time, which phenomenon is called concept drift [8], [9]. That is to say, the learning machine deployed in streaming environment requires learning examples by means of one pass, as well to adapt to the dynamic change of data distributions, i.e., concept drift. The first term could be satisfied by either modifying the conventional static learning algorithms that only use the new arriving data to tune the model parameters [10], [11], or adopting ensemble learning models which construct a new learner on each new arriving block of instances [12], [13], [14], [15], [16], [17]. As for the second term, i.e., concept drift, it is obviously more sophisticated as there are a number of distribution drift types [18], [19]. To adapt concept drift, there are also two different strategies as follows: one is adding a forgetting mechanism into single learning model to make it gradually forget old knowledge [20], and the other one is adaptively designating weights for classifiers in ensemble to make it adapt variance of data distribution [21]. In contrast with single model, it is obviously more flexible to use adaptive weighted ensemble to implement data stream learning [22].
Due to the modularity of ensembles, in the data stream, they adapt to the current environment by modifying their own structure, replacing old component classifiers with new ones or updating the weights in the voting formula [23]. According to the type of incoming data, adaptive weighted ensemble can be further divided into two categories: online ensembles and block-based ensembles [22]. The former refers to online learning, where ensemble processes each training instance only once without storing or reprocessing, whereas the latter term refers to incremental learning that processes incoming data in blocks, instead of processing each training example separately [24], [25].
Block-based ensembles work in environments where instances arrive in blocks (also called chunks). The component classifiers in the ensemble are evaluated by recent received labeled block, and their weights are thereby updated. The weakest component classifier will be selected based on results of the evaluation, and further it will be replaced by a new (candidate) classifier trained on the recent received labeled block. The updated ensemble will make predictions for the next unlabeled block. Streaming Ensemble Algorithm (SEA) [26] which was the first of such ensemble algorithms, and the Accuracy Weighted Ensemble (AWE) [27], which was subsequently proposed, are the most representative method of block-based adaptive weighted ensembles. The advantage of such algorithms is that they can cope with gradual concept drifts well, but be always delayed in response to the sudden concept drifts, as noted in [28]. This is because the existing block-based ensemble algorithms always weight component classifiers based on their performance on the most recent received labeled block and then make predictions for the next unlabeled block. That means if a concept drift occurs on next unlabeled block, then ensemble must fail to provide an accurate enough classification for it. In the other word, the existing block-based adaptive weighted ensemble methods utilize only the information of the labeled blocks, but ignores the useful information embedded in next unlabeled block; thus they could not adapt concept drift immediately and provide just-in-time adaptive weight designation for component classifiers in ensemble.
To solve above problem, we propose a novel data stream learning algorithm called Distributed Matching Ensemble (DME), which also utilizes an ensemble framework with dynamically updating the classifier weights. Unlike previous work, DME adopts a delayed weight designation strategy which is activated when and only when the new unlabeled block is received. The weight assignment depends on similarity between the data distribution information of each reserved labeled block and that of the new received unlabeled block. In particular, the Gaussian mixture model (GMM) is used to estimate the data distribution with considering two following reasons: 1) GMM is capable for approximating any distribution, and 2) a specific GMM [29] can be described with only a few variables, thus storing them avoids to violate one pass rule. Additionally, Kullback-Leibler (KL) divergence [30] is used to estimate the similarity between two GMM distributions, and then the similarity can be further used to associate with weight of corresponding classifier. It is obvious that if two data distributions are more similar, then a higher weight should be designated, too. Finally, we also design two dynamic classifier update rules to avoid the infinite extension of component classifiers in ensemble. We compared the proposed DME algorithm to five other state-of-the-art block-based adaptive weighted ensemble approaches on both synthetic and real streaming datasets. The results show that the proposed DME algorithm can make just-in-time responds to concept drift, and provides more appropriate weight allocation scheme than those compared methods. Especially, when reoccurring concept drift occurs frequently, the DME algorithm presents a more significant superiority.
The rest of this paper is organized as follows. Section II presents the basic concepts and related work in context of block-based adaptive weighted ensembles. In Section III, we describe the structure and procedure of proposed DME algorithm in detail. In Section IV, the experimental settings, results, and analysis are sequentially provided. Finally, Section V concludes the contributions and findings of this work, and indicates the future work.

A. BASIC CONCEPTS
Without loss of generality, we suppose that a data stream can be divided into n equal-sized blocks B 1 , B 2 , . . . , B n in which each block has d examples. Each example can be represented as a two-tuples (x, y), where x is a vector including multiple attribute values and y ∈ {C 1 , C 2 , . . . ,C l } denotes the class label of x, in where l denotes the number of classes. When a new block B i arrives, the weights of component classifiers CL j ∈ are calculated by a classifier quality evaluation function Q(•) which is also called weighting function. After evaluating the component classifiers, a new classifier is built based on the block B i . If the current size of the ensemble is smaller than k that is the given size of ensemble buffer , then the new built classifier will be added to the ensemble, but if the ensemble has been full, then the weakest component classifier according in would be replaced by the new classifier. The generic learning procedure of block-based adaptive weighted ensemble learning paradigm is described in Algorithm 1.
Concept drift is a phenomenon that the statistical properties of a target domain change over time in an arbitrary way [18], [19]. If all examples in the data stream come from the same concept, i.e., the same joint probability P (x,y), then we can say that the concept is stable, otherwise, it means that concept drift occurs. Generally speaking, concept drift can be roughly divided into four types which are respectively : an updated ensemble with k component classifiers 1: for a newly received data block B i ∈ S do 2: make decision for each instance in B i by combine classifiers in ; 3: build new component classifier C using B i after it acquires real class labels; 4: calculate the weight for each classifiers CL j in ensemble buffer using Q(•; 5: if | | < k 6: then ← ∪ C ; 7: else 8: then replace weakest ensemble member defined by Q(•) in with C ; 9: end for described in Fig.1. In Fig.1(a), a new concept occurs within a short period of time and replace previous concept completely, that is known as sudden (abrupt) drift. The sudden drift is apt to directly deteriorate the classification ability of the classifier or even make the classifier be completely useless. In comparison with sudden drift, both gradual drift (see Fig.1(b)) and incremental drift (see Fig.1(c)) always changes the previous concept slowly, and the change procedure lasts in a long period. To make clear the difference between gradual drift and incremental drift, a term called intermediate concept [8], which is used to describe the transformation procedure between the starting concept and the ending concept, need to be introduced. Specifically, the intermediate concept of gradual drift denotes a mixture of starting concept and ending concept, that is, the probability of observing examples from two distributions changes, one gradually increases and the other gradually decreases. While for incremental drift, its intermediate concept denotes a unitary change of data distribution, that is, the starting concept gradually changes until the ending concept. As for reoccurring drift (see Fig.1 (d)), it means that in procedure of learning, an old concept emerges again. Here, the old concept may recur suddenly, gradually or incrementally.
For different concept drift types, the learning algorithm is required to prepare different responds mechanisms and to achieve different goals. For sudden drift, the algorithm should focus on how to detect the drift rapidly, and recover the algorithm to adapt it in time. For gradual and incremental drifts, the algorithm needs to track the drift and try its best to lower its performance reduction. While for reoccurring drift, it requires stressing on the reusability of historical concepts. Most existing stream learning algorithms aim at dealing with one specific drift type, thus generally perform poorly when several other drift types emerge. Actually, a success learning algorithm from data stream should be robust to any concept drift types, that is, it should be able to detect drift rapidly and change itself to adapt the drift in time.

B. RELATED WORK
As indicated in Section I, although ensemble learning is initially developed to run in static environment to improve classification accuracy and generalization ability of learning algorithms, it has been generalized into dynamic environment [22].
As the first proposed dynamic ensemble algorithm running in streaming environment, SEA [26] designates the number of component classifiers in ensemble buffer, and uses a heuristic strategy based on both accuracy and diversity to evaluate the quality of classifiers and update ensemble buffer. Specifically, SEA considers that the contribution of each component classifier for decision is equal; thereby it uses majority voting but not weighted voting to make the final predictions. Of course, it cannot track and adapt concept drifts in time. Therefore, we can say that SEA is a dynamic ensemble learning algorithm, but it is not an adaptive weighted ensemble learning algorithm.
AWE [27] is in its true sense the first generic adaptive weighted ensemble learning paradigm for dealing with concept drifts in data streams. In fact, AWE can be seen as an improved version of SEA. AWE first evaluates all classifiers in ensemble buffer on a new received labelled data block, and then trains a new classifier on that to replace the weakness one in ensemble buffer. Here, the quality evaluation relies on error rate on new received labeled data block. Finally, AWE uses the evaluation results to assign weights for classifiers in updated buffer for providing prediction for next incoming unlabeled data block. In comparison with SEA, the AWE improves adaptive capacity to concept drifts to a large extent. However, it is time-consuming as that to guarantee the accuracy of quality evaluation, each new trained classifier adopts ten-fold cross-validation (10-CV).
Based on the framework of AWE, Brzezinski and Stefanowski proposed the Accuracy Updated Ensemble (AUE1) algorithm [31]. In AUE1, they designed a simpler weighting function to assign and update the weights of the component classifiers. In addition, if it is necessary, the component classifier can learn incrementally. AUE1 offers higher accuracy than AWE, but it inherits the way of 10-CV from AWE; thus it is still time-consuming. Furthermore, Brzezinski and Stefanowski proposed AUE2 algorithm [32] which can be seen as an improved version of AUE1. Specifically, AUE2 combines the error rate-based weighting mechanism with a specific incremental learning algorithm, that is, very fast decision tree (VFDT or Hoeffding tree) [33]. In addition, AUE2 assumes that the most recent received data block offers the most approximated representation for current and future data distributions. Therefore, it always regards the classifier built on the most recent received block as a 'perfect' classifier, and hereby assigns a largest weight for it when classifying next data block. In contrast to AUE1, AUE2 abandons 10-CV and compulsively requires component classifiers to learn incrementally.
The Learn++ is a family of algorithms that consists of many adaptive weighted ensemble learning algorithms running in data stream environment, including Learn++.NC [34], Learn++.MF [35], and Learn++.NSE [36], etc. As one member in Learn++ family, the Learn++.NSE [36] algorithm is different from others as it is required to reserve all component classifiers in ensemble buffer without replacement. In Learn++.NSE, the weight of each component classifier is updated by a sophisticated error rate-based weighting mechanism which considers its performance on both past and current data blocks. If the cumulative contribution of a component classifier reduces to zero, then it would give up decision until it performs well again. The advantage of Learn++.NSE lies in that it can effectively handle reoccurring concept drifts. However, a sustaining extension of component classifiers will add storage burden of learning system, too.
Dynamic Weighted Majority (DWM) [37] proposed by Kolter and Maloof is one of the most popular online learning algorithms with considering adapting concept drifts. However, DWM is originally designed to only deal with data stream with receiving instances in the way of oneby-one. Dynamic weighted majority for imbalance learning (DWMIL) [38] is an extended version of DWM to adapt chunk-by-chunk environment. Specifically, DWMIL assigns an initial weight 1 to each most newly trained component classifier, and then gradually decreases its weight according to the error rate feedback on future received blocks until its weight is lower than a pre-designated threshold, the classifier would be removed from ensemble.
We note that the emerging adaptive weighted ensemble learning algorithms share a common underlying hypothesis, that is, the most recent received data block always offers the most approximated representation for current and future data distributions. It seems to be sound for data stream with only gradual and/or incremental drifts, however, when sudden or reoccurring drifts emerges, the hypothesis is clearly invalid. In addition, nearly all emerging methods adaptively assign weights only depending on error rates on the most recent labelled data block, while ignore the effect of next unlabeled block. Obviously, the so-called weight adaption is delayed one block at least, causing the learning algorithms cannot track and adapt concept drifts in time, and provide a just-intime decision. This motivates us to design more robust and just-in-time drifts adaption ensemble learning method in this study.

A. GAUSSIAN MIXTURE MODEL
Gaussian mixture model (GMM) [29] is popular approach to estimate the probability density function (pdf) of a distribution. To approximate any distributions, GMM always assumes that a unknown pdf could be represented as a weighted sum of m known Gaussian pdfs, i.e., where ω denotes the weight of each Gaussian pdf which can be also seen as the prior probability of each cluster in viewpoint of clustering, i denotes a Gaussian pdf in x with mean vector VOLUME 10, 2022  (2), and then the example will be assigned to the cluster with the highest probability.
At the M stage, the parameters of pdf of each cluster are calculated using the following equations, so that a mixture pdf can be modeled.
As long as designating a suitable number of clusters m, GMM can accurately approximate any distribution in theory. Therefore, in this study, we use GMM approach to estimate the distribution of each data block in data stream.

B. KULLBACK-LEIBLER DIVERGENCE
Kullback-Leibler (KL) divergence [30], which is also known as the relative entropy, is often used to estimate the similarity and/or dissimilarity between two pdfs. For two pdfs f and g defined on R z , where z is the dimension of the observed vectors, their KL divergence can be defined as: When the f and g are both Gaussian pdfs, then the KL divergence has a closed-form expression as follows: However, for GMM pdfs, there is no a closed-form expression. In such case, the KL divergence can be approximated to be other functions which may be calculated efficiently. In this study, we adopt the variational approximation strategy which was proposed by Hershey and Olsen [39] to address this problem.
Let L f (g) = E X log g (X ) , where X ∼ f . The KL divergence can be replaced by a decomposition as follows: Then the lower bounds for L f (f ) and L f (g) can be obtained by using Jensen's inequality: These lower bounds can be used as approximations for the corresponding quantities, further acquiring the approximation of KL divergence [40]: Note that if the KL divergence between two pdfs is larger, then it means that these two pdfs are more different.
To demonstrate the effectiveness and rationality of using KL divergence to adaptively assign weights for classifiers in ensemble, we provide a synthetic example in Fig.2.
Without loss of generalization, in our example, we suppose the data is two-dimensional, and there are only two different classes. Taking (6,7) and (4,4) as the centroids of two classes, we firstly generate 250 instances for each class, respectively, and make each of them satisfy a Gaussian distribution. Based on above rule, the generated data block is called B 1 . Then we counterclockwise rotate the centroids of two classes to sequentially generate three new data blocks B 2 ∼ B 4 . Specifically, the distribution of B 4 is totally same as that of B 1 . It is clear that these data blocks simulate two different drifting types as follows: sudden drift and reoccurring drift. Taking any one block as training block, we calculated the KL divergence between it and any other one block, and meanwhile tested the accuracy on that block. Without loss of generalization, we use naïve Bayes as classification algorithm. The results are presented in Table 1.
The results in Table 1 show two conclusions as follows: 1) the smaller the KL divergence between two distributions is, the more similar these two distributions are, and 2) classifier trained on a data block is able provide a better prediction for the data block with similar distribution, but tends to give a worse prediction for the data block with significantly different distribution.
The above conclusions indicate that the KL divergence is an effective tool to adaptively designate weights for classifiers in ensemble. Additionally, it is worthy to note that the KL divergence is asymmetric, that is, D KL (f ||g) = D KL (g||f ). However, we consider D KL (f ||g) ≈ D KL (g||f ) by observing results in Table 1, thus no matter which block is used as reference object, the KL divergence could reflect similarity/dissimilarity between two distributions well.

C. DISTRIBUTION MATCHING ENSEMBLE (DME) METHOD
In this study, we present a novel adaptive weighted ensemble learning algorithm called DME for dealing with block-based data stream learning issue. Unlike previous methods which associate weights with error rates, DME adopts a new weight assignment rule that correlates weights with KL divergences. In addition, it modifies the weight pre-designating principle which is used by nearly all previous methods. That is to say, DME adopts a delayed weight assignment strategy which provides adaptive weight designation when and only when a new unlabeled data block is received. Therefore, DME implements just-in-time trace and adaption to concept drifts. In other words, DME gets rid of a possibly wrong underlying hypothesis that the most recent received labeled data block always offers the most approximated representation for current and future data distributions. At least, we know it is wrong when a sudden or reoccurring drift occurs between the last labeled data block and the current unlabeled data block. DME constructs on a new underlying hypothesis that two data blocks with smaller KL divergence have more similar distribution with each other. Obviously, DME takes advantage of information embedded in both labeled and unlabeled data blocks, which can effectively prevent delayed response to concept drift.
As indicated in Table 1, two similar data distributions share a small KL divergence, which contraries to weight assignment. Therefore, we use the reciprocal of KL divergence to associate with weight. Suppose there are k component classifiers maintained in ensemble buffer , then the weight w i that denotes the ith component classifier could be calculated as follows: where D i denotes the KL divergence between the GMM pdf corresponding to the ith data block in and that of the current unlabeled data block. Specifically, based on Eq. (12), it can guarantee k i=1 w i = 1. The learning procedure of DME algorithm is described as in Algorithm 2. From the procedure description of DME, we observe that it is significantly different from the generic adaptive weighted ensemble learning paradigm at three following aspects. The first one is that DME designates weights after receiving new unlabeled data block, but the generic paradigm does not. The second one is that the weight assignment of generic paradigm relies on error rate feedback, whereas that of DME associates with distribution similarity. VOLUME 10, 2022 The final one is that the generic paradigm only stores m classifiers and their weights in buffer, while DME reserves m classifiers and the corresponding m GMM distribution information in buffer. Considering that the GMM information only contains the mean vectors, the covariance matrixes, the cluster weights, and meanwhile the number of them is restricted as m × k, thus the storage is acceptable as it does not violate one pass rule. As for time complexity, in contrast with the generic paradigm, our proposed DME algorithm adds a GMM modeling procedure on a data block, and k KL divergence calculation procedures between two data blocks, but removes the procedure of running k classifiers on new data block to calculate error rates. In fact, the analysis of time complexity for DME may be very complicated as it associates with multiple parameters, including m, k, d, and the parameters existing in component classifier. According to the feedbacks from our subsequent experiments, DME runs rapidly and can satisfy the requirement of real-time decision making.

Input:
S : a data stream d : the number of instances given in a block k : given size of ensemble buffer Q(•) : classifier quality evaluation function m : the number of mixture clusters in GMM Output: : an updated ensemble with k component classifiers and distribution information of the corresponding data blocks 1: for a newly received data block B i ∈ S do 2: p i ← the GMM pdf of B i estimated by EM algorithm; 3: for each reserved p j ∈ do 4: D ij ← KL divergence between P i and P j calculated by Eq. (11); 5: end for 6: for each component classifiers CL j ∈ do 7: w j ← calculate weight based on Eq. (12); 8: end for 9: make decision for each instance in B i by combine weighted classifiers in ; 10: C ← new component classifier built on B i after it acquires real class labels; 11: if | | < k 12: then ← ∪ C & ∪ {p i }; 13: else 14: then replace weakest ensemble member defined by Q(•) and its distribution information in with C and p i ; 15: end for Another significant issue lies in that in DME, how to evaluate the quality of component classifiers and to update the buffer. As we know, the quality of a component classifier is defined and evaluated by the quality evaluation function Q(•). In this study, we design two different quality evaluation functions and classifier update rules. The first one is called lowest weight removing rule, which is abbreviated as Low. In Low rule, the quality evaluation function Q(•) only associates with the real-time weight of each component classifier. When updating the ensemble buffer, the component classifier with lowest weight will be replaced by the classifier trained on the new data block. The second one is called delayed average weight removing rule, which is also called Ave in brief. In Ave rule, both weight and time factors are considered by Q(•) function. Specifically, for a classifier CL i , we respectively usew i andt i to represent its average weight and how many data blocks it has experienced since it was added into the buffer. We first consider time factor, i.e., whent i = 2k, the corresponding classifier will be preferentially removed. If the above condition cannot be met by all component classifiers, then we will abandon the classifier with the lowest average weight min(w i ), and meanwhile guarantee itst i > 3. It effectively avoids accident deletion for some significant component classifiers. In this study, we adopt Ave as the default rule. As for the difference between these two rules, we will further compare and discuss in Section IV.

IV. EXPERIMENTS A. DATASETS DESCRIPTION
To clearly present the characteristic of proposed DME algorithm, in this study, we used thirteen synthetic and two realworld streaming datasets to conduct comparison experiments. Next, we provide a detailed description about these streaming datasets.
Without loss of generalization, we denoted first nine synthetic streaming data sets to be two-dimensional, and all instances can be averagely divided into one of two classes. Here, the instances in each class satisfies Gaussian distribution, and the initial centroids of two classes are demarcated at (2, 2) and (5, 5), respectively. We simulated different concept drifts by changing the centroids or moving instances into adverse class, which has been adopted in [35] and [38].
For sudden drift, we generated two datasets in which each one contains 50,000 instances, and 100 data blocks, that is, 500 instances per block. Sudden S and Sudden F both change the distributions sharply between two adjacent data blocks. The difference between them lies in that Sudden S changes only once, whereas Sudden F changes per 20 data blocks.
Gradual F dataset simulates the gradual concept drift in data streaming. In Gradual F , we gradually move instances in each class into the adverse class. On any two adjacent data blocks, 1% instances are exchanged.
For reoccurring drift, we simulate it by adopting a similar way of generating sudden drifts. When drift occurs, two adjacent blocks synchronously change the distributions of two classes sharply, and then after a period of time, the distributions are recovered again. To better observe the adaption of various algorithms on this drift type, we generated four different reoccurring drift datasets. Reoccur 5 , Reoccur 10 , Incremental S and Incremental F both simulate the incremental concept drift. The former changes the centroid of each class with a variance of 0.01 per block, while the latter changes the centroids with a variance of 0.1 per block.
The SEA [26], which is a well know sudden drift data stream generator, is also used to generate datasets used in this study. In the concept of SEA, each instance consists of three attributes, where only the first two are relevant, whereas the third one can be seen as noise. All attributes share the same range of [0, 10]. Here, each concept is defined by f 1 + f 2 ≤ θ, where f 1 and f 2 represent the first two attributes, and θ is the threshold associated with the class label. We generate two datasets in which each one contains 100,000 instances with 500 per block. SEA S contains four concepts with drifts occurring per 50 blocks, and SEA F contains five concepts with drifts occurring per 40 chunks. Specifically, each dataset has been mixed 10% class noise.
The hyperplane generator was originally designed to compare performance of CVFDT and VFDT algorithms [41]. It models a hyperplane satisfying d i=1 w i x i = w 0 in a d-dimensional space, where x i denotes the ith attribute value of x, and it is restricted between 0 and 1. Positive class corresponds to the condition that d i=1 w i x i ≥ w 0 , and otherwise, the instances are categorized to the negative class. In hyperplane, the concept drifts are generated by varying the weight w i . In this study, we constructed two hyperplane datasets in which each one consists of 100,000 instances that are represented with 10 attributes, and every block has 500 instances. The first dataset (Hyp S ) simulates the incremental drift by modifying the weight w i with a variance 0.01 for each data block. The second dataset (Hyp F ) speeds up the incremental drift as it uses a variance 0.1 for each data block. Additionally, both datasets have been added 5% noise.
The Electricity dataset (Elec) is a popular real-world dataset which has been widely used to evaluate the online/incremental learning algorithms. Elec records the energy prices acquired from the electricity market in New South Wales State, Australia. It contains 45,312 instances spanning the period from May 1996 to December 1998, in where each instance contains seven attributes. Specifically, each instance is labeled as up or down according to whether the current electricity price is higher or lower than the average price of the past 24 hours. The dataset is available at: http://moa.cms.waika to.ac.nz/datasets/.
The Weather dataset (Wea) is also a widely used streaming dataset. It contains 18,159 instances collected from the National Oceanic and Atmospheric Administration (NOAA), and recorded the weather of Offutt Air Force Base in Bellevue, Nebraska, USA spanning the period from 1949 to 1999. Specifically, each instance is represented by 8 attributes to represent barometric pressure, humidity, and wind speed etc. The dataset is available at: ftp://ftp.ncdc.noaa.gov/pub/data/gsod. Table 2 presents detailed information about above mentioned streaming data sets, where No.Inst, No.Attrs, No.Cls, Per.Noise, and No.Drifts denote the number of instances, the number of attributes, the number of classes, the percentage of noise, and the number of drifts embedded in the correspond streaming dataset, respectively. Specifically, for two realworld datasets, their number of drifts and drift types are both unknown.
For each dataset, it has been divided into multiple continuous blocks, and each block contains d= 500 instances as it is regarded as the best suitable setting for testing block-based ensemble learning algorithms [35].
Without loss of generalization, for each compared algorithm, we used naive Bayes as base classifier [42]. Additionally, the average classification accuracy throughout all data blocks is used to evaluate and compare the quality of various compared algorithms.

C. EXPERIMENTAL RESULTS AND ANALYSIS
Average classification accuracy (%) of various compared adaptive weighted ensemble algorithms are presented in Table 3 where the best result on each streaming dataset has been highlighted in bold.
From the results in Table 3, we observe that our proposed DME algorithm performs significantly better than several other algorithms on several datasets with reoccurring drifts, indicating it is specifically suitable for dealing this drifting type. On Reoccur 5 , Reoccur 10 , and Reoccur 15 , DME obviously outperforms to several other compared algorithms. In addition, DME shows a clear superiority on Wea dataset. In fact, as a data stream recording the weather variations throughout 50 years, the Wea must contain reoccurring drifts. The phenomenon can be explained well by reviewing the mechanism of DME that weights are assigned according to the similarity between two data blocks. That means if a similar concept is still reserved in ensemble buffer, then it can play an important role for predicting instances in new received testing block. Therefore, we can say that DME can adapt data stream with frequent reoccurring drifts well. As for several other drifting types, although DME has not presented a significant superiority in comparison with several state-ofthe-art algorithms, it still produced comparable classification performance. The results show that both error rate and distribution similarity are useful reference objects to reflect the potential and contribution of each component classifier in ensemble. Considering DME is constructed on a totally different underlying hypothesis from several other algorithms, it can observe concept drift in time, and provide a just-intime adaption for the drift. In other words, DME decreases the potential risk of performance degradation caused by concept drifts.
Next, we simply analyzed these compared ensemble learning algorithms in statistics by Nemenyi test [40]. Specifically, the critical difference (CD) metric is used to show the difference among these algorithms. Fig.3 shows the CD diagram at a standard level of significance α = 0.10, in which the average ranking of each algorithm is marked along the axis. In CD diagram, if a group of algorithms are not significantly different, then these algorithms would be connected by a thick line.
In Fig.3, we observe that DME is significantly superior to Learn++.NSE. The reason may be two folds: 1) we have broken the infinite extension of ensemble buffer in Learn++.NSE, and 2) Learn++.NSE assigns weights for component classifiers based on its performance on both past and current data blocks, making it be difficult to adapt concept drifts in time. Also, we observe that although DME is not significantly superior to four other algorithms, it has acquired the lowest average ranking; thus we can say that DME is more robust algorithm in comparison with several others. Furthermore, to more clearly observe the characteristics of various algorithms, we tracked their classification accuracy trajectories throughout the whole learning procedure on four representative streaming data sets, namely Sudden F , Reoccur 10 , Hyp S , and Wea, respectively (see Fig.4∼Fig.7). Fig.4 shows that on data stream with sudden drifts, the proposed DME algorithm can defenses the destruction when sudden drifts occur to a large extent. The reason lies in that although there may be no concepts in ensemble buffer which are similar to the new emerging concept, the DME can try it best to adaptively find several data blocks that have the most approximated distribution as the new data block, further reducing the performance degradation. In addition, we note   that the DME can rapidly recover the adaptation of ensemble after encountering a sudden drift because the classifier trained on the drifting data block can immediately play an important role as it always has a similar distribution with several subsequent data blocks. At least, the DME presents a faster response speed than several other competitors.
The results in Fig.5 confirm the fact that DME is specifically suitable for dealing with reoccurring drifts again. Except the first drift which can be seen as a sudden drift, the DME has adapted all other reoccurring drifts well. This is because once an old concept recurs, as long as it is still reserved in ensemble buffer, the DME could adapt it well by adaptively activating the corresponding component classifier. Of course, if the recurring cycle is long, and meanwhile the size of ensemble buffer is not large enough, the DME would fail to adapt reoccurring drifts.     6 presents the variance of classification accuracy of various incremental ensemble learning algorithms aiming at detecting their reactions to gradual drifts. DME can produce stable performance on this type of data streams, indicating it can adapt gradual drifts well. Of course, most other algorithms can deal with this drifting type well, too.
Considering in general, there are lots of unpredictable and uncertain conceptual changes in real-world environment, using real-world streaming data is expected to better reflect the qualities of various learning algorithms. The results in Fig.7 indicate the accuracy changes of various algorithms on real-world Wea dataset. It can be observed that in Wea dataset, there are some potential drifts, but the drifting types are unknown. DME performs stably throughout the whole learning procedure, indicating it can adapt real-world drifting streams well.
Also, we are conscious of a potential risk, that is, in an extreme case in where two blocks share the same overall distribution but own totally reverse label distributions, the underlying hypothesis that is used to support DME might be wrong. Fortunately, it is almost impossible to happen in realworld applications. Therefore, in comparison with the previously used underlying hypothesis, our proposed underlying hypothesis in this study is more reliable in theory D. COMPARISON OF TWO WEIGHT UPDATE RULES Next, we compared the quality and running time of two proposed buffer update rules in this study. The results are presented in Table 4.
From the results in Table 4, we observe that in most cases, the DME with Ave buffer update rule performs better than that with Low buffer update rule. To explore its reason, we consider that the Averule simultaneously evaluates the quality of each component classifier from two aspects as follows: its average contribution, and its dwell time. Such behavior significantly lowers the probability of wrongly removing important component classifiers. Also, we note that adopting Ave buffer update rule is generally more time-consuming than using Low buffer update rule, but the increment of running time is acceptable in real-world applications. All in all, we recommend the readers to select Ave buffer update rule when they model DME algorithm in practical applications.

E. DISCUSSIONS ABOUT PARAMETERS
Finally, we expect to make clear the influence law of two key parameters in DME. One is the size of ensemble buffer k, and the other is the number of mixture clusters in GMM m. Specifically, we denoted k to vary from 5 to 20, m to vary from 3 to 10, and both vary with an increment of 1.
Taking Reoccur 10 , Incremental S and Elec as representative datasets, the corresponding variance of average classification accuracy (%) and running time per block (cs) are presented in Fig.8.
The results in Fig.8 show that average classification accuracy of DME strongly associates with the parameter m, that is, with increase of m, the classification accuracy tends to be improved. As we know, m denotes the complexity for describing a distribution, thus it determines the accuracy of estimating pdfs, calculating KL divergences, and designating weights. In addition, we also observe that the average classification accuracy could be greatly impacted by the parameter k that denotes the size of ensemble buffer. The results reflect that on streaming data with reoccurring drifts, a large k should be designated to better adapt recurring old concepts, while on data streams with other drifting types, a small k might be more suitable for DME to track and adapt concept drifts. As for running time, it is linearly associated with both k and m. In real-world applications, the readers are suggested to designate appropriate parameters according to the practical requirements.

V. CONCLUSION
In this paper, we propose a novel adaptive weighted ensemble learning algorithm called DME to deal with block-based concept drift streaming data. Specifically, DME uses GMM to estimate distribution of data block, KL divergence to estimate VOLUME 10, 2022 the similarity/dissimilarity between two data block distributions, and distribution similarity to adaptively assign decision weights for component classifiers in ensemble buffer. The experimental results on some artificial and real-world nonstationary data streams indicated that the proposed DME algorithm is able to provide just-in-time track and adaption to various concept drifts. Especially on data streams with frequent reoccurring drifts, the DME can present a larger superiority than several state-of-the-art algorithms.
The contributions of this study can be concluded as follows: 1) A new underlying hypothesis, which is used to describe data stream drifting law, is proposed to replace the old underlying hypothesis. 2) A distribution diversity-based weight assignment rule is proposed to replace the old error rate-based weight designation rule, further solving the delayed concept drift adaption problem. 3) Two component classifiers update rules are proposed to guarantee reserving those component classifiers with most potential and significance in ensemble buffer. In future work, we plan to further verify the effectiveness and superiority of the proposed DME algorithms in more realworld non-stationary data stream applications. Additionally, how to extend DME algorithm from Block-based incremental learning to one-by-one online learning will be investigated, too.