A Systematic Study of Online Class Imbalance Learning with Concept Drift

As an emerging research topic, online class imbalance learning often combines the challenges of both class imbalance and concept drift. It deals with data streams having very skewed class distributions, where concept drift may occur. It has recently received increased research attention; however, very little work addresses the combined problem where both class imbalance and concept drift coexist. As the first systematic study of handling concept drift in class-imbalanced data streams, this paper first provides a comprehensive review of current research progress in this field, including current research focuses and open challenges. Then, an in-depth experimental study is performed, with the goal of understanding how to best overcome concept drift in online learning with class imbalance. Based on the analysis, a general guideline is proposed for the development of an effective algorithm.


I. INTRODUCTION
With the wide application of machine learning algorithms to the real world, class imbalance and concept drift have become crucial learning issues. Applications in various domains such as risk management [1], anomaly detection [2], software engineering [3], and social media mining [4] are affected by both class imbalance and concept drift. Class imbalance happens when the data categories are not equally represented, i.e., at least one category is minority compared to other categories [5]. It can cause learning bias towards the majority class and poor generalization. Concept drift is a change in the underlying distribution of the problem, and is a significant issue specially when learning from data streams [6]. It requires learners to be adaptive to dynamic changes.
Class imbalance and concept drift can significantly hinder predictive performance, and the problem becomes particularly challenging when they occur simultaneously. This challenge arises from the fact that one problem can affect the treatment of the other. For example, drift detection algorithms based on the traditional classification error may be sensitive to the imbalanced degree and become less effective; and class imbalance techniques need to be adaptive to changing imbalance rates, otherwise the class receiving the preferential treatment may not be the correct minority class at the current moment.
Although there have been papers studying data streams with an imbalanced distribution and data streams with concept drift respectively, very little work discusses the cases when both class imbalance and concept drift exist. This paper aims to provide a systematic study of handling concept drift in classimbalanced data streams. We focus on online (i.e. one-byone) learning, which is a more difficult case than chunk-based learning, because only a single instance is available at a time.
We first give a comprehensive review of current research progress in this field, including problem definitions, problem and approach categorization, performance evaluation and upto-date approaches. It reveals new challenges and research gaps. Most existing work focuses on the concept drift in posterior probabilities (i.e. real concept drift [7], changes in P (y | x)). The challenges in other types of concept drift have not been fully discussed and addressed. Especially, the change in prior probabilities P (y) is closely related to class imbalance and has been overlooked by most existing work. Most proposed concept drift detection approaches are designed for and tested on balanced data streams. Very few approaches aim to tackle class imbalance and concept drift simultaneously. Among limited solutions, it is still unclear which approach is better and when. It is also unknown whether and how applying class imbalance techniques (e.g. resampling methods) affects concept drift detection and online prediction.
To fill in the research gaps, we then provide an experimental insight into how to best overcome concept drift in online learning with class imbalance, by focusing on three research questions: 1) what are the challenges in detecting each type of concept drift when the data stream is imbalanced? 2) Among the proposed methods designed for online class imbalance learning with concept drift, which one performs better for which type of concept drift? 3) Would applying class imbalance techniques (e.g. resampling methods) facilitate concept drift detection and online prediction? Six recent approaches, DDM-OCI [8], LFR [9], PAUC-PH [10] [96], OOB [11], RLSACP [12] and ESOS-ELM [13], are compared and analyzed in depth under each of the three fundamental types of concept drift (i.e. changes in prior probability P (y), class-conditional probability density function (pdf) p (x | y) and posterior probability P (y | x)) in artificial data streams, as well as real-world data sets. To the best of our knowledge, they are the very few methods that are explicitly designed for online learning problems with class imbalance and concept drift so far.
Finally, based on the review and experimental results, we provide some guidelines for developing an effective algorithm for learning from imbalanced data streams with concept drift. We stress the importance of studying the mutual effect of class imbalance and concept drift. arXiv:1703.06683v1 [cs.LG] 20 Mar 2017 The contributions of this paper include: this is the first comprehensive study that looks into concept drift detection in class-imbalanced data streams; data problems are categorized in different types of concept drift and class imbalance with illustrative applications; existing approaches are compared and analysed systematically in each type; pros and cons of each approach are investigated; the results provide guidance for choosing the appropriate technique and developing better algorithms for future learning tasks; this is also the first work exploring the role of class imbalance techniques in concept drift detection, which sheds light on whether and how to tackle class imbalance and concept drift simultaneously.
The rest of this paper is organized as follows. Section II formulate the learning problem, including a learning framework and detailed problem descriptions and introduction of class imbalance and concept drift individually. Section III reviews the combined issue of class imbalance and concept drift, including example applications and existing solutions. Section IV carries out the experimental study, aiming to find out the answers to the three research questions. Section V draws the conclusions and points out potential future directions.

II. ONLINE LEARNING FRAMEWORK WITH CLASS
IMBALANCE AND CONCEPT DRIFT In data stream applications, data arrives over time in streams of examples or batches of examples. The information up to a specific time step t is used to build/update predictive models, which then predict the new example(s) arriving at time step t + 1. Learning under such conditions needs chunkbased learning or online learning algorithms, depending on the number of training examples available at each time step. According to the most agreed definitions [6] [14], chunk-based learning algorithms process a batch of data examples at each time step, such as the case of daily internet usage from a set of users; online learning algorithms process examples one by one and the predictive model is updated after receiving each example [15], such as the case of sensor readings at every second in engineering systems. The term "incremental learning" is also frequently used under this scenario. It is usually referred to as any algorithm that can process data streams with certain criteria met [16].
On one hand, online learning can be viewed as a special case of chunk-based learning. Online learning algorithms can be used to deal with data coming in batches. They both build and continuously update a learning model to accommodate newly available data, and simultaneously maintain its performance on old data, giving rise to the stability-plasticity dilemma [17]. On the other hand, the way of designing online and chunkbased learning algorithms can be very different [6]. Most chunk-based learning algorithms are not suitable for online learning tasks, because batch learners process a chunk of data each time, possibly using an offline learning algorithm for each chunk. Online learning requires the model being adapted immediately upon seeing the new example, and the example is then immediately discarded, which allows to process highspeed data streams. From this point of view, designing online learning algorithm can be more challenging but so far has received much less attention than the other. First, the online learner needs to learn from a single data example, so it needs a more sophisticated training mechanism. Second, data streams are often non-stationary (concept drift). The limited availability of training examples at the current moment in online learning hinders the detection of such changes and the application of techniques to overcome the change. Third, it is often seen that data is class imbalanced in many classification tasks, such as the fault detection task in an engineering system, where the fault is always the minority. Class imbalance aggravates the learning difficulty [5] and complicates the data status [18]. However, there is a severe lack of research addressing the combined issue of class imbalance and concept drift in online learning.
To fill in this research gap, this paper aims at a comprehensive review of the work done to overcome class imbalance and concept drift, a systematic study of learning challenges, and an in-depth analysis of the performance of current approaches. We begin by formalizing the learning problem in this section.

A. Learning Procedure
In supervised online classification, suppose a data generating process provides a sequence of examples (x t , y t ) arriving one at a time from an unknown probability distribution p t (x, y). x t is the input vector belonging to an input space X, and y t is the corresponding class label belonging to the label set Y = {c 1 , . . . , c N }. We build an online classifier F that receives the new input x t at time step t and then makes a prediction. The predicted class label is denoted bŷ y t . After some time, the classifier receives the true label y t , used to evaluate the predictive performance and further train the classifier. This whole process will be repeated at following time steps. It is worth pointing out that we do not assume new training examples always arrive at regular and pre-defined intervals here. In other words, the actual time interval between time step t and t + 1 may be different from the actual time interval between t + 1 and t + 2.
One challenge arises when data is class imbalanced. Class imbalance is an important data feature, commonly seen in applications such as spam filtering [19] and fault diagnosis [2] [3]. It is the phenomenon when some classes of data are highly under-represented (i.e. minority) compared to other classes (i.e. majority). For example, if P (c i ) P (c j ), then c j is a majority class and c i is a minority class. The difficulty in learning from imbalanced data is that the relatively or absolutely underrepresented class cannot draw equal attention to the learning algorithm, which often leads to very specific classification rules or missing rules for this class without much generalization ability for future prediction. It has been wellstudied in offline learning [20], and has attracted growing attention in data stream learning in recent years [21].
In many applications, such as energy forecasting and climate data analysis [22], the data generator operates in nonstationary environments. It gives rise to another challenge, called "concept drift". It means that the probability density function (pdf) of the data generating process is changing over time. For such cases, the fundamental assumption of traditional data miningthe training and testing data are sampled from the same static and unknown distribution -does not hold anymore. Therefore, it is crucial to monitor the underlying changes, and adapt the model to accommodate the changes accordingly.
When both issues exist, the online learner needs to be carefully designed for effectiveness, efficiency and adaptivity. An online class imbalance learning framework was proposed in [18] as a guide for algorithm design. The framework breaks down the learning procedure into three modules -a class imbalance detector, a concept drift detector and an adaptive online learner, as illustrated in Fig. 1  The class imbalance detector reports the current class imbalance status of data streams. The concept drift detector captures concept drifts involving changes in classification boundaries. Based on the information provided by the first two modules, the adaptive online learner determines when and how to respond to the detected class imbalance and concept drift, in order to maintain its performance. The learning objective of an online class imbalance algorithm can be described as "recognizing minority-class data effectively, adaptively and timely without sacrificing the performance on the majority class" [18].

B. Problem Descriptions
A more detailed introduction about class imbalance and concept drift is given here individually, including the terminology, research focuses and state-of-the-art approaches. The purpose of this section is to understand the fundamental issues that we need to take extra care of in online class imbalance learning. We also aim at understanding whether and how the current research in class imbalance learning and concept drift detection are individually related to their combined issue elaborated later in Section III, rather than to provide an exhaustive list of approaches in the literature. Among others, we will answer the following questions: can existing class imbalance techniques process data streams? Would existing concept drift detectors be able to handle imbalanced data streams? 1) Class imbalance: In class imbalance problems, the minority class is usually much more difficult or expensive to be collected than the majority class, such as the spam class in spam filtering and the fraud class in credit card application. Thus, misclassifying a minority-class example is more costly. Unfortunately, the performance of most conventional machine learning algorithms is significantly compromised by class imbalance, because they assume or expect balanced class distributions or equal misclassification costs. Their training procedure with the aim of maximizing overall accuracy often leads to a high probability of the induced classifier predicting an example as the majority class, and a low recognition rate on the minority class. In reality, it is common to see that the majority class has accuracy close to 100% and the minority class has very low accuracy between 0%-10% [23]. The negative effect of class imbalance on classifiers, such as decision trees [20], neural networks [24], k-Nearest Neighbour (kNN) [25] [26] [27] and SVM [28] [29], has been studied. A classifier that provides a balanced degree of predictive performance for all classes is required. The major research questions in this area are summarized and answered as follows: (a) How do we define the imbalanced degree of data?
It seems to be a trivial question. However, there is no consensus on the definition in the literature. To describe how imbalanced the data is, researchers choose to use the percentage of the minority class in the data set [30], the size ratio between classes [31], or simply a list of the number of examples in each class [32]. The coefficient of variance is used in [33], which is less straightforward. The description of imbalance status may not be a crucial issue in offline learning, but becomes more important in online learning, because there is no static data set in online scenarios. It is necessary to have some measurement automatically describing the up-todate imbalanced degree and techniques monitoring the changes in class imbalance status. This will help the online learner to decide when and how to tackle class imbalance. The issue of changes in class imbalance status is relevant to concept drift, which will be further discussed in the next subsection.
To define the imbalanced degree suitable for online learning, a real-time indicator was proposed -time-decayed class size [18], expressing the size percentage of each class in the data stream. It is updated incrementally at each time step by using a time decay (forgetting) factor, which emphasizes the current status of data and weakens the effect of old data. Based on this, a class imbalance detector was proposed to determine which classes should be regarded as the minority/majority and how imbalanced the current data stream is, and then used for designing better online classifiers [11] [3]. The merit of this indicator is that it is suitable for data with arbitrary number of classes. (b) When does class imbalance matter?
It has been shown that class imbalance is not the only problem responsible for the performance reduction of classifiers. Classifiers' sensitivity to class imbalance also depends on the complexity and overall size of the data set. Data complexity comprises issues such as overlapping [34] [35] and small disjuncts [36]. The degree of overlapping between classes and how the minority class examples distribute in data space aggravate the negative effect of class imbalance. The small disjunct problem is associated with the within-class imbalance [37]. Regarding the size of the training data, a very large domain has a good chance that the minority class is represented by a reasonable number of examples, and thus may be less affected by imbalance than a small domain containing very few minority class examples. In other words, the rarity of the minority class can be in a relative or absolute sense in terms of the number of available examples [5].
In particular, authors in [38]  Borderline, rare and outlier data sets were found to be the real source of difficulties in learning imbalanced data sets offline, which have also been shown to be the harder cases in online applications [11]. Therefore, for any developed algorithms dealing with imbalanced data online, it is worth discussing their performance on data with different types of distributions. (c) How can we tackle class imbalance effectively (state-ofthe-art solutions)?
A number of algorithms have been proposed to tackle class imbalance at the data and algorithm levels. Data-level algorithms include a variety of resampling techniques, manipulating training data to rectify the skewed class distributions. They oversample minority-class examples (i.e. expanding the minority class), undersample majority-class examples (i.e. shrinking the majority class), or combine both, until the data set is relatively balanced. Random oversampling and random undersampling are the simplest and most popular resampling techniques, where examples are randomly chosen to be added or removed. There are also smart resampling techniques (a.k.a guided resampling). For example, SMOTE [32] is a widely used oversampling method, which generates new minorityclass data points based on the similarities between original minority-class examples in the feature space. Other smart oversampling techniques include Borderline-SMOTE [40], ADASYN [41], MWMOTE [42], to name but a few. Smart undersampling techniques include Tomek links [43], Onesided selection [44], Neighbourhood cleaning rule [45], etc. The effectiveness of resampling techniques have been proved in real-world applications [46]. They work independently of classifiers, and are thus more versatile than algorithmlevel methods. The key is to choose an appropriate sampling rate [47], which is relatively easy for two-class data sets, but becomes more complicated for multi-class data sets [48]. Empirical studies have been carried out to compare different resampling methods [30]. Particularly, it is shown that smart resampling techniques are not necessarily superior to random oversampling and undersampling; besides, they cannot be applied to online scenarios directly, because they work on a static data set for the relation among the training examples. Some initial effort has been made recently, to extend smart resampling techniques to online learning [49].
Algorithm-level methods address class imbalance by modifying their training mechanism with the direct goal of better accuracy on the minority class, including one-class learning [50], cost-sensitive learning [51] and threshold methods [52]. They require different treatments for specific kinds of learning algorithms. In other words, they are algorithmdependent, so they are not as widely used as data-level methods. Some online cost-sensitive methods have been proposed, such as CSOGD [53] and RLSACP [12]. They are restricted to the perceptron-based classifiers, and require pre-defined misclassification costs of classes that may or may not be updated during the online learning.
Finally, ensemble learning (also known as multiple classifier systems) [54] has become a major category of approaches to handling class imbalance [55]. It combines multiple classifiers as base learners and aims to outperform every one of them. It can be easily adapted for emphasizing the minority class by integrating different resampling techniques [56] [57] [58] [59] or by making base classifiers cost-sensitive [60] [61] [62] [63]. A few ensemble methods are available for online class imbalance learning, such as OOB and UOB [11] applying random oversampling and undersampling in Online Bagging [64], and WOS-ELM [65] training a set of cost-sensitive online extreme learning machines.
It is worth pointing out that, the aforementioned online learning algorithms designed for imbalanced data are not suitable for non-stationary data streams. They do not involve any mechanism handling drifts that affect classification boundaries, although OOB and UOB can detect and react to class imbalance changes. (d) How do we evaluate the performance of class imbalance learning algorithms?
Traditionally, overall accuracy and error rate are the most frequently used metrics of performance evaluation. However, they are strongly biased towards the majority class when data is imbalanced. Therefore, other performance measures have been adopted. Most studies concentrate on two-class problems. By convention, the minority class is treated to be the positive, and the majority class is treated to be the negative. Table I illustrates the confusion matrix of a two-class problem, producing four numbers on testing data.  From the confusion matrix, we can derive the expressions for recall and precision: The learning objective of class imbalance learning is to improve recall without hurting precision. However, improving recall and precision can be conflicting. Thus, F-measure is defined to show the trade-off between them.
where β corresponds to the relative importance of recall and precision. It is usually set to 1. Kubat et al. [44] proposed to use G-mean to replace overall accuracy: It is the geometric mean of positive accuracy (i.e. TP rate) and negative accuracy (i.e. TN rate). A good classifier should have high accuracies on both classes, and thus a high G-mean.
According to [5], any metric that uses values from both rows of the confusion matrix for addition (or subtraction) will be inherently sensitive to class imbalance. In other words, the performance measure will change as class distribution changes, even though the underlying performance of the classifier does not. This performance inconsistency can cause problems when we compare different algorithms over different data sets. Precision and F-measure, unfortunately, are sensitive to the class distribution. Therefore, recall and G-mean are better options.
To compare classifiers over a range of sample distributions, AUC (abbr. of the Area Under the ROC curve) is the best choice. A ROC curve depicts all possible trade-offs between TP rate and FP rate, where FP rate = F P/ (T N + F P ). TP rate and FP rate can be understood as the benefits and costs of classification with respect to data distributions. Each point on the curve corresponds to a single trade-off. A better classifier should produce a ROC curve closer to the top left corner. AUC represents a ROC curve as a single scalar value by estimating the area under the curve, varying in [0, 1]. It is insensitive to the class distribution, because both TP rate and FP rate use values from only one row of the confusion matrix. AUC is usually generated by varying the classification decision threshold for separating positive and negative classes in the testing data set [66] [67]. In other words, calculating AUC requires a set of confusion matrices. Therefore, unlike other measures based on a single confusion matrix, AUC cannot be used as an evaluation metric in online learning without memorizing data. Although a recent study has modified AUC for evaluating online classifiers [10], it still needs to collect recently received examples.
The properties of the above measures are summarized in Table II. They are defined under the two-class context. They cannot be used to evaluate multi-class data directly, except for recall. Their multi-class versions have been developed [68] [69] [70]. The "multi-class" and "online" columns in the table show whether the corresponding measure can be used directly without modification in multi-class and online data scenarios. 2) Concept drift: Concept drift is said to occur when the joint probability P (x, y) changes [7] [71] [72]. The key research topics in this area include: (a) How many types of concept drift are there? Which type is more challenging?
Concept drift can manifest three fundamental forms of changes corresponding to the three major variables in the Bayes' theorem [73]: 1) a change in prior probability P (y); 2) a change in class-conditional pdf p (x | y); 3) a change in posterior probability P (y | x). The three types of concept drift are illustrated in Figure 2. Comparing to the original data distribution shown in Figure 2  shows the P (y) type of concept drift without affecting p (x | y) and P (y | x). The decision boundary remains unaffected. The prior probability of the circle class is reduced in this example. Such change can lead to class imbalance. A well-learnt discrimination function may drift away from the true decision boundary, due to the imbalanced class distribution. Fig. 2(c) shows the p (x | y) type of concept drift without affecting P (y) and P (y | x). The true decision boundary remains unaffected. Elwell and Polikar claimed that this type of drift is the result of an incomplete representation of the true distribution in current data, which simply requires providing supplemental data information to the learning model [74]. Fig. 2(d) shows the P (y | x) type of concept drift. The true boundary between classes changes after the drift, so that the previously learnt discrimination function does not apply any more. In other words, the old function becomes unsuitable or partially unsuitable, and the learning model needs to be adapted to the new knowledge.
The posterior distribution change clearly indicates the most fundamental change in the data generating function. This is classified as real concept drift. The other two types belong to virtual concept drift [21], which does not change the decision (class) boundaries. In practice, one type of concept drift may appear in combination with other types.
Existing studies primarily focus on the development of drift detection methods and techniques to overcome the real drift. There is a significant lack of research on virtual drift, which can also deteriorate classification performance. As illustrated in Fig. 2(b), even though these types of drift do not affect the true decision boundaries, they can cause a well-learnt decision boundary to become unsuitable. Unfortunately, the current techniques for handling real drift may not be suitable for virtual drift, because they present very different learning difficulties and require different solutions. For instance, the methods for handling real drift often choose to reset and retrain the classifier, in order to forget the old concept and better learn the new concept. This is not an appropriate strategy for data with virtual drift, because the examples from previous time steps may still remain valid and help the current classification in virtual drift cases. It would be more effective and efficient to calibrate the existing classifier than retraining it. Besides, techniques for handling real drift typically rely on feedback about the performance of the classifier, while techniques for handling virtual drift can operate without such feedback [7]. From our point of view, all three types are equally important. Particularly, the two virtual types require more research effort than currently dedicated work by our community. A systematic study of the challenges in each type will be given in Section IV.
Concept drift has further been characterized by its speed, severity, cyclical nature, etc. A detailed and mutually exclusive categorization can be found in [72]. For example, according to speed, concept drift can be either abrupt, when the generating function is changed suddenly (usually within one time step), or gradual, when the distribution evolves slowly over time. They are the most commonly discussed types in the literature, because the effectiveness of drift detection methods can vary with the drifting speed. While most methods are quite successful in detecting abrupt drifts, as future data is no longer related to old data [75], gradual drifts are often more difficult, because the slow change can delay or hide the hint left by the drift. We can see some drift detection methods specifically designed for gradual concept drift, such as Early Drift Detection method (EDDM) [76].
(b) How can we tackle concept drift effectively (state-of-the-art solutions)?
There is a wide range of algorithms for learning in nonstationary environments. Most of them assume and specialize in some specific types of concept drift, although real-world data often contains multiple types. They are commonly categorized into two major groups: active vs. passive approaches, depending on whether an explicit drift detection mechanism is employed. Active approaches (also known as trigger-based approaches) determine whether and when a drift has occurred before taking any actions. They operate based on two mechanisms -a change detector aiming to sense the drift accurately and timely, and an adaptation mechanism aiming to maintain the performance of the classifier by reacting to the detected drift. Passive approaches (also known as adaptive classifiers) evolve the classifier continuously without an explicit trigger reporting the drift. A comprehensive review of up-to-date techniques tackling concept drift is given by Ditzler et al. [14]. They further organise these techniques based on their core mechanisms, summarized in Table III. This table will help us to understand how online class imbalance algorithms are designed, which will be introduced in details in Section III. There exist other ways to classify the proposed algorithms, such as Gama et al.'s taxonomy based on the four modules of an adaptive learning system [7], and Webb et al.'s quantitative characterization [77]. This paper adopts the one proposed by Ditzler et al. [14] for its simplicity.
The best algorithm varies with the intended applications. A general observation is that, while active approaches are quite effective in detecting abrupt drift, passive approaches are very good at overcoming gradual drift [74] [14]. It is worth noting that most algorithms do not consider class imbalance. It is unclear whether they will remain effective if data becomes imbalanced. For example, some algorithms determine concept drift based on the change in the classification error, including OLIN [78], DDM [79] and PERM [80]. As we have explained in Section II-B 1), the classification error is sensitive to the imbalance degree of data, and does not reflect the performance of the classifier very well when there is class imbalance. Therefore, these algorithms may not perform well when concept drift and class imbalance occur simultaneously. Some other algorithms are specifically designed for data streams coming in batches, such as AUE [81] and the Learn++ family [74]. These algorithms cannot be applied to online cases directly. (c) How do we evaluate the performance of concept drift detectors and online classifiers?
To fully test the performance of drift detection approaches (especially an active detector), it is necessary to discuss both data with artificial concept drifts and real-world data with unknown drifts. Using data with artificial concept drifts allows us to easily manipulate the type and timing of concept drifts, so as to obtain an in-depth understanding of the performance of approaches under various conditions. Testing on data from real-world problems helps us to understand their effectiveness from the practical point of view, but the information about when and how concept drift occurs is unknown in most cases. The following aspects are usually considered to assess the accuracy of active drift detectors. Their measurement is based on data with artificial concept drifts where drifts are known.
• True detection rate: the possibility of detecting the true concept drift. It shows the accuracy of the detection approach. • False alarm rate: the possibility of reporting a concept drift that does not exist (false-positive rate). It characterizes the costs and reliability of the detection approach. • Delay of detection: an estimate of how many time steps are required on average to detect a drift after the actual occurrence. It reflects how much time would be taken before the drift is detected.
Wang and Abraham [9] use a histogram to visualize the distribution of detection points from the drift detection approach over multiple runs. It reflects all the three aspects above in one plot. It is worth nothing that there are tradeoffs between these measures. For example, an approach with a high true detection rate may produce a high false alarm rate. A very recent algorithm, Hierarchical Change-Detection Tests (HCDTs), was proposed to explicitly deal with the tradeoff [82].
After the performance of drift detection approaches is better understood, we need to quantify the effect of those detections on the performance of predictive models. All the performance metrics introduced in the previous section of "class imbalance" can be used. The key question here is how to calculate them in the streaming settings with evolving data. The performance of the classifier may get better or worse every now and then. There are two common ways to depict such performance over time -holdout and prequential evaluation [7].
Holdout evaluation is mostly used when the testing data set (holdout set) is available in advance. At each time step or every few time steps, the performance measures are calculated based on the valid testing set, which must represent the same data concept as the training data at that moment. However, this is a very rigorous requirement for data from real-world applications.
In prequential evaluation, data received at each time step is used for testing before it is use for training. From this, the performance measures can be incrementally updated for evaluation and comparison. This strategy does not require a holdout set, and the model is always tested on unseen data.
When the data stream is stationary, the prequential perfor- If we use the accumulated measure based on all the historical data, the overall accuracy will be 93/110, which seems to be high but does not reflect the true performance on the new data concept. This problem can be solved by using a sliding window or a time-based fading factor that weigh observations [83].

III. OVERCOMING CLASS IMBALANCE AND CONCEPT DRIFT SIMULTANEOUSLY
Following the review of class imbalance and concept drift in Section II, this section reviews the combined issue, including example applications and existing solutions. When both exist, one problem affects the treatment of the other. For example, the drift detection algorithms based on the traditional classification error may be sensitive to imbalanced degree and become less effective; the class imbalance techniques need to be adaptive to changing P (y), otherwise the class receiving the preferential treatment may not be the correct minority class at the current moment. Therefore, their mutual effect should be considered during the algorithm design.

A. Illustrative Applications
The combined problems of concept drift and class imbalance have been found in many real-world applications. Three examples are given here, to help us understand each type of concept drift.
1) Environment monitoring with P (y) drift: Environment monitoring systems usually consist of various sensors generating streaming data in high speed. Real-time prediction is required. For example, a smart building has sensors deployed to monitor hazardous events. Any sensor fault can cause catastrophic failures. Machine learning algorithms can be used to build models based on the sensor information, aiming to predict faults in sensors accurately and timely [3]. First, the data is characterized by class imbalance, because obtaining a fault in such systems can be very expensive. Examples representing faults are the minority. Second, the number of faults varies with the faulty condition. If the damage gets worse over time, the faults will occur more and more frequently. It implies a prior probability change, a type of virtual concept drift.
2) Spam filtering with p (x | y) drift: Spam filtering is a typical classification problem involving class imbalance and concept drift [84]. First of all, the spam class is the minority and suffers from a higher misclassification cost. Second, the spammers are actively working on how to break through the filter. It means that the adversary actions are adaptive. For example, one of the spamming behaviours is to change email content and presentation in disguise, implying a possible classconditional pdf (p (x | y)) change [7].
3) Social media analysis with P (y | x) drift: Social media (e.g. twitter, facebook) is becoming a valuable source of timely information on the internet. It attracts a growing number of people, sharing, communicating, connecting and creating usergenerated data. Consider the example where a company would like to make relevant product recommendations to people who have shown some type of interest in their tweets. Machine learning algorithms can be used to discover who is interested in the product from the large amount of tweets [85]. The number of users who have shown the interest is always very small. Their information tends to be overwhelmed by other unrelated messages. Thus, it is utterly important to overcome the imbalanced distribution and discover the hidden information. Another challenge is users' interest changing from time to time. Users may lose their interest in the current trendy product very quickly, causing posterior probability (P (y | x)) changes.
Although the above examples are associated with only one type of concept drift, different types often coexist in realworld problems, which are hard to know in advance. For the example of spam filtering, which email belongs to spam also depends on users' interpretation. Users may re-label a particular category of normal emails as spam, which indicates a posterior probability change.

B. Approaches to Tackling Both Class Imbalance and Concept Drift
Some research efforts have been made to address the joint problem of concept drift and class imbalance, due to the rising need from practical problems [86] [1]. Uncorrelated Bagging is one of the earliest algorithms, which builds an ensemble of classifiers trained on a more balanced set of data through resampling and overcomes concept drift passively by weighing the base classifier based on their discriminative power [87] [88] [89]. Selectively recursive approaches SERA [90] and REA [91] use similar ideas to Uncorrelated Bagging of building an ensemble of weighted classifiers, but with a "smarter" oversampling technique. Learn++.CDS and Learn++.NIE are more recent algorithms, which tackle class imbalance through the oversampling technique SMOTE [32] or a sub-ensemble technique, and overcome concept drift through a dynamic weighting strategy [92]. HUWRS.IP [93] improves HUWRS [94] to deal with imbalanced data streams by introducing an instance propagation scheme based on a Naïve Bayes classifier, and uses Hellinger distance as a weighting measure for concept drift detection. This method relies on finding examples that are similar to the current minority-class concept, which however may not exist. So, Hellinger Distance Decision Tree (HDDT) was proposed to use Hellinger distance as the decision tree splitting criteria that is imbalance-insensitive [95]. All these approaches belong to chunk-based learning algorithms. Their core techniques work when a batch of data is received at each time step, i.e. they are not suitable for online processing. Developing a true online algorithm for concept drift is very challenging because of the difficulties in measuring minority-class statistics using only one example at a time [14].
To handle class imbalance and concept drift in an online fashion, a few methods have been proposed recently. Drift Detection Method for Online Class Imbalance (DDM-OCI) [8] is one of the very first algorithms detecting concept drift actively in imbalanced data streams online. It monitors the reduction in minority-class recall (i.e. true positive rate). If there is a significant drop, a drift will be reported. It was shown to be effective in cases when minority-class recall is affected by the concept drift, but not when the majority class is mainly affected. A Linear Four Rates (LFR) approach was then proposed to improve DDM-OCI, which monitors four rates from the confusion matrix -minority-class recall and precision and majority-class recall and precision, with statistically-supported bounds for drift detection [9]. If any of the four rates exceeds the bound, a drift will be confirmed. Instead of tracking several performance rates for each class, prequential AUC (PAUC) [10] [96] was proposed as an overall performance measure for online scenarios, and was used as the concept drift indicator in Page-Hinkley (PH) test [97]. However, it needs access to historical data. DDM-OCI, LFR and PAUC-based PH test are active drift detectors designed for imbalanced data streams, and are independent of classification algorithms. They aim at concept drift with classification boundary changes by default. Therefore, if a concept drift is reported, they will reset and retrain the online model. Although these drift detectors are designed for imbalanced data, they themselves do not handle class imbalance. It is still unclear how they perform when working with class imbalance techniques.
Besides the above active approaches, the perceptron-based algorithms RLSACP [12], ONN [98] and ESOS-ELM [13] adapt the classification model to non-stationary environments passively, and involve mechanisms to overcome class imbalance. RLSACP and ONN are single-model approaches with the same general idea. Their error function for updating the perceptron weights is modified, including a forgetting function for model adaptation and an error weighting strategy as the class imbalance treatment. The forgetting function has a predefined form, allowing the old data concept to be forgotten gradually. The error weights in RLSACP are incrementally updated based either on the classification performance or the imbalance rate from recently received data. It was shown that weight updating based on the imbalance rate leads to better performance. ESOS-ELM is an ensemble approach, maintaining a set of online sequential extreme learning machines (OS-ELM) [99]. For tackling class imbalance, resampling is applied in a way that each OS-ELM is trained with approximately equal number of minority-and majority-class examples. For tackling concept drift, voting weights of base classifiers are updated according to their performance G-mean on a separate validation data set from the same environment as the current training data. In addition to the passive drift detection technique, ESOS-ELM includes an independent module -ELM-store, to handle recurring concept drift. ELM-store maintains a pool of weighted extreme learning machines (WELM) [65] to retain old information. It adopts a threshold-based technique and hypothesis testing to detect abrupt and gradual concept drift actively. If a concept drift is reported, a new WELM will be built and kept in ELM-store. If any stored model performs better than the current OS-ELM ensemble, indicating a possible recurring concept, it will be introduced in the ensemble. ESOS-ELM assumes the imbalance rate is known in advance and fixed. It needs a separate data set for initializing OS-ELMs and WELMs, which must include examples from all classes. It is also necessary to have validation data sets reflecting every data concept for concept drift detection, which can be a quite restrictive requirement for real-world data.
With a different goal of concept drift detection from the above, a class imbalance detection (CID) approach was proposed, aiming at P (y) changes [18]. It reports the current imbalance status and provides information of which classes belong to the minority and which classes belong to the majority. Particularly, a key indicator is the real-time class size w (t) k , the percentage of class c k at time step t. When a new example x t arrives, w (t) k is incrementally updated by the following equation [18]: where [(x t , c k )] = 1 if the true class label of x t is c k , and 0 otherwise. θ (0 < θ < 1) is a pre-defined time decay (forgetting) factor, which reduces the contribution of older data to the calculation of class sizes along with time. It is independent of learning algorithms, so it can be used with any type of online classifiers. For example, it has been used in OOB and UOB [11] for deciding the resampling rate adaptively and overcoming class imbalance effectively over time. OOB and UOB integrate oversampling and undersampling respectively into ensemble algorithm Online Bagging (OB) [64].
Oversampling and undersampling are one of the simplest and most effective techniques of tackling class imbalance [30].
The properties of the above online approaches are summarized in Table IV, answering the following six questions in order: • How do they handle concept drift (the type based on the categorization in Table III)? • Do they involve any class imbalance technique to improve the predictive performance of online models, in addition to concept drift detection? • Do they need access to previously received data? • Do they need additional data sets for initialisation or validation? • Can they handle data streams with more than two classes (multi-class data)? • Do they involve any mechanism handling P (y) drift?

IV. PERFORMANCE ANALYSIS
With a complete review of online class imbalance learning, we aim at a deep understanding of concept drift detection in imbalanced data streams and the performance of existing approaches introduced in Section III-B. Three research questions will be looked into through experimental analysis: 1) what are the difficulties in detecting each type of concept drift? Little work has given separate discussions on the three fundamental types of concept drift, especially the P (y) drift. It is important to understand their differences, so that the most suitable approaches can be used for the best performance. 2) Among existing approaches designed for imbalanced data streams with concept drift, which approach is better and when? Although a few approaches have been proposed for the purpose of overcoming concept drift and class imbalance, it is still unclear how well they perform for each type of concept drift. 3) Whether and how do class imbalance techniques affect concept drift detection and online prediction? No study has looked into the mutual effect of applying class imbalance techniques and concept drift detection methods. Understanding the role of class imbalance techniques will help us to develop more effective concept drift detection methods for imbalanced data.

A. Data Sets
For an accurate analysis and comparable results, we choose two most commonly used artificial data generators, SINE1 [79] and SEA [100], to produce imbalanced data streams containing three simulated types of concept drift. This is one of the very few studies that individually discuss P (y), p (x | y) and P (y | x) types of concept drift in depth. In addition, each generator produces two data streams with a different drifting speed -abrupt and gradual drifts. The drifting speed is defined as the inverse of the time taken for a new concept to completely replace the old one [72]. According to speed, drifts can be either abrupt, when the generating function is changed completely in only one time step, or gradual, otherwise. The data streams with a gradual concept drift are denoted by 'g' in the following experiment, i.e. SINE1g [76] and SEAg. Every data stream has 3000 time steps, with one concept drift starting at time step 1501. The new concept in SINE1 and SEA fully takes over the data stream from time step 1501; the concept drift in SINE1g and SEAg takes 500 time steps to complete, which means that the new concept fully replaces the old one from time step 2001. The detailed settings for generating each type of concept drift are included in the individual subsections.
After the detailed analysis of the three types of concept drift, three real-world data sets are included in our experiment with unknown concept drift, which are PAKDD 2009 credit card data (PAKDD) [101], Weather data [75] and UDI Tweet-erCrawl data [102]. Data in PAKDD are collected from the private label credit card operation of a Brazilian retail chain. The task of this problem is to identify whether the client has a good or bad credit. The "bad" credit is the minority class, taking 19.75% of the provided modelling data. Because the data have been collected from a time interval in the past, gradual market change occurs. The Weather data set aims to predict whether rain precipitation was observed on each day, with inherent seasonal changes. The class of "rain" is the minority at IR of 31%. The original Tweet data include 50 million tweets posted mainly from 2008 to 2011. The task is to predict the tweet topic. We choose a time interval, containing 8774 examples and covering seven tweet topics [103]. Then, we further reduce it to 2-class data by using only two out of seven topics for our experiment. These real-world data will help us to understand the effectiveness of existing concept drift and class imbalance approaches in practical scenarios, which usually have more complex data distributions and concept drift.

B. Experimental and Evaluation Settings
The approaches listed in Table IV, which are explicitly designed for the combined problem of class imbalance and concept drift, are discussed in our experiment. For the three active drift detection methods -DDM-OCI, LFR and PAUC-PH, they are used with the traditional Online Bagging (abbr. OB) [64] and OOB with CID [11] respectively for classification. Because OOB applies oversampling to overcome class imbalance and OB does not, it can help us to observe the role of class imbalance techniques (oversampling in our experiment) in concept drift detection. UOB is not chosen, for the consideration that undersampling may cause unstable performance which may indirectly affect our observation [11]. Between RLSACP and ONN, due to their similarity and the more theoretical support in RLSACP, only RLSACP is included in our experiment.
Considering RLSACP and ESOS-ELM are perceptronbased methods, we use the Multilayer Perceptron (MLP) classifier as the base learner of OB and OOB. The number of neurons in the hidden layer of MLPs is set to the average of the number of attributes and classes in data, which is also the number of perceptrons in RLSACP and ESOS-ELM. All ensemble methods maintain 15 base learners. For ESOS-ELM, we disable the "ELM-Store", which is designed for recurring concept drift; we allow that its ensemble size can grow to 20. In addition, ESOS-ELM requires an initialisation data set to initialize ELMs, and validation data sets to adjust misclassification costs. When dealing with artificial data, we use the first 100 examples to initialize ESOS-ELM, and generate a separate validation data set for each concept stage. We track the performance of all the methods from time step 101. In summary, ten algorithms join the comparison from Table IV To evaluate the effectiveness of concept drift detection methods and online learners, we adopt prequential test (as described in Section II) for its simplicity and popularity. Prequential recall of each class (defined in Eq. 1) and prequential G-mean (defined in Eq. 4) are tracked over time for comparison, because they are insensitive to imbalance rates. When discussing the generated artificial data sets with ground truth known, we also compare the true detection rate (abbr. TDR), total number of false alarms (abbr. FA) and delay of detection (abbr. DoD) (as defined in Section II) among methods using any of the three active drift detectors (i.e. DDM-OCI, LFR and PAUC-PH). The calculation of TDR, FA and DoD is based on the following understanding: before a real concept drift occurs, all the reported alarms are considered as false alarms; after a real concept drift occurs, the first detection is seen as the true alarm; after that and before the next new real concept drift, the consequent detections are considered as false alarms.
Furthermore, because we are particularly interested in how the learner performs on the new data concept in the artificial data sets, we calculate the average recall and G-mean over all the time steps before the concept drift starts and after the concept drift completely ends. It is worth noting that the recall and G-mean values are reset to 0 when the drift starts and ends for an accurate analysis. We use the Wilcoxon Sign Rank test at the confidence level of 95% as our significance test in this paper.

C.1. P (y) Concept Drift
This section focuses on the P (y) type of concept drift, without p (x | y) and P (y | x) changes. Data streams SINE1 and SINE1g have a severe class imbalance change, in which the minority (majority) class during the first half of data streams becomes the majority (minority) during the latter half. SEA and SEAg have a less severe change, in which the data stream presented to be balanced during the first half becomes imbalanced during the latter half. The concrete setting for each data stream is summarized in Table V.  Table VI compares the detection performance of the three active concept drift detectors, in terms of TDR, FA and DoD. The first column is the data ID number, as denoted in Table V. We can see that DDM-OCI and LFR are sensitive to class imbalance changes in data. They present very high true detection rate; especially, LFR has 100% TDR in all cases regardless of whether resampling is used to tackle class imbalance. PAUC-PH does not report any concept drift, showing 0% TDR in all cases. This is because DDM-OCI and LFR use time-decayed metrics as the indicator of concept drift, which have higher sensitivity to performance change in general than the prequential AUC used by PAUC-PH. LFR shows even higher TDR than DDM-OCI, because it tracks four rates in the confusion matrix instead of one. For the same reason, DDM-OCI and LFR have a higher chance of issuing false alarms than PAUC-PH. For DDM-OCI, oversampling in OOB increases the probability of reporting a concept drift by observing TDR in SEA and SEAg, compared to OB. This is because more examples are used for training in OOB, which improves the performance on the minority class for concept drift detection.  Table VII compares recall and G-mean of all models over the new data concept, i.e. performance over time steps 1501-3000 for data streams with an abrupt change and performance over time steps 2001-3000 for data streams with a gradual change, showing whether and how well the drift detector can help with learning after concept drift is completed. The first column is the data ID number, as denoted in Table V. In SINE1 and SINE1g, the negative class presents to be the minority after the change; in SEA and SEAg, the positive class presents to be the minority after the change. In terms of minority-class recall, we can see that ESOS-ELM performs the significantly best, but ESOS-ELM sacrifices majority-class recall, especially in SINE1 and SINE1g. In terms of G-mean, OOB and OOB using PAUC-PH perform the significantly best, which shows they can best balance the performance between classes. It is worth noting that PAUC-PH is the drift detection method with 0% TDR based on Table VI. It means that OOB plays the main role in learning. It also explains that OOB and OOB using PAUC-PH have very close Points below y = sin (x) Points below y = sin (x) Points above or on y = sin (x) Points above or on y = sin (x) 2 SINE1g Gradual and P (x < 0.5) = 0.9 and P (x < 0.5) = 0.1 3 SEA Abrupt SEAg Gradual and P (x 1 < 5) = 0.9 and P (x 1 < 5) = 0.1 performance. All the OB and OOB models using the other active drift detectors do not show competitive recall and Gmean. Especially for those using DDM-OCI and LFR, the high number of false alarms causes too much resetting and performance loss; OOB can increase the chance of producing a false alarm, because more minority-class examples join the training. Therefore, we conclude that, for P (y) type of concept drift, it is not necessary to apply any drift detection techniques that are not specifically designed for class imbalance changes; the use of these drift detectors could be even detrimental to the predictive performance due to false alarms and performance resetting; the adaptive resampling in OOB is sufficient to deal with the change and maintain the predictive performance; when using OOB with other active concept drift detectors, the number of false alarms and performance resetting need to be carefully considered.

C.2. p (x | y) Concept Drift
The data streams in this section only involve p (x | y) type of concept drift, without P (y) and P (y | x) changes. The class imbalance ratio is fixed to 1:9 and we let the positive class be the minority, so that the data stream is constantly imbalanced. The concept drift in each data stream is controlled by p (x) of the negative class, as shown in Table VIII.  Table IX compares the detection performance of the three active concept drift detectors. Similar to our previous results, DDM-OCI and LFR are more sensitive to P (x | y) changes than PAUC-PH. When DDM-OCI and LFR work with OOB, their TDR shows 100%; and LFR has higher FA and shorter DOD than DDM-OCI, due to more indicators it monitors. PAUC-PH shows 0% TDR in most cases of working with both OB and OOB. Different from P (y) changes, when DDM-OCI and LFR work with OB, their TDR is rather low, which suggests that their sensitivity is dependent on the class imbalance techniques. Unlike the cases with class imbalance changes, where it is possible for the minorityclass examples to become more frequent, the data streams generated in this section have a fixed minority class with a constantly small prior probability. In other words, it would be more difficult to recognize examples from this minority class, which indirectly affects the detection sensitivity of DDM-OCI and LFR. When oversampling is applied, which introduces more training examples for the minority class, the performance metrics (G-mean, recall and precision) monitored by DDM-OCI and LFR can be substantially improved. It also increases the possibility of reporting a concept drift. This explains the low detection rate of DDM-OCI and LFR when working with OB and their high detection rate when working with OOB.  the new data concept. As we expected, almost all OB models show significantly worse minority-class recall and G-mean. On SINE1 and SINE1g data, minority-class recall of OB models is as low as 0, which may hinder the detection of any concept drift. Among the OOB models, those using DDM-OCI and LFR perform significantly worse than OOB using PAUC-PH and OOB itself, and the latter two show very close performance. This is because DDM-OCI and LFR trigger concept drift with false alarms, and cause model resetting multiple times. Along with the resetting, the useful and valid information learnt in the past is forgotten at the same time.
For the two passive models, RLSACP and ESOS-ELM do not perform very well compared to OOB. Generally speaking, for imbalanced data streams with p (x | y) changes, class imbalance seems to be a more important issue than concept drift, considering that the learning model without triggering any concept drift detection achieves the best performance. Besides, while the adopted class imbalance technique can improve the final prediction, it can also improve the performance of active concept drift detection methods, depending on their working mechanism.

C.3. P (y | x) Concept Drift
The data streams in this section only involve P (y | x) type of concept drift, without P (y) and p (x | y) changes. Following the settings in Section IV-C.2, we fix the class imbalance ratio to 1:9 and let the positive class be the minority, so that the data stream is constantly imbalanced. As shown in Table XI, the data distribution in SINE1 and SINE1g involves a concept swap, and this change occurs probabilistically in SINE1g; the data distribution in SEA and SEAg has a concept threshold moving, and this change occurs continuously in SEAg. The change in SEA and SEAg is less severe than the change in SINE1 and SINE1g, because some of the examples from the old concept are still valid under the new concept after the threshold moves completely. The concept drift discussed in this section belongs to the real concept drift category, which affects the classification boundary and is expected to be captured by all concept drift detectors. According to Table XII, we can see that DDM-OCI and LFR have difficulty in detecting the concept drift when working with OB, because of the poor recall and G-mean produced by OB, which is also observed and explained in Section IV-C.2. When DDM-OCI and LFR work with OOB, their detection rate TDR is greatly improved (above 90% in most cases). This is because the improved performance metrics facilitate the detection. LFR is more sensitive to the change, which produces higher FA and shorter DoD. Different from previous observations in terms of concept drift detection performance, PAUC-PH working with OB produces 100% TDR and low FA on data streams SINE1 and SINE1g, but PAUC-PH does not work well with OOB on the same data. It is interesting to see that oversampling does not always play a positive role in drift detection. One possible reason is that class imbalance techniques may sometimes hide the performance drop caused by the real concept drift, while it tries to maintain the overall predictive performance, especially for AUC type of metrics in our case. On data streams SEA and SEAg, PAUC-PH does not report any concept drift, probably due to the less severe concept drift.
The recall and G-mean over the new data concept in Table XIII further confirms the above analysis. The OB models produce very low minority-class recall and thus low G-mean. RLSACP and ESOS-ELM do not perform well on the new data concept either. By comparing the models that captures concept drifts (DDM-OCI+OOB, LFR+OOB, PAUC-PH+OB) and the models without reporting any concept drift (PAUC-PH+OOB, OOB), it seems that class imbalance causes a more difficult learning issue than the real concept drift in our cases. The models solely tackling class imbalance produce the significantly best recall and G-mean. The rather low imbalance ratio (i.e. 1:9) could be a reason. It would be worth discussing various imbalance levels in data with concept drift in our future work, in order to find out when it is worthwhile considering concept drift in imbalanced data streams. By comparing the results in Table XIII, Table X and Table VII, the P (y | x) type of concept drift indeed leads to the most performance reduction. It is consistent with our understanding that the real concept drift is the most radical type of change in data. However, existing approaches do not seem to tackle it well when data streams are very imbalanced. To develop better concept drift detection methods, the key issues here include how to best have them and class imbalance techniques work together and how to tackle the performance loss brought by false alarms.

D. Comparative Study on Real-World Data
After the detailed analysis of the three types of concept drift, we now look into the performance of the above learning models on the three real-world data sets (PAKDD [101], Weather [75] and Tweet [102]) described in Section IV-A. Based on the experimental results on the artificial data, we focus on the best active (PAUC+OOB) and the best passive concept drift detection methods (ESOS-ELM) here for a clear observation, in comparison with OOB. The three methods use the same parameter settings as before. The initialisation and validation data required by ESOS-ELM is the first 2% examples of each data set.
Without knowing the true concept drifts in real-world data, we calculate and track the time-decayed G-mean by setting the decay factor to 0.995, which means that the old performance is forgotten at the rate of 0.5%. All the compared metrics are the average of 100 runs in the following figures. Fig. 3 presents the time-decayed G-mean curves from OOB, PAUC-PH+OOB and ESOS-ELM on the three real-world data   In the PAKDD plot, we can see that the G-mean level is relatively stable without significant drop; differently, G-mean in the Tweet plot is reducing. It may suggest that the concept drift in PAKDD is less significant or influential than that in Tweet. Compared to the gradual market and environment change in PAKDD, the tweet topic change can be much faster and more noticeable. Therefore, although PAUC-PH detects 3 concept drifts in PAKDD, the two methods, OOB and PAUC-PH+OOB, does not show much difference. In tweet, PAUC-PH+OOB presents better G-mean than using OOB alone, showing the positive effect of the active concept drift detector in fast changing data streams.

E. Further Discussions
In this section, we summarize and further discuss the results in the above comparative study on the artificial and real-world data. We also answer the research questions proposed at the beginning of this paper. When dealing with imbalanced data streams with concept drift, we have obtained the following: • When both class imbalance and concept drift exist, class imbalance status and class imbalance changes are shown to be more crucial issues than the traditional concept drift (i.e. p (x | y) and P (y | x) changes) in terms of the online prediction performance. It is necessary to adopt adaptive class imbalance techniques (e.g. OOB discussed in our experiment), in addition to using concept drift detection methods alone (e.g. DDM-OCI, LFR). Most existing papers that proposed new concept drift detection methods for imbalanced data so far did not consider the effect of class imbalance techniques on final prediction and concept drift detection. • P (y | x) concept drift (i.e. real concept drift) is the most severe type of change in data, compared to p (x | y) and P (y) concept drift. This is based on the observation on the final prediction performance. For all three types of concept drift, existing concept drift approaches do not show much benefit in performance improvement. Concept drift is hard to be detected when no class imbalance technique is applied. Their drift detection performance is affected by the class imbalance technique, depending on their detection mechanism. • For P (y) concept drift, it is not necessary to apply any concept drift detection methods that are not designed for class imbalance changes, due to their false alarms and model resetting. It is crucial to detect and handle the class imbalance change in time. • From the results on real-world data, we see that the effectiveness of traditional concept drift detectors (e.g. PAUC-PH) depends on the type of concept drift. For fast and significant concept drift, applying PAUC-PH seems to be more beneficial to the prediction performance. • Among existing methods designed for imbalanced data with concept drift (4 active methods and 2 passive methods), the passive methods (i.e. ESOS-ELM and RLSACP) do not perform well in general. Although they contain both class imbalance and concept drift techniques, firstly, their class imbalance technique is not effectively adaptive to class imbalance changes, so that wrong imbalance status might be used during learning; secondly, they are restricted to the use of certain perceptron-based classifiers, so that the disadvantages of the classifiers are also inherited by the online model. For example, the training of OS-ELM in ESOS-ELM requires initialisation and validation data sets reflecting the correct data concepts, and the weighted OS-ELM was found to over-emphasize the minority class and present large performance variance sometimes in earlier studies [11]. • Among the three active methods discussed in this work, which are DDM-OCI, LFR and PAUC-PH, DDM-OCI and LFR are more sensitive to concept drift than PAUC-PH, with a higher detection rate but also higher false alarms. In addition, the detection performance of DDM-OCI and LFR can be greatly improved by OOB. The explanation can be found in the previous analysis. Overall, all these results suggest us that class imbalance and concept drift need to be studied simultaneously, when we design an algorithm to deal with imbalanced data with concept drift. Their mutual effect must be taken into consideration. Hence, we propose the following key issues to be considered for an effective algorithm: This paper gives the first systematic study of handling concept drift in class-imbalanced data streams. In the context of online learning, we provide a thorough review and an experimental insight into this problem.
First, a comprehensive review is given, including the problem description and definitions, the individual learning issues and solutions in class imbalance and concept drift respectively, the combined challenges and existing solutions in online class imbalance learning with concept drift, and example applications. The review reveals research gaps in the field of online class imbalance learning with concept drift. Specifically, little work has looked into the concept drift issue in imbalanced data streams systematically, although a few methods have been proposed for this purpose; P (y) type of concept drift is closely related to the class imbalance issue, but it has not been investigated properly so far; most existing concept drift detection methods are only designed for or tested on balanced data streams.
Second, to fill in these research gaps, we carry out a thorough empirical study by looking into the following research questions: 1) what are the challenges in detecting each type of concept drift when the data stream is imbalanced (i.e. changes in P (y), p (x | y), and P (y | x))? 2) Among the proposed methods designed for online class imbalance learning with concept drift, i.e. DDM-OCI [8], LFR [9], PAUC-PH [10], OOB [11], RLSACP [12] and ESOS-ELM [13], which one performs better for which type of concept drift? 3) Would applying class imbalance techniques (e.g. resampling methods) facilitate the concept drift detection and online prediction? By generating artificial data streams with different types of class imbalance and concept drift and experimenting on real-world data, we make the following conclusions.
For the first research question, a P (y) change can be easily tackled by an adaptive class imbalance technique (e.g. OOB used in this work). The traditional concept drift detectors, such as LFR, DDM-OCI and PAUC-PH, do not perform well in detecting a p (x | y) change. The prediction performance on an imbalanced data stream with p (x | y) changes can be effectively improved by solely using an adaptive class imbalance technique. A P (y | x) change is the most challenging case for learning, where the traditional active and passive concept drift detection methods do not bring much performance improvement. Class imbalance is shown to be a more crucial issue in terms of final prediction performance.
For the second research question, the two passive methods, RLSACP and ESOS-ELM, do not perform well in general. DDM-OCI and LFR are sensitive to different types of concept drift, with a high detection rate but also high false alarms. PAUC-PH is more conservative in terms of drift detection. Based on the observation on minority-class recall and G-mean, the combination PAUC-PH and OOB was shown to be the best approach among all.
For the third research question, it is necessary to apply adaptive class imbalance techniques when learning from imbalanced data streams with concept drift -they bring the most prediction performance improvement. In our experiment, our class imbalance technique OOB facilitates the concept drift detection of DDM-OCI and LFR.
This paper also provides guidelines for future algorithm design. Several important issues are pointed out for consideration. There are still many challenges and learning issues in this field that are worth of ongoing research, such as more effective concept drift detection methods for imbalanced data streams, studying the mutual effect of class imbalance and concept drift, and more real-world applications with different types of class imbalance and concept drift.