Online Learning With Adaptive Rebalancing in Nonstationary Environments

An enormous and ever-growing volume of data is nowadays becoming available in a sequential fashion in various real-world applications. Learning in nonstationary environments constitutes a major challenge, and this problem becomes orders of magnitude more complex in the presence of class imbalance. We provide new insights into learning from nonstationary and imbalanced data in online learning, a largely unexplored area. We propose the novel Adaptive REBAlancing (AREBA) algorithm that selectively includes in the training set a subset of the majority and minority examples that appeared so far, while at its heart lies an adaptive mechanism to continually maintain the class balance between the selected examples. We compare AREBA with strong baselines and other state-of-the-art algorithms and perform extensive experimental work in scenarios with various class imbalance rates and different concept drift types on both synthetic and real-world data. AREBA significantly outperforms the rest with respect to both learning speed and learning quality. Our code is made publicly available to the scientific community.


I. INTRODUCTION
E FFICIENT and effective analysis methods for the everincreasing volume of sequential data in a wide range of applications is of paramount importance. In practical applications, data is evolving or drifting over time, i.e., data is drawn from nonstationary distributions. Various factors can trigger a nonstationarity effect or concept drift, for example, seasonality or periodicity effects, changes in users' habits, interests or preferences, and hardware or software faults [1]. Learning in nonstationary environments constitutes a major challenge. In such environments, a classifier with learning capabilities is of vital importance as it will provide an adaptive behaviour and help maintain optimal performance. The problem becomes significantly more complex if class imbalance co-exists with concept drift. In this case, class imbalance refers to sequential data that have skewed distributions and is a difficult problem as it causes a traditional learning algorithm to be ineffective because of its poor generalisation ability and its weak prediction power for the minority class examples [2].
Learning from nonstationary and imbalanced data has been studied separately but several key challenges remain open when the joint problem is considered. The majority of existing work, focuses on batch (or chunk-by-chunk) learning i.e. when examples arrive in batches (or chunks). In this paper we address the combined challenges of drift and imbalance in online (or one-by-one) learning, i.e., when a single example arrives at each step. The design of batch learning algorithms differs significantly from that of online learning and, therefore, the majority are typically unsuitable for online learning tasks [3]. Addressing these key challenges can have a significant impact in various applications areas, e.g., in critical infrastructure systems, smart buildings, finance and banking, security and crime, healthcare and environmental sciences [3]- [6].
The desired properties of an online classifier learning from nonstationary and imbalanced data are [4], [7]: (1) Learning new knowledge: The classifier should learn novel knowledge as new data is arriving. (2) Preserving previous knowledge: Being able to preserve previous knowledge relies on the ability of the classifier to determine what previous knowledge is still relevant (and hence to preserve it) and what has now become irrelevant (and hence to discard or "forget" it). (3) High performance: The classifier should obtain high performance on both the majority and minority classes. (4) Fast operation: The classifier should operate in less than the example (or batch) arrival time. (5) Fixed storage: The classifier should use no more than a fixed amount of memory for any storage; ideally, it should be capable of incremental learning i.e. when learning occurs on a single instance (or batch) without considering (and hence storing) previous data. Balancing the trade-offs between the aforementioned properties is a challenging task.
The contributions made are: (i) We provide new insights into learning from nonstationary and imbalanced data, a largely unexplored area which focuses on the combined challenges of class imbalance and concept drift in online learning. (ii) We propose the novel Adaptive REBAlancing (AREBA) algorithm that maintains the aforementioned desired properties. AREBA selectively includes in the training set a subset of the positive and negative examples that appeared so far, while at its heart lies an adaptive rebalancing mechanism to continually maintain class balance. (iii) We compare AREBA to strong baselines and state-of-the-art algorithms and perform an extensive experimental work in scenarios with various imbalance rates and different drift types on both synthetic and real-world data. AREBA significantly outperforms the rest with respect to both learning speed and learning quality. (iv) To our knowledge, this paper is one of the very few studies that examines online imbalance learning under each type of drift independently. For reproducibility of our results we make the datasets used and our code publicly available to the community 1 . 2 The organisation of this paper is as follows. Section II provides the background material necessary to understand the contributions made. Section III provides an in-depth review of related work. AREBA is presented in Section IV. Our experimental setup is described in Section V. An analysis of the proposed method is given in Section VI, followed by a comprehensive comparative study in Section VII. We conclude in Section VIII where we discuss some important remarks, the pros and cons of AREBA and pointers for future work.

II. BACKGROUND
We consider a data generating process that provides at each time step t a sequence of examples or instances from an unknown probability distribution p t (x, y), where x t ∈ R d is a d-dimensional input vector belonging to input space X ⊂ R d , y t ∈ Y is the class label where Y = {0, 1} and M is the number of instances arriving at each step. The focus of this paper is on binary classification and, as a convention, the positive class represents the minority class. When the observed sequence S t consists only of a single instance (i.e. M = 1), it is termed online (or oneby-one) learning, otherwise it is termed batch (or chunk-bychunk) learning [1]. The design of batch learning algorithms differs significantly from that of online learning as they are designed to process batches of data, possibly by utilising an offline learning algorithm [3]. Therefore, the majority of batch learning algorithms are typically not suitable for online learning tasks [3]. This work focuses on online learning.
An online classifier receives a new example x t at time step t and makes a predictionŷ t based on a concept h : X → Y such thatŷ t = h(x t ). The classifier receives the true label y t , its performance is evaluated using a loss function and is then trained, i.e., its parameters are updated accordingly based on the loss incurred. This process is repeated at each time step. Depending on the application, new examples do not necessarily arrive at regular and pre-defined intervals.
If data is sampled from a long, potentially infinite, sequence which is typically the case for big data applications, it is unrealistic to expect that all the previously observed data will always be available. If learning occurs on the most recent single instance only without taking into account previously observed data, it is termed incremental learning [1]. For online and incremental learning, the cost at time t is calculated using the loss function l as follows J = l(y t ,ŷ t ).
This framework is suitable for human-in-the-loop learning. As mentioned, the label becomes available as the next example arrives i.e. verification latency does not exist. Algorithms of this framework, including AREBA, are typically trained from user interaction by domain experts. Various and widely studied domains exist that fit into this framework, and hence, this assumption is satisfied e.g. in financial fraud [6], a banker can provide every few minutes if a credit card transaction is fraudulent. For rain prediction [4] an expert can provide every few hours if rain precipitation was observed. In healthcare applications [8], a doctor can provide the x-ray result as soon as it is completed. The framework may not be ideal in some cases (e.g. for real-time applications). Relaxing this assumption will be part of our future work, but we take a first step towards this direction and examine the robustness of all algorithms under conditions where this assumption is violated.
According to the Bayesian decision theory, a classification can be described by the prior probability p(y) and the class conditional probability or likelihood p(x|y) for all classes y [7]. The classification decision is made according to the posterior probability, which for class y, is expressed as: where p(x) = y={0,1} p(x|y)p(y). Class imbalance [2] is a key challenge in learning and occurs when at least one data class is under-represented, thus constituting a minority class. For a binary classification problem, class imbalance at time t occurs if: where class 0 (negative) and 1 (positive) represents the majority and minority class respectively. Concept drift represents a change in the joint probability and the drift between time step t 0 and t 1 is defined as follows: Concept drift can occur in three forms: (i) a change in prior probability p(y) (ii) a change in class-conditional probability or likelihood p(x|y) and (iii) a change in posterior probability p(y|x). In real-world applications, the three forms can appear together. A change in posterior probability p(y|x) which may or may not be due to a change in p(x), is known as real drift because the true decision boundary is changed. A change in the distribution of the incoming data p(x) without affecting p(y|x) is known as virtual drift because the true decision boundary remains unchanged, however, the classifier's learnt decision boundary may drift away from the true one. While other drift characteristics [9] are important, our focus is on the drift type. As mentioned, along with [3], this paper is one of the few studies that examines online imbalance learning under each type of concept drift independently.

III. RELATED WORK
This section provides an in-depth review of related work and describes the state-of-the-art methods. For exhaustive surveys, the interested reader is directed towards these excellent papers: [1], [7] for drift methods, [2] for imbalance methods, [3] for methods that address both, and [10] for online ensembles.

A. Concept drift
Concept drift algorithms are classified as memory-based, change detection-based and ensembling [7].
1) Memory-based: algorithms typically employ a sliding window approach to maintain a set of recent examples that a classifier is trained on; a representative algorithm of this category is FLORA [11]. A key challenge is to determine a priori the window size as a larger window is better suited for a gradual drift, while a smaller window is suitable for an abrupt drift. To address this, methods use an adaptive sliding window 3 [11] or multiple sliding windows [12]. This is also known as an abrupt forgetting approach because the examples that fall outside the window are immediately dropped out of memory. Alternatively, gradual forgetting approaches employ the full memory but the influence of older examples is deteriorating, e.g., using an exponential decay weighting strategy [13].
2) Change detection-based: algorithms that employ explicit mechanisms to detect concept drift. These methods are based on sequential analysis and control charts, e.g., Page-Hinkley (PH) test [14], cumulative sum (CUSUM) [14], justin-time (JIT) classifiers [15], [16] and on monitoring two distributions, e.g., adaptive windowing (ADWIN) [17]. These approaches are also known as active detectors and are generally suitable for detecting abrupt concept drift but may fail to work well in prediction settings with gradual or recurring concept drift [1], although recently JIT classifiers have been extended to address recurring concept drift [18].
3) Ensembling: an ensemble of classifiers can improve performance and provide the flexibility of injecting new data by adding classifiers or "forgetting" irrelevant data by removing or updating existing classifiers [19]. It can be computationally costly; recall that one of the desired properties of a classifier is to be able to operate fast in less than the example arrival time. Popular methods are the streaming ensemble algorithm (SEA) [20], Learn++.NSE [21], diversity for dealing with drifts (DDD) [22] and online bagging (OB) [23]. Another method is the accuracy updated ensemble (AUE) [24] which combines accuracy-based weighting mechanisms known from chunkbased ensembles with the incremental nature of Hoeffding Trees. Its follow-up work Online AUE (OAUE) is provided in [25]. The interested reader is directed towards [10] for a survey on ensemble learning for data streams.
None of the aforementioned approaches consider class imbalance. Below we discuss how these are combined with class imbalance methods to address the joint problem.

B. Class imbalance
Cost-sensitive learning and resampling algorithms have recently shown particular success in this area [3].
1) Cost-sensitive learning: The cost-sensitive online gradient descent method (CSOGD) uses this loss function: where I condition is the indicator function that returns 1 if condition is satisfied and 0 otherwise, c p , c n ∈ [0, 1] and c p + c n = 1 are the misclassification costs for positive and negative classes respectively [26]. The authors use the perceptron classifier and stochastic gradient descent, and apply the cost-sensitive modification to the hinge loss function, achieving excellent results. The downside of this method is that the costs need to be pre-defined, however, the extent of the class imbalance may not be known in advance. Moreover, in nonstationary environments, it cannot cope with imbalance changes (i.e. p(y) drift) as the pre-defined costs remain static. This issue can be resolved by introducing an adaptive cost strategy. One way to achieve an adaptive cost strategy is by using class imbalance detection (CID) [27] to determine the imbalance rate in an online manner. The authors define a timedecayed class size metric, where for each class k, its size s k is updated at each time t according to the following equation: where 0 < θ < 1 is a pre-defined time decay factor that gives less emphasis to older data. This metric can determine the imbalance rate at any given time, for instance, for a binary classification problem where the positive class is the minority (s t p < s t n ), the imbalance rate at time t is given by s t n /s t p . Another method that uses an adaptive cost strategy with a perceptron-based classifier is RLSACP [28]. EONN [29] uses an ensemble of cost-sensitive online neural networks to cope with drift and imbalance. As with CSOGD, the costs are predefined, thus limiting its adaptability to evolving data.
2) Resampling: Traditionally, in offline learning, resampling techniques alter the training set to deal with the skewed data distribution, specifically, oversampling techniques "grow" the minority class while undersampling techniques "shrink" the majority class. The simplest and most popular technique is random oversampling (or undersampling) where data examples are randomly added (or removed) respectively [30]. More sophisticated techniques exist, for example, the use of Tomek links [31] discards borderline examples while the SMOTE [32] algorithm generates new minority class examples based on the similarities to the original ones. Recently, Generative Adversarial Networks (GANs) have been used to approximate the distribution and generate data for the minority class [33].
Resampling has been demonstrated to be a powerful technique for addressing online imbalance learning problems as well. Uncorrelated bagging (UCB) [34] is an ensembling technique that is trained on all the minority examples observed so far, plus a subset of the most recent majority examples. This technique has two drawbacks; it assumes that the distribution of the minority class is stationary and it does not handle the accumulated minority class examples for lifelong learning. SERA [35] and REA [36] are based on UCB and use more intelligent oversampling techniques. Other notable examples are the Learn++CDS and Learn++NIE [4] methods where both of them use the aforementioned Learn++NSE method (Section III-A3) to address concept drift. The former combines the SMOTE algorithm to address class imbalance while the latter combines a variation of bagging. Despite their effectiveness, all these techniques are only suitable for batch learning and not for online learning, which is the focus of this paper.
ESOS-ELM [37] is an ensemble of online sequential extreme learning machines that are trained on balanced subsets of the data stream. It relies, however, on the assumption that drift does not affect the minority class. Oversampling-based online bagging (OOB) [38] is an online ensembling method that extends the OB method (Section III-A3) which addresses concept drift. It works by adjusting the learning bias from the majority to the minority class adaptively through resampling by utilising the CID method (Section III-B1) and its time-decayed class size metric defined in Eq (5). Its basic idea is as follows. OOB updates each classifier of the ensemble K times. If a minority example arrives the value of K increases, otherwise it decreases. The effectiveness of OOB has been demonstrated 4 using two types of classifiers, namely, Hoeffiding trees and neural networks. The ensemble size varies for each study e.g. in [38] 50 trees and 50 neural networks were used while in [3] 15 neural networks were used. The approach can be computationally costly and may hinder online learning in highspeed sequential applications for two reasons. The first reason is due to the multiple classifiers (ensembling) and the second is because each classifier gets updated multiple times per time step. Analogous to OOB, its authors also introduce [38] the Undersampling-based online bagging (UOB) algorithm.

C. Open Challenges
Several key challenges still remain open when the joint problem of imbalance and drift is considered. The authors in [10] state that "working with class-imbalanced and evolving streams is still in early stages", while this study [3] "reveals research gaps in the field of online imbalance learning with concept drift". Specifically, many existing methods are capable of addressing only a single problem, either imbalance or drift, but not the joint problem. In other methods that address the joint problem, weaknesses are revealed under conditions where one or both the problems become very challenging. For instance, we will demonstrate these weaknesses under conditions where class imbalance is extreme (e.g. 0.1%). This paper introduces the concept of maintaining separate and balanced queues for each class, and has a dual-nature as it merges ideas from memory-based and resampling algorithms.
Also, besides its type, drift can be classified by its severity, speed, predictability, frequency and recurrence [39]. A more recent study characterises drift by subject, frequency, transition, re-occurrence and magnitude [9]. Therefore, in practice, it is very difficult to characterise concept drift. Our focus is on learning the concept drift without its explicit characterisation and detection. As discussed in Section III-A2, explicit or active drift detectors can perform well under specific drift characteristics. However, no detector can universally perform satisfactory under any combination of drift characteristics [40]. In fact, detailed characteristics of drift have not been consistently investigated in the literature [10]. Our proposed approach learns the concept drift without its explicit characterisation and detection, and adapts the classifier continuously.

IV. PROPOSED METHOD
We now introduce the Adaptive REBAalancing (AREBA) algorithm. Its central idea is to selectively include in the training set a subset of the positive and negative examples that appeared so far. At its heart lies an adaptive rebalancing mechanism that dynamically modifies the queue sizes to maintain class balance between the selected examples. AREBA extends our recently introduced algorithm queue-based resampling (QBR) [41].

A. Queue-based Resampling (QBR)
The memory size B ∈ 2Z + (i.e. B ≥ 2 and even) determines how many previously observed examples can be stored. The selection of the examples is achieved by maintaining at any given time t two separate windows of capacity (maximum ! = 100 are the current lengths of the queues. Let z i = (x i , y i ), for any two z i , z j ∈ q t n (or q t p ) such that j > i, z j arrived more recently in time.
An example showing how QBR works when B = 10 for 100 time steps is shown in Figure 1. Negative examples are shown in green, positive ones in light red and the minority class is the positive class. The class imbalance is set to CI = 10% i.e. p(y = 1) = 0.1 and for the sake of illustration, positive instances arrive at times multiple of ten (t = 10, 20, ..., 100). Initially, both queues are empty (shown as empty boxes) but their capacity is set to one (shown in the parenthesis). At t = 0 a negative example (z 0 ) arrives which is appended to the negative queue and the queue's capacity is incremented by one. At t = 4 the negative queue is full and has reached the full capacity B 2 . At t = 10 the first positive example (z 10 ) arrives which is appended to the positive queue and the queue's capacity is incremented. At t = 100 the positive queue is full and has reached the full capacity B 2 . Note that at this time both queues contain the most recent B 2 examples i.e. z 95 , .., z 99 and z 60 , .., z 100 for the negative and positive queue respectively.
The union of the two queues is then taken to form the new training set. The cost function is given by: where q t = q t p ∪ q t n and |q t | ∈ [1, B]. At each step the classifier is updated once based on the cost J incurred. QBR's Algorithm 1 Queue-based Resampling (QBR) 1: Input: 2: f : classifier 3: B: total storage size (B ≥ 2) 4: Initialisation: 5: queues q 0 p , q 0 n = {} 6: queue capacities q 0 p .cap = q 0 n .cap = 1 7: for each time step t do 8: receive example x t ∈ R d 9: predict classŷ t ∈ {0, 1} 10: receive true label y t ∈ {0, 1} 11: if y t == 1 then 12: pass 25: prepare the training set q t = q t p ∪ q t n 26: calculate cost J on q t using Eq (7) 27: update classifier once f.train() pseudocode is shown in Algorithm 1. In Lines 5-6 the queues are initially empty with a capacity of one each. In Lines 11-14 a new example is appended in its relevant queue based on its true label. The append function behaves exactly as in Fig. 1 i.e. it inserts the most recent example in a queue, while discarding the oldest one. Consider the case of t = 10 where the most recent example in the negative queue is z 9 . When the example z 9 arrived at t = 9, the example z 4 was discarded from the queue. In Lines 15-24 the capacity of the relevant queue is incremented. In Lines 25-27 the training set is prepared, the cost is calculated and the classifier is updated once. Of particular importance, is the observation that the original class imbalance problem still persists in the queues for a sustained period of time. Let's revisit time t = 10 in Figure 1. While the q t n is full, the q t p only contains a single example. Recall that to train the classifier we first take the union of the queues and then calculate the cost. Class imbalance is reduced as positive examples arrive and the problem eventually disappears at t = 100 when the queues become balanced.

B. Adaptive Rebalancing (AREBA)
Adaptive rebalancing introduces a novel element that dynamically modifies the queue lengths in order to constantly maintain balance between the queues. Without this element, the initial class imbalance problem would still persist in the queue-based system as discussed in the previous section. update negative class size s t n = θs t−1 if y t == 1 then 16: if q t p .is empty() then 20: if q t n .cap < B then 21: q t n .cap = q t n .cap + 1 increase capacity 22: else if q t n .cap == B then 23: pass queue no longer grows 24: else if q t n .is empty() then 25: if q t p .cap < B then 26: if q t n .cap = q t p .cap then 37: q t n .cap = q t p .cap 38: if s t n ≤ s t p then if negative class is minority 39: if q t n .is f ull() then 40: if q t n .cap < B 2 then 41: if q t p .cap = q t n .cap then 45: q t p .cap = q t n .cap 46: prepare the training set q t = q t p ∪ q t n 47: calculate cost J on q t using Eq (7) 48: update classifier once f.train() We describe in Figure 2 how AREBA works through the same example as before. Negative examples are shown in green, positive ones in light red and the minority class is the positive class. The class imbalance is set to CI = 10% i.e. p(y = 1) = 0.1 and for the sake of illustration, positive 6 instances arrive at times multiple of ten (t = 10, 20, ..., 100). Initially, both queues are empty (shown as empty boxes) but their capacity is set to one (shown in the parenthesis). At t = 0 a negative example (z 0 ) arrives which is appended to the negative queue and the queue's capacity is incremented by one. Contrary to QBR, each queue is allowed to have a maximum capacity of B (rather than B 2 ). Since no positive examples are observed in the beginning, at t = 9 the q n is full and has reached the maximum capacity B.
The first positive example (z 10 ) arrives at t = 10 and is appended to q p . Rebalancing is now initiated and the capacity of q n and q p becomes 1 and 2 respectively. The q n only contains its most recent example (z 9 ), hence, the queues are now balanced. The queues remain balanced until the second positive example (z 20 ) arrives at t = 20. The q p contains now the two most recent positive examples (z 20 , z 10 ) while q n contains the most recent negative example (z 19 ). At t = 20 the capacity of q n and q p becomes 2 and 3 respectively. At t = 21 another negative example (z 21 ) arrives which is appended to the relevant queue, thus the queues are again balanced. At t = 101 each queue is full and has a capacity of B 2 . AREBA proposes an adaptive mechanism that dynamically alters the queue sizes to maintain balance between the examples contained in the queues. To achieve this, it is necessary to be able to decide in an online fashion which class is the minority and which the majority. AREBA adopts the CID method's time-decayed class size metrics defined in Eq (5).
AREBA is shown in Algorithm 2. In Lines 7-8 the capacity of each queue is initialised to 1. In Lines 13 and 14 the class size metrics are updated. The new example is appended in its relevant queue (Lines 15-18). We then check if one of the queues is empty (Lines 19-28). For instance, if the positive queue is empty, we increase the capacity of the negative queue by 1; this corresponds to the cases t = 0 to t = 9 in Figure 2. Since the positive class is the minority class, the rest of the cases in Figure 2 are captured by Lines 30-37. Line 31 checks if the positive queue is full (in our illustration this occurs at t = 10, 20, ..., 100) and then we adapt the capacities accordingly (Lines 32-37) depending on whether the capacity of the positive queue has reached B 2 . Similarly, Lines 38-45 are applicable when the negative class is the majority class.
In summary, AREBA introduces the concept of maintaining separate and balanced queues for each class and its effectiveness is attributed to its dual-nature as it combines ideas from memory-based and resampling methods. These and other important remarks are discussed in detail in Section VIII.

V. EXPERIMENTAL SETUP
This section describes the synthetic and real-world datasets used along with any data pre-processing steps performed. It also describes the baseline and state-of-the-art methods used along with their selected parameters. It further describes the performance metrics and the evaluation method used.
A. Datasets 1) Synthetic datasets: They provide the flexibility to control the imbalance level, when to introduce drift, and control the drift type. We experiment with imbalance of 10% (mild), 1% (severe) and 0.1% (extreme) and examine each drift type individually to inspect the advantages and limitations of all the compared methods. The three synthetic datasets used are described below where we will use their balanced and imbalanced versions, with and without drift.
Circle [42]: It has the two features where (x 1c , x 2c ) is its centre and r c its radius. The circle with (x 1c , x 2c ) = (0.4, 0.5) and r c = 0.2 is created. Instances inside the circle are classified as positive, otherwise as negative.
2) Real-world datasets: They are typically more complex than synthetic ones and have a large number of noisy features, but the true nature of concept drift may be unknown. The five datasets used in this paper cover various application domains, specifically, healthcare, security and crime, finance and banking, image classification and environmental sciences.
Cervical Cancer [8]: The dataset was collected at the Hospital Universitario de Caracas in Caracas, Venezuela and contains demographic information, habits and historical medical records of 858 patients. The task is to predict the outcome of a biopsy with respect to cervical cancer and each class label was decided by a team of six experts. The dataset is highly imbalanced as 55 out of the 858 cases (6.4%) correspond to cases of cervical cancer. The number of features is 45. The exact nature of drift is unknown but it is expected to occur due to the fact that cancer cells gain genetic variation over time and also due to changes in patients' habits.
Fraud [6]: The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset contains 30 features and is severely imbalanced as the positive class (frauds) accounts for 0.172% of all transactions. The exact nature of concept drift is unknown but it is expected to occur due to the adaptive nature of adversarial actions.
Credit Score [43]: The dataset contains demographic and financial / banking information. The task is to predict the credit score (good, bad) of a customer. The dataset contains 1000 entries, out of which 300 correspond to a bad credit i.e. class imbalance is 30%. The number of features is 24. The exact nature of concept drift is unknown but may occur due to customer attempts to fake or improve their credit score.
MNIST [44]: It is a database of handwritten digits ("0"-"9") where each image is 28x28. Contrary to the rest of the datasets where their data type is numeric, MNIST is commonly used for training image processing systems for visual tasks. The dataset is diverse as the digits were written by ∼250 adult and student writers. We selected digit "7" to be the majority class with 6000 instances and digit "2" to be the minority class 7 with 60 instances, therefore, imbalance is close to 1%. Drift can occur as new handwriting styles appear during learning.
Forest Cover Type [45]: The datasets consists of cartographic information obtained from the US Forest Service. The task is to predict the forest cover type for given 30x30 meter cells from the Roosevelt National Forest in Colorado. We have selected the cover type "1" to be the majority class with 200000 instances and type "4" to be the minority class with 2000 instances, therefore, imbalance is close to 1%. This dataset has been used in many concept drift studies e.g. [24].

B. Compared methods
All compared methods share the same base classifier which is a fully-connected neural network of one 8-neuron hidden layer, except for MNIST which consists of two 512-neuron hidden layers to deal with its complex data type (i.e. images). The base classifier is configured as follows: He Normal [46] weight initialisation, the Adam [47] optimisation algorithm, LeakyReLU [48] as the activation function of the hidden neurons, sigmoid activation for the output neuron, and the binary cross-entropy loss function. The learning rate is 0.01 for the synthetic, Credit Score and Forest Cover Type datasets, 0.1 for Cervical Cancer and 0.0001 for Fraud. For MNIST the learning rate is 0.001 and L2 regularisation is set to 0.01.
Baseline. A baseline algorithm where no mechanisms to address class imbalance or concept drift exist. This baseline method is an online and incremental learning algorithm.
Sliding window. A memory-based method that uses a single sliding window to address drift, but no mechanism to address imbalance exists. This is in contrast to QBR which utilises one sliding window per class. The window size is set to W = 100. It is an online but not incremental learning algorithm as it requires access to W − 1 previously observed data examples.
Adaptive CS. A state-of-the-art adaptive cost-sensitive learning method. It uses the CSOGD cost function defined in Eq (4) initialised to c = cp cn = 0.95 0.05 = 19 as suggested by its authors. These costs are adapted according to the CID approach, where its time-decayed class size metrics are defined in Eq (5). The time-decayed factor is set to θ = 0.99. To overcome stability issues in performance we set an upper bound to the ratio i.e. 1 ≤ c ≤ 50 and we update the costs every 250 steps. This method is both online and incremental.
OOB. A state-of-the-art online resampling algorithm (Section III-B2). We will consider OOB with 20 classifiers and the special case OOB single where only a single classifier is used. The time-decayed factor is set to θ = 0.99. OOB is an online and incremental learning algorithm but unlike other methods, it performs multiple updates of the classifier at each time step.
AREBA. The proposed method whose pseudocode is provided in Algorithm 2. The time-decayed factor is set to θ = 0.99. AREBA is an online learning algorithm. When B = 2, the method (referred to as AREBA 2) becomes a nearincremental learning algorithm as it requires access only to a single old example. In our study, we will examine various values of memory size, which we will refer to as AREBA B.
We discuss now some aspects concerning the computational cost of algorithms. For all methods, one-pass learning is used i.e. the base classifier is updated once (#epochs=1) at every step, except for OOB which is updated K times (#epochs=K). A single batch update is performed within an epoch; this is to allow fast learning in high-speed applications. For AREBA, the q t is the batch while for Sliding, the window W is the batch. For the rest, the batch size is 1 as they are incremental algorithms. The same classifier gets updated throughout the duration of an experiment. Even in the presence of drift, we never reset the classifier or introduce any new classifier(s).

C. Performance metrics
Traditionally, classifiers are evaluated using the overall accuracy metric. When class imbalance exists, accuracy becomes problematic as it is biased towards the majority class. This occurs because any metric that uses values from both columns of the confusion matrix will be inherently sensitive to class imbalance [2]. Therefore, it is necessary to adopt a performance metric which is not sensitive to class imbalance.
The geometric mean G-mean [2] is such a suitable metric that evaluates the degree of inductive bias in terms of a ratio of positive accuracy Acc + (or recall) and negative accuracy Acc − (or specificity), as defined below: [49].
where T P, P, T N and N is the number of true positives, total positives, true negatives and total negatives respectively. G-mean has some desirable properties as it is high when both Acc + and Acc − are high and when their difference is small.

D. Evaluation method
To evaluate, compare and assess predictive sequential learning algorithms we adopt the prequential error with fading factors method. It has been proven that for learning algorithms in stationary data this method converges to the Bayes error [50]. This method does not require a holdout set and the predictive model is always tested on unseen data. In our experimental study the fading factor is set to θ = 0.99.
In all simulations we plot the prequential metric (e.g. Gmean) in every step averaged over 50 repetitions, including the error bars displaying the standard error around the mean. Additionally, in all experiments we test for statistical significance using a one-way repeated measures ANOVA and then using post-hoc multiple comparisons tests with Fisher's least significant difference correction procedure to show which of the compared method is significantly different from the others.

VI. EMPIRICAL ANALYSIS OF AREBA
This section presents a two-fold analysis of the proposed method. In particular, we examine and discuss the roles of the adaptive rebalancing mechanism and the memory size.

A. Role of the adaptive rebalancing mechanism
This analysis has been conducted on the stationary Circle dataset, i.e., without concept drift. Figures 3a -3c depict how the memory size affects the learning performance of QBR. The left, middle and right columns correspond to experiments with class imbalance of CI = 10%, 1%, 0.1% respectively.
In Fig. 3a, where imbalance is mild i.e. CI = 10%, the final performance remains unaffected for B ≥ 50, while for B < 50 it performs slightly worse. However, the learning speed is severely affected by the choice of B i.e. it gets slower with an increasing value of B. For instance, QBR with B = 500 equalises the performance of B = 2 at about t = 2000.
In Fig. 3b where class imbalance is severe i.e. CI = 1%, both the learning speed and final performance are severely affected by the choice of B. The learning speed gets slower with an increasing value of B, for instance, QBR with B = 500 equalises the performance of B = 2 at about t = 5000. In regard to the final performance, after a certain threshold (here, B ≥ 50), it significantly deteriorates as the value of B increases. For instance, QBR with B = 1000 does not even equalise the performance of B = 2 after 5000 time steps.
In Fig. 3c where imbalance is extreme (CI = 0.1%), the learning speed and final performance are severely affected by B, as previously. QBR with B = 2 is by far the best algorithm. QBR with B = 50, 100 only start closing the gap after 5000 time steps while B = 500, 1000 perform poorly at t = 5000.
The analogous experiments for AREBA are shown in Figures  4a -4c. Irrespective of the imbalance severity, after a certain threshold (here, B ≥ 50), all AREBA versions behave almost identically. To sum up, these important remarks can be made: • Without the adaptive rebalancing mechanism, QBR is very sensitive to the choice of the memory size B and, as a result, the learning speed and/or the final performance are significantly affected. The problem becomes more acute as class imbalance becomes more severe. • The adaptive rebalancing mechanism makes AREBA robust to the choice of the memory size B, irrespective of the imbalance severity. Specifically, the higher the value of B the better, however, after a (dataset-specific) threshold, the improvement is negligible.

B. Role of the memory size
This section identifies the role of the memory size in the presence of outdated concepts. We examine both the leaning speed and final performance. To examine the learning speed we present the learning curves; when the curve is steep it means the algorithm is learning faster. The learning curves present the prequential G-mean at every time step. To examine the learning quality we present the final performance i.e. the one obtained at the last time step of the curve.
The analysis is conducted on the synthetic datasets on two settings, based on [3]. The short setting (5000 steps, drift at t = 2500) allows us to examine AREBA's behaviour immediately after the drift, while the long (20000 steps, drift at t = 10000) allows us to examine AREBA's long-term behaviour.
For the Circle dataset the drift is defined as follows: We start with the case of mild imbalance (Fig. 5a). For the part of the curves before the drift, we conclude that the higher the value of B the better, however, after a (dataset-specific) threshold, the improvement is negligible. This has been studied in Section VI-A (Fig 4a), therefore, from now on, we turn our attention to the part of the learning curves after the drift.
For the learning speed, AREBA with B = 500, 1000 is slower than B = 20, 100 which in turn is slower than B = 2. For the final performance, AREBA with B = 2 significantly outperforms the rest. We observe that the smaller the value of B the better the results. This is also clear in Fig. 5e. This is expected as immediately after a drift, a larger value of B means that more outdated examples exist in the queues.
Surprisingly, this no longer holds when imbalance becomes severe or extreme (Figs. 5b and 5c) where all AREBA versions yield the same final performance. This is attributed to two reasons. Firstly, a small value of B is suitable in case of outdated examples with mild imbalance (Fig. 5a). Secondly, in case of severe imbalance without drift, a large value of B is suitable (Fig. 4b). In the presence of both drift and severe imbalance, the aforementioned are in conflict with each other and, interestingly, it appears that imbalance becomes the key issue when it is severe rather than drift.
Overall, an important trade-off exists. In stationary settings, the higher the value of B the better the results, as shown in   (Fig. 6a), the learning speed of AREBA with B = 500, 1000 is slower than B = 20, 100. In case of severe imbalance (Fig. 6b), we can conclude that the key issue becomes the imbalance rather than the drift. Notably, AREBA with B = 1000 outperforms B = 2 and equalises the performance of B = 100, 500 despite containing significantly more outdated examples in its queues. In Fig. 6e, the best final performance when imbalance is mild (CI = 10%) occurs when B = 100. The best final performance when imbalance is severe (CI = 1%) occurs when B = 20. Under conditions of mild imbalance (Fig 6a), the aforementioned trade-off is not as clear as it was for the Circle dataset (Fig 5a). Therefore, we note that the need for some experimentation to find a suitable choice of B becomes even more important.
For the Sea dataset the posterior drift is defined in Eq. 11.  10 Notice that the experiment runs for 20000 steps to inspect AREBA's long-term behaviour. We start with the case of mild imbalance (Fig. 7a). While immediately after the drift, AREBA with B = 500, 1000 appear to be learning slower, given additional time, the two eventually outperform the rest. This is expected because after a long time without any recurring drift, the data can be considered as stationary and, hence, we reach the same conclusion as previously (Section VI-A, Fig. 4a).
For severe imbalance (Fig. 7b), we conclude that the key issue becomes the imbalance rather than the drift. Notably, AREBA with B = 1000 outperforms B = 2 and equalises B = 20, 500 despite containing significantly more outdated examples in its queues. The best final performance when imbalance is severe occurs when B = 100. This is also shown in Fig. 7e. For the case of extreme class imbalance i.e. CI = 0.1%, all AREBA versions (except when B = 2) perform similarly, although, B = 20 has a slight advantage.
To sum up, the following important remarks can be made: • For mild class imbalance, the general trend is that the lower the value of the memory size B, the better AREBA performs. This is attributed to the fact that a smaller amount of outdated examples are contained in the queues. The optimal value of B depends on the dataset. • When severe imbalance exists, it appears to be the key problem rather than drift. Surprisingly, AREBA with large values of B has been shown to perform similarly or even better than AREBA with lower values. Interestingly, this is somewhat in contradiction to the previous point as the former contains more outdated examples in the queues. • The previous are useful guidelines to help with the selection of the memory size parameter. However, some experimentation is still necessary to obtain the best value of B. We discuss the role of B further in Section VIII.

A. Stationary data
We describe our work on stationary synthetic data. Figs 8 -10 show the results for the Sine dataset with imbalance of CI = 10%, 1%, 0.1% respectively. For completeness, we also present the prequential recall and specificity. Based on the analysis in Section VI, we choose B = 20 for the Sine dataset.
In Fig. 8a, the best performance is achieved by AREBA 20. The rest reach a similar performance with the exception of the Baseline. Both AREBA versions learn faster with OOB / OOB single being the second best. Noteworthy, the 20 classifier ensemble (OOB) performs similarly to the single classifier OOB single. While this may seem surprising at first, it is in fact consistent with the results of their authors [38], where they have concluded that resampling, and not ensembling, is the reason behind the effectiveness of the approach.
In Fig. 9a where severe imbalance exists the two AREBA versions significantly outperform the rest. Under extreme imbalance (Fig. 10a) AREBA performs more than ten times better than the rest. AREBA's effectiveness is attributed to the following. While all algorithms perform well on the majority class (Figs 9c and 10c), AREBA has a superior performance on the minority class (Figs 9b and 10b). This means that the rest Recall that the G-mean is high when both Acc + and Acc − are high and when their difference is small.
The reason for AREBA's advantage on minority class examples is due to the concept of maintaining separate and balanced queues for each class. Let us first consider the incremental learning algorithms Baseline, Adaptive CS, OOB single and OOB. These algorithms use only the most recent arriving example which they later discard. As a result, under severe or extreme imbalance, they don't experience many minority class examples. Let us now consider Sliding which is a memorybased algorithm as it implements a single sliding window. Under severe or extreme imbalance, this method may still suffer from the same problem if only a small number of minority class examples are found in its sliding window. The problem can be alleviated with a larger window size, however, it wouldn't be able to rapidly react to concept drift. In contrast, the proposed AREBA can afford to have a small window size and, at the same time, experience a sufficient number of minority class examples. AREBA, like all methods, depends on the classifier being able to forget old knowledge quickly enough to react to drift. Therefore, as discussed in Section VI-B, the optimal value of B is application-dependent and tuning is required. To sum up, these remarks are made: (a) prior drift (b) likelihood drift (c) posterior drift Fig. 11. Comparative study on Sine with imbalance of 1% and different drift types. AREBA is robust to drift, can fully recover and significantly outperforms other state-of-the-art methods as it had been the case before the drift occurred.
• AREBA 20 outperforms all algorithms in all imbalance scenarios while AREBA 2 is the second best. Under extreme class imbalance, AREBA has been shown to perform ten times better than state-of-the-art algorithms. • Interestingly, resampling methods (AREBA, OOB single) seem to better handle online imbalance learning than costsensitive learning methods (Adaptive CS). • As the imbalance becomes severe, the performance of all algorithms declines significantly. AREBA has been shown to be affected less seriously, thus being more robust to it. • Methods without a mechanism to handle class imbalance in online learning perform poorly (Baseline, Sliding).

B. Nonstationary data
We describe our work on nonstationary synthetic data. To examine each drift type independently, we devise the following experiments on the Sine dataset based on [3]. In all experiments, the imbalance is set to CI = 1% i.e. p(y = 1) = 0.01.
A drift in prior probability p(y) occurs as shown in Eq (12): A drift in likelihood p(x|y) occurs as shown in Eq (13): A posterior probability drift occurs as shown in Eq (10). Figures 11a -11c show a comparative study on the Sine dataset with a prior, class-conditional and posterior probability drift respectively. Overall, the proposed AREBA 20 outperforms the rest in all cases while AREBA 2 is the second best. The first point to note is about virtual drift; recall that this drift type does not alter the true decision boundary but it can affect the learnt one. Generally, the state-of-the-art (AREBA, OOB single, Adaptive CS) algorithms not only are robust to virtual drift, but an overall slight improvement is observed (Figures 11a and 11b). This is because more feature space is revealed to the algorithm after the drift occurs, for instance, in Eq (13) more examples with x 1 ≥ 0.6 will be observed.
The second point is about posterior drift. In this case the performance of all algorithms decline significantly after the drift (Fig. 11c). Recall from Eq (10) that the way we defined posterior drift is by performing a "concept swap". Therefore, all algorithms should start re-learning the new concept. The third point is about OOb single. It outperforms its ensemble version when drift is virtual (Figs. 11a and 11b). After a posterior drift occurs, OOB is significantly better than OOB single but they obtain a similar final performance. Again, this is in alignment with [38] as in a number of occasions, the single classifier has outperformed its ensemble version. From now on, we will consider only OOB single. To sum up, the following important remarks can be made: • AREBA 20 outperforms all algorithms in all drift scenarios while AREBA 2 is the second best. • A drift in p(y|x) is the most severe type of data alteration and this clearly reflects on all algorithms' performance. • Algorithms with a mechanism to address drift (e.g. Sliding) but without one to address imbalance perform poorly. • In settings where there is solely drift (no imbalance), AREBA B and Sliding significantly outperform all algorithms; they are almost identical when W = 2B. AREBA 2 and Baseline jointly follow. Adaptive CS and OOB single are outperformed by the rest (the figures for these results are not included).

C. Data with noisy class labels
We study each algorithm under conditions where the class label (i.e. ground truth) is incorrectly provided. The study aims at emulating realistic scenarios that can be encountered in deployed settings. One such scenario is when an expert is unable to provide the label (violation of the label availability assumption). Another scenario is when the label can be provided, but not always timely i.e. before the arrival of the next example (violation of the no verification latency assumption). In these scenarios, an estimated label could be provided by an automated mechanism which may or may not be true.
To model this behaviour, we specify a probability threshold by which the true label cannot be provided. The degree to which the aforementioned assumptions are violated in our experiments i.e. the probability threshold is 10%. Specifically, with a probability of 10% we revert the true class label of the arriving example, therefore, each algorithm is not always trained using the ground truth. We repeat the experiments of  scenarios with 10% noise and 1% (severe) imbalance is G-mean 0.6, which is close to the performance obtained by the proposed algorithm in the case of 0.1% (extreme) imbalance without drift (Figure 10a).

D. Real-world data
We describe now our work on real-world data. To examine the learning speed we present the learning curves in Figures 13a -13e. The final mean performances are shown in Table I; the best performing algorithm based on ANOVA and its post-hoc tests (Section V-D) is shown in bold font which denotes statistical significance over the others, and the standard deviation is shown in brackets. The tables with the p-values are found in the supplementary materials. Following the guidelines derived from our analysis in Section VI-A we use AREBA 20 in three datasets and AREBA 50 in the other two.
In Fig 13a, AREBA 50 achieves G-mean = 0.8 at t ≈ 350 while Adaptive CS obtains this score at t ≈ 650. The learning speed is of utmost importance as, in practise, this means Adaptive CS would observe about 300 more biopsies to equalise AREBA. In Fig 13c, AREBA 20 achieves G-mean = 0.6 at t ≈ 50, while Adaptive CS obtains this score at t ≈ 450. In Fig 13e all algorithms equalise AREBA 20 at t ≈ 10000. In Fig 13b, AREBA and OOB single behave similarly.
On Table I and Cervical Cancer, AREBA 20/50 is joint first with AREBA 2 and outperforms the second Adaptive CS by more than 9%. In Credit Score, AREBA 20/50 outperforms the second AREBA 2 by 3.5% and the third Adaptive CS by 6%. In MNIST, AREBA 20/50 achieves a superior performance, outperforming the second Adaptive CS by more than 12% and the third AREBA 2 by more than 15%. In Fraud, AREBA 2 outperforms the rest but the improvement over AREBA 20/50 and OOB single is less than 1%. In Forest Cover Type, all algorithms obtain the same final performance. Surprisingly, Adaptive CS performs better overall than OOB single while the opposite hold true in the synthetic datasets. To sum up, these important remarks can be made: • AREBA outperforms other algorithms in all real datasets.
These are cases where it would have an impact in practise. • In regard to the choice of memory size B, the results are consistent with those observed in our studies with synthetic data. Specifically, AREBA 2 either performs similarly to AREBA 20 or closely but worse.

VIII. DISCUSSION AND CONCLUSION
We discuss below important aspects of AREBA, its advantages and limitations, and directions for future work.
Dual nature. AREBA's effectiveness is attributed to a few important characteristics. By maintaining separate and balanced queues for each class helps to address the imbalance problem. Propagating past examples in the most recent training set is viewed as a form of oversampling. The fact that examples are carried over a series of steps allows the classifier to "remember" old concepts. Also, to address the drift challenge, the classifier needs to also be able to "forget" old concepts. This is achieved by AREBA's memory-based nature i.e. by bounding the length of queues, these are essentially behaving like sliding windows. Hence, AREBA can cope with both imbalance and drift. We have shown that the proposed synergy is seamless, and it significantly outperforms algorithms that belong to resampling or memory-based methods solely, and 13 even algorithms that belong to other types e.g. cost-sensitive methods. Lastly, recall that no drift detector can perform satisfactory under any situation. When domain expertise can foresee the drift's nature, AREBA can work in cooperation with change detection-based methods to complement each other.
QBR vs AREBA. QBR was first introduced in our preliminary study [41] that ran experiments only on synthetic datasets, and examined only a limited range of imbalance rates and a single drift type. In this work, we conducted an extensive experimental work and stress tested QBR under conditions of severe imbalance and different drift types. We have identified, discussed and analysed QBR's limitations and under which conditions these occur. It turns out that QBR's limitations arise from the fact that the imbalance problem persists in the queues (as described with reference to Fig. 1). To overcome this, the paper proposes AREBA with two major design changes over QBR. Firstly, it allows each queue to be of size B (rather than B 2 ). Secondly, the queues remain balanced throughout the time using a dynamic mechanism (adaptive rebalancing). These changes allow AREBA to be very effective when dealing with online learning tasks under drift and imbalance.
Role of the memory size. B's role is of great importance and goes beyond that of just controlling the maximum queue lengths. It controls the "level" of incremental learning. The smaller its value the closer to being incremental, specifically, when B = 2 AREBA becomes near-incremental. Finding a suitable value is dataset-specific and requires some trial-anderror. To reduce this tedious process, we have provided in Section VI-B some guidelines to help us determine B.
Choice of classifier. AREBA does not impose any restrictions on the selection of the classifier. Some classifiers, however, are more suitable than others for online learning. Our scope was on neural networks which have been shown to work well (e.g. [3], [26]). Hoeffding Trees have also been shown to work well (e.g. [24]). Future work will apply AREBA with trees to examine if the observed improvement can generalise.
Verification latency. The learning framework used is suitable for human-in-the-loop learning and assumes that no verification latency exists. Involving humans, however, may cause delays in receiving the labels. In practise, to avoid or reduce potential delays, would require mechanisms to collect labels in an automatic manner. We have shown in Section VII-C that AREBA maintains its dual nature benefits and still outperforms other state-of-the-art algorithms in conditions where the assumption is violated. Future work will relax this assumption and examine other paradigms, such as, active learning [51].
To conclude, we introduced the novel Adaptive REBAlancing (AREBA) algorithm to addresses the problem of class imbalance in nonstationary environments. We provided new interesting insights towards the joint problem of imbalance and concept drift. Our study compared AREBA to other four baseline and state-of-the-art algorithms and showed that it significantly outperforms them in the vast majority of compared settings.