Adaptive Learning With Extreme Verification Latency in Non-Stationary Environments

Existing Data Stream Mining algorithms assume the availability of labelled and balanced data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud-Detection systems, Cyber Security, and Human Activity Recognition, a vast amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet are unlabelled. Furthermore, the prediction models need to learn in Non-Stationary Environments due to evolving concepts. Manual labelling of these data streams is not practical due to the need for domain expertise and the time-resource-prohibitive nature of the required effort. To deal with such scenarios, existing approaches are self-Learning or Cluster-Guided Classification (CGC) which predict the pseudo-labels, which further update the prediction models. Previous studies have yet to establish a clear and conclusive view as to when, and why one pseudo-labelling approach should be preferable to another and what causes an approach to fail. In this research, we propose a novel approach, “Predictor for Streaming Data with Scarce Labels” (PSDSL), which is capable of intelligently switching between self-learning, CGC and micro-clustering strategies, based on the problem it is applied to, i.e., the different characteristics of the data streams. In PSDSL a novel approach called Envelope-Clustering has been introduced to resolve the conflict during the cluster labelling which suggested a confidence measure approach to ensure the quality and correctness of labels assigned to the clusters. The auto parameter tuning mechanism of PSDSL eliminates the human dependency and determines the best value of number of centroids from initial labelled data. The predictive performance of the PSDSL is evaluated on non-stationary datasets, synthetic data-streams, and real-world datasets. The approach has shown promising results on randomised datasets as well as on synthetic data-streams, as compared with state-of-the-art approaches. This is the first large-scale study on an adaptive extreme verification approach that supports automatic parameter tuning and intelligent switching of pseudo-labelling strategy, thus reducing the dependency of machine learning on human input.

In many real-world applications of online data stream mining, the data originates from different sources such as sensor devices, social media, business/financial transactions, etc. The data evolves over time, and therefore extracting worthwhile knowledge is hard to achieve in such Non-Stationary Environments (NSEs). The underlying probability distributions of the data stream change over time, resulting in concept drifts [12].
This change occurs in a set of input variables 'x' and/or class labels 'y', i.e., P t (x, y) = P t+1 (x, y) at the time 't'. Two different types of concept drifts exist, i.e. ''Virtual drift'' [19] in which only the distribution of input data 'x' changes, i.e., P t (x) = P t+1 (x) and does not affect the class labels, i.e. P t (y|x) = P t+1 (y|x) whereas ''Real drift'' refers to any gradual or sudden changes in class labels due to changes in the distribution P (y|x).
Initially a Labelled Non-Stationary Environment (ILNSE) addresses both EVL and NSEs issues simultaneously, for example, autonomous robots [65] are initially trained inside a specific environment on labelled data and known classes. Later, they are sent to explore an unknown environment without the supervision of humans. The robots also need to adapt themselves to changing environments and do so under the condition of lack of true class labels from sensor data. Another application is credit card fraud detection [1] in which the true class label of a particular transaction is unknown, and it is impossible to say whether it is fraud or non-fraud until the user receives and reviews the monthly statement. It is a VL scenario, because the true class labels are only available in the future, and it is also an NSE because the customers' patterns of spending change seasonally and/or during holidays due to changes in their geographic locations and these factors can result in concept drifts.
Learning under scarcity of class labels is challenging in data streams because the true class labels for future emerging data instances are not available. In data stream clustering, semantically similar objects are moved closer to each other, and the algorithms try to group similar objects. However, the clusters could easily be misclassified in the absence of true class labels. The choice of pseudo-labels in cluster labelling could be problematic as the pseudo-labels are predicted using the same model and due to NSEs these labels could make the models less reliable. In Cluster Guided Classification (CGC), the labels from the nearest clusters are transferred, however, the algorithm does not implement a confidence measure approach to assure the quality and correctness of labels assigned to the clusters.
In data streams the instances arrive in a sequential order which is directly fed into the online learning models thus storing and referring to the previous data is not practical due to time limitations. The output of an adaptive classifier at every time step depends on instances the classifier has been trained on to-date. Hence, performance depends on the order of instances in the dataset [68], Žliobaitė [68] suggested executing multiple tests with randomised copies of a data stream. Existing benchmarks for non-stationary datasets [8], [26] are designed to evaluate CGC on EVL, by inducing gradual shifting to the clusters. CGC shows promising results due to the high purity of clusters [76]; however, when the order of these datasets is randomised the CGC performance drops considerably. This supports the fact that the existing CGC approaches succeed only under certain conditions.
Real-time streaming data is usually unlabelled and unstructured; therefore, supervised learning algorithms are not very effective due to their dependency on class labels. Most of the algorithms on learning from NSEs -including prior efforts on Heterogeneous Dynamic Weighted Majority (HDWM) [13] -have focused on supervised approaches. The terms homogeneous and heterogeneous refer to the data mining algorithms used in the process, where homogeneous refers to the use of only one data mining algorithm, and heterogeneous refers to the use of different data mining algorithms [44]. HDWM makes use of ''seed'' learners, in which different types of classifiers maintain the diversity of the ensemble. To support EVL, HDWM was extended, and a new approach proposed, PSDSL. Fig.1 shows the positioning of the presented work in the literature. It is placed at the intersection of EVL and NSE, while data stream clustering and stream classification are mutually intersected with both EVL and NSE.
PSDSL automatically decides the use of the best classifier from a pool of heterogeneous classifiers, it can switch on the pseudo-labelling strategy, i.e., cluster guided, selflearning or micro-clustering, and selects whichever approach is beneficial, based on the characteristics of the data stream. We also introduce a new approach called envelope-clustering to resolve the conflict during the cluster labelling and suggested a confidence measure approach to ensure the quality and correctness of labels assigned to the clusters.
PSDSL is empirically evaluated against existing stateof-the-art approaches namely COMPacted Object Sample Extraction (COMPOSE) [21], Learning Extreme VErification Latency with Importance Weighting (LEVEL IW) [11], Stream Classification Algorithm Guided by Clustering (SCARGC) [8] and Micro Cluster for Classification (MClassification) [70] on benchmarks NSE datasets [8] Massive Online Analysis (MOA) [25] data streams and real-world datasets [8]. We also introduced the hyperparameters tuning mechanism in PSDSL which assist the algorithm to automatically suggest the best value for the number of centroids 'k'. The predictive performance of PSDSL and SCARGC were also evaluated after randomising the benchmarks nonstationary datasets in which the training instances were shuffled by changing the training orders. The results showed that PSDSL performed significantly better than existing approaches when the instances of the datasets were randomised or noisy.
This paper is structured as follows. Section II presents related work. Section III then describes the proposed PSDSL approach. The experimental setup and experimental evaluation are described in Section IV. Finally, Section V provides a discussion of the presented research and sets out concluding remarks. Several approaches exist that address the problems associated with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. ILNSE is a challenging task because the learning algorithms have no access to the true class labels directly after the drift occurs. From the literature, it is not clear when and under what conditions one approach is better than the other, and what causes one approach to fail. Most of the real-world data streams are continuous and infinite. Unlike data mining, in data streams, there is no prior information about the number of classes and this value may change in the future. In some specific conditions, the CGC algorithms could be more effective than self-learning if the data is favouring clustering, i.e., high purity clusters. These issues make it difficult to choose the right EVL approach to different problems. This parameter has a great influence on clustering results. In offline machine learning, this parameter is iteratively tuned on finite datasets; however, the data streams are infinite and arrive at high speed and the existing EVL approaches are relying on the manual selection of parameter 'k'.
Set out below are the questions addressed in the presented research, which are followed by answers to the research questions and contributions. • RQ3: What strategy should be adopted if one of the EVL approaches fail? The following is a summary of the answers to the research questions as an extract of the findings of the research presented in this paper. To deal with EVL under NSEs, the most successful approaches are self-learning and CGC. SCARGC [8] is a well-known algorithm that is based on CGC to deal with EVL under NSE. SCARGC performs well on certain datasets in which centroids are moving at a constant speed; however, when the order of the training instances is shuffled, its predictive performance was significantly reduced; this confirms the influence of randomisation on the prediction capabilities of the CGC approach (RQ1). To address (RQ3), PSDSL is made capable of intelligently switching between self-learning and CGC, based on the problem it is applied to, i.e., different characteristics of the data streams. In SCARGC [8] and COMPOSE [21], which is another approach that addresses EVL under NSEs, the values of parameter 'k' are chosen manually to achieve the best results in different datasets (RQ2). MClassification [70] does not require the number of clusters to be known prior to execution as SCARGC, however, it is also computationally expensive. LEVEL IW [11] is highly dependent on the value of the Gaussian kernel bandwidth. While PSDSL is capable of hyperparameter tuning for the best values of parameter 'k' and does not depend on the Gaussian kernel.
This paper provides the following novel contributions: • Empirically evaluated PSDSL against standalone approaches based on micro-clusters, self-learning and CGC. The predictive accuracies of PSDSL, COMPOSE, LEVEL IW , SCARGC and MClassification are evaluated on benchmarks NSE datasets [8].
• Introduced envelope-clustering (centroid-based clustering for micro-instances) approach to resolve conflicts during the cluster labelling and suggested a confidence measure to ensure the quality and correctness of labels assigned to the clusters.
• Introduced the hyperparameters tuning mechanism in PSDSL to automatically suggest the best value for the number of centroids 'k'.
• The predictive performance of PSDSL and SCARGC were also evaluated after randomising the benchmark non-stationary datasets in which the training instances were shuffled by changing the training orders.
In a comparative analysis of existing EVL learning algorithms which was presented by Umer and Polikar [77]. The average prediction accuracy of SCARGC is highest in all the benchmarks non-stationary datasets [26] when compared with COMPOSE, LEVEL IW and MClassification, therefore only SCARGC was implemented in the MOA [25] to compare it with the MOA data streams approach.

II. RELATED WORK
A recent comprehensive survey and comparative analysis of some of the EVL algorithms is available in the literature [77]. Data stream classification is a process to extract effective knowledge and thereby unlock valuable insights arising from large amounts of real-time data. In semi-supervised data stream mining, a clustering step is followed by a classification which also repeatedly applies in a closed loop fashion. The clustering algorithms make use of unlabelled data to predict the pseudo-labels which are fed to the classifiers to update the prediction models. Pseudo-labelling is a process of using the labelled data model to predict labels for unlabelled data. The clustering task is to group similar objects. EVL can be handled using self-learning [45], [46], active learning [47], [48] or CGC [8], [49], [50]. The authors of Dyer et al. [21] applied clustering and ensemble learning to deal with Label Scarcity and drift handling. Graph mining, clustering, and ensemble approaches have been used by Zhang et al. [22] for mining data streams with concept drifts. Trees with a clustering approach were used by Xindong et al. [23] to deal with recurrent drifts. The following section discusses some of the more promising online ensemble classifiers that can be used in EVL under NSEs.

A. ONLINE ENSEMBLES FOR NSE
Ensembles are learning models grouped together in an effort to improve the prediction capabilities of single classifiers. Ensemble methods are one of the most promising research directions [51]. A comprehensive survey on ensemble learning for data streams is available in the literature [20]. The following are some of the online ensembles dealing with concept drifts: Weighted Majority Algorithm (WMA) [17] combines the different types of base classifiers with an initial weight equal to '1'. The weight is updated on each wrong prediction. The final prediction is made based on the weighted majority vote among the base learners. WMA base learners are heterogeneous, potentially helping to produce more diverse ensembles. However, one of the drawbacks of WMA is that it uses fixed numbers of base learners.
Dynamic Weighted Majority algorithm (DWM) [18] is similar to WMA but it uses a dynamic ensemble size. Despite using the WMA weighting mechanism, DWM does not exploit one of the key aspects of WMA, the use of different types of base models.
Heterogeneous Dynamic Weighted Majority (HDWM) [13], is suggested by this study for deployment; as this makes use of different types of ''seed'' learners to maintain the diversity of ensemble and to overcome problems of existing dynamic ensembles that may undergo loss of diversity due to the exclusion of base learners.
Additive Expert Ensembles (AddExp) [52] uses weighted majority vote and adds a new base model at every wrong prediction by the ensemble. Online Accuracy Updated Ensemble [53] combines block-based and online ensemble methods.

B. EVL APPROACHES FOR NSE
Several approaches have been proposed to handle EVL and NSE over the last few years. In particular, the following algorithms were developed to address this problem.
SCARGC [8] applies K-Nearest Neighbour to build the classification models. The algorithm stores instances in batches or in a pool. The initial classification is trained using labelled instances and predicts the pseudo-labels for the unlabelled instances and stores them in the pool. When the pool size reaches 'θ' which is a user-provided value, and the clusters are formed, new centroids receive their labels from previous centroids. The new centroids are used for the prediction of new class labels for the pool data. The algorithm follows a closed loop by switching between clustering and classification.
MClassification [70] uses the concept of micro-clusters [38]. The paper that introduced micro-clusters. The algorithm uses tuple (N , − → LS, − → SS, y) to store sufficient statistics from a set of examples. The authors of [70] calculated the centre and radius of micro-clusters using eq. (1) and (2).
where: 2 is the squared sum in N data points • N = number of data points • y = class label for a set of data points New data points are absorbed in the existing micro-clusters and this results in an increase in the radius and centroids. If the radius is increased from the threshold set by the user, it creates a new micro-cluster, and this process repeats in a loop for each newly received unlabelled data point. For example, a new data point − → x can be absorbed in MC A = N A , − → LS A , − → SS A updating the summary statics in the following way: Similarly when merging two disjoint micro-clusters MC A and MC B the union of these two clusters is equal to the sum of its parts and the sufficient statistics is calculated as: Micro-clusters are generated for initial labelled examples, and new unlabelled data instances are accorded their respective labels from the nearest clusters based on the Euclidean distance. In this way as new data points are absorbed and this results in an increase in the radius and centroids. If the radius is increased from the threshold set by the user, it creates a new micro-cluster, and this process repeats in a loop for each newly received unlabelled example.
COMPOSE [21] also addresses the EVL problem. Initially, the labelled instances build a base classifier, either Gaussian mixture model or k-nearest neighbour to obtain a hypothesis and predict class labels. It then constructs the α-shape (density estimation) using Compaction Percentage and assigns the labels that typically lie in the centre of the feature PSDSL for each class. The Core Support Extraction (CSE) extracts those newly labelled data drawn from the central region of the current distribution.
LEVEL IW [11] relies on a least-squares probabilistic wrapper classifier, which predicts the labels for the unlabelled test data and becomes the labelled training data for the current time step. To predict the labels for the unlabelled test data the algorithm takes four parameters. 1) The training data at the current time step, 2) the corresponding label 3) the unlabelled test data at the current time step, and 4) the kernel bandwidth value σ . The algorithm then follows a closed loop.

C. DATA STREAM CLUSTERING
A series of surveys incorporate the latest developments in the field of Semi-Supervised Learning (SSL) methods which are closely related to label scarcity issues [24]. Several surveys and reviews on stream clustering algorithms are available [27], [42], [43], [63]. Examples of data stream clustering algorithms are incremental k-means [56], E-Stream [54] and HUE-Stream [55], CluStream [38], StreamKM++ [57], StreamLS [37], SWClustering [58]. Density Based algorithms are intended to group arbitrary-shaped clusters. Examples are DenStream [59], LDBSCAN [60], D-Stream [61], and MR-Stream [62]. However, existing surveys focus on offline learning for static data and make two basic assumptions: 1) the availability of large training datasets; and (2) training and test data are stationary. The CluStream algorithm divides the clustering process into on-line and offline components. Online micro-clusters compute and store summary statistics of the data stream. The offline macro-clusters apply K-mean on these micro-clusters.

D. DRIFT DETECTION IN EVL
Several approaches for drift adaptation are available in the literature [14], [15], [16], [18], [33]. Some data stream clustering algorithms adapt to concept drift implicitly as part of the learning process. More specifically, in EVL, when new instances arrive, the clusters are updated to reflect new concepts. The number of clustering algorithms explicitly addressing concept drift is very limited.
To address the non-stationary nature of data, most available algorithms apply window models-with the exception of two ODAC (Online Divisive-Agglomerative Clustering) [28] and FEAC-Stream (Fast Evolutionary Algorithm for Clustering data stream) [29] which use explicit concept drift adaptation. ODAC partitions the streams into different time windows. It constructs an incremental tree-like hierarchy of clusters and continuously monitors the diameters of clusters. FEAC-Stream uses the Page-Hinkley Test [30] to detect concept drifts.
CUSUM (cumulative sum approach) [31] was applied in the work of Namitha and Santhosh [32] for identifying virtual drifts in data stream clustering problems. MC-NN (Micro-Cluster Nearest Neighbour) [64], [74] aims to keep a recent and accurate summary of the data stream, and these microclusters are used for feature selection and detecting concept drift.

E. HYPERPARAMETER TUNING APPROACHES
Hyperparameters are parameters that need be initialised and before learning begins, these parameters control the learning process. Several data stream clustering algorithms apply k-means [34] due to its simplicity, scalability, and empirical success in many real-world applications [35]. However, one of the pitfalls of k-means is its dependency on the number of centroids 'k' that must be specified prior to the learning. To extend k-means-based Algorithms for evolving data streams with a variable number of 'k', de Andrade Silva and Hruschka [36] describe an algorithmic framework that enables the automatic estimation of 'k' based on the data.
The authors applied three state-of-the-art algorithms for clustering data streams -Stream LSearch [37], CluStream [38], and Stream KM++ [39] combined with two well-known algorithms for estimating the number of centroids 'k', namely: Ordered Multiple Runs of k-means [40] and Bisecting k-means [41].

F. LIMITATIONS OF THE APPROACHES DESCRIBED IN THE LITERATURE
This section highlights the gaps of the literature and explains in what way PSDSL fills these gaps. The most influential parameter for LEVEL IW is the value of the kernel width σ as used in the Gaussian kernel. The algorithm relies on Core Support Extraction (CSE), which is computationally very expensive, especially for high-dimensional data. Furthermore, the process is critically dependent on the parameter 'CP' which defines the Compaction Percentage of current labelled instances to use as core supports, and this means that therefore selecting the best value is problematic. Importantly, PSDSL does not rely on CSE and CP parameters.
While analysing the results published in the respective papers of LEVEL IW and COMPOSE, it is difficult to determine which performs better, it seems to be strongly dependent on the application. COMPOSE showed better results than LEVEL IW when there was a significant class overlap. COMPOSE uses the parameter 'k', the number of centroids, and LEVEL IW uses σ which is the value of the kernel bandwidth. In the case of complete class overlap and a condition when no ground truth data is available, it is extremely difficult for the algorithm to recover from this scenario. PSDSL is made capable of switching the learning strategies based on the problem it is applied on, this strategy is explained in Section III-A. PSDSL also introduced envelopeclustering to recover from class overlaps. This is explained in Section III-C.
The predictive performance of SCARGC is highly dependent on clustering, and it also requires some prior knowledge such as the number of centroids 'k' and pool size 'θ' which may significantly affect the predictive performance when such information is not available. To choose the best value of VOLUME 10, 2022 'k' which is suitable for a particular data stream, the algorithm needs to run several times with different values of 'k' and pick the 'k' that gives the best predictive accuracy. PSDSL applies an auto parameter tuning mechanism which determines the best value of 'k'.

III. THE PSDSL ALGORITHM
PSDSL is implemented in MOA [25] an open-source framework for data stream mining. The PSDSL is designed to work under EVL and NSEs and performs the following tasks on the initial labelled data.
1) Decide on the best classifier from a pool of Heterogeneous classifiers. 2) Decide on the pseudo-labelling strategy, i.e., Cluster guided or self-learning using classifiers. 3) Build offline micro-clusters and apply them online on-demand only in the case of drift detection. 4) Perform hyperparameter tuning to determine the best value of 'k'.
The unlabelled data stream generates and periodically updates the clustering on real-time data streams. To handle the Virtual drifts that occur due to changes in the distribution of input data i.e., P t (x) = P t+1 (x), PSDSL establishes a mapping between current and previous clusters (C t −→ C t+1 ) by assigning the current centroid the label which is the same label of the 'k' nearest past centroid. An overview of PSDSL is shown in Fig 2. In step 1, a set of heterogeneous classifiers are trained on a small number of labelled data, and ground truth clusters are formed. This information of ground truth clusters is passed to the hyperparameter tuning (step 2) and switching of pseudo-labelling states (step 3) which are explained in Sections III-D and III-A respectively. In step 4 overlapping of the clusters is determined (explained in Section III-C), if confidence levels of cluster labels fall below a user-provided threshold, envelopeclusters are formed to resolve the conflict in labelling. Finally, in step 5, the pseudo-labels are fed back to update the classifiers for predictions.

A. SWITCHING OF PSEUDO-LABELING STATES
PSDSL can switch between the three learning states for pseudo-labelling, 1) Cluster guided 2) Self-learning and 3) micro-clustering. The switching mechanism of PSDSL is illustrated in Fig. 3.
In a situation where pseudo-labelling is not improving the predictive performance on initially labelled data, PSDSL switches off the pseudo-labelling state. For this, Ensemble 'GT (Ground Truth)' is trained on the complete set of initial labelled data, while Ensemble 'PL (Pseudo-Labelling)' is trained only on 80% of the training data. Ensemble 'PL' predicts the pseudo-labels for the remaining 20% and trains itself on these pseudo-labels.
If the prediction accuracy of Ensemble 'PL' improves over 'GT', the self-learning state is enabled, otherwise it is suspended. The cluster guided state is enabled when the mean  values of F1-P and F1-R [69] are higher than a user provided threshold 'ρ'. The names F1-P and F1-R of evaluation metrics are given as the same names mentioned in MOA [25]. The F1-measure is the harmonic mean of precision and recall. F1-P calculates the total F1-score for each found cluster instead of for all ground truth clusters. While the F1-R is calculated by maximising F1 for each ground truth class.

B. CLUSTER GUIDED CLASSIFICATION IN PSDSL
In PSDSL, clusters guide the classification algorithms. Fig. 4 shows Concept v 1 (t) and v 2 (t+1) as a function of time 't and 'x' are the input attributes, and 'c' are the classes.
The steps used in this approach are given below. 1) In concept v 1 , at time (t), the labelled instances generate {C 1 . . . C n } clusters representing {c 1 . . . c n } classes in the initial labelled data. 2) When unlabelled data arrives at time (t+1) and data distribution changes, the concept v 1 changes to v 2 and the clusters receive labels from the nearest clusters. 3) More new data arrives at time (t+2) for which class labels are missing (shown in white circles) and pseudo-labels are required. 4) At time (t+3), the unlabelled instances 'x' receive pseudo-labels from the nearest clusters 'C' using the Euclidean distances between 'C' and 'x'.

C. ENVELOPE-CLUSTERING
Micro-clustering state applies on-demand when overlaps between clusters are detected. When clusters overlap, the nearest labelling approach encounters common issues such as losing the correct labels. Envelope-clustering detects and resolves the labels assigned to the clusters. Current micro-clusters receive their labels from the previously labelled clusters and vote for the class labels from 'k' nearest neighbours. This is a scenario where one group of clusters crosses another. As shown in Fig. 5, one group of clusters is stationary i.e., C 2 , and C 1 is crossing it. There are two possible outcomes 1) Triangle cluster 'C 1 ' transfers its label to 'C2' upon intersection with C 2 as the Circle cluster and converts the Circle cluster to Triangle; or, outcome 2) whereby the Triangle cluster 'C 1 gets re-labelled upon intersection with C 2 , thus turning the Triangle cluster Circle. PSDSL generates envelope-clusters by transforming the micro-clusters into micro instances. Envelope-clusters are generated using centroid-based clustering, such as k-means. When no cluster overlaps are detected, the concept of envelope-clustering applies online micro-clustering to calculate and store the summary statistics of the data stream; thus, applying it offline to generate macro-clusters when overlaps are detected, increases the processing speed of microclustering. Finally, the conflicted clusters receive labels from the corresponding envelope-clusters. Section III-C-1 describes conflict detection and resolution steps in detail.

1) CONFLICT DETECTION
The confidence level for the cluster labelling on the votes is calculated as in (3) below. When the confidence level FIGURE 5. Cluster overlapping in 1Csurr dataset [26] showing one class surrounding the other and resulting in two outcomes. 1) C 1 transfers its label to 'C2' or 2) C 1 ' gets re-labelled upon intersection with C 2 .
reaches below a user-provided threshold 'α' it reports the drift; otherwise, it transfers the labels to the corresponding clusters.

Confidence Level
where, λ are the class votes, Min and Max are the minimum and maximum number of votes per class and 'N' are total votes.

2) CONFLICT RESOLUTION
It is necessary to resolve clustering conflicts and label the remaining clusters. The conflict clusters receive the labels from the corresponding nearest envelope-clusters. Fig. 6 shows a plot for the 1Csurr dataset [26] as an example; circle and triangle clusters are successfully labelled from previous clusters (unfilled circle and triangles) with high confidence levels. The figure showing '6' conflicts (diamond) at threshold α = 0.5 and 3-nearest neighbours. For λ = [1,2] i.e. '1' vote for 'class 0' and '2' votes for class '1', the confidence level is = (2-1)/3 = 0.3 < 0.5 threshold. When there were no conflicts, λ = [3, 0] the confidence ratio = (3-0)/3 = 1.0 > 0.5 resulted in a successful label transfer shown in filled circle and triangle clusters. Fig. 7 shows the resolution of conflict in which the labels to the diamond conflicted clusters are assigned using the nearest envelope-clusters.

D. HYPERPARAMETER TUNING
This step is an essential automated hyper parameter tuning approach used in PSDSL that determines the number of centroids 'k' to be used in clustering using the few initial labelled instances. The cluster evaluation uses extrinsic methods to assign a score to the clustering when the ground truth is available. It applies the mean values of (F1-P), (F1-R) [69], and purity (P) [76] to determine the optimum value of 'k'. The Purity is a measure of the quality of clusters and determines the extent to which clusters contain a single class. VOLUME 10, 2022 FIGURE 6. Conflict detection in micro-cluster using class votes from 3-nearest neighbours using threshold α = 0.5. The diamond shape representing conflicts in cluster labelling due to low confidence value.

E. PSDSL PSEUDOCODE
The pseudo code for PSDSL is depicted in Algorithm 1. In EVL, initially available labelled examples are of significant use in hyperparameter tuning to determine the optimal values for 'k' (number of centroids). This parameter tuning approach is described in Algorithm 1.1. These labelled examples also play an important role in automatically deciding the best pseudo-labelling approaches, such as selflearning or CGC. The switching mechanism is described in Algorithm 1.2. whereas the concept drift detection and handling using clustering is described in Algorithm 1.3. The symbols and notations used in the algorithms are described in Table 1.
The PSDSL algorithm maintains a set of 'm' base classifiers and clusters. Inputs to the algorithm are 'n' training examples in which τ instances are labelled, followed by complete unlabelled examples. As shown in Algorithm 1, both labelled and unlabelled instances incrementally create the micro-clusters (line 5). When labelled instances arrive (line 6) a clustering algorithm is executed to generate C t and divide the data into clusters and associates each cluster with one of the classes (line 7) and trains the initial classifier ε.
When unlabelled instances arrive (line 9) it determines the learning state, if the self-learning state is active, it applies a prequential evaluation to predict the pseudo-labels by using the best classifier in the ensemble and re-training the ensemble on these predicted pseudo-labels (line [10][11][12]. When the self-learning state is inactive, it performs CGC (line 14-32).
The unlabelled examples are stored in a pool or batch of size θ (line 14) the value of which is set by the user and periodically performs the tasks listed in lines (16 -30). The pool data is periodically analysed for potential drifts due to cluster overlaps in micro-clusters (line 17). This process returns labelled micro-cluster instances and reports the state of drifts as described in Algorithm 1.3.
If drift is detected, envelope-clusters are formed using micro-cluster instances such that each cluster represents a class in the data (line 19). Envelope-clusters then transfer their labels to the nearest conflicted micro-clusters (lines [20][21][22]. If no drift is detected, the clustering algorithm obtains C t+1 on the pool data (line 24) by applying the best values of 'k' obtained in Algorithm 1.1. Each new centroid receives its label from the nearest centroids using the Euclidean distances between C t and C t+1 (line 25). Finally, a set of heterogeneous base classifiers is trained using the pseudo-labelled instances (line 29).

1) ALGORITHM: HYPER PARAMETER TUNING
As outlined in Algorithm 1.1, there are three input parameters, a set of labelled instances, a clustering algorithm, and Kmax which is the maximum number of centroids (k) provided by the user. Initially, the ground truth centroids are generated using the labelled instances (line 2) such that {c = k} where 'c' is the number of classes. Lines 3 and 4 generate and evaluate purities µPurit for micro-clusters. Line 5 begins the loop to determine the best value of 'k' by iterating in the range from 'k = 2' to K max . In line 6, new clusters are generated after eliminating the ground truth labels from the labelled data. A user-provided clustering algorithm is applied while passing the incremented values of 'k'. In line 7, the ground truth clustering and current clustering are evaluated, 2) ALGORITHM: SWITCHING LEARNING STATES Algorithm 1.2 outlines the switching algorithm; it takes µPurity and Purity inputs, and the parameter ρ is the switching threshold set by the user. The ensemble 'ε GT ' (Ground Truth) is trained on initial labelled data (line 5), this training set splits in the ratio of 80% and 20% (line 6). Another ensemble ε PL (Pseudo Label) 'trains on 80% of this training examples, then ε PL predicts the pseudo-labels for the remaining 20% and retrains itself on the predicted pseudo-labels (line 7,8). As the ground truth labels of initial training set are known, the predictive performance of both ε GT and ε PL is compared, if the overall prediction accuracy of ε PL becomes higher than the ε GT , the self-learning state becomes active, otherwise the pseudo-labelling is suspended.

3) ALGORITHM: DETECT CLUSTER DRIFT
The algorithm to detect cluster drift is available in Algorithm 1.3, the current micro-clusters C t+1 are associated with previous clusters C t by measuring similarity between 'k' nearest centroids q t i; i = {1, . . . , k} using Euclidean distance, i.e., Dist (q t , q t+1 ) (line 7). The 'k' nearest clusters vote for the class labels to the current clusters (line 12). To calculate the conflict ratio, min-max values of the votes are applied to the formula (line 15). If the ratio reaches above the user-provided drift threshold, current micro-clusters are assigned the label of the majority class vote.

F. COMPLEXITY OF PSDSL
PSDSL is a single pass algorithm which splits the data stream into batches of predefined size such that each batch contains n examples. These batches are sequentially processed, requiring less computational time and space because only the information regarding the centroids and data points of the current batch is stored in the memory. The complexity of PSDSL depends on the choice of learners. PSDSL intelligently switches learning strategies and applies an HDWM classifier for self-learning or k-means and CluStream for CGC and micro-clustering respectively. Under EVL, when labelled data arrives, PSDSL executes hyperparameter tuning (Algorithm 1.1) and switch learning strategy (Algorithm 1.2) only once.
Hyperparameter tuning (Algorithm 1.   as n is much larger than q, k and i. The value of parameter k is constant which has already been tuned using Algorithm 1.1. PSDSL requires less iterations for convergence because the initial centroids are trained on initial labelled data from the stream and the new centroids obtain their labels from the nearest centroids. COMPOSE on the other hand is of order O(n (d+1)/2 ) i.e., exponential in dimensionality [77]. SCARGC has the worst time complexity, which is O(nki).

IV. EXPERIMENTS AND RESULTS
This section investigates the PSDSL algorithm and compares its performance with SCARGC [8], LEVEL IW [11], COMPOSE [21], and MClassification [70]. To verify statistically significant differences between algorithms, the Friedman test was applied, which is a suitable non-parametric test for multiple algorithms on multiple datasets [66]. The Friedman test was applied with α = 0.05 to test the null hypothesis that there is ''no statistical difference between the algorithms''. The Nemenyi post-hoc test [67] has been applied to identify which pairs of algorithms differ from each other. In EVL few initial ground truth labels are available; therefore, internal evaluations were applied to the clusters i.e., Purity, Precision and Recall [69].
Methods used for evaluating learning models in previous data stream mining studies include prequential, holdout [25] and Kappa statistics [72]. A prequential accuracy estimate is appropriate when all classes are approximately balanced [73]. Kappa statistics is a more sensitive measure for quantifying the predictive performance of streaming classifiers since it cannot be ascertained whether the classes were balanced.

A. EXPERIMENTAL SETUP
The evaluation procedure used is Kappa statistics and prequential testing. Prequential testing is a facility of the MOA [25] in which each instance is used to test the model before it is used for training, and the accuracy is updated incrementally. The PSDSL was compared with existing EVL approaches, static and benchmark setting. To determine how PSDSL performs with and without pseudo-labelling the 'Static' approach was used in which PSDSL does not apply pseudo labelling. Further, to analyse the consequences of unlabelled examples in the data stream and their impact on predictive performance, 95% of the class labels were removed from each dataset and PSDSL was compared in the 'benchmark' setting in which all the training examples are labelled.
All the experiments are evaluated in terms of time consumption and predictive performance. Processing time is measured in seconds and is based on the CPU time used for training and testing. All the experiments were performed on machines with Core i7 @ 3.4 GHz, 4 GB of RAM. The experiments performed on non-stationary datasets [26] using MOA-generated streams [25] and real-world datasets. The details of algorithms and parameters used in the experiments for these existing EVL approaches are provided in Table 2.

1) NON-STATIONARY DATASETS
Non-stationary datasets used in the experiments were provided by the authors of SCARGC [8] and are available to the machine learning community [26]. These datasets have been randomised and made available for further research [9]. They provide datasets with incremental changes over time. Here Unimodal Gaussian datasets represent two bi-dimensional Gaussian clusters rotating around a common axis. The distance between the Gaussian components changes over time. Class overlap exists in these datasets. The datasets UG-2C-2D, UG-2C-3D, and MG-2C2D were originally proposed by Dyer et al. [21].

2) MOA DATA STREAMS
The artificial data streams used in the experiments are generated through the MOA workbench [25]; the number of instances is 100,000 and the batch size is 1000 in all the streams. The MOA commands to generate these streams are available in Appendix I.
1) SEA data stream contains three attributes, function x i ∈ R and the value of xi is between 1.0 and 10.0. The target concept is determined using the equation y = [x 0 + x 1 + x 2 ≤ θ], such that θ ∈ {7, 8, 9, 9.5}. 2) RandomTree generates a stream based on a randomly generated tree. 3) LED generates a stream defined by a 7-segment LED display and the task is to predict the digit (0-9). 4) Hyperplane is a flat n-dimensional PSDSL useful for simulating gradually drifting concepts. The orientation and position can be modified by slightly changing the relative size of the weights. 5) Random Radial Basis Function (RRBF) consists of a fixed number of randomly positioned centroids with a single standard deviation, class label and weight. 6) Keystroke dataset [8] task is to predict one of four users based on their typing patterns. The dataset contains keystroke records obtained from the users in 8 different sessions who typed a fixed password. The description of the datasets used in the experiments is provided in Table 3 and Table 4.
The batch size for the MOA Stream is 300. The information about drifts and class overlap is not available for the realworld datasets. Next in Sections IV-B and C, the predictive capabilities of PSDSL were tested on MOA data streams, benchmark non-stationary datasets and real-world datasets.

B. COMPARATIVE ANALYSIS OF PSDSL ON BENCHMARK DATASETS
Predictive accuracies of PSDSL, COMPOSE, LEVEL IW , SCARGC and MClassification (MC) were evaluated on benchmark non-stationary datasets [26] that have also been used in the original SCARGC publication. Table 5 shows  the Friedman statistic X 2 r is 18.93 (df = 5, n = 15). The p-value = .0019 shows significant difference in the algorithms at (p < .05). The number in the brackets represents the ranks, the lower the rank and the higher the predictive performance. Fig. 8 shows a critical difference diagram on ranked accuracies for non-stationary datasets. For 6 algorithms and 16 datasets, the Critical Difference (CD) for the Nemenyi [80] at (α = 0.05) is (CD = 1.82). The solid bar shows no significant differences between COMPOSE, LEVEL IW , SCARGC, MClassification and PSDSL, however these performed significantly better than 'Static'. Table 6 presents the Evaluation time in seconds; the results show that PSDSL achieved similar accuracies in less average computation time (8.01 seconds) on non-stationary datasets.
Thus, LEVEL IW is found to be the second lowest performing algorithm in terms of computational complexity after MClassification and performs significantly worse than all other algorithms except SCARGC and PSDSL.

C. ANALYSIS OF MOA DATA STREAMS
Previous set of experiments are performed on offline datasets, a comprehensive analysis was made on MOA data streams. As can be seen from Table 5 the average prediction accuracies   of SCARGC is highest (93.64%) in all the benchmark datasets therefore we implemented it in the MOA to compare it with our approach. A recent comparative analysis in the literature [77] reports no statistical significance at α = 0.05 for classification accuracy among COMPOSE, LEVEL IW , MClassification and SCARGC). LEVELIW performs rather poorly on benchmark datasets with significant between-class overlap. MClassification and LEVEL IW are found to be computationally inefficient as shown in Table 6. To analyse SCARGC and PSDSL, Prequential Accuracies, Kappa Statistics and Evaluation time were used and the ranks for each algorithm were calculated. It is noted that SCARGC and PSDSL were compared with the 'Static' and benchmarked approaches which are described in Table 2. The first batch i.e., 300 instances were kept labelled and the class labels of the remaining data stream were removed.

1) PREQUENTIAL ACCURACIES
In EVL these accuracies could not be evaluated due to the scarcity of true class labels; as true labels become available, the accuracy is calculated and presented for comparison. Table 7 presents the average accuracy (in %) achieved by the methods over the 12 MOA streams. The best results were highlighted in a comparison between the proposed method PSDSL and SCARGC, benchmark, and Static.
The overall results show that PSDSL performed better than all other approaches. The Friedman statistic X 2 r is 24.05 (df = 3, n = 11), the p-value = .00002 shows a significant difference in the algorithms at (p < .05). The number in brackets represents the rank.
To determine which algorithm(s) performed differently, Fig. 9 is the critical difference diagram on ranked accuracies for MOA streams. The connected solid lines represent groups of algorithms that are similar to each other, and any two algorithms are significantly different if the difference between their average ranks is at least CD [66]. For four algorithms and 12 streams, the CD for the Nemenyi [67] at α = 0.05 is 1.41. The results show two groups of algorithms, i.e. PSDSL -Benchmark and SCARGC-Statics. Significant differences are found between PSDSL and SCARGC, while the performance of PSDSL is closer to the Benchmark while no significant difference was found between SCARGC and Static.

2) KAPPA STATISTICS
The Kappa evaluation measure is widely used in data stream mining, as it can handle both multi-class and imbalanced class problems. The larger the Kappa value, the more generalised and better the classifier. The kappa statistics show similar results compared with average accuracy, in which PSDSL  performs significantly better than other algorithms. Table 8 provides the Kappa statistics for the experiments. Table 9 presents the Evaluation time in Seconds for Static, Benchmark, SCARGC and PSDSL on MOA Streams. The results show that PSDSL achieved better average accuracies (72.7%) in less average computation time (58.38 seconds) than SCARGC Accuracy = 57.0% in 120.11 seconds, but not as far as Benchmark and Static because these do not apply pseudo-labelling.

4) SIGNIFICANT FINDINGS
As the PSDSL does not apply a CGC approach on MOA streams and switches to a self-learning state, this improvement is due to the switching mechanism of heterogeneous base classifiers. Fig. 10 shows the predictive accuracy plots for MOA Streams in which no drift is induced. The results show that PSDSL performed significantly better than SCARGC on all the MOA Streams when there are no concept drifts. Fig.11 shows the predictive accuracy plots for MOA Streams in which artificial drift is induced. The results show that in EVL, when the CGC fails, restoring from the concept drift is challenging due to unavailability of true class labels. In SEA (Abrupt) and RandomTree (Recurring Drift) streams, all the algorithms restored learning after the sudden drifts. However, the graphs show that, before the first and after the last drifts the PSDSL predictive performance is higher than VOLUME 10, 2022  the competing algorithms. This demonstrates that under EVL conditions PSDSL adapted to the abrupt as well as recurring drifts better than other algorithms. On LED which is a multiclass problem, and Hyperplane which contains incremental drifts, none of the approaches adapted to the drifts in these two streams. Overall, PSDSL performed better than other approaches on drift induced MOA streams.
SCARGC performed best in non-stationary datasets however its predictive performance did not improve when applied to MOA data streams. To further investigate the cause(s) of this failure a Randomisation analysis was made and is presented in Section IV-D

D. ANALYSIS OF RANDOMIZATION
This experiment analyses the sequence of training data and its influence on prediction accuracies for CGC algorithms. In data streams, continuous data arrives at high speed and there is practically no control over the sequence of training data presented to the learning algorithms. Randomisation is thus different to noise, as it is not a random displacement of examples, but a random order in which data instances are presented to the learning algorithm. In this section, RQ1 is addressedare existing ILNSE approaches always successful when applied to different problems and why this approach sometimes fails? The benchmark non-stationary datasets [8] [26] are randomised by shuffling the order of examples in the datasets [9]. Fig. 12 shows a plot of a Four Class Rotating (4CR) [26] dataset. The plot '4CR original dataset' on the left shows initial 1000 examples, and on the right '4CR randomised' are initial 1000 instances after shuffling 144k instances in the dataset. The centroids in the dataset are gradually rotating, therefore the examples which are located above the 1000 appeared in the first batch and resulted in a noise effect. The change in the order of examples resulted in the loss of cluster boundaries. This is the scenario in real-time data streams, i.e., no control over the order of examples.
CGC rely on the assumption that the data follows a normal or Gaussian distribution. This supports the clustering process by helping to generate distinct clusters. This assumption also makes CGC a more effective choice in class labels imputation for missing class labels. However, the normalcy (Gaussian) EVL approaches cannot hold for randomised datasets or for real-world data streams, as most such streams are unstructured and contain noise.
The results in Table 10 show the prediction accuracies achieved by the SCARGC and PSDSL algorithm on original and randomised datasets. The results show that SCARGC had a significant drop in average prediction accuracy by 35.3% on randomised datasets, whereas PSDSL only dropped by 20.9%.

E. SWITCHING MECHANISM IN PSDSL
To address (RQ3), what strategy should be adopted if the CGC or self-learning approaches fail? PSDSL is made capable of intelligently switching learning states 'CGC with k-means, micro-clusters or self-learning. Table 11 shows the switching mode in PSDSL is dependent on {F1-P, F1-R and purity} of k-means and micro-clusters. Whichever is higher, it adapts the learning mode accordingly. For values lower VOLUME 10, 2022 FIGURE 12. Plot for initial 1000 instances of Four Class Rotating (4CR) original dataset (left) [26] versus randomised 4CR dataset (right). The centroids in 4CR dataset are gradually rotating and the instances located after the 1000 appeared before, thus resulted in a noise effect. than threshold 'ρ' such as in randomised datasets or MOA Streams, it switches to self-learning. Further, it monitors the performance of pseudo-labelling. In the case that pseudolabelling does not improve the predictive performance on initial labelled data, PSDSL suspends the pseudo-labelling.

F. HYPERPARAMETER TUNING
This section presents the analysis carried out to address RQ2: Does this approach depend on parameters that require manual tuning by the users before inducing the training models? As shown in Table 12, SCARGC applies k = 4 for (1CSurr) which is a binary class problem; similarly, SCARGC applies k = 4 for (FG_2C_2D) (MG_2C_2D) which contains 5 and 2 classes in the datasets respectively. Furthermore, the realworld dataset 'keystroke' contains 4-classes, but SCARGC applies k = 12 (number of centroids). In SCARGC these values need to be manually chosen by the user to achieve the best results. Contrary to this, PSDSL automatically tuned the best values for the 'k'. As evident in the Table, in most of the datasets, PSDSL predicted values for 'k' which were similar to those in SCARGC. However, the difference is that  the parameter 'k' was sent manually in SCARGC, while PSDSL automatically adapts the parameter 'k' to optimise the classification results over time.

G. PARAMETER SENSITIVITY ANALYSIS
The influence of the PSDSL parameters pool size (θ) and number of labelled examples |T| is analysed against the prediction accuracy. Fig. 13 shows the prediction accuracy in % on different values of θ from 300 to 1500 and |T | from 50 to 1000. As it is clear from the plot, increasing the pool size increases the prediction accuracy; however, |T| has no significant effect on the accuracy.

V. DISCUSSION AND CONCLUSIONS
The twin constraints of lack of domain expertise availability and time-resources make the goal of evolving predictive models increasingly more impractical given the relentless volumes of data flowing across the network-centric cyberphysical IoT and semantic media spaces.
This study directly responds to this challenge in proposing a novel algorithm, PSDSL that can responsively switch between Self-Learning, micro-clustering and CGC; whichever approach is beneficial, based on the characteristics of the data stream. Accordingly, a new approach has been introduced, called envelope-clustering, which resolves the conflicts during the cluster labelling process. This approach applies a Confidence Measure to enhance the overall integrity of the labelling by ensuring the quality and correctness of labels assigned to the clusters.
It was concluded that the existing approaches such as SCARGC or COMPOSE perform well for certain datasets in which centroids are moving with a constant velocity. However, when SCARGC was evaluated after shuffling the training instances of the same datasets by changing the training orders, its predictive performance was significantly reduced. The results showed that PSDSL performed significantly better than SCARGC on most real-time data streams including randomised data instances. Thus, the prediction performance of pseudo-labelling has been evaluated by automatically switching between self-labelling and clusters labelling based on the characteristics of the training instances. This study has demonstrated that the PSDSL algorithm performed better than SCARGC for some non-stationary datasets when these were randomised. PSDSL was evaluated on artificially induced MOA streams and real-world data streams and the results showed significantly enhanced performance over SCARGC for most of the MOA streams.
Finally, it was found that, for SCARGC to achieve the best results in different datasets, the values of 'k' needed to be manually chosen, whereas, in contrast, PSDSL achieved similar predictive accuracies without the need for manual selection of the value of the parameter 'k'. Thus, the novel approach proposed in this paper further paves the way for reducing the dependency of machine learning on human input which essentially liberates the process from this hard constraint, as a critical bottleneck, to enable mass-scale deployment of dynamically adaptive labelling of data instances in various emerging data streams. He has published over 60 articles in peer-reviewed conferences, journals, and book chapters. He has been working in the field of data mining for more than ten years focusing on the research domain of big data analytics. His particular research interests are lie in (i) developing scalable algorithms for building adaptive models for real-time streaming data; (ii) developing scalable parallel data mining algorithms and workflows; and (iii) applications in big data analytics.
Dr. Stahl is a member of the British Computer Society (BCS) and has been elected three times as a Committee Member of the BCS's Specialist Group on Artificial Intelligence (SGAI), servicing on the committee, since 2013.
ATTA BADII is currently a Research Professor of computer science with a focus in the fields of AI and data science. He has established a track record of key contributions to over 50 projects, including over 30 large-scale collaborative research programs many of which he has initiated and led as the technical coordinator. He has pioneered several paradigms in user-centered assistive-ambient technologies particularly personalized contextaware security-privacy protective design; and hardware-accelerated ML for real-time AI-assisted application domains such as an Anomaly Detection in online time series data streams as applied to safety alerting, cyber-attack detection and financial transactions fraud detection.
He has served on various editorial and research steering boards as a Coordinator/Technical Leader/Invited Expert, for example, as the Chair of the Security Architectures and Virtualisation Taskforce of the European Road Map Project SECURIST, the Chair of the VideoSense: European Video-Analytics Network of Excellence and Expert Advisor on the largescale EPSRC-funded Research Program on Big Data and Human Rights at the University of Essex.