Dynamic Feature Selection for Clustering High Dimensional Data Streams

Change in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature’s perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked.


I. INTRODUCTION
Along with time and memory constraints, change is an important consideration in data stream mining.Recognising and reacting to change is important for accurate, real-time analysis.Change in a stream can happen in a number of ways.Let S = [x t ] ∞ t=0 denote a stream where x is a vector in d dimensions at time t.Let Y represent the set of k discovered clusters: Y = {y 1 , . . ., y k }.We can represent the assignment of a point x i to a cluster y j ∈ Y as a conditional probability P t (y j |x i ); the probability of x i belonging to a cluster y j at time t.One possible type of change is concept evolution.Concept evolution occurs when an entirely new cluster y m appears in the stream, y m ∈ Y .Another type of change in a data stream can occur in the form of concept drift.This occurs if the characteristics of the data change, i.e., if the underlying process generating x changes.Typically, this kind of drift is The associate editor coordinating the review of this manuscript and approving it for publication was Zhanyu Ma. referred to as virtual drift, a change in P t (x).A second type of drift is known as real drift, a change in P t (y|x).For example, at time t point x i is assigned to cluster y j , but at t + δ, x i is assigned to cluster y m .This would occur if, for example, clusters y j and y m have drifted into different positions in the feature space.
A third type of change which has not received as much attention is a change at the feature level.Change at the feature level can occur in two ways; feature drift and feature evolution.Assuming an incoming instance x in d dimensions x = {f 1 , . . ., f d }.Feature drift occurs if the importance, discriminatory power or relevance of a feature f i changes over the course of a stream.For example, in text-mining the relevance of a particular word can change over time.Feature evolution occurs when new features appear in the stream, for example, additional words might appear in a text stream and d, the dimensionality of x, changes.
A lot of attention has been given to clustering streams in the presence of change at the concept level but very little to change at the feature level.Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based) rely on distance as a similarity metric and this is problematic for high-dimensional data where the curse of dimensionality renders distance measurements and any concept of 'density' difficult.To address these two challenges we propose combining them and framing the problem as a feature selection problem.
Feature selection (FS) aims to identify a subset of the most relevant features f from the set of all features F (In this work, 'feature' and 'dimension' can be used interchangeably).Traditionally, f would be used to cluster data and all redundant features ({f i : f i ∈ F and f i / ∈ f }) are ignored for future points.This might not be a sensible approach to non-stationary data as f is likely to change over time.A significant change could require previous clusters to be abandoned and new clusters discovered on the latest data.This would be especially true for clustering algorithms that rely on some form of distance as a similarity metric; it might not be possible to cluster two points composed of different feature subsets, e.g., if the number of 'important' features changes (| ft | = | ft+1 |) or a previously important feature is no longer considered important (f i ∈ ft but f i / ∈ ft+1 )).Motivated by these challenges we propose using a dynamic feature mask for clustering high dimensional data streams.A stream is split into windows of size β and unsupervised FS is performed after each window.Redundant features are masked and clustering is performed along unmasked, relevant features.If a feature's perceived importance changes, the mask is updated accordingly -previously unimportant features can be unmasked and features which lose relevance become masked.As new features appear in the stream, the size of the mask is changed.Clustered points contain all features (not just a subset of relevant features) but the clustering process only considers the subset of relevant features.
In summary, we propose a novel Dynamic Feature Mask method for clustering high dimensional data streams and the main contributions of this work are: • Feature Drift and Feature Evolution can be detected and tracked in a fully unsupervised way and the importance of features can be monitored over time.
• The method is algorithm-independent and can be used with any of the existing density-based stream clustering algorithms which typically do not have a feature drift mechanism and are unable to deal with high dimensional data.
• Applied to an existing stream clustering algorithm, the proposed method can reduce the time requirements and increase accuracy.We offer an overview of related work in Section II along with a more detailed description of relevant background work in Section III.The proposed method is presented in Section IV.Experimental results are described in Section V. Finally, conclusions are given in Section VI.

II. RELATED WORK
Much research has been carried out on Feature Selection (FS) and good overviews on this research are available in [37] and [2].The majority of this research has focused on supervised methods, whereby a feature's importance is estimated by its correlation with the class label.Features (or subsets of features) with the greatest discriminatory power between classes are selected.Generally, FS methods can be divided into filter methods and wrapper methods.Filter methods are independent of the model and can be seen as a preprocessing step which rank features according to some criterion and the top n features are selected.Popular methods include the Fisher Score [25], Information Gain [36] and Pearson Coefficient [11].Wrapper methods use a model or an underlying classifier to iteratively evaluate subsets of features.An example would be the GA-SVM [29], a genetic algorithm searches for subsets of features and these potential subsets are evaluated using a traditional support vector machine.
Unsupervised methods, too, can be divided into filter and wrapper methods.Unsupervised wrapper techniques use a clustering algorithm to evaluate feature subsets [35].This method is usually computationally expensive and succumbs to what Alelyani et al. described as the ''Chicken and Egg Dilemma'' [2].When attempting to cluster and select features simultaneously, is it better to first find features and then cluster, or first cluster and then select features?
Unsupervised filter methods are based on the intrinsic properties of the data, for example, the assumption that data from the same class are usually close in the decision space.Based on this assumption, features are selected by their locality preserving power, or Laplacian Score.This idea has been applied for unsupervised FS in [9] and is explained in detail in Section III.Infinite-Feature Selection [44] selects features by exploiting the convergence properties of power series of matrices.A subset of features is analogous to a path between different feature distributions.In [13], the authors propose a filter method which selects features based on their ability to preserve the original structure of the data.Their algorithm, Multi-Cluster Feature Selection (MCFS), measures the correlation between different features using spectral analysis techniques and selects those which can most preserve the structure.This algorithm is explained in greater detail in Section III.
Most of the research into FS has assumed a static batch of data but recently more work has been focusing on FS in streaming data.A comprehensive overview of this recent work is provided in [4].Again, the majority of this research has been on supervised FS.The work by Katakis et al. [33] was one of the first to address the FS problem in streaming data.Here, the authors address the problem of a large, dynamic feature space.They use the example of a text stream, the feature space being all possible words.As more text arrives, new words (features) appear and the size of the known feature space grows and changes.Cumulative statistics based on the word count in each class of document are recorded.
Using the chi-squared metric the top n words in each document are selected as inputs for a Naive-Bayes classifier.As a new document arrives, the cumulative statistics are updated.Features can be promoted or demoted from the top n and the classifier is updated with these new features.Heterogeneous Ensemble for Feature Drift (HEFT) [42] uses a Fast Correlation Based Filter [27] as a supervised filter method to select the top features in each windowed chunk of a data stream.A classifier is trained using the top features and added to an ensemble where each classifier is trained on a different feature subset.Carvalho and Cohen [15] used the weights of an online classifier to estimate the importance of each feature.Interestingly, the authors found that using some of the lowest ranked features improved the classification accuracy.The authors report using 90% of the top features and 10% of the bottom features.
Feature Selection based on Symmetric Uncertainty (a concept taken from Information Theory) was introduced in [7] and extended in [5] with Dynamic Symmetrical Uncertainty Selection for Streams (DISCUSS).DISCUSS is classifier independent and acts as a filter method in a sliding window.Features are selected using a merit-guided strategy whereby the perceived merit of a subset of features is a function of how predictive of a class the subset is, and also how much redundancy there is within the subset.This selection method was shown to improve the performance of two different types of classifiers.Adaptive Boosting for FS (ABFS) was introduced by Barddal et al. [6] and uses a combination of boosting [23] and decision stumps (a decision tree whereby the root node is connected to the terminal nodes) to select features.Boosting gives higher weights to training instances which are harder to classify, then decision stumps are used to select features from these difficult-to-classify samples.ABFS is shown to improve classification rates while also reducing computational overheads.Other supervised approaches dynamically select features implicitly [12], [24].DX-Miner [40] is a streaming classification algorithm that incorporates dynamic FS.The algorithm can use either a supervised or an unsupervised filter method.For the supervised method, the previous three windows are stored and the Information Gain metric is used to select the top features from these recent windows.In the unsupervised case, the authors suggest that the n highest frequency features could be used but this is not discussed any further.This was extended in [46], here the authors used DX-Miner with MCFS as the filter method.An unsupervised FS method for data streams with linear time and space was proposed in [30].Matrix sketching is used to maintain a low rank approximation of the data.At every time-step t, the top features are selected, though all data until time t is used for selecting the top features.The authors reported that this gave memory problems with comparative algorithms and in a dynamic stream it is perhaps better to disregard data as the stream progresses and old data is no longer relevant.
Other work in feature processing is concerned with the task of creating new features (as opposed to selecting) from the existing set of features [38], [39], [49].In [49] a Deep Neural Network is developed to detect spoofing in an automatic Speaker Verification System and the paper reports that selecting features dynamically works more effectively that static selection.
Clustering data streams differs from traditional clustering.There are additional time and memory constraints, usually only a single pass of the data is afforded, and some form of change is expected.Many approaches have been proposed including grid based methods [10], [45] and partitional methods [26], although Density Based methods appear to be the most common.Density clustering methods [17] identify clusters as areas of high density separated by areas of low density.They have the advantages that the number of clusters does not need to be specified a-priori, clusters can have any shape (not just hyper-spherical), and they can have an intrinsic summarisation method; the micro-cluster.Micro-clusters are d-dimensional spheres which summarise a group of local points.The set of connected micro-clusters form the cluster.CluStream [1] was one of the first to employ micro-clusters to cluster dynamic data streams.A two-phase approach was proposed, whereby data is first summarised online and the summaries are then clustered off-line.This two-phase approach was extended in MR-Stream [47], D-Stream [45], DenStream [14], and others.A good overview on density based stream clustering is provided in [3].More recent proposals for density-clustering include Ant Colony Stream clustering (ACSC) [19], which uses a decentralised swarm intelligence approach, CEDAS [31] and SNCStream+ [8], use a graph structure with micro-clusters as nodes, and Multi-Density Stream Clustering (MDSC) [20], which combines both online and off-line phases into a single online phase and can discover clusters with varying levels of density.
In summary, the majority of research on dynamic FS for data streams assume the supervised method [15], [33], [40], [42] and is typically used for classification tasks and not suitable for clustering.Existing stream-clustering algorithms can deal with change at the concept level (concept drift and concept evolution) [14], [19], [20], [31].However, these methods suffer from the curse of dimensionality and are not designed to track change at the feature level.The method proposed in this paper aims to address these two challenges: tracking change at the feature level and dynamically clustering in high dimensions.

III. BACKGROUND
The proposed method requires an unsupervised feature selector.We evaluate three existing static methods for maintaining the dynamic feature mask.Each method is described below along with the clustering algorithms we use to evaluate the proposed dynamic feature mask.

A. UNSUPERVISED FEATURE SELECTION 1) VARIANCE
The most simple, yet effective, method of unsupervised feature selection is the maximum-variance method; the average squared deviation of a feature's value from the mean.X = {x 1 , . . ., x N } represents N instances, where A larger variance suggests the feature has a greater representative power.The intuition here is that if a feature does not vary much (if it has a near constant relevance for each different class) it has little predictive power.However, if a feature is sufficiently different for each class it is potentially more useful when discriminating between classes.
The variance for each feature is calculated, the features are ranked in the descending order, and top n features are selected.

2) LAPACIAN SCORE
The Lapacian Score [9] aims to preserve the local geometric structure in data.This local structure is modelled in a nearest-neighbour graph and features which respect this graph are selected.
• A nearest-neighbour graph G is created with N nodes and an edge is created between nodes i and j if x i and x j are neighbours (x i is among x j 's k nearest neighbours or vice versa).
• A weight matrix S of G models the local structure.
An RBF function with a constant t ∈ R is used to weigh the edge between nodes i and j: • The importance of a feature is considered to be the degree to which it respects G and the weight matrix S. The Lapacian Score L for feature f r is estimated by minimising: A good feature will have a larger S ij (thus a smaller f r i − f r j ) and should have a high variance.So, the Lapacian Score for a good feature should be small.The Lapacian Score for each feature is calculated, the features are ranked in the ascending order, and top n features are selected.

3) MULTI CLUSTER FEATURE SELECTION (MCFS)
MCFS [13] uses spectral clustering to select the features which have the most structure-preserving power.Spectral clustering is performed using the top eigenvectors of the graph Lapacian.As in the Lapacian Score, a nearest neighbour graph G and weight matrix S are created.The data manifold in S is ''unfolded'' to a ''flat'' embedding of data points and features are selected using the ''flat'' embedding.From S, a diagonal matrix D is created whose values are column sums of S; D ij = j S ij .From these matrices, the graph Lapacian L = D − W is created and the ''flat'' embedding of the data can be found by solving the generalised eigen-problem: The feature scores are evaluated using the resultant eigen-vectors Y = {Y 1 , . . ., Y k }, where k is the number of clusters in the data.In static batch data, k might be known a-priori, or can be tuned to find the best solution.In the streaming case, k can not be known.So, we use the number of clusters discovered in the previous window.
MCFS scores for each feature are sorted in the descending order and the top n are selected.

B. STREAM CLUSTERING ALGORITHMS
To evaluate our proposed method, we use four density based stream clustering algorithms; MDSC [20], ACSC [19], CEDAS [31] and DenStream [14].Clusters are defined as areas of high density separated by areas of low density.Points which are close in the feature space (measured using the Euclidean distance) are summarised in micro-clusters.
A micro-cluster containing N points { X j }, j = {1, . . ., N }, is described using four components: N, the number of points described by the micro-cluster, each of which is an d-dimensional vector; LS, the linear sum of these points (i.e., N j=1 X j ); SS, the squared sum of these points (i.e., N j=1 X 2 j ); and t, the time stamp.
LS and SS are d-dimensional vectors.From these three components, we can obtain the centre c and radius r of the micro-cluster as follows: The time stamp t records the most recent time a micro-cluster was updated (in CEDAS, the time stamp is referred as the 'energy' of the micro-cluster).The concept of 'dense' is governed by a parameter , which is the maximum radius allowed for a micro-cluster.It is sensitive and data-dependant.This is a user-parameter in ACSC and CEDAS, but MDSC discovers this parameter adaptively, removing a sensitive, manually tuned parameter.Two micro-clusters a and b are considered density reachable if: where b cen is the centre of micro-cluster b and b r is its radius.The set of density reachable micro-clusters form the macrocluster.
MDSC consists of two on-line components.Newly arriving points are assigned to a live cluster if there is a suitable cluster; otherwise, they are passed to the buffer.
Points in the buffer could be noise points, the seed of a new cluster, or a signal of drift.New clusters are discovered at intervals (with a local, adaptive ) in the buffer.Current Features (CF) ← Select (buffer) 3: Use CF to generate Current Mask (CM ).(Eqn.( 8)) 4: Use CM to update Feature Values(FV ).(Eqn.( 9)) 5: Use FV to update Feature Mask (DFM ) (Eqn. ( 10)) 6: Clear buffer 7: Store latest FV off-line 8: Apply DFM to p (Eqn.13) 9: for <each cluster C> do CEDAS treats micro-clusters as nodes in a graph.Micro-clusters which are connected by edges form the macrocluster.
DenStream uses the online/offline model whereby micro-clusters are formed online and these micro-clusters are clustered off-line using DBSCAN [14].

IV. PROPOSED METHOD
In our propsed method, a feature mask is maintained and clustering is performed according to this mask.A stream of instances arrive online.When a point arrives, it is passed to the clustering algorithm and, also, a copy of the point is stored in an offline buffer.When the buffer reaches a pre-defined size β, feature selection is performed on the buffer and the feature mask is updated.This mask is used for the clustering process until the next β points arrive in the stream.We refer to this chunk as the β-window.Below, we first describe the dynamic feature mask (DFM) and the process of updating and maintaining it, and then outline the clustering process using this mask.
Assuming a window of β points in d dimensions, we first perform unsupervised feature selection on this window and extract the top n features.We call this subset of features the Current Features CF.Formally: CF = {cf 1 , . . ., cf n }, where {cf i ∈ N + | cf i ≤ d}.This subset of features CF is used to create a binary mask, which is called the Current Mask CM .
Here, CM = {cm 1 , . . ., cm d }, where: Note here that |CM | = d and the n features in CF will be represented as 1 and the others as 0. These two sets (CF and CM ) are calculated at each β window and are used to update a persistent vector of the feature values (FV).The feature values are the perceived importance or relevance of each feature at any given time.
FV is updated after each window using the values in CM , as follows: It is the rolling average of each feature's importance (according the CM at each window) as the stream progresses.Finally, the DFM is updated based on the feature's importance in FV and a pre-defined threshold λ.DFM = {dfm 1 , . . ., dfm d }, where: The λ threshold (λ ∈ R | 0 ≤ λ ≤ 1) dictates the length of time a feature is considered relevant if it is no longer selected in the top n features.A high threshold makes it harder for a new feature to be considered and also makes it easier to be discarded.A lower threshold maintains a previously important feature's relevance in DFM even if it is no longer selected.
After each β-window, a snapshot of the feature values is stored offline.This can be used to quickly examine a feature's importance over time.
To initialise the process, we read β points into the buffer, create the DFM and then perform clustering using this mask.After initialisation we have a DFM and a set of clusters.Incoming points are clustered using the DFM.In density clustering algorithms, clusters are composed of micro-clusters and an incoming point is assigned to the most appropriate micro-cluster.This is determined by the distance from the point p to a micro-cluster m's center c provided that this distance is less than r, the radius of the micro-cluster.The distance is measured along each feature f i in p to the center of m.With the DFM, we are only interested in taking the distance along the relevant unmasked features.
Center c and radius r for a micro-cluster m (Eqn.(5) and Eqn.(6), respectively) require the Linear Sum (LS) and Squared Sum (SS) of the N points described by m.To recap, m describes N points ( X j , j = {1, . . ., N }), and X j is composed of d features X ji , i = {1, . . ., d}, where j is the instance and i the feature.The Linear Sum of feature i is calculated X ij and the Squared Sum of the feature is To apply the mask, we multiply each feature by its counterpart in the binary DFM and consider only the non-zero features.
For the incoming point p we do the same: The process of maintaining the DFM and clustering using it is outlined in Algorithm 1.

V. EXPERIMENTAL STUDY
In this section we present our experimental results using the proposed method.We describe the metrics and evaluate the method on four high-dimensional data streams, exhibiting feature drift, feature evolution, concept drift and concept evolution.We then perform a sensitivity analysis and offer some discussion on the results.

A. PERFORMANCE METRICS
Discovered clusters are evaluated across four metrics: Purity, F-Measure [32], Rand Index [43] and Cluster Mapping Measure [34].In each of the datasets we use, we know the ''correct'' solution as each instance is labelled.Accordingly, the clustering performance is measured with respect to this ground truth.With each metric, the ideal clustering solution will have a value close to 1 and a poor solution will have a value close to 0.
Purity measures how homogeneous a cluster is.The F-Measure (sometimes called F-Score or F1-Score) is the harmonic mean of the precision and recall scores.The Rand Index measures the accuracy of the clustering solution.It rewards true positives and true negatives and penalises false positives and false negatives.
In the following, R represents the clustering result returned by the algorithm.R contains n clusters.In every identified cluster R i (i = {1, • • • , n}), V i represents the most frequently appearing class label in cluster R i , V i sum is the number of instances of V i in R i , and V i total represents the total number of instances of V i in the current window.From these, we define the following features for cluster R i : We can now express Purity (P) and F-Measure (F) in terms of the total number of clusters discovered, as follows: The Rand Index (R) is a measure of agreement between two clustering solutions; the solution identified by the algorithm and the ground truth, which is defined as follows: where TP, TN , FP, and FP denote the number of true positive, true negative, false positive and false negative decisions, respectively.Unlike the previous three metrics, the Cluster Mapping Measure (CMM) was developed specifically for evaluating evolving data streams.The metric considers aging points, missed points, misplaced points, and noise.It is based on a mapping component which handles disappearing and emerging clusters.The metric is described in detail in [34].

B. DATASETS
Here, we describe the four datasets used to evaluate our proposed method: two image-streams and two text-streams.An overview is presented in Table 1.
We first take the popular MNIST benchmark dataset and convert it to a stream in order to simulate concept and feature drift.MNIST consists of 26,000 grey scale, handwritten digits.To convert to a stream, we take five classes from the original dataset (digits 0-4) and introduce them to the stream in a sequential order.The first 4,000 instances contain images of digits 0 and 1 (shuffled), the following 4,000 points contain images of digits 0, 1 and 2, and so on.The makeup of the stream is presented in Table 2.The features in this stream are pixels and the discriminatory power of a pixel will change over the course of the stream.
For example, the subset of pixels which can best describe digits 0 and 1 might not be useful to discriminate between the digits which appear later in the stream.
The second image stream is the Columbia Object Image Library (COIL-20) dataset [41], which consists of 1,440 normalised grey scale images.Images of 20 household objects are taken at different angles.We convert it to a stream by reading the data in order.The different image classes arrive in sequence (class 1, then class 2 etc.).These different images simulate concept drift; for example, the stream might contain an image of a toy race-car, then an image of a tea-cup.The subset of features (in this case, pixels) which are useful to describe the race-car is perhaps not the best subset of pixels to describe the tea-cup.In this way feature drift is simulated.
We further evaluate the proposed method on two benchmark text-streams: 20Newsgroups and the Topic Detection and Tracking Corpus (TDT-2). is a collection of 14,000 documents separated into 7 topics and further divided into 20 different sub-topics.Some of these sub-topics are very closely related (for example, PC Hardware and Mac Hardware), so in our evaluation we take the root of the topic as the ground truth, this gives 7 topics: 'Alternative', 'Computers', 'Miscellaneous', 'Recreation', 'Science', 'Society', and 'Talk'.
We split the datasets into chunks of 1,000 and shuffle each chunk in order to remove any bias (for example, a window containing only documents belonging to a single topic).We shuffle chunk-by-chunk in order to maintain the progression of topics in the stream and each chunk contains between 2 and 5 topics.As a pre-processing step we remove stop words ('a', 'the', 'and', etc.) from the data-set giving a feature space of 60,881 words.We refer to this data stream as Newsgroup (as opposed to 20Newsgroups).As old topics disappear from the stream and new topics are introduced, concept drift is simulated.The features (in this case, words) which are useful to describe one topic might not be useful to describe another.For example, 'RAM', 'Keyboard', 'JPEG' might be useful features to describe the concept 'computers' but useless to describe 'Society'.In this way feature drift is simulated.
TDT-2 [21] consists of data taken from 6 sources; 2 newswires, 2 radio, and 2 television programmes.We use TDT-2 which consists of 2 months of reports and is often used as the training set in text-classification tasks.It consists of 9,494 documents divided into 30 topics.Again, we remove stop words, divide the data-set into chunks of 1,000 and shuffle each chunk to remove any bias and simulate a stream by reading the data in sequential order.

C. EVALUATION
In this section, we evaluate clustering performance using the proposed DFM on 4 high-dimensional data streams.On each stream: • We evaluate three different selection methods for creating and maintaining the DFM; • We evaluate the performance of a clustering algorithm with the DFM, without a mask, and with a static mask.
The mask performs feature selection on the first window and is never updated as the stream progresses.On the COIL stream, we use a β-window of 100 points and evaluate three unsupervised methods to maintain the mask.The first window contains two classes.We evaluate feature-subsets of different sizes (the top 250 features, the top 150, and the top 100) and we try to recreate the original images using the selected features.This is illustrated in Table 3. LS and Var appear to select similar features.The clustering performance using a DFM with different selectors and feature sizes is presented in Table 4. MCFS with 250 features creates the best DFM across the three metrics.
The comparative performance of the DFM with a static mask and no mask is presented in Table 5. Clustering using the DFM returns a better performance than clustering without a mask.Clustering using a static mask returns the worst performance out of the three.
The MNIST stream has a larger number of samples but fewer dimensions, we use a β-window of 1,000 points and evaluate three unsupervised methods to maintain the mask.
The first window contains two classes: digits 0 and 1.We evaluate feature-subsets of different sizes (the top 100 features, the top 50, and the top 25) and the features selected by each FS method are displayed in Table 6.As in COIL, TABLE 6. Features selected on MNIST.

TABLE 7.
Performance of different selection methods on MNIST using Purity (P), F-Score (F), Rand-Index (R), and Cluster mapping measure (C).
LS and Var.appear to select similar features.The average performance using each FS method (with different feature subset sizes) over the entire stream is presented in Table 7. Again, MCFS creates a better mask than LS and Var.
We also report the time each algorithm requires in Table 8.LS and Var appear to select similar features but LS takes substantially longer.The clustering performance using a DFM with different selectors and feature sizes is presented in Table 7. MCFS with 100 features creates the best DFM across the three metrics.The comparative improvement (in three metrics and required time) in the underlying clustering algorithm (MDSC) is illustrated in Fig. 1.
The DFM improves clustering on all three metrics and also requires less time.This is because fewer pair-wise distance calculations are required.Without a mask, measurements are taken along each of the dimensions but with a mask only the important features are considered.This time measurement includes the time it takes to perform feature selection.The static mask is fastest; it requires fewer pairwise calculations and does not perform feature selection after the first window.Although it is faster, the performance suffers and it is better to use no mask at all rather than a static mask.
Over the entire stream, the Maximum Variance selection method creates the best mask as can be seen in Table 9.The first window contains two topics: 'Alternative' and 'Computers'.The top 5 features selected by Maximum Variance are: {jpeg, image, graphics, Jesus, God}.Using the Feature Values vector, which is updated after each window, we can track the importance of a word as it changes over time.As an illustrative example, we take two words selected in the first window 'jpeg' and 'space'.Their perceived importance over the course of the stream is displayed in Fig. 2. 'jpeg' is considered important for the first five windows but begins to lose importance as the 'computers' topic disappears from the stream.It is never selected again and by the end of the stream its perceived importance is zero.'Space' is also selected in the context of computing and its importance drops as the 'computing' topic disappears from the stream.However, the word becomes relevant again later in the stream, in a different context; 'space' is once again selected when the 'Science' topic is present in the stream.'Space' is selected along side features such as 'satellite', 'NASA', and so on.
The performance of the clustering algorithm (with the DFM) over the course of the stream is presented in Table 10.Without a mask, no clustering solution is found.This is likely because of the high dimensionality (>60,000).However, with a static mask of 150 features, a solution is returned.Clustering performance is further improved using a dynamic mask.
On the TDT-2 stream, the Maximum Variance selection method provides the best DFM.The comparative performance with the other two selection methods is displayed in Table 11, and the clustering improvement in Table 12.
The results across each data stream are summarised in Table 13.On each of the four data streams, on all three metrics, the MDSC algorithm is improved using the proposed DFM.
The text-streams have much higher dimensionality than the grey-scale images.So, we take larger feature subsets (up to 500 features) and we use a window-size of 1,000.
In all of the previous experiments described we used MDSC to test the proposed DFM.We also evaluate on three  other density based Ant Colony Stream Clustering (ACSC) [19], CEDAS [31], and DenStream [14].On each dataset, we use the best selection method discovered in previous experiments; MCFS with 100 and features for MNIST and COIL, respectively, and Maximum Variance with 150 and 250 features for Newsgroup and TDT-2, respectively.The comparative results are displayed in Table 14 using ACSC, Table 15 using CEDAS, and Table 16 for DenStream.On every stream, each of the underlying clustering algorithms is improved by the proposed DFM.

SENSITIVITY ANALYSIS
In this section, we examine the sensitivity and effect of the two parameters required to create and maintain the DFM; the threshold value λ and the window size β.We experiment on the MNIST data stream described in Section V-B using MCFS with 100 features as the selector.For illustrative clarity, we use a metric 'Score'.Score is the average of purity, Rand Index and F-Score.
λ determines the length of time a feature remains relevant if it is no longer selected in top n features.It is a threshold for the Feature Values and determines which features are considered in the clustering process.We experiment with values in the range 0.1 to 1.0.The results are displayed in Fig. 3.
Clustering performance is stable with a slight drop after a value of 0.5.If the threshold is too high (1.0 in this example), the performance suffers dramatically.
If the threshold is too high, no features are considered so the clustering process does not happen.In this case, clustering would only occur on features which have been selected in every window.This is perhaps unlikely in a dynamic stream.This is illustrated in Fig. 3. Here, we display the number of features that are considered in the clustering process.We are selecting the top 100 features in each window and with a low threshold, previously important features remain relevant for a long time even if they are no longer being selected.This can be seen with a λ value of 0.01, approximately 300 features are considered as 'important' at each time-step.With a high threshold of 1.0, no features are considered important by the end of the stream.Using a value of 0.5, the number of selected features remains at roughly 100.For each experiment described in this paper, we use a λ value of 0.5.
The parameter β determines the number of points which should be collected in the buffer before feature selection is performed and the DFM is updated.We first measure the time it takes to perform FS on different β-windows.We examine window-sizes from 500 to 10,000 and measure using seconds.The results are displayed in Fig. 4. It can be seen that the relationship between time and β is not quite linear and it is more efficient to use smaller values for β.This is confirmed when we measure the clustering performance using the different window sizes.The score decreases as β increases.We used a value of 1,000 for β in all experiments described except for COIL-20 which is comparatively small so we used a value of 100.

E. DISCUSSION
In each of the experiments, the proposed DFM method improves the performance of an underlying clustering algorithm.This is true for each of the evaluated clustering algorithms.Of the three feature selectors evaluated to create the mask, MCFS and Maximum Variance outperform the Laplacian Score.On the image streams with a lower dimensionality, the MCFS method creates the best mask.On the text-streams with higher dimensionality, Maximum Variance creates the best mask.
On the text-streams with high dimensionality (up to 60,000 features), the underlying clustering algorithms were unable to return a solution without a mask.With a static mask (a mask with features selected from the first window and never updated as the stream progresses), the performance is improved and a solution is returned.
However, a dynamic mask further improves this performance.A dynamic mask also allows the importance of a feature to be observed and tracked over time.This was illustrated in the Newsgroup data stream (Fig. 2); two features were selected and their perceived importance over the course of the stream was tracked revealing feature drift.On the image-streams with a lower dimensionality, a clustering solution can be found without a mask but performance is improved with a dynamic mask.Not only is performance improved but less processing time is required.Fewer features are considered in the clustering process, therefore fewer pairwise calculations are required.On the image-streams (≈ 1,000 dimensions), a static mask actually deteriorates the clustering performance.This suggests that, in the presence of feature drift and concept evolution, it is preferable not to perform feature selection at all, rather than traditional static selection methods.In the presence of feature drift, as features become redundant and new features become relevant, the static mask is never updated and clustering is performed along irrelevant features and omits newly important features.
Despite never selecting features which create the best mask, the Lapacian Score method requires the most time.On the higher dimensional text-streams, the Maximum Variance method selects the best features and also requires the least amount of time, demonstrated in Table 8.This method requires O(N + d) time, where N is the number of instances with d dimensions.MCFS takes longer time (it requires O(N 2 + d3) [13]) and was found to be better suited to the (comparatively) lower dimensional streams.

VI. CONCLUSIONS
This paper presents a Dynamic Feature Mask (DFM) for unsupervised dynamic feature selection in non-stationary data streams.Redundant features are masked and clustering is performed along unmasked, relevant features.If a feature's perceived importance changes, the mask is updated accordingly -previously unimportant features can be unmasked and features which lose relevance become masked.The method is proposed to address two challenges in data stream clustering: 1) feature drift -a change at the feature level in a stream, and 2) the problem of clustering high-dimensional streams where the curse of dimensionality renders distance measurements and the concepts of 'density' difficult.
The proposed method is algorithm-independent and can be used with any existing density based clustering algorithm.
There are many density-based clustering algorithms in the literature and they typically do not have a mechanism to deal with feature drift or with very high-dimensionality.
We evaluated the proposed method on four density based clustering algorithms (MDSC, CEDAS,ACSC, and Den-Stream) across four high-dimensional streams; two text streams and two image streams.In each case, the proposed DFM improves clustering performance and furthermore, reduces the processing time required by the underlying algorithm.
An unsupervised feature selection method is required to create and maintain the DFM and we evaluate three existing methods: Laplacian Score, Multi-Cluster Feature Selection, and Maximum Variance.Experimental results suggest that on the lower dimensional (≈ 1, 000 dimensions) streams, MCFS is the best selector for the mask.On the higher dimensional text streams (up to 60,000 dimensions), the Maximum Variance method selects the best features to maintain the mask.The Laplacian Score did not return the best features on any stream and was shown to require considerably more time than the other two methods.
On each stream, we compare the DFM with a static feature mask.In the static case, the mask is created on one window at the beginning of the stream and is never updated.The dynamic mask performs better on each stream.On the higher dimensional streams, the static mask is preferable to no mask (without a mask the clustering algorithms could not return a solution at all) but on the lower dimensional streams it is preferable to use no mask rather than a static mask.
Future work will investigate the suitability of the proposed method for density-based classification methods in high-dimensional data streams with feature drift.

10 :
Apply DFM to C (Eqns.11 & 12)    11: Clust(p) 12: Add copy of p to buffer 13: counter ++ 14: Read next point ACSC uses the tumbling window model and an ant-inspired swarm intelligence method for clustering.Windows are non-overlapping chunks of the stream and clusters are incrementally formed over a single pass of each window.In the ant metaphor micro-clusters are 'ants' and ants form 'nests' with similar ants.The resultant nests are returned as the clustering solution.

FIGURE
FIGUREComparative improvement in clustering performance on the MNIST stream.

FIGURE 2 .
FIGURE 2. Tracking feature-drift on two words in the NewsGroup stream.

FIGURE 3 .
FIGURE 3. Sensitivity of λ with respect to clustering performance (left) and the effect of the parameter on the number of features considered in the clustering process (right).

FIGURE 4 .
FIGURE 4. Time required to perform FS on different values for β (left) and sensitivity of β with respect to clustering performance (right).

TABLE 1 .
Description of datasets used in experiments.

TABLE 8 .
Average time required (secs.)for feature selection on different window sizes.

TABLE 9 .
Selection methods for creating the DFM on NewsGroup stream.

TABLE 10 .
Average clustering performance over NewsGroup stream.

TABLE 13 .
Performance of dynamic mask with MDSC.

TABLE 14 .
Performance of dynamic mask with ACSC.

TABLE 15 .
Performance of dynamic mask with CEDAS.

TABLE 16 .
Performance of dynamic mask with DenStream.