Dynamic Sparse Subspace Clustering for Evolving High-Dimensional Data Streams

In an era of ubiquitous large-scale evolving data streams, data stream clustering (DSC) has received lots of attention because the scale of the data streams far exceeds the ability of expert human analysts. It has been observed that high-dimensional data are usually distributed in a union of low-dimensional subspaces. In this article, we propose a novel sparse representation-based DSC algorithm, called evolutionary dynamic sparse subspace clustering (EDSSC). It can cope with the time-varying nature of subspaces underlying the evolving data streams, such as subspace emergence, disappearance, and recurrence. The proposed EDSSC consists of two phases: 1) static learning and 2) online clustering. During the first phase, a data structure for storing the statistic summary of data streams, called EDSSC summary, is proposed which can better address the dilemma between the two conflicting goals: 1) saving more points for accuracy of subspace clustering (SC) and 2) discarding more points for the efficiency of DSC. By further proposing an algorithm to estimate the subspace number, the proposed EDSSC does not need to know the number of subspaces. In the second phase, a more suitable index, called the average sparsity concentration index (ASCI), is proposed, which dramatically promotes the clustering accuracy compared to the conventionally utilized SCI index. In addition, the subspace evolution detection model based on the Page-Hinkley test is proposed where the appearing, disappearing, and recurring subspaces can be detected and adapted. Extinct experiments on real-world data streams show that the EDSSC outperforms the state-of-the-art online SC approaches.


I. INTRODUCTION
H IGH-DIMENSIONAL data streams are generated at an unprecedented scale in various realms, such as media, communication, finance, meteorology, etc., [1]- [4]. These data streams are often high dimensional, unlabeled, large scale, and evolving, which present huge challenges for data stream clustering (DSC). Most existing DSC algorithms, including the classic ones, such as CluStream [5] and DenStream [6], or even the more recent ones, such as STRAP [7], EDMStream [8], and CEDAS [9], are inadequate to address these challenges. In addition, most existing DSC algorithms are based on nonevolutionary models (e.g., CluStream) or simple evolutionary models (e.g., DenStream, STRAP, and CEDAS) which cannot adapt to the complicated dynamics of clusters' structures in the real world [8], [10]. Therefore, there is a pressing need to study more effective DSC algorithms to process evolving data streams that are high dimensional and large scale.
Motivations: It has been realized that many real-world high-dimensional data, such as motion [11], face [12], and texture [13], actually lie in a union of low-dimensional subspaces [1], [2], [14]- [18]. Subspace clustering (SC) refers to the problem of simultaneously clustering the data into multiple subspaces and finding a low-dimensional subspace to fit each group of points [19]- [22]. Currently, representation-based SC (RBSC) approaches have been dominating the field and represent the state of the art. They are based on the hypothesis that each data point in a union of subspaces can be represented as a linear combination of other points, that is, the so-called selfexpressiveness property. Popular RBSC approaches include sparse SC (SSC) [1], low-rank representation (LRR) [16], and their variants.
RBSC approaches have been extensively studied in the static data clustering field. However, existing RBSC approaches cannot be directly applied to the DSC problem due to the following reasons.
1) Most RBSC approaches, such as SSC and LRR, are batch processing based [as shown in Fig. 1(a)] and, thus, cannot deal with evolving data streams. As contrasted in Fig. 1, for data streams, it is obviously unwise and even impracticable to collect all the data points and then process them due to the limitations in computing and storage resources. In addition, batch processing manners cannot achieve real-time processing and cannot meet the basic requirements of online learning scenarios. changes over time underlying the data streams [as shown in Fig. 1(b)] that cannot be revealed.
3) The difficulty is balancing two conflicting goals: a) saving more points for pursuing good clustering performance of SC or b) discarding more points for pursuing efficiency of DSC. On the one hand, the selfexpressiveness property of data requires saving lots of data points to obtain more robust results, leading to an increase in computational complexity and storage assumption of most existing SC approaches [23]. On the other hand, the DSC algorithms usually discard many points to fulfill the restrictions on computing and storage usage. Recently, research on online SC algorithms using the selfexpressiveness property has grown in popularity with some representative algorithms, such as SSSC [23], SLRR [23], online LRR [24], OLRSC [25], and SLSR [23], proposed. Most of these approaches achieve online SC based on twophase frameworks, that is, the static phase for global subspace learning and the online phase for subspace classifying (see Fig. 2). However, we think these approaches are too infant to deal with DSC problems from the following aspects: first, they are based on the assumption that subspaces as well as the subspace structure remain stationary and can be all learned in the static phase, without taking the evolving subspace structure into consideration (which is impractical in the DSC field). Second, they lack proper data structures for storing statistic summaries of data streams, which is one of the basic requirements for DSC goals. For example, SSSC, SLSR, and SLRR tend only to save the data points used in the static phase and discard all the data points in the second phase. Online LRR saves all the newly added data to refine the subspace structure learned in the first phase. Clearly, it is not reasonable to simply reserve or discard all the data points to solve DSC tasks. Contributions: In this article, a novel online SC method, called evolutionary dynamic SSC (EDSSC), is proposed for evolving high-dimensional DSC tasks. Like the recent approaches SSSC [23], SLSR [23], SLRR [23], and online LRR [24], our proposed EDSSC consists of two phases: 1) static learning and 2) online clustering, but fundamental differences exist: EDSSC does not assume that the global subspace structure must be covered by the data points in the static learning phase and must remain unchanged in the online clustering phase. Instead, EDSSC can detect and adapt to the evolving subspace structure based on the proposed subspace evolution detection strategy. In the online clustering phase, EDSSC identifies the representative points of the newly appearing subspaces and then updates the models with these new points, instead of keeping all the points (such as online LRR [24]) or discarding all the points (such as SSSC [23]). The statistic information of the data streams with identified representative points are stored in a data structure that is referred to as EDSSC summary. The main contributions of this article are summarized as follows.
1) We are among the first to formulate and study online SC on evolving data streams, that is, the task of performing SC on data streams whose points lie in a union of evolving subspaces. We provide a mathematical formulation of online SC and propose a novel online SC method EDSSC by making use of the selfexpressiveness property of data. 2) We propose a new index, called the average sparsity concentration index (ASCI), to help EDSSC to identify if the newly arriving point is a normal point or an outlier. This issue has not acquired sufficient attention in state-of-the-art online RBSC approaches. Compared to the widely adopted index, that is, the sparsity concentration index (SCI) [15], we contend ASCI is better than SCI, especially when the number of data points among subspaces is imbalanced (see Example 2). 3) We propose a subspace structure evolution detection model in EDSSC based on the Page-Hinkley (PH) test [7], [26] and three typical subspace evolutions, such as subspace emergence, disappearance, and recurrence, can be detected. In addition, unlike the state-of-the-art online RBSC algorithms, EDSSC does not require the number of subspaces to be known as prior information. Instead, a method to estimate the number of subspaces is proposed to ensure that EDSSC is more practical. 4) We perform extensive experiments on evolving data streams derived from real-world public datasets, including facial images (AR, ExYaleB, and MPIE) and handwritten digits or characters (USPS, PenDigits, EMNISTletter, and MNIST), showing that EDSSC achieves significant improvement in clustering quality [measured by accuracy and normalized mutual information (NMI)] over the existing representative DSC (CEDAS [9] and STRAP [7]) or online RBSC methods (SSSC [23], SLRR [23], SLSR [23], and OLRSC [25]). For example, on the ExYaleB data stream, EDSSC achieves 75.01% accuracy and 86.47% NMI compared with 57.14% accuracy and 74.43% NMI of OLRSC which has the best performance among all baseline algorithms. This article is a substantial extension of our conference paper [27]. The proposed D-SSC model in [27] gets further improved in this article from the following aspects. First, we propose a new subspace evolution detection model based on the PH test which is different from the one adopted by D-SSC [27]. In comparison, the proposed subspace evolution detection model relies on fewer parameters, making EDSSC more adaptable to evolving data streams. Second, we solve the estimation of the number of problem subspaces that is not fully resolved in [27]. Third, as concluded above, a more proper index, that is, ASCI is proposed in this article to ensure EDSSC is more robust than D-SSC in processing the imbalanced data streams. In addition, we further analyze EDSSC in space and time complexity and parameter sensitivity. Finally, more extensive experiments are performed in this article where more real-world datasets and more state-of-the-art approaches are tested.
Organizations: The remainder of this article is organized as follows. Section II provides a review of the state of the art. Section III formulates the evolving high-dimensional DSC problem. Section IV presents the principles and methodology behind EDSSC and describes the corresponding pseudocode as well as the complexity analysis of the EDSSC. In Section V, we verify and analyze the performance of the proposed algorithm by comparing with SSSC [23], SLRR [23], SLSR [23], OLRSC [25], CEDAS [9], and STRAP [7]. Finally, the study is concluded in Section VI.

II. RELATED WORK
Our approach draws on methods from DSC and SC. In this section, we will give a brief overview of the related work.

A. High-Dimensional Data Stream Clustering
Although a plethora of DSC algorithms has been proposed, the clustering of high-dimensional data streams is still in an immature stage. One of the main reasons is that in the high-dimensional space, all pairs of points tend to be almost equidistant from each other due to an effect, called "curse of dimensionality" [28]. Yet only a few approaches have been proposed to tackle the high-dimensional DSC problem. They can be divided into two categories: 1) full-space-based ones and 2) subspace-based ones.
Full-Space-Based DSC: Most partition-based methods cannot keep effective performance when dealing with highdimensional data streams. However, STRAP [7], one of the latest methods of partition-based DSC algorithms, has been successfully employed to process the KDD'99 dataset (34D). Compared to partition-based DSC methods, density-based [6], [9] and synchronization-based ones [29] are more suitable for processing high-dimensional data streams because, in theory, they can find arbitrary-shaped clusters that exist in the entire feature space without requiring the number of clusters. For example, DenStream [6], CEDAS (density based) [9], and SyncTree (synchronization based) [29] are recently proposed and hold state-of-the-art performance among full-space DSC algorithms. They achieve high-performance DSC by combining the static clustering theories (i.e., density-based clustering or synchronization-based clustering [30]- [34]) with dynamic frames (a graph structure or a hierarchical tree). It should be noted that one of the major limitations of the full-space DSC algorithms is that they can only detect clusters existing in the full space. However, clusters are always being hidden since many irrelevant dimensions exist in the high-dimensional space. Therefore, the clustering accuracy of these algorithms cannot be guaranteed for high-dimensional data streams.
Subspace-Based DSC: As discussed above, the cluster structures of high-dimensional data are quite challenging to be directly discovered due to the curse of dimensionality. Luckily, it has been found that some of the high-dimensional data points may be possibly grouped together in certain subspaces.
Recently, a few DSC algorithms [28], [35]- [39] are proposed to detect the cluster structures in low-dimensional subspaces. HPStream [28] first introduces the projected clustering method to process high-dimensional data streams. However, it needs a predefined number of clusters that is impractical in real scenarios. HDDStream [35] introduces the density-based projected clustering idea to the DSC field. PreDeConStream [36] extends a static projected-based clustering method, called PreDeCon [40], toward the data stream processing field. SubCluTree [38] uses an adaptive grid strategy to perform a bottom-up detection of the subspace candidates. In addition, it can adapt to the varying speeds of the streaming data. Generally, the subspace-based DSC algorithms cost considerable time for searching the preferred dimension in all possible subspaces of the full space, which cannot meet the low computational complexity requirement of data stream processing [36].

B. Subspace Clustering
The goal of SC is to reveal the data structures by detecting the clusters in the low-dimensional subspaces of the original feature space. Actually, SC is quite a general concept with tremendous algorithms proposed in different directions. For example, the axis-parallel SC methods, for example, [41]- [43], aim to find the clusters only in the axis-parallel subspaces. The arbitrarily oriented SC algorithms, such as 4C [44], CASH [45], and ORSC [46], make it possible to find the clusters in arbitrarily oriented subspaces. Particularly, the ORSC achieves state-of-the-art performance in accuracy and scalability based on the synchronization clustering theory [29]- [33]. Furthermore, the disjoint SC focuses on segmenting the entire space into a union of subspaces that are disjoint or even independent. Among these research directions of SC, the disjoint SC methods are closely related to our work in this article.
Meanwhile, the disjoint SC methods are accepting worldwide attention due to their wide application in machine-learning fields. Therefore, in the following, SC refers to disjoint SC methods unless otherwise stated. Currently, the RBSC methods utilizing the self-expressiveness property of data have become the state-of-the-art methods in the field of SC. The main idea of RBSC methods is to first learn an affinity graph for the data points (subspace recovery) and then apply spectral clustering to the graph (subspace segmentation) [14]. The RBSC algorithms can be divided into two categories: 1) static ones and 2) online ones.
Definition 1 (Self-Expressiveness Property): For a collection of N high-dimensional data points X = [x 1 . . . x N ] d×N ∈ R d×N , with each point from a union of independent linear or affine subspaces, each data point x i can be represented as a linear or affine combination of other points, that is where Static RBSC: If we have a static dataset X, then we have where C = [c 1 c 2 . . . c N ] ∈ R N×N and diag(C) is the vector of the diagonal elements of C. C is referred to as the representation matrix.
For a system of equations such as (2) with infinitely many solutions [1], one can restrict the set of solutions by minimizing an objective function, that is where f (·) denotes the objective function. Generally, static RBSC approaches vary from their objective functions in (3) and there exists two popular objective functions, that is, norm-based and low-rank-based ones. Typical approaches that utilize the two objective functions are SSC [1], LSR [47] (norm-based objective functions), and LRR (lowrank-based objective functions) [16], respectively. For SSC, f (C) = C 0 , where · 0 denotes the 0-norm. For LSR, f (C) = C F , where · F denotes the Frobenius norm. While for LRR [16], f (C) = rank(C). That is, SSC enforces C to be sparse, LSR tends to group highly correlated data together, while LRR encourages C to be low rank [47]. Actually, most RBSC algorithms so far are static SC algorithms and actually are variants or extensions of SSC, LSR, LRR, or a combination of them (e.g., LSS [48]). The optimal C * obtained by (3) will then be utilized to build an affinity matrix that is denoted as W. The SC results will be obtained after applying spectral clustering algorithms [49]- [52] to W. Now, we take LRR as an example to briefly introduce the solving of (3). Generally, considering that real-world datasets may contain noise corruption, (3) is usually converted to the following optimization problem: where λ > 0 is a balance parameter, E is an additional error matrix that is assumed to be sparse, and · denotes a certain regularization strategy. However, (4) is hard to optimize due to the rank function. In practice, one can solve (4) by using the 2,1 -norm to regularize the second term and the nuclear norm as a surrogate to replace the rank function where · * means the nuclear norm, which is the best convex envelope of the rank [24]. The optimization problem (5) is convex and can be solved by several methods. In this article, we utilize a state-of-the-art approach, called the inexact augmented Lagrange multiplier (IALM), for its accuracy and efficiency [53], [54]. Then, the skinny SVD of C is obtained An affinity matrix W can be built as follows: where U = U * (S * ) (1/2) . Finally, one can get the clustering (segmentations) of the data by applying spectral clustering algorithms (e.g., normalized cuts (NCuts) method [16]) to W.
Online RBSC: Popular online RBSC includes SSSC, SLSR, SLRR [23], and online LRR [24]. However, these online SC algorithms, in general, fail to process evolving data streams as analyzed in Section I.
Here, we comprehensively summarize the architectures of state-of-the-art online RBSC algorithms in a unified framework depicted in Fig. 2. There are two distinct phases, that is, static phase and online phase, as well as four main steps included (i.e., steps 1-4 in Fig. 2): first, the input data are divided into two parts, that is, static data and online data. The static data and online data are used for static learning and online clustering, respectively. Second, a certain static SC approach, such as SSC, LSR, LRR, is performed over the static data to obtain the static result, which could be regarded as a preparation work for the following online clustering. Then, the representation matrix or representation coefficients are obtained based on the static data, which relies on the processing manner of the algorithm. Finally, the online data points will be assigned to the corresponding subspaces that are found in the static results.

III. PROBLEM FORMULATION
A high-dimensional data stream is a sequence of timestamped high-dimensional data points that lie in a union of low-dimensional subspaces. Assume that at each timestamp t, we receive one data point x t , x t ∈ R d×1 . We arrange t points which we have received in total in a matrix Note that data streams have two features that are significantly different from static datasets: the first distinctive feature is the extremely limited number of accesses to the received points (most of the points should be immediately discarded after being accessed once or a few times for decreasing computing and storage consumption). The second main difference to static data is the potentially evolving property, that is, the evolving data streams.
The goal of this article is to perform DSC on the evolving high-dimensional data streams, that is, providing a timevarying SC result S t at each timestamp t which reflects the partition of received points X t such that the points belonging to the same subspace can be assigned to the same cluster.
The first feature of data streams aforementioned results in which we could not expect to obtain S t by repeatedly performing static clustering on X t . Instead, the incremental learning manner should be applied to process data streams, that is The evolving feature of the data streams requires that no assumption should be made that the subspace structure remains unchanged over time. Generally, the evolution can be caused by three types of subspace evolution [55]- [58], that is, subspace emergence, disappearance, and recurrence, which are formulated as follows.
Subspace Emergence: It refers to the occurrence of a new subspace at timestamp t. In particular, a subspace S emerges at timestamp t if S / ∈ S 1 ∪ S 2 ∪ · · · ∪ S t−1 and S ∈ S t . Subspace Disappearance: It is defined as an existing subspace that is not visited by the recently arrived data points.

IV. PROPOSED EDSSC APPROACH
Even though the self-expressiveness property has been successfully used to perform SC on static datasets, it is quite challenging to achieve dynamic SC on data streams based on the property because it is hard to balance two conflicting goals: 1) saving points for good SC performance and 2) discarding points for low computational complexity. In this section, an efficient algorithm, EDSSC, is proposed for balancing the two competing goals and addressing (8), that is, performing DSC on evolving data streams.
EDSSC consists of the following two phases: 1) static learning and 2) online clustering, which are presented in detail in Sections IV-A and IV-B, respectively. The subspace evolution problem is solved in Section IV-C where the subspace evolution detection strategy is given. Finally, the pseudocode and the complexity analysis of EDSSC are provided in Section IV-D.

A. Static Learning and EDSSC Summary Structure
As indicated in (8), the previous clustering result is needed to identify the clustering result at the current timestamp. Therefore, the ultimate goal of static learning is to learn an initial subspace structure based on a certain amount of points initially received to initialize the EDSSC model.
In the static learning phase, we mainly address two challenges: 1) proposing a method to know the number of subspaces lying in the data used in the static learning and 2) designing a data structure, called the EDSSC summary, for storing statistic summaries of the data stream. In a broad sense, the EDSSC summary here is the clustering result. Hence, we continue using S t to denote the EDSSC summary.
Specifically, we denote the EDSSC summary as 1) n t l is the total number of data points assigned to subspace l up to timestamp t; 2) R t l is referred to as a reserved data matrix of subspace l, which saves selected points from subspace l up to t; 3) T t l records the timestamps of all the points that are assigned to subspace l up to t; 4) t l records the ASCI (which will be described in Section IV-B) of all the points belonging to subspace l up to t. EDSSC needs to be initialized at the static learning phase where the first batch of T 0 (T 0 > d) data points X T 0 is processed, namely, EDSSC needs to cluster X T 0 to obtain the representation matrix C * T 0 first, which can be solved via the optimization problem in (5).
The affinity matrix at timestamp T 0 , denoted as W T 0 , is then built by (7). The clustering result will be obtained after applying spectral clustering algorithms on W T 0 . Note that the number of subspaces, that is, k T 0 , is required to be input for most spectral clustering algorithms.
Generally, estimating the number of subspaces underlying the data matrix is still one of the most challenging problems in clustering [59]- [61]. This is partly because it is quite subjective to definite what a subspace is. Subsequently, most state-ofthe-art online RBSC methods (including SSSC, SLRR, SLSR, OLRSC, etc.) assume that k T 0 is defined by users as prior, which limits their usage in real-world applications.
Currently, singular-based Laplacian matrix decomposition is one of the main techniques to solve the problem. Liu et al. [16] estimated the number of clusters by counting the small singular values of a normalized Laplacian matrix that should be smaller than a given cut-off threshold. However, the method is extremely sensitive to the selection of the threshold. This article proposes a new method that uses singular-based Laplacian matrix decomposition to automatically estimate the number of subspaces without requiring any input threshold. Assume we have obtained the similarity matrix W of the data matrix X d×N [via (6) and (7)]. Then, we can obtain the normalized Laplacian matrix L as follows: Then, the eigenvalues of L are obtained, which are denoted as Here, assume that the eigenvalues have been sorted in an increasing order, that is, 0 = σ 1 ≤ σ 2 ≤ · · · ≤ σ N . Note that 0 is one of the eigenvalues.
Then, the estimation of k T 0 could be obtained by (a is a constant and a > 1) (11) and After estimating the number of subspaces underlying X T 0 , we can further pursue the eigenvectors corresponding to the smallest k T 0 eigenvalues. The clustering result of X T 0 can be then obtained by performing a basic clustering (e.g., k-means clustering) on the eigenvectors matrix. Note that T T 0 l and T 0 l are empty for l ∈ [1, k T 0 ]. Limited by the storage space, it is unwise to have all the points of each subspace in their respective R t l . Therefore, it is reasonable to only select and save a small part of points as representatives of each subspace. Actually, representative selecting is another challenging topic that has not been well solved. There are a few algorithms designed to accomplish the task, such as DS3 [62] and ESC [14]. However, these methods are inefficient for largescale data processing. Here, for each subspace, we adopt a uniform random sampling approach of which the time complexity is only O(1) and the performance are comparable with the other complex sampling techniques [23]. We call the points processed in the static learning phase and online clustering as supporting points and streaming points, respectively. Since the number of supporting points in each subspace may vary significantly, we design a logarithmic function to control the number of points reserved in each subspace as follows. Assume that we have n sup supporting points and n res points will be reserved in R, then n res can be decided by the following: where N 0 , a 0 , b 0 , c 0 , and n 0 are constants and can be predefined by the users for different tasks. Our recommended setting for n 0 and N 0 is n 0 = 2 and N 0 = d. The basic principle for (13) is when n sup < N 0 , all the n sup will be reserved. While when n sup ≥ N 0 , a subset of the supporting points will be reserved and the larger the n sup , the lesser the ratio n res /n sup . Generally, we expect that when n sup = d, n res = d and when n sup = 4d, n res = 2d (d is the dimension of the data point). This can be easily achieved by setting a 0 and b 0 according to the following equations: It should be noted that we use (13) to balance the contradiction between the saving points for accuracy and discarding points for efficiency. Note that we set n 0 = 2, N 0 = D, and c 0 = 1.005 in this article. In specific, the initialization of the EDSSC summary is summarized in Algorithm 1.

B. Online Clustering Based on Sparse Representation
After the static learning phase, the EDSSC is ready to cluster the following points in an online manner, that is, the online clustering phase. For each arriving data point in the online clustering phase, EDSSC must decide if it is a normal point first (if the point belongs to one of the subspaces we have found in the EDSSC summary). The abnormal points are called outliers and should be rejected. Generally, the outlier detection is one of the most important research topics in the DSC research community [63], [64]. Here, with the help of the self-expressiveness property, EDSSC is capable of automatically accepting the normal points and rejecting the outliers.
Depending on whether the subspaces can effectively represent the current pattern of data streams, EDSSC divides the found subspaces into two states, that is, active and inactive, at each timestamp. The inactive state means the corresponding subspaces have expired. Both states can be converted to each other over time. Section IV-C will demonstrate how they convert to each other under the subspace evolution detection strategy of EDSSC. Note that only the active subspaces are stored in the EDSSC summary. For the inactive subspaces, EDSSC does not directly delete them but stores them to another reservoir D t = {D t m } h t m=1 which we refer to as the remove reservoir. Here, we assume that at timestamp t, there are h t inactive subspaces in D t . The inactive subspace can be denoted as here. Z t is made up by all reserved data matrix of active and inactive subspaces at timestamp t.
For a new arriving data point x t (t > T 0 ), we first get its representation coefficients under Z t , that is We can also use · 1 norm to relax (15) to obtain the solution c * t , referred to as representation coefficients. Considering x t could be either an outlier or a normal point, it is necessary to identify x t before we assign x t into a subspace. According to the self-expressiveness property, we know that c * t has the block sparsity property if x t is a normal point. That is, nonzero elements of c * t are concentrated on a certain part (this part corresponds to the reserved matrix of its own subspace). While for outliers, their solutions c * t do not have the block sparsity property. Without loss of generality, we further demonstrate the difference between the results of outlier and normal point in the following example. the difference of the coefficients of a normal and an outlier. The Cropped Extended Yale B database contains 2432 images from 38 subjects (64 for each subject) and the size of each image is 192 × 168 cropped from the original Extended Yale B database [66]. We further downsize each image to 48 × 42 for computational efficiency. Here, the 38 subjects are referred to as Sub1, Sub2, . . . , Sub38, respectively. The first four subjects (Sub1-Sub4) are utilized in this example. We randomly select 20 images from Sub1, Sub2, and Sub3, respectively, (60 images in total) and treat each image as a column to form matrix Z 2016×60 . Obviously, for Z, the new arriving images to be assigned will be normal points if they are from Sub1-Sub3. Otherwise, they would be outliers. We select one image from the remaining images of Sub2, which is denoted as x normal . Then, an image, denoted as x outlier , is randomly selected from Sub4. The representation coefficients for x normal and x outlier are obtained by (15) and are illustrated in Fig. 3(a). For the coefficients of the normal point x normal , the nonzero coefficients only correspond to the images from the same subspace (Sub2) of x normal . However, the nonzero coefficients for the outlier are very different.
In order to quantitatively measure the sparse concentration of c * t , we propose the ASCI to quantify how concentrated the coefficients are on a single subspace.

Definition 2 [Average Sparsity Concentration Index (ASCI)]:
The ASCI of a coefficient vector c ∈ R n is defined as where k is the number of internal subparts which c can be divided into. The function δ j (·) : R n → R n selects the coefficients associated with the jth (j ∈ [1, k]) subpart in c and keeps its elements which correspond to other subsparts in c as 0. ζ j is the length of the jth subspart in c. ASCI(c) ∈ [0, 1] and a higher ASCI(c) means the coefficients of c is more likely to concentrate on a single subpart. Proposition 1: The ASCI index is equivalent to the SCI index 2 [15] when the length of all the subsparts of the vector to be measured is equal.
Proof of Proposition 1: Suppose we have a q × 1 coefficient vector c containing k sequentially concatenated but disjoint subparts {c i } k i=1 (i.e., for ∀i, j ∈ [1, k], c i ∩ c j = ∅ and c 1 ∪ c 2 · · · ∪ c k = c) and the length of each subpart c i is denoted as ζ i . When all the subparts are of equal length, we have ζ i = q/k for ∀i ∈ {1, . . . , k}. Then, the ASCI (16) can be further derived as Therefore, the SCI index can be viewed as a special form of the ASCI index. For more general cases when the subparts of the vector are of different length, the ASCI index takes the length of each subpart into consideration. We contend this could ensure that ASCI is more suitable to evaluate the sparsity concentration of a coefficient vector than the SCI index, especially when the length of subparts is highly imbalanced, as illustrated in Example 2.
Example 2 (ASCI Versus SCI): For ease of demonstration, we assume that Z consists of two subparts, Z 1 and Z 2 and the length of the two subparts are imbalanced. Without loss of generality, we assume ζ 1 is obviously larger than ζ 2 . Accordingly, Z 1 and Z 2 are referred to as over-represented subject and under-represented subject, respectively. We randomly select 60 and six images from Sub1 and Sub2, respectively, to form Z 2016×66 . Then, an image, denoted as x test , is randomly selected from the rest images from Sub2. After being taken into (15), the representation coefficients c * are obtained and demonstrated in Fig. 3(b). Interestingly, as can be found in Fig. 3(b), the coefficients do not ideally concentrate on the second subpart (which corresponds to the images from Sub2) as expected. The main reason is when there is an imbalance among internal subparts of Z, the representation coefficients c * is more likely to have nonzero entries corresponding to the columns in Z from the over-represented subject. Actually, a similar phenomenon is found recently in [14] which demonstrates that the imbalanced case is pretty common in practice. In this experiment, ASCI(c * ) = 0.9021 while SCI(c * ) = 0.1002. Obviously, under the SCI index, the test x test (which should be identified as a normal point) is easier to be wrongly identified as an outlier.
We perform additional experiments by selecting other subjects as over-represented and under-represented subjects. Specifically, we sequentially select each of the 38 subjects as the over-represented subject and select one from the remaining subjects as the under-represented subject. The other settings are equal to the experiment above. We perform 40 774 experiments (38 × 37 × 58) in total and we obtain the corresponding ASCI(c * ) and SCI(c * ). Note that under a given τ , these ASCI(c * ) and SCI(c * ) will be compared to τ and then identified to be normal points or outliers. Here, with different τ , we further depict the accuracy of these 40 774 test samples being identified as normal points under ASCI and SCI indexes [see Fig. 3(c)]. As shown in Fig. 3(c), the ASCI index dramatically outperforms than SCI under different τ with a significant improvement on the accuracy. Even though it has been observed that such imbalanced case are quite more common in real life [14], fortunately, the imbalanced case can be better handled by ASCI because the length of each subpart in c * is accounted in (16). Then, for the under-represented subject, the influence of the imbalanced case will be dramatically mitigated or even eliminated.
It should be noted that we denote ASCI(c) as ω(c) for brevity. For c * t obtained by (15), we can then calculate the corresponding ω(c * t ). Here, we choose a threshold τ ∈ [0, 1] and accept x t as a normal point if ω c * t ≥ τ (17) and otherwise reject as an outlier. The outlier will be saved in an outlier reservoir, denoted as O t . Precisely, and O t = {x i , ω i , t i }. n t o is the number of outliers at timestamp t. While for a normal point, it can be either from an active subspace or from an inactive subspace. There are k t active subspaces and h t inactive subspaces whose reserved data matrix corresponds to the k t + h t subparts in Z t . We then assign the normal point x t based on the following optimization function: where r j (·) is the residual if x t is assigned into the jth subspace. The optimal j * will be obtained by (18) which corresponds to the j * th subpart in Z t . If j * ≤ k t , the corresponding active subspace S j * would be updated, otherwise the inactive subspace D j * −k t would be updated.

C. Online Subspace Evolution Detection
In real-world application, most data streams have nonstationary properties, commonly known as concept drift [55]. The concept drift leads to a time-varying subspace structure, that is, subspace evolution. Depending on the drift speed, such subspace evolution can be further divided into abrupt and gradual subspace evolution. It should be pointed out that here we mainly focus on the abrupt case because it is relatively more ubiquitous and easy to be detected in unsupervised tasks [67]. Therefore, in this article, the subspace evolution mainly refers to the abrupt case unless otherwise stated. Concretely, three specific types of subspace evolution are considered, that is, subspace emergence, disappearance, and recurrence. An online subspace evolution detection strategy is designed to ensure that EDSSC can deal with these evolving cases. Specifically, the subspace emergence and recurrence detection are based on the PH test [26], [68].
The PH test is a scalar change point detection (CPD) test method and has been successfully proved and verified in [7]. Compared with other concept drift detection methods, such as PCA-based approaches [34], the PH test allows us to not directly detect the changes of data stream X t , to avoid increasing the computational cost and complicating the algorithm. Here, inspired by [7], we detect the emergence and recurrence through observing the variation tendency of the outlier rate and the recurring points rate, respectively. The subspace disappearance detection is based on a fading function.

1) Subspace Emergence and Recurrence Detection Based on PH Test:
Considering high-dimensional DSC is an unsupervised task, we employ the PH test, which is one of the classical statistical hypothesis testing methods to detect the concept drift. We first give a brief demonstration of how the PH test works [7], [26], [68].
Assume the observed random variable is p. At each timestamp t, we get the empirical average of p p i (19) and the sum of differences between p andp t during the time interval [1, t]. δ is a positive real value which controls the test model. Meanwhile, t records the historical maximum value of γ up to current timestamp t, that is, t = max{γ 1 , . . . , γ t }.
At each timestamp, the gap PH t between t and γ t is obtained and the PH test is triggered if the gap is above a threshold η, that is The PH test has been theoretically verified to detect a negative drop in the mean of the Gaussian distribution [7]. Obviously, η controls the flexibility of the detection and an appropriate η setting should depend on the stream itself. In [7], a proper way about setting η is proposed and employed in our work. η can be set as where f is a constant that controls the sensitivity of the detection and t 0 is the first timestamp when PH t = 0. Now, we consider the subspace emergence. Subspace emergence will result in an obvious feature in the data stream, that is, a large number of outliers found in the data stream within a short time interval. Hence, we define a variable p which can quantitatively reflect whether outliers appear in large quantities in a short time. That is

Algorithm 2 EDSSC Algorithm
p t is monitored under the PH test at each timestamp. When outliers appear in large quantities in a short time, p t will show a downward trend which can be detected by the PH test. It should be pointed out that when the subspace emergence is detected, the points in O will be taken into (5) to find the clustering result. The points will also be randomly sampled then. The new added subspaces summary S * will be updated into the EDSSC summary finally.
Subspace recurrence is the case that once disappeared subspaces are active again, that is, their points are observed again in the data streams during the very recent timestamps. Actually, in this sense, subspace recurrence has a common evolving feature with subspace emergence. That is, subspace recurrence is equivalent to the emergence of the second time or even higher time. Hence, for each inactive subspace D t m in the remove reservoir D t , we define a variableṗ t m , that iṡ 2) Subspace Disappearance Detection Based on Fading Function: The data points of data streams are potentially infinite which implies that many of the previously active subspaces will gradually expire. These subspaces should be detected in time to ensure that the clustering results are more suitable to reflect the current patterns of data streams. EDSSC defines a variablep t l for each active subspace, that is where max{T t l } is the last timestamp when a data point was assigned to subspace l and β is a preset parameter which controls the sensitivity of the EDSSC model to subspace disappearance detection. As can be inferred from (25),p t l is based on a sigmoid function andp t l will drop sharply when t > max{T l }+β. Hence, EDSSC monitorsp t l for all active subspace at each timestamp and accept the subspace as an active subspace ifp t l ≥ 0.5. Inactive subspaces will be removed from the EDSSC summary to avoid the summary being redundant.

D. Algorithm Flow and Complexity Analysis of EDSSC
We summarize the complete procedure of EDSSC in Algorithm 2. Precisely, the EDSSC involves four main steps.
First, the first bunch of data is collected and processed to initialize the EDSSC summary (Section IV-A).
Second, for each of the arriving points, its ASCI value will be calculated by (15) and (16). Then, the point will be accepted as a normal point or rejected as an outlier according to (17). For normal points, they will be updated in the EDSSC summary or the remove reservoir by (18). While the outliers will be put into the outlier reservoir O t (Section IV-B).
Third, the PH test will be utilized to check if subspaces recur in remove reservoir D t and new subspaces emerge in outlier reservoir O t (Section IV-C1).
Finally, all active subspaces in EDSSC summary S t will be checked if they are still active at timestamp t. The inactive subspaces will be removed from S t to the remove reservoir D t (Section IV-C2).
Since the variables used for change detection consume ignorable memory, the memory usage of EDSSC is mainly dominated by the saving of representatives. According to (13), the number of saved points is approximately logarithmic to the total number of points. Therefore, for n points, the approximate space complexity will be O(d log 2 (n)), which is comparable to the state-of-the-art online RBSC algorithms (e.g., SSSC, SLSR, and SLRR) and much less than those of the static RBSC algorithms [e.g., O(dn 2 ) of SSC and LRR].
In the static learning phase, the time complexity of EDSSC is mainly determined by the solving of (5), which is approximately O(T 3 0 ) [23]. While in the second phase, EDSSC needs to compute (15) for the following n − T 0 points and (5) if subspace emergence is detected, which is approximately O(ζ n 3 o + (n − T 0 ) log 2 (n)). Here, ζ denotes the number of times that subspace emergence detection is triggered and n o is the number of points in the outlier reservoir when the emergence detection is triggered. Due to T 0 , n o n, the time complexity of EDSSC is much smaller than those of LRR (O(d 2 n + n 3 )) and SSC (O(dn 3 )) [23] and relatively larger than those of SSSC, SLRR and online LRR due to the ζ times computing of (5). However, such an additional computational cost, that is, O(ζ n 3 o ), is inevitable to find the new subspaces which are not learned in the static phase instead of assuming all the subspaces having been learned in the first phase. The time complexity of EDSSC would be comparable with the other algorithms when processing data streams without any new subspaces emergence.

V. EXPERIMENTS
In this section, experiments are performed on publicly available datasets to evaluate the performance of EDSSC. Moreover, the state-of-the-art online SC algorithms, that is, SSSC [23], SLRR [23], SLSR [23], and OLRSC [25], as well as DSC algorithms CEDAS [9] and STRAP [7], are selected as baseline algorithms. Section V-B demonstrates the influence of parameters on the performance of the proposed algorithm. Section V-C compares the results of all the evaluated algorithms on small-scale evolving data streams. In Section V-D, we investigate the performance of these evaluated algorithms on medium-scale evolving data streams. In addition, the performance of EDSSC, as well as other baseline algorithms on evolving large-scale data streams, are reported in Section V-E.

A. Datasets and Experimental Setup
The experiments are performed on seven evolutionary streams generated from seven public datasets. Particularly, they are divided into three categories according to their scales, that is, small scale, medium scale, and large scale, as shown in Table I. For computational efficiency, some datasets have been preprocessed. For AR datasets, we select 1400 images equally from 100 persons and the dimension of each sample has been reduced from 19 800 to 167. Similar to AR, we also use a subset of the MPIE dataset which consists of 4400 images from 100 persons of the MPIE dataset. EMNIST-letter consists of 13 000 samples of 26 English letters and MNIST 30K is a subset whose samples are randomly selected from the MNIST dataset. In order to generate the evolutionary data stream, in the initial stage, we only select some samples from some of the classes and the samples of classes that are not selected emerge in the online phase.
We compare the proposed algorithm with four state-of-theart online RBSC algorithms, that is, SSSC [23], SLRR [23], SLSR [23], OLRSC [25], and two state-of-the-art related DSC algorithms, CEDAS [9] and STRAP [7]. All the algorithms are implemented with MATLAB. For SSSC, SLRR, SLSR, and OLRSC, there exists a common and vital parameter λ which is utilized to balance the data fidelity and the regularization term when solving the 1 -minimization problem. For CEDAS and STRAP, a parameter r is required as an input to control the radius of a cluster. For a fair comparison, the parameters of all algorithms are tuned for obtaining the best performance, as summarized in Table II.
All the algorithms are measured using accuracy and NMI between the results given by the algorithms and the ground truth. The values of accuracy and NMI are real numbers between 0 and 1. Particularly, larger values mean the given result matches the ground truth more. In our experiments, the  accuracy and NMI are the average values obtained by running each program ten times on each data stream.

B. Impact of Parameters
Before comparing the performance of all algorithms, we would like to investigate the influence of parameters in EDSSC. There are two key parameters, that is, τ and f , which would impact EDSSC. Precisely, τ is used to identify the normal points from outliers and f in (22) controls the threshold η in (21) for change detection in PH tests. To study how these parameters impact on EDSSC, we chose different parameters settings of τ and f and run the EDSSC on the PenDigits stream. The corresponding results are illustrated in Fig. 4, from which the following observation could be obtained. Fig. 4(a) depicts the clustering quality (accuracy and NMI) of EDSSC under different τ settings (we keep f = 25). When the τ is relatively small (e.g., τ < 0.5) or large (e.g., τ > 0.75), the clustering quality is lower. The reason is a smaller τ results in that more outliers are accepted as normal points and a larger τ gets more normal points rejected as outliers. Moreover, a larger τ is easier to trigger PH detection getting clustering quality decreased.
We report the results of the effect of f on the EDSSC method (we fix τ = 0.5) in Fig. 4(b) from which we find that f will impact the clustering quality. If f is set too large for the PH test to get triggered, then the new emerging subspaces would not be found timely.

C. Online Subspace Clustering on Small-Scale and Evolving Data Streams
In this section, we focus on studying the performance of EDSSC and baseline algorithms on small-scale and evolving data streams. We report their performance in Table III from which we can find the following.
Our EDSSC succeeds in achieving high clustering accuracy and NMI in processing the three small-scale and evolving data  III  PERFORMANCE COMPARISON AMONG DIFFERENT ALGORITHMS ON  THREE SMALL-SCALE DATA STREAMS (AR, EXYALEB, AND MPIE) streams. For example, on the ExYaleB data stream, EDSSC achieves 75.01% accuracy and 86.47% NMI compared with 57.14% accuracy and 74.43% NMI of OLRSC which has the best performance among all baseline algorithms. It proves the feasibility of our EDSSC algorithm, especially the effectiveness of the detection strategy of subspace evolution, the estimation algorithm of the number of subspaces, and the proposal of the ASCI index.
It can be observed that our EDSSC achieves much higher clustering quality than the state-of-the-art RBSC algorithms, that is, SSSC, SLRR, SLSR, and OLRSC. Such superiority of EDSSC is more obvious on the ExYaleB and MPIE data streams. The reason is that these RBSC algorithms are based on the assumption that the subspace structure should be unchanged. This assumption leads to that they fail to detect and adapt the evolving subspace, which naturally results in the failure in the clustering. It should be pointed out that for the AR stream, the NMI value of EDSSC is slightly less than that of SLSR. One possible reason is that the spatial distribution of data points in the AR datasets is rather complex. Hence, it is relatively difficult to accurately estimate the number of subspaces underlying the AR stream without any prior information (EDSSC estimates 126 subspaces). However, SLSR, SSSC, SLRR, and OLRSC need an accurate number of subspaces as prior information. In the experiments of this article, we provide them with the accurate number of subspaces, which will promote their NMI values.
Compared with the two state-of-the-art DSC algorithms, that is, STRAP and CEDAS, EDSSC is more suitable in processing these high-dimensional evolutionary data streams. This is mainly because STRAP and CEDAS are based on traditional distance measurements which are not effective in high-dimensional space to indicate the aggregation of the points belonging to the same subspace. However, our EDSSC fully exploits the self-expressiveness property of the data points, thus achieves better performance in high-dimensional space.

D. Online Subspace Clustering on Medium-Scale and Evolving Data Streams
This section investigates the proposed method as well as baseline algorithms on three medium-scale handwritten digit (or character) streams, that is, USPS, PenDigits, EMNIST-letter. Table IV reports the accuracy and NMI of the tested algorithms, from which the following could be observed. The proposed EDSSC achieves the best performance on the USPS and PenDigits streams among all the performed algorithms. For example, on the USPS data stream, the accuracy and NMI of EDSSC can reach 67.01% and 78.65% followed by SLRR whose accuracy and NMI only 60.43% and 67.20%. For the EMNIST-letter stream, EDSSC achieves the highest accuracy with 67.62%. The NMI of EDSSC on EMNIST-letter is 75.58%, which is less than SLSR (83.50%). However, as discussed before, it is mainly because the spatial distribution of the EMNIST-letter stream is complex. The number of subspaces estimated by EDSSC is 32, which is slightly larger than the actual.
It can be observed that the performance of our EDSSC is more stable when handling different kinds of evolving data streams. Nevertheless, the performance of the baseline algorithms fluctuates. For example, SLSR performs relatively acceptable on processing the EMNIST-letter stream with accuracy = 65.60% and NMI = 83.50%. However, it achieves only accuracy = 50.43% and NMI = 59.60% in USPS stream processing. The performance of SSSC on the USPS stream as well as the EMNIST-letter stream is not comparable to the other RBSC algorithms but its performance becomes relatively acceptable on the PenDigits stream processing. The possible reason is that these baseline algorithms are based on different theory to capture the subspace structure. For example. SSSC is based on sparse restriction while our EDSSC is based on the low-rank restriction in (3). The latter is more powerful to capture the structure among different kinds of streams [23].

E. Online Subspace Clustering on Large-Scale and Evolving Data Stream
We further evaluate the proposed and the baseline algorithms on a large-scale and evolving data stream. The the MNIST 30K stream is used in this experiment which consists of 30000 data points randomly selected from 10 classes in the MNIST database. Besides subspace emergence, we further add subspace disappearance, recurrence in the data stream to test the performance of the algorithms. The temporal distribution and evolving property of MNIST 30K are shown in Fig. 5.
As can be seen from Fig. 5, we divide the data stream into five phases denoted as P1-P5, respectively. There are five subspaces emerging in P1 followed by three subspaces and two subspaces emerging in the second and third phases, respectively. Then, in P4, five subspaces disappear and two of them recur in the next phase.
The performance of the algorithms on the MNIST 30K stream has been reported on the right side of Table IV. As shown in Table IV, the proposed EDSSC outperforms the baseline algorithms in the accuracy and NMI value of the result. Furthermore, in order to study the performance of each algorithm for subspace evolution, we depict the real-time number of subspace recovered by each algorithm on MNIST 30K in Fig. 6. From Fig. 6(a), it can be observed that EDSSC is effective in detecting and adapting the subspace evolution with the three types of subspace evolution being detected accurately. However, in Fig. 6(b), it can be found that the number of subspaces keeps as equal to 5 during the entire process. The reason is that SSSC, SLRR, SLSR, and OLRSC cannot deal with the subspace evolution and assume all the subspaces should be covered in the static learning phase. Therefore, the number of subspaces is unchanged. In addition, these four algorithms actually cannot estimate the number of subspaces in the static phase and they require the number of the subspace must be input as prior information. In Fig. 6(c), we can find that the CEDAS has the potential to deal with the subspace emergence and disappearance. However, it cannot recover the subspace accurately because CEDAS is based on traditional distance measurement which is not effective in high-dimensional space. Meanwhile, note that the subspace recurrence cannot be detected at all. As shown in Fig. 6(d), STRAP can only detect subspace emergence. Therefore, the number of subspaces can reach 9 from 5 around t = 14 000. But after that, the number of subspaces keeps unchanged. This is because the subspace disappearance and recurrence can not be detected or acted by STRAP.

VI. CONCLUSION
In this article, we focused on one unsolved but vital problem, that is, how to cluster the evolving and highdimensional data streams. To this end, a novel DSC algorithm, called EDSSC, was proposed which can perform online SC on the evolving and high-dimensional data streams. EDSSC is a two-phase algorithm, including a static learning phase and an online clustering phase. Compared with the state-ofthe-art algorithms, EDSSC satisfactorily handles the following issues. First, in the static learning phase, our EDSSC was capable of estimating the number of subspaces by the proposed method (10). Second, by proposing a data structure, called EDSSC summary, EDSSC perfectly reached a balance between the two competing goals: saving more points for the accuracy or discarding more points for efficiency. Third, EDSSC does not make the assumption that the subspace structure of the data streams must be unchanged. Instead, EDSSC can detect and act subspace evolution properly by the proposed subspace detection strategy based on the PH test. Finally, in the learning phase, the arriving point can be assigned to the proper subspace with the help of the proposed ASCI index. We have proved that ASCI is better than the commonly utilized SCI index.
In future studies, we will concentrate on exploring a proper strategy enabling EDSSC to detect gradual subspace evolution. In addition, we will focus on the automatic optimization of relevant parameters affecting the performance.