Online Robust Subspace Clustering With Application to Power Grid Monitoring

In this work, a robust subspace clustering algorithm is developed to exploit the inherent union-of-subspaces structure in the data for reconstructing missing measurements and detecting anomalies. Our focus is on processing an incessant stream of large-scale data such as synchronized phasor measurements in the power grid, which is challenging due to computational complexity, memory requirement, and missing and corrupt observations. In order to mitigate these issues, a low-rank representation (LRR) model-based subspace clustering problem is formulated that can handle missing measurements and sparse outliers in the data. Then, an efficient online algorithm is derived based on stochastic approximation. The convergence property of the algorithm is established. Strategies to maintain a representative yet compact dictionary for capturing the subspace structure are also proposed. The developed method is tested on both simulated and real phasor measurement unit (PMU) data to verify the effectiveness and is shown to significantly outperform existing algorithms based on simple low-rank structure of data.


I. INTRODUCTION
Grid status monitoring is an important prerequisite for reliable and efficient operation of the power system. Failure to identify anomalies in the grid states in an accurate and timely manner may result in inefficient use of resources, equipment damages, system instabilities, and even cascading blackouts. It is also critical that cyber-attacks to the power system and its data be detected, and their effect mitigated promptly.
The benefit of employing synchrophasor measurements for power system monitoring has been widely recognized in recent years [1], [2]. The phasor measurement unit (PMU) is a device that can measure voltage phasors at buses, typically at a rate of 30 samples per second or higher. Based on the global The associate editor coordinating the review of this manuscript and approving it for publication was Michele Magno .
positioning system (GPS), PMU measurements can achieve precise synchronization across a wide region. Compared to the conventional supervisory control and data acquisition (SCADA) protocol, which provides measurements at a frequency of a few samples per minute, PMU data can reveal dynamic changes in the grid states, significantly improving the monitoring capability.
As the high-speed large-scale spatio-temporal measurement of the grid is enabled, data-driven approaches for power system monitoring became important. Departing from the more conventional model-driven paradigm, which requires detailed and accurate system models and parameters, the data-driven methods utilize the ample measurements to learn the salient structures in the data and perform inferences. Voltage stability was predicted using singular value decomposition (SVD), spline extrapolation, and neural networks, trained on measured and simulated data [3], [4], [5], [6]. Line outages were identified using machine learning techniques [7], [8]. A real-time event identification method was developed based on unique subspace signatures present in the dynamics of PMU data [9]. Co-occurring multiple events were recognized from frequency measurements through sparse coding techniques [10]. The onset of events was detected based on compressed PMU data using dynamic programming optimization [11]. Periodic forced oscillations caused by rogue inputs were detected from PMU measurements in [12].
There are some challenges associated with analyzing the large-scale data generated by PMUs. The processing of up to tera-bits of daily measurements for large-scale systems require significant dimensionality reduction that can still preserve informative features in the data [13], [14]. Furthermore, such data are often generated in a streaming fashion and need to be processed in real time. Thus, processing algorithms are constrained to be of low computational complexity. Based on the data, fast and accurate inference must be performed to detect anomalies and disturbances occurring in the systems. As the data volume grows large, it is natural to have missing and corrupt entries in the data due to various reasons, such as sensor failure and communication errors. In critical infrastructures, the measurements may also be tampered by cyberattacks, which must be detected and isolated.
Exploiting inherent structures in the data shows promises in mitigating these challenges. Missing measurements in power grid data were reconstructed using low-rank matrix completion approaches in [15]. Subspace clustering approaches were devised and tested for synchrophasor data [16], [17]. Measurements deviating from the postulated models were detected as indicating disturbance events or cyber-attacks [17], [18], [19]. The classification problem to detect the attacks in the smart grid was tackled using supervised and semi-supervised methods [20], [21]. However, these works were based on batch implementations, where the processing takes place offline after measurements have been collected, or the model is re-trained frequently. Thus, the methods may incur large computational and memory burden and may not be suitable for real-time monitoring applications. Deep learning models, such as the convolutional neural network (CNN), deep autoencoder (DAE), generative adversarial network (GAN), and long short-term memory (LSTM) network, were employed to detect anomalies in power system data, but they typically demand large datasets and long training time [22], [23], [24], [25], [26], [27].
To mitigate these issues, online algorithms are desired, which can process the data stream sequentially and incrementally. Online identification of low-dimensional subspaces from data with missing entries was tackled in [28]. Online algorithms for robust principal component analysis (PCA) were derived for processing outlier-corrupted data with lowdimensional structures in [29] and [30]. Online learning of sparse coding dictionary models was proposed in [31]. An online nonnegative matrix factorization algorithm was derived in [32]. An online algorithm for missing PMU data estimation was developed based on low rank matrix completion in [33].
In this paper, the subspace clustering model is adopted, in which data points are assumed to lie in a union of subspaces [34]. Such a model is readily justified since subspaces can capture different states of the dynamic system operations [3], [18]. While the subspace clustering model can subsume low-rank subspace models, it can capture the richer union-of-subspaces structure, and thus can be used in high-rank situations as well-i.e., when the sum of the dimensionalities of individual subspaces is higher than the ambient measurement dimension [16], [35].
A host of approaches have been developed for learning subspace clustering structures, including the k-plane clustering [36], generalized PCA [37], sparse subspace clustering [38], and low-rank representation (LRR) [39]. Here the LRR framework is employed for its good performance and extendability to online implementation. It is noted that an online algorithm for the LRR model was developed in [40]. However, in addition to neglecting the missing data issue and employing a costly second-order update rule, it was suggested in [40] to use the entire data matrix itself as the representative dictionary. Therefore, the algorithm can be started only after the data collection is done. If the initial portion of the streaming data is used as the dictionary, the dictionary may lose the representative power if the data distribution changes over time.
In our work, the dictionary for subspace representation is updated in an online fashion as well. More specifically, the dictionary atoms are selected incrementally yet judiciously based on an appropriate sparsification criterion. Furthermore, to maintain the dictionary size under a given memory budget, the atoms are adaptively discarded, which is called a pruning procedure.
Some preliminary results were presented in conferences [17], [41]. Compared to these precursors, the current journal version contains slightly different formulations for the consistency of batch and online problems. Furthermore, rigorous proofs are provided for the convergence of the derived batch and online algorithms. Extended numerical tests are performed with synthetic data using grids of different sizes, and with real PMU data. Various aspects of the performance of the proposed algorithms are also compared with existing alternatives that adopt low-rank models of data.
Our contributions can be summarized as follows. 1) We formulate a novel LRR-based robust subspace clustering problem accommodating both incomplete and corrupt measurements. 2) Both batch and online algorithms are derived with provable convergence guarantees. The online algorithm enjoys low computational complexity, processing delay, and storage overhead. In addition, it can track slow variations in the underlying dynamics. Online sparsification and pruning procedures are also proposed to maintain a compact dictionary to represent the subspaces.
3) The algorithms are applied to the synchrophasor data analysis for power grid monitoring. Numerical tests are performed to validate the performance of reconstructing missing measurements and detecting anomalous events using both simulated and real PMU data. The algorithms are compared with existing alternatives and shown to be significantly superior. The rest of the paper is organized as follows. In Section II, the subspace clustering formulation with missing and corrupt data is presented. The batch and online algorithms are derived, and then strategies to update the dictionaries on the fly are discussed in Section III. Results from the numerical tests are presented in Section IV. The conclusion is provided in Section V.

II. LOW RANK REPRESENTATION FOR INCOMPLETE DATA
Let us denote the N -channel measurements obtained from various sensors at time t as z t ∈ R N . Collecting the measurements over T time intervals, matrix Z is formed as Z := [z 1 , z 2 , . . . , z T ] ∈ R N ×T . The subspace clustering model postulates that a (noise-free) measurement vectorz t lies in one of K subspaces {S k } K k=1 ; that is,z t ∈ ∪ K k=1 S k . Such a model is useful for PMU data since small variations in the node voltages lead to the measurements that are well approximated by a linear subspace, and significant variations in the operating point or disturbances in the system would result in measurements that belong to different subspaces [3], [16]. Thus,z t can be represented by a linear combination of a set of template vectors, called atoms, which are collected as the columns in a dictionary matrix D ∈ R N ×M , where M is the dictionary size or the number of atoms. Then, with c t ∈ R M representing the vector of coefficients, one has z t ≈ Dc t .
Interestingly, upon defining C := [c 1 , c 2 , . . . , c T ] ∈ R M ×T , the LRR model insists that C has a rank that is much smaller than min{M , T }. It turns out that this constraint can reveal the subspace clustering structure. More specifically, consider the optimization problem with where ∥C∥ * denotes the nuclear norm of C, or the sum of its singular values. Minimizing ∥C∥ * is tantamount to promoting a low rank in C [42]. If D =Z andZ has the skinny SVD given byZ = U 0 0 V ⊤ 0 ( ⊤ denotes transposition), it can be shown that the solution of (1) is C = V 0 V ⊤ 0 [39]. Under the assumption that the data {z t } are clean, that is, they lie exactly in the union of subspaces, and the set of subspaces {S k } are independent, which means that the dimension of the union (the rank ofZ) is equal to the sum of the individual dimensions of the subspaces, it can be shown that the (i, j)-th entry of V 0 V ⊤ 0 is nonzero only ifz i andz j belong to the same subspace [43]. Thus, V 0 V ⊤ 0 can serve as the affinity matrix, which indicates whether a given pair of data points lie in the same subspace. In fact, various subspace clustering techniques first compute the affinity matrix, and then extract the clusters [34]. Note also that if D contains the atoms that span the individual subspaces {S k }, then the resulting C can still be used as the affinity matrix [39].
As the data may be collected at a high rate and transported over some communication infrastructure, it may happen that some measurements do not arrive at the processing unit on time, resulting in incomplete measurements. Let denote the set of indices for the entries of Z corresponding to the observed measurements, and c the missing ones. Also, define operator P (Z), which keeps the observed entries unchanged, while setting the missing ones to zero. In other words, with the (i, j)-entry of Z denoted as Z ij , the (i, j)-entry of P (Z) is defined as In addition, the measurements may become corrupted by gross errors in the communication channel, or even by cyberattacks. Such measurements need to be detected and isolated. Anomalous measurements containing significant deviations from nominal operational states also need to be identified. Assuming that only a small portion of the data matrix Z is corrupted, one can adopt a robust LRR model, which postulates that Z = DC + E, where E ∈ R N ×T is a sparse matrix. Overall, an optimization problem for robust LRR with incomplete measurements can be posed as where ∥E∥ 1 is the ℓ 1 -norm of E defined as the sum of absolute values of all entries in E, and µ > 0 is a parameter that balances the low rank of C and the sparseness of E. Group or other structured sparsity can be easily incorporated to capture the correlations in the corruption patterns. It can be easily verified that at the optimum, E will have the entries in c equal to 0, or E| c = 0. Therefore, the optimal objective of (3) is equal to that of The optimal C for (3) will be identical to the optimal C for (4). Once the optimal E for (4) is denoted as E, the optimal E for (3) is simply given by E = P (E).

III. ALGORITHM DERIVATION A. BATCH ALGORITHM
An iterative algorithm to solve (4) can be derived, which processes the entire data set Z in a batch fashion. The batch solution is useful for offline analysis of historical data and serves as a performance benchmark for the online algorithm that will be developed in Section III-B.
To solve (4) efficiently, the alternating direction method of multipliers (ADMM) is employed [44]. First, a copy of variable C is introduced as C to obtain a formulation equivalent to (4) as min C,E,C Upon introducing Lagrange multiplier matrices 1 and 2 associated with constraints (5b) and (5c), the augmented Lagrangian can be written as where ρ > 0 is a positive constant, and ∥ · ∥ F denotes the Frobenius norm. The ADMM is an iterative algorithm that minimizes L ρ alternatingly with respect to two blocks of variables. Here, C is taken as one block of variables, and (C, E) as the other block. Then, at iteration k, based on the k-th (current) iterates C k , C k , E k , k 1 , k 2 , the ADMM generates the next iterates as It is noted that (7) can be split into the optimizations with respect to C and E separately. The optimization for C can be written as = arg min Thus, with the SVD of C k + ρ −1 k 2 = U V ⊤ , and upon defining a soft-thresholding operator S µ (·) as C k+1 is given by [45] where S 1/ρ ( ) applies the soft-thresholding operation entry-wise. The optimization problem for E k+1 is given by whose closed-form solution is given by Finally, the update equation for C k+1 can be derived from (8) as where I is the M × M identity matrix. The resulting batch robust subspace clustering (RSC) algorithm is described in Table 1. Compared to a similar algorithm derived in [39], the algorithm in Table 1 can handle missing entries and provides a provable convergence guarantee. The convergence follows from the standard ADMM literature. On the contrary, the algorithm in [39] alternates among three blocks of variables, for which convergence has not been established in general [46].  Table 1) The iterates {C k , E k , k 1 , k 2 } generated from the batch algorithm are bounded, and every limit point of {C k , E k } is an optimal solution to (3). Proof: See Appendix A.

B. ONLINE ALGORITHM
Contrary to the batch analysis method that processes a bulk of measurements together, the online algorithm updates the VOLUME 11, 2023 estimates of C and E each time a new datum arrives. This allows low-delay real-time analysis. Since only the latest data sample is involved, the online update rules are typically of low complexity and require a small memory footprint as well.
To facilitate the derivation of an online algorithm, (4) is first re-written as an unconstrained problem as (18) where λ > 0 is a parameter tuning the rank of C. Note that the nuclear norm of a matrix C, whose rank is no larger than R, can be expressed as [42] ∥C∥ * = min Based on this, (18) is modified to [see also [47]] Proposition 2: If the optimal C for (18) has a rank no larger than R, then (20) achieves the same optimal objective as (18). Proof: Let A * and B * be the optimal solution to (20), and set C * = A * B * ⊤ . If one minimizes 1 2 (∥A∥ 2 F + ∥B∥ 2 F ) subject to C * = AB ⊤ (call the corresponding optimal solutionǍ andB), one should be able to further reduce the objective of (20) by usingǍ andB. However, since A * and B * are already optimal for (20), one must have 1 2 , and e t the t-th column of E, i.e., E = [e 1 , · · · , e T ]. The set t denotes the set of indices of the observed entries at time t, that is, t := {n : (n, t) ∈ }. Thanks to reformulation (20), the problem is now separable into different time slots and can be rewritten as The basic idea for deriving the online algorithm is to update b t and e t based on P t (z t ) at each time t, without revisiting the past entries {b τ } and {e τ } for τ = 1, 2, · · · , t −1. Furthermore, A is updated based on the stochastic gradient descent (SGD) method to reduce computational complexity [48].
First, the update for b t and e t at time t is based on the previous iterate of A t−1 via solving To solve this problem, the coordinate descent method is adopted, where b t and e t are obtained by fixing one and solving for the other alternately until they both converge. Let e t | t and e t | c t denote the entries of e t whose indices are in t and c t , respectively. Then, the coordinate descent proceeds for l = 0, 1, 2, · · · as where S µ (x) for vector x applies element-wise. The update for A is based on the SGD method. The key observation is that as T increases, the objective of (21) approaches the expected value, thanks to the law of large numbers. That is, upon defining and problem (21) tends to where the expectation E[·] is with respect to z t and t . Instead of computing the gradient of the entire cost function, the SGD takes the instantaneous derivative using only the current data sample. Thus, the update rule for A becomes where ρ t > 0 is the step size. The derived online RSC algorithm is listed in Table 2. The convergence of the algorithm is established in the following proposition. Table 2) Suppose that {z t } are bounded and the iterates {A t } generated by the algorithm in Table 2 are bounded as well. Then, for ρ t = 1/Lt with a large enough L > 0, {A t } converge to the stationary point of (28) almost surely.

Proposition 3: (Convergence of the Algorithm in
Proof: See Appendix B.

C. DICTIONARY UPDATE
Clearly, for the union-of-subspaces structure to be captured, D must contain the vectors that span the individual subspaces {S k }. If this is true, provided that the subspaces are independent, and the data samples are strictly drawn from ∪ K k=1 S k , it can be shown that C from (1) will have its (m, t)-entry equal to zero if the m-th atom in D and the t-th data sample z t lie in different subspaces [39].
A choice for the dictionary often made in the literature is to use the batch dataset Z itself by setting D = Z [40]. Obviously, such a choice does not result in truly online processing since the algorithm would then require the entire data set to be available first. One could instead make use of the historical data. However, this might incur sizable memory requirements. Moreover, the dictionary should still be updated from time to time to keep up with any slow drift in the data distribution. A desirable option would be to start with a small initial dictionary, which is then updated continuously based on the incoming data stream.
In this work, instead of learning the dictionary from the data as in the dictionary learning frameworks [31], it is constructed directly from the online measurement vectors. A critical issue in this context is to be selective in adding a new measurement into the dictionary so that the size of the dictionary does not grow excessively, but still the diversity of the atoms is maintained so that a good representation capability is achieved. For this, online sparsification and pruning procedures are proposed next.

1) ONLINE SPARSIFICATION
Various online sparsification strategies have been developed, in particular, in the context of kernel-based adaptive filtering literature [49], [50], [51]. The main idea is to accept a new datum into the dictionary only when it is deemed to sufficiently contribute to the diversity of the dictionary based on an appropriate metric. Representative sparsification criteria include minimum pairwise distance, approximate linear dependence, coherence, and Babel measure, all of which can be shown to eventually upper-bound the condition number of the kernel (Gram) matrix [51]. In this work, the minimum pairwise distance measure is adopted for its simplicity.
Let us first focus on the case where there are no missing entries in the data. Suppose that at time t − 1, M t−1 data vectorsž 1 ,ž 2 , . . . ,ž M t−1 are in the dictionary as D t−1 = [ž 1 ,ž 2 , . . . ,ž M t−1 ], wherež m ∈ {z 1 , z 2 , . . . , z t−1 }. Then, at time t, a new datum z t arrives. The distance metric computed for z t is given by If κ t is greater than some threshold δ 2 , then z t is admitted to the dictionary. That is, one sets M t = M t−1 + 1 anď z M t = z t . Otherwise, z t is discarded. When there are missing entries, that is, c t ̸ = ∅, a simple pragmatic strategy is to discard this measurement, which is used in the numerical tests. An alternative would be to use the reconstructed vector Through sparsification, the size of the dictionary M t tends only to increase as time goes on. One can show that as long as the data vectors z t belong to a compact set, the size of the dictionary does not increase without bound [49]. However, the resulting size of the dictionary may still be too large in practice. To contain the size of the dictionary within a prescribed memory budget, a pruning procedure may be employed as explained next.

2) PRUNING
When M t exceeds a memory budget M for the dictionary, that is, from the last sparsification step, M t = M + 1 results, pruning removes an atom from the current dictionary D t . A reasonable strategy is to discard the atom that has the least contribution for representing the data seen so far [52]. Thus, upon denoting the m-th row of A t as a t (m, :), one finds andž m * is removed from D t . The corresponding row in A t should also be removed before the next iteration. Note that VOLUME 11, 2023 the atom that was just added to the dictionary is not pruned immediately. An alternative strategy would be to look at {c t := A t−1 b t }, but this is not pursued here since it requires a larger memory overhead. The overall online RSC algorithm including the dictionary update is described in detail in Table 3.

IV. NUMERICAL TESTS
The performance of the proposed algorithms was verified by numerical tests. Average performance values were obtained based on 20 runs. First, the results using the simulated PMU data are reported, followed by the real PMU data results.

A. TESTS WITH SIMULATED PMU DATA 1) A 23-BUS SYSTEM
The algorithms were first tested on simulated PMU data generated using the PSS/E simulator [53]. The simulated power system consists of 23 buses and 6 generators, as shown in Fig. 1. The PMU at each bus acquires measurements at a sampling rate of 40 samples per second and a signal-to-noise power ratio (SNR) of 92 dB. More detailed grid parameters are provided in [17]. To simulate events, the transmission line connecting buses 3001 and 3003 was tripped at t = 10 seconds and closed at t = 70 seconds. Then, the same line was tripped again at t = 110 seconds and closed back at t = 170 seconds. This is to ensure the convergence of online algorithms in the first half of the data so that the algorithm performance can be assessed accurately in the steady state during the second half. For simplicity, only the voltage magnitudes were used in the experiment. Employing multiple modalities, such as using the magnitudes and the angles together, resulted in very similar results. The voltage magnitudes are shown in the top panel of Fig. 2. For preprocessing, the nominal per-unit quantity 1 was subtracted to construct the data matrix Z. To verify that the proposed algorithms can accurately recover missing measurements, 5% of the entries in Z were randomly removed. We also manually identified the outliers in Z as the part from the beginning of an event to when the system returns to its steady state. Let O denote the set of the time indices of the events (outliers). Then, Z| O and Z| O c represent the outlier and the inlier columns of Z, respectively. The middle panel of Fig. 2 shows the voltage magnitudes reconstructed using the algorithm in Table 3. The memory budget M was set to 100, and λ = 10 −2 , µ = 5 × 10 −3 , ρ t = 10 −2 , and δ 2 = 10 −6 were used. The initial dictionary D 0 was constructed by randomly choosing M 0 = 5 columns from Z| O c . The maximum rank R of the low-rank matrix C was set to 10 based on the singular value distribution of Z. It can be seen that the reconstructed voltage magnitudes match well with the true values. The bottom panel of Fig. 2 shows the sparse error component E estimated from the algorithm. It is seen that the estimated sparse errors indicate the disturbance events accurately.
To check the convergence of the algorithm, the instantaneous objective h(b t , e t , A t , z t , t ; D t ) in (26) is plotted in the top panel of Fig. 3. The objective rapidly converges both initially and after each disturbance event. In the bottom panel of Fig. 3, the Frobenius norm of the difference of the consecutive iterates of {A t }, namely ∥A t −A t−1 ∥ F , is plotted. It can be seen that whenever there is a disturbance in z t , A t tries to track it. However, as the effects of the disturbances vanish quickly as seen in Fig. 2, A t captures only the nominal structure, and the disturbances are detected by the sparse error component.
Next, the performance of the proposed algorithms is compared with that of benchmark algorithms. Specifically, the online robust principal component analysis (ORPCA) algorithm [29] and the improved version [54] of the robust online subspace estimation and tracking algorithm (ROSETA) [55] are considered. The ORPCA objective in [29,Eq. (6)] is very similar to the objective function of (21), but it does not account for missing observations. Thus, the ORPCA formulation is modified here, which turns out to be the same as (21) with D = I. As for the improved ROSETA (iROSETA), it can be verified that the objective function can be expressed as where A is again M × R. Note also that µ is a parameter to be specified, whereas γ is adapted automatically by the algorithm. It can be seen that (33) lacks the regularizers on A and b t , which means that the rank of the subspace is fixed to R specified, and R no longer plays the role of the maximum rank. In the experiment, we set the values of R for the RSC and ORPCA algorithms equal to 10, but for iROSETA, both R = 5 and R = 10 were tested. Fig. 4 depicts the receiver operating characteristic (ROC) of detecting the disturbance events. Each curve shows the trade-off between the true positive rate and the false positive rate. The false positive rate is the probability that the algorithm erroneously detects an anomaly when the grid is in the normal condition. The true positive rate is the probability that the algorithm correctly detects a disturbance event during such an event. When a disturbance event happens, the sparse error vector e t will have non-zero entries. To compute the true positive rate, we assume that the event is detected if any of the sparse error vectors during the event duration is non-zero. The  curves with the triangle, circle, and square markers depict the results obtained by the batch RSC, online RSC, and ORPCA algorithms, respectively. The curves with the cross markers correspond to the performance of iROSETA, where the solid curve is for R = 5 and the dash-dot for R = 10. Each curve was obtained by varying µ (as well as λ for ORPCA and online RSC) in the corresponding algorithm. As expected, the batch RSC algorithm performs the best. Among the online algorithms, the online RSC algorithm is seen to be superior to other algorithms. In the case of iROSETA, the one with R = 5 yields better performance than R = 10, as the true rank of Z is in fact closer to 5.
We also compared the reconstruction performance of the missing observations. To do this, the average normalized mean-square error (NMSE) between the true and the reconstructed measurements is computed for the missing observations outside the disturbance events. (Recall that the missing observations cannot be reconstructed for outliers.) Specifically, the average NMSE is defined as  whereẐ = DC. Since λ and µ not only affect the NMSE but also the event detection performance, one must hold the event detection performance of the algorithms at the same level for fair comparison. In Fig. 5, the average NMSEs are compared at different false positive rates. It can be seen again that among the online algorithms, online RSC achieves the smallest NMSEs across all false positive rates less than 0.5.
To see why online RSC performs better than ORPCA in terms of both event detection and missing entry estimation even though the formulations of the two algorithms are very similar, the evolution of the estimated low-rank subspaces is examined. In Fig. 6, the top panel shows the differences ∥A t − A t−1 ∥ F of successive iterates of A t in ORPCA, where A t spans the low-rank subspace at each time t. Likewise, the variation of the subspace in online RSC can be captured by Fig. 6. Recall that the difference of the two algorithms lies in that in online RSC, dictionary D t is continually estimated, while in ORPCA, essentially an identity matrix is used as the dictionary. It can be observed from the figure that ORPCA takes much more time to stabilize than online RSC, both initially as well as after disturbance events, even though ORPCA employs computationally costly block coordinate descent (BCD) while online RSC uses SGD. It can be deduced that the dynamic dictionary of online RSC with efficient sparsification and pruning strategies significantly improves the stability of the algorithm in the presence of outliers, which contributes to the performance.

2) AN IEEE 68-BUS SYSTEM
The proposed algorithm was tested with a data set for a larger power system. The data set was generated for an IEEE 68-bus system using the GridSTAGE (Grid Spatio-Temporal Adversarial scenario GEneration) simulator, a multivariate spatio-temporal data generation tool for simulating adversarial scenarios [56], [57]. A sampling rate of 50 samples per second and an SNR of 92 dB were used. A load change event started at t = 1 second and ended at t = 1.25, which is repeated once more from t = 31 to t = 31.25. Note that even though an event occurs during 0.25 seconds, it takes about 10 seconds for the grid to return to its steady state. Thus, we view the entire period from the start of an event to the return to the steady state as the disturbance duration. Also, while the data set contains voltage magnitudes, voltage phase angles, and frequencies, only the frequencies (shown in the top panel of Fig. 7) were utilized in our experiment to show the various availability of the algorithm. The nominal frequency 60 Hz was subtracted from each entry when constructing the data matrix Z for preprocessing. To test the performance of estimating missing observations, 5% of the entries in Z were randomly obliterated.
The middle panel of Fig. 7 depicts the reconstructed one from the incomplete frequency measurement. Parameters were set as λ = 10 −4 , µ = 10 −6 , and ρ t = 10 −2 . The sparsification and pruning procedures were utilized with δ 2 = 10 −6 . It can be seen that the reconstructed frequencies match the true ones faithfully. The bottom panel of Fig. 7 depicts the sparse error matrix E, where the disturbance events are well captured.
The average NMSE performance of the online RSC, ORPCA, and iROSETA algorithms is shown in Fig. 8. The rank parameter R was set to 30 for online RSC, 50 for ORPCA, and 20 for iROSETA, which correspond to the best NMSE values for the individual algorithms. It is observed that online RSC achieves the best performance when the false positive rate is smaller than 0.2. It should be noted that operating with the false positive rate higher than 0.2 is hardly useful as the true positive rate of all three algorithms already reach 100% when the false positive rate is as small as 1%. Compared to the NMSE performance for the 23-bus system  shown in Fig. 5, it is also observed that the NMSE performances for different algorithms are similar here. We suspect that this is due to the higher redundancy in the data resulting from a larger number of channels.

B. TESTS WITH REAL PMU DATA
The algorithm was tested with real PMU data acquired from the Texas ERCOT grid [58]. The data set is comprised of the measurements from four PMU stations, located at the Harris 69 kV Substation, McDonald Observatory, the University of Texas-Pan American, and Brazos Electric in Waco, taken for one hour at the rate of 30 samples per second. The data set contains the voltage magnitudes, phase angles, and frequencies. The voltage magnitudes were preprocessed by subtracting the mean and also dividing by the mean value. For the phase angles, the first bus was chosen as a reference, and the angle differences between other buses and the reference bus were calculated. Then, the mean of the differences were subtracted. For the frequencies, the nominal value 60 Hz was subtracted from the frequency measurements. The resulting data matrix is repeated once more to construct Z corresponding to a 2-hour duration.
The second half of Z, containing the voltage magnitudes, phasor angles, and frequencies, is plotted in the top panel of Fig. 9. It is seen that the real data contain a few outliers, which may be due to actual events or corrupt measurements. The middle panel corresponds to the reconstruction from the online RSC algorithm using the data with 5% of the entries randomly missing. The bottom panel shows the sparse errors. We used λ = 1, µ = 5 × 10 −4 , δ 2 = 10 −6 , and ρ t = 10 4 . The memory budget was set as M 0 = 5 and M = 100. It is seen that the sparse error captures all the prominent outliers. It also produces small nonzero values throughout, which may indicate events worth looking or could simply be ignored through appropriate thresholding, depending on the detailed requirements of grid monitoring. Fig. 10 depicts the performance of recovering incomplete measurements as the missing percentage varies from 1% to 20%. The three curves correspond to the NMSE performances of online RSC, ORPCA, and iROSETA, which were obtained by fixing the false positive rates around 0.1, and varying µ and λ. The values of R were set to 10 for the online RSC and ORPCA, and to 5 for iROSETA, respectively, based on the singular value distribution of Z. As can be seen, the online RSC algorithm performs much better than others. It can be noted that the recovered missing entries are quite accurate, even though the number of PMUs is small and realistic noise is included in the data. VOLUME 11, 2023

V. CONCLUSION
A RSC formulation has been proposed that can reconstruct missing measurements and detect corrupt entries based on a union-of-subspaces structure present in the data. Both batch and online algorithms have been derived with convergence guarantees. The online RSC algorithm enjoys low computational complexity and small memory footprint-suitable for real-time processing of large-scale streaming data. To construct a representative yet compact dictionary for capturing the subspaces, online sparsification and pruning methods were also proposed. The algorithms were applied to synchrophasor measurement data for power grid monitoring. The numerical tests performed on simulated and real PMU data validated the effectiveness of the proposed algorithms. In particular, the online RSC algorithm with sparsification and pruning strategies was shown to achieve a performance on par with the batch counterpart using low-complexity updates, faithfully reconstructing missing entries and accurately capturing disturbance events. The performance of the proposed algorithm was also shown to outperform existing alternatives. Future research directions include incorporating nonconvex spectral regularizers for better performance [47], building event classifiers based on the reduced-dimensionality features, and distributed implementation for scalability.

APPENDIX A PROOF OF PROPOSITION 1
The claim can be proved by the straightforward application of C 1 = R M ×T × {P (Z)}, C 2 = R (M +N )×T , G 1 (x) = 0, and G 2 (z) = ∥C∥ * + µ∥P (E)∥ 1 . It is noted that is a variable constrained to be equal to constant P (Z). Clearly G 1 and G 2 are closed and convex, and the optimal solution set of (5) can be assumed nonempty. ■

APPENDIX B PROOF OF PROPOSITION 3
The convergence can be established by viewing the SGD update as an instance of the stochastic successive upper-bound minimization (SSUM) algorithm for solving (28) [60]. To use the SSUM algorithm, one needs to construct a tight upper-boundĝ(A,Ā, z t , t ) of g(A, z t , t ) at A =Ā. To do this, first note that the solution (b t , e t ) to (27) is unique and thus Danskin's theorem can be invoked to assert that the gradient of g with respect to A is given by Since z t is bounded, ∇ A g is Lipschitz continuous. Let L be the Lipschitz constant. Then, it can be shown that g(A,Ā, z t , t ) := g(Ā, z t , t ) is a tight upper-bound of g(A, z t , t ). That is, g(A,Ā, z t , t ) ≥ g(A, z t , t ) for all A.
Furthermore, it is verified thatĝ is strongly convex in A, g andĝ are continuous in A, and g,ĝ, and their derivatives are bounded. Then, the SSUM iterates are given by where ρ t = 1/Lt, which is exactly the update in (29). Thus, A t is guaranteed to converge to the stationary point of (28) almost surely [60,Theorem 1]. ■