Structural Tensor Learning for Event Identification With Limited Labels

The increasing uncertainty of distributed energy resources promotes the risks of transient events for power systems. To capture event dynamics, Phasor Measurement Unit (PMU) data is widely utilized due to its high resolutions. Notably, Machine Learning (ML) methods can process PMU data with feature learning techniques to identify events. However, existing ML-based methods face the following challenges due to salient characteristics from both the measurements and the labels: (1) PMU streams have a large size with redundancy and correlations across temporal, spatial, and measurement type dimensions. Nevertheless, existing work cannot effectively uncover the structural correlations to remove redundancy and learn useful features. (2) The number of event labels is limited, but most models focus on learning with labeled data, suffering risks of non-robustness to different system conditions. To overcome the above issues, we propose an approach called Kernelized Tensor Decomposition and Classification with Semi-supervision (KTDC-Se). Firstly, we show that the key is to tensorize data storage, filter information via decomposition, and learn discriminative features via classification. This leads to an efficient exploration of structural correlations via high-dimensional tensors. Secondly, the proposed KTDC-Se can incorporate rich unlabeled data to seek decomposed tensors, invariant to varying operational conditions. Thirdly, we make KTDC-Se a joint model of decomposition and classification so that there are no biased selections of the two steps. Finally, to boost the model accuracy, we add kernels for non-linear feature learning. We demonstrate the KTDC-Se superiority over the state-of-the-art methods for event identification using PMU data.

the growing uncertainty and maintain the system stability, a power system requires advanced tools for system event identification.To capture the event dynamics, Phasor Measurement Units (PMUs) provide synchronized phasor measurements with high-granularity (e.g., 30 or 60 samples per second) [1].Therefore, the PMU-based power system event identification is one of the central topics to improve the system reliability.With synchrophasors to record system dynamics, many efforts analyze measurement patterns and identify when, where, and what type of events are.To find the event initialization time, methods like change point detection [2] can detect abnormal intervals that imply events (e.g., line trip, generator trip, single-phase-to-ground fault, phase-to-phase fault, and three-phase fault).However, to know more information about event types and locations, it's challenging to analyze high-dimensional PMU streams in the best way.
One idea to find event types and locations is to use expert information.For example, one can use signal transformation or filtering to map the time series data into some physically meaningful domain for comparing with predefined thresholds.These methods use wavelet transformation [3], Kalman filtering [4], and Swing Door Trending (SDT) [5], etc.For example, [5] utilizes a swing door to compress data with a pre-defined door width, and the detectable events must have a certain level of slope rate.However, as these methods need to pre-define some measures or thresholds, the usage may be biased because of the specific design and test cases.Therefore, can we have a general model?
For obtaining a general form, previous work proposes to use existing events and their labels to train in a Supervised Learning (SL) manner.Such Machine Learning (ML) models typically extract features for minimizing the loss function.For instance, Decision Tree (DT) [6] treats each measurement as a factor to determine the final decision.Although transparent, such a method is inefficient to make use of complex measurement correlations.Therefore, [7] proposes Support Vector Machine (SVM) to assign each input measurement a weight to form the final feature.There are also more complex and powerful models such as Convolutional Neural Network (CNN) [8], and Graph Neural Network (GNN) [1].They consider the spatial correlations with square and graph convolutions, respectively.One can also couple the temporal information in Long Short-Term Memory (LSTM) units.For example, [9] uses LSTM to extract periodic patterns and data inertia in time.However, for PMU data, it's desirable to simultaneously consider correlations among spatial, temporal, and measurement type dimensions.So, one can keep on increasing the model complexity.But, PMU streams accumulate quickly into terabyte (TB) quantities for training due to high volumes, large dimensionality, and complex correlations among the space, time, and measurement type (e.g., voltage magnitude, angle, frequency, etc.) dimensions.
To make the computation feasible, e.g., for real-time analysis, past work pre-processes PMU data with various dimension reduction techniques before the learning phase [10], [11], e,g., Principal Component Analysis (PCA) [12] and Independent Component Analysis (ICA) [13].The pre-processing can also select core statistics, including mean values [14] and wavelet basis coefficients [15], etc.We further categorize these data processing techniques by their treatments along the temporal and spatial dimensions.In the time domain, [16] utilizes Kullback-Leibler (KL) divergence to measure the distribution change for time-series measurements to detect events.[17] studies wavelet transformation families to transform signals into features.Then, these features are input to K-Nearest Neighbors (KNN) and Naive Bayes (NB) to classify events.Similarly, [18] employs wavelet transformation and Kalman filtering to process frequency measurements and produce features.[19], [20] utilize Teager-Kaiser energy operator on PMU streams to identify event time and locate events.Subsequently, PMUs that are close to events are selected as input to KNN and LSTM.
In the spatial domain, [21] converts the frequency and angle measurements into an image and utilizes CNN to capture the local features.[22] develops a graph theory-based network partitioning algorithm to group different buses and locate events.More studies investigate both dimensions together.[23], [24] develop a Graph Signal Processing (GSP) method to capture temporal correlations of PMU measurements and reorganize PMUs based on the magnitude of the temporal correlations.This re-organization enables the following supervised learning model to better extract spatial information and identify events.Some researchers store data in a 2-dimensional matrix and study the corresponding matrix-based algorithms.[25] develops the two-Dimensional Orthogonal Locality Preserving Projection (2-D-OLPP)-based method to extract spatial-temporal correlations.[26] calculates the subspace of the PMU data matrix and produces measures over the subspace for event identification.[27], [28], [29] employ PCA for the data matrix to yield compact feature vectors.In general, these methods are diversified and have different feature extraction capacities based on their own biases.There is a need to come up with a unified and efficient design to capture the complex correlations of synchrophasor data among the space, time, and measurement type domains.
In this paper, we propose to merge the two steps of 3-D dimension reduction and supervised learning into one step to avoid bias and improve efficiency.The key idea is to define a structure to hold key information in different dimensions, e.g., the measurement types ignored by [8], [14], [15].Based on such an idea, we design a tensor learning framework to extract physically meaningful features with simple computations.Fig. 1 shows the design that easily includes the temporal, spatial, and measurement type dimensions.The classification process in the tensor learning can directly provide the classification results while the dimension reduction process for efficiency is via tensor decomposition.Hence, tensor unfolding can covert the decomposed core tensor to vectors for classification, enabling an end-to-end model in Fig. 1.While tensor learning can extract key information quickly and systematically, we also want to preserve the nice property of nonlinear feature extraction, like the property in CNN.For such a purpose, we mathematically derive kernelization of the classifier to process non-linear physical relationships in power systems.
While the proposed model is highly efficient, many realistic cases do not have enough labels for training.For example, [30] notes that out of 1,013 PMU events recorded by a utility, only 84 events are labeled.But, limited labeled data will decrease the learning accuracy.One idea is to employ Semi-supervised learning to take advantage of widely available unlabeled data in power systems, shown in the middle part of Fig. 1.But, there are challenges in integrating Semi-supervised learning and kernel-based tensor learning.For example, we need to restrict the same decomposition model for labeled and unlabeled data.We achieve the integration by aggregating all 3-D PMU tensors into a 4-D tensor, where the additional dimension is the number of PMU tensors.Then, a direct tensor decomposition of the 4-D tensor can maintain the same decomposition parameters for all 3-D tensors, no matter if the data is with a label or not.To train the proposed Kernelized Tensor Decomposition and Classification with Semi-supervision (KTDC-Se) from the above, we develop an efficient coordinate descent method.Finally, when the tensor number is significantly large, we modify our training method based on a mini-batch-based training scheme to save computational storage.
For the numerical verification, the KTDC-Se method is tested extensively at various conditions on the Illinois 200-bus system, South Carolina 500-bus system [31], and realistic data sets from our utility partners.These conditions include different loading conditions and PMU penetrations, etc.The benchmark methods include various supervised and semi-supervised learning approaches, and cross-validation is used for evaluating model accuracy.The results show that our proposed method can efficiently obtain highly accurate event identification and localization in large systems with many data streams coming from PMUs.In general, we have the following contributions: r Design the KTDC-Se model to incorporate massive labeled and unlabeled PMU streams.Employ tensors to uncover the complex multi-dimensional correlations and create compact and informative features to identify events; r Derive a fast coordinate descent algorithm and its varia- tional mini-batch version to train our KTDC-Se; and r Conduct extensive experiments to demonstrate the high performance of KTDC-Se over other models with synthetic and real-world datasets.To emphasize the contributions, we summarize the basic principles of our designs as follows.(1) The proposed model employs tensors to explore high-dimensional correlations and create more compact and informative features for accurate and fast inference.(2) The proposed model is a unified framework for dimension reduction and supervised learning, which prevents the bias induced by separate steps and metrics.(3) The proposed model can take in rich unlabeled data for better performance.
The remainder of the paper is organized as follows: Section II introduces the notations and tensor preliminaries for the model.Section III defines the problem.Section IV proposes our KTDC-Se.SectionV illustrates the learning algorithm.Section VI conducts experiments for baselines and KTDC-Se, and Section VII concludes the paper.

II. NOTATIONS AND PRELIMINARIES
To integrate PMU data reduction and machine learning in one tensorized framework, we first introduce the basics of tensor algebra and the corresponding notations [32], [33].To summarize, Table I presents the basic notations for different types of variables and operations.

A. Tensor Notations
Multi-mode data can be stored in the so-called tensor [34], the multi-dimensional arrays.The number of dimensions for a tensor is referred to as order.For example, scalar (0-order tensor), vector (1-order tensor), and matrix (2-order tensor).Then, for a D-order tensor X ,

B. Tensor Operations
There are many types of operations for a tensor like folding, unfolding, product, etc.This subsection provides some operations used in the KTDC-Se method.
1) Mode-n Unfolding of a Tensor: A tensor can be unfolded to a matrix, a process that is also known as matrization.Specifically, for Mathematically, the result is: where X (i 1 , . . ., i n ) is denoted as the (i 1 , . . ., i n )th entry of tensor X .
2) n-Mode Product: For a tensor X ∈ R I 1 ×I 2 ×•••×I D and a matrix U ∈ R K×I n , the n-mode product is denoted as: 3) Tensor Tucker Decomposition: For a D-order tensor X ∈ R I 1 ×•••×I D , one key research topic is to find the approximation using a set of small tensors.For example, PMU data is of high volume, and low rank [35].Thus, the low-rank approximation is preferred to efficiently represent the PMU data and remove the redundant information.The target can be achieved via tensor decomposition.Specifically, the so-called Tucker decomposition is [32]: where a core tensor in the factorization, and U i ∈ R I i ×R i is a base matrix along mode i. u r i i is the r i th column of U i and • is the outer product.
Finally, the Tucker decomposition can be rewritten in a matrix format: where ⊗ is the so-called Kronecker product and represents the matrix transpose.In summary, the introduced tensor operations lay foundations for the KTDC-Se integrated model with certain physical interpretations.Specifically, tensor decomposition provides efficient feature extraction while maintaining certain Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.physical structures in the core tensor G. Further, tensor unfolding converts G to vectors that can be input to a classifier, which enables an end-to-end model of decomposition and classification.

III. PROBLEM FORMULATION
In this section, we define the target problem using the above notation.Before introducing the formulation, we first identify the study scope with the following points.

A. Data Preparation of PMU Tensors
For PMU streams, we follow the idea of [1], [14], [23] to extract a window of PMU signals with sufficient information to indicate the event dynamics for training a classifier.As shown in Fig. 2, each window of data can be formalized into a PMU tensor X ∈ R T ×L×M where T denotes the number of time slots for each window, L denotes the number of PMUs, and M denotes the number of measurement types (e.g., Voltage Magnitude (VM), Voltage Angle (VA), Frequency (F), etc.).Further, with a slight abuse of notation, the total N PMU tensors are denoted as the total event tensor X ∈ R N ×T ×L×M and assume the first H (H < N) sub-tensors along the first dimension have labels.Correspondingly, the label vector is denoted as y ∈ Z H×1 .In addition, to make the tensor data comparable between different measurement types, we implement normalization to restrict each entry of the tensor within the range [0, 1].More specifically, the normalization happens for each entry of the tensor in the total N tensors in Fig. 2.Then, among the N values, the maximum value of the specific entry is assigned to be 1, the minimum value is assigned to be 0, and the intermediate value is corresponding transformed to be an entry within (0, 1).
1) Data Labels: Event Types and Locations: We focus on the tasks of distinguishing event types and locating events.Since we are proposing data-driven approaches without the requirement of domain knowledge, a diversified set of system event types and locations can be considered.For example, in the experiment, we consider five event types, including line trip, generator trip, single-phase-to-ground fault, phase-to-phase fault, and three-phase fault.Further, for each event type, we randomly create 2 different locations in the system.Similar treatments are implemented in many related studies [14], [15], [23], [31], [36].
a) Model Capacity of Tackling Multi-class Events: Our proposed method can handle events of multi-classes.W e emphasize that in the following modeling process, each of our trained classifiers is binary to make better use of kernel methods with hinge loss for high accuracy, which is similar to Support Vector Machine (SVM) [37].Nevertheless, power system event identification is essentially a multi-class classification problem, i.e., each unique combination of an event type and an event location can indicate a unique label.Thus, the one-against-one method is utilized for training multiple classifiers.Specifically, we train multiple binary models at the same time, and their majority vote leads to the final event label that can be an arbitrary integer.One can refer to [38] for this method which has better performances than other multi-class SVMs.
b) Model Capacity of Tackling Cascading Events: Identifying cascading events is important for power systems.For cascading events, the time interval between two events usually lasts for minutes or hours [39].However, our model considers a time interval within a few seconds (e.g., 0.5 s in our experiments) because our goal is to conduct fast event identification.Thus, one approach is to treat cascading events as independent events.Then, our training process can find the correct decision boundary to classify events, including both single and cascading events.The above process focuses on quickly identifying events but ignores sequentially predicting cascading events.While the latter task is out of our study scope, our model still has a certain capacity to predict cascading failures with enough event files and labels.
Specifically, one can utilize a two-digit label y 1 y 2 where y 1 represents the current event status and y 2 represents the following event in an event sequence.If the current event status is a single event, we can treat y 2 = 0, which indicates the following system condition is normal.Subsequently, the two-digit label needs to be converted to an integer by considering all the labels in the training dataset for multi-class classification.Finally, as claimed above, we train multiple binary models to vote for the integer label.In general, the above process utilizes a data-driven way to estimate the correlations of cascading events during training for cascading event predictions.As long as we have enough data, our tensor-based information compression and feature extraction lead to efficient event identification and prediction.
Above notations and treatments summarize the common scenarios for power system event identification and how KTDC-Se proposed model can handle them.In general, we define our problem as follows.
r Problem type: semi-supervised event identification using PMU data.
r Given: a total event tensor X and a label vector y.r Find: an abstract mapping f (X ) = y to compress the information in X and use the compressed information to identify event labels in y.

IV. PROPOSED MODEL
The design of the above mapping f can be diversified.However, existing work suffers a key challenge of biased selections of data compression and event identification without proper integration.In this section, we design an end-to-end model that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.makes full use of tensor structure to achieve fast computations, physical interpretations, high capacity with non-linear feature extractions, and high accuracy under semi-supervision.

A. An Integrated Model With Efficiency and Physical Meanings
For an efficient model, we need to remove the redundancy in PMU measurements.Section I shows the drawback of traditional methods: they can hardly explore the high-dimensional correlations.Further, Section II illustrates that tensor is a natural container of high-dimensional data and tensor Tucker decomposition is an excellent approach to uncover the cross-dimension correlations.However, it is still unclear how we can design an efficient model to remove redundancy and capture event information for different PMU event tensors, and how we can guarantee the model robustness by tackling some labeled and rich unlabeled tensors.
For a detailed design, we show the motivation in Fig. 3.We utilize different colors to represent different components of data.More specifically, we utilize light blue to represent tensor data X and G.Then, we utilize orange, blue, red, and green colors to represent the decomposed parameter matrices A, B, C and D, respectively.These colors can help to distinguish the types of decomposed tensors.Second, to emphasize the dimension of the decomposed tensors, we bold the corresponding lines in G and utilize different colors in B, C, and D to distinguish the bold lines.Then, the reader can easily understand how the dimensions match.(3) We change the figures for output y 1 , y 2 , and y 3 to circles, where the solid line represents y 1 = 1, the two types of dotted lines represent y 2 = −1 and y 3 = ?(i.e., unknown).
For each PMU tensor X , the left part of Fig. 3 visualizes the process of a Tucker decomposition into base matrices B, C, and D and a core tensor G. G can maintain the structure as the PMU tensor X , leading to specific physical interpretations.Specifically, the base matrices can be viewed as the bases along different dimensions, and the core tensor G represents the interactions among these bases [11].Thus, we can assume bases for different PMU tensors are similar as long as the number of bases is sufficient enough.In contrast, the interaction tensor G contains discriminative event information.
Then, to maximally remove the redundancy, we directly keep the same bases B, C, and D for different PMU tensors during the decomposition, shown in the middle part of Fig. 3. Namely, we utilize a direct Tucker decomposition for a 4-D total event tensor X .Then, B, C, and D are naturally kept to be the same.Further, we want the core tensor G to contain distinguished event information.Therefore, we employ the event labels to conduct a supervised learning-based classification for dissimilarity maximization.The above procedure is for labeled tensors.For unlabeled tensors, only the decomposition procedure is implemented to increase the model robustness to different loading conditions.The concrete mathematical model of joint optimization is formulated in the following subsection.

B. Semi-Supervised Optimization for the Joint Model
In the semi-supervised learning setting, the 4-D total event tensor X in Section IV-A contains all labeled and unlabeled data.Mathematically, we implement the Tucker decomposition for X as follows: where G ∈ R N ×R 2 ×R 3 ×R 4 is the core tensor, i.e., the compressed tensor with small information redundancy.The matrices to scale the core tensors are and d r 4 are the r 1 th, r 2 th, r 3 th, and r 4 th columns of A, B, C, and D, respectively.R 2 < T , R 3 < L, and R 4 < M are the pre-defined dimensions of the reduced tensors to achieve the information compression.
In the decomposition of (1), the first dimension is fixed of the core tensor G to be N so that the decomposition can still bring N features to represent different PMU tensors.Furthermore, the decomposition can be rewritten as: where where J, J 1 , J 2 , and J 3 denote the total loss, reconstruction loss, classification loss and regularization terms, respectively.E = I H×H , 0 H×(N −H) ∈ R H×N denotes a selection matrix to select the first H feature instances (i.e., the instances with labels) in G (1) for the classification.|| • || F and || • || 2 denote the Frobenius norm and the l 2 norm, respectively.l(•, •) represents the classification loss function.The hinge loss of Support Vector Machine (SVM) is considered in this paper, i.e., l(EG is the ith instance of the transformed features and y i is the ith label in y.Namely, (1) .Further, the hinge loss is defined as [1 − t] + = max(0, 1 − t) p .Usually, one can treat p = 1 or p = 2 for l 1 -or l 2 -SVM [10], respectively.γ 1 and γ 2 are positive hyper-parameters to reweight the three terms in (3).

C. Efficient Kernelization for Powerful Non-Linear Feature Extractions
In the last two subsections, we successfully merge the data reduction and machine learning model into one optimization under semi-supervision.However, many PMU measurements have non-linear correlations.It is challenging to add non-linearity due to the computational cost.For example, adding sigmoid or polynomial functions to the loss function l in (3) significantly increases the computations.This motivates the usage of kernel trick for nonlinear feature calculations.Specifically, (i) the kernel trick shows that we can find a kernel function . The inner product computations, i.e., the main calculation procedure for the loss in (3), can be conducted in the original feature space rather than the features transformed from polynomial or sigmoid functions.(ii) Further, the input features of the classifier in (3) is a feature vector f with the dimension r = R 2 • R 3 • R 4 .Then, if we consider utilizing d-degree polynomial features to represent the original features, we can obtain a r d -dimensional feature vector φ(f ).Then, the inner product over the r d -dimensional features can incorporate r 2 d multiplications and r 2 d − 1 summations.In general, we need 2r 2 d − 1 operations, which is expensive.(iii) However, the kernel trick enables another procedure of computations with a much smaller computational cost.Basically, let f 1 and f 2 denote two r-dimensional features.Then, one only need to consider the inner product within the original r-dimensional space with r 2 multiplications and r 2 − 1 summations.Then, based on the kernel trick, only extra d multiplications are needed to bring the same result as in (ii).In general, the total operation number is 2r 2 − 1 + d, which saves a lot of computational resources compared to the computations in (ii).Thus, we propose to utilize kernel function to lift the data to high-dimensional or even infinite-dimensional feature space, and the kernel trick can enable the calculation to happen in the original data space, which easily maintains efficient calculations [40].
Specifically, due to the Representer theorem [41], the inner product of the classification model can be rewritten as ×1 is a variable in the feature space.Based on this equation, the kernelized learning process is: where α is the vector of all α i s.Eventually, we re-emphasize the nice properties of KTDC-Se by (1) using tensors to capture multi-dimensional correlations, (2) proposing a joint model for decomposition and classification, (3) introducing kernels for non-linear features, and (4) conveniently tackling both labeled and unlabeled data. 1) Treatment for New Data: If other types of events come without label information and if they never appear in the historical dataset, our model can not directly output the event type and location since this scenario is beyond our study scope.However, we can assign these new events a new label that means "to be determined".By doing this, we can view these new events as labeled data and input them into our model.This procedure is helpful in providing guidance.
2) Treatment for Imbalanced Dataset: For power systems, the frequency of different events can vary significantly, leading to an imbalanced dataset.Then, the classifier may be biased towards event types with more data.To mitigate the imbalanced issue, there are approaches including reweighting [42], [43], [44], cost-sensitive training [45], [46], [47], and learning rate adaptation [48].The reweighting technique aims to balance the different classes by modifying the weights of the samples.The cost-sensitive training tries to assign higher costs for misclassifying the minority class.The learning rate adaptation adjusts the learning rate for gradient-based methods and encourages more weight updating for the minority class.Luckily, these three types of methods are applicable to our framework.

V. LEARNING ALGORITHM
The optimization in (4) is non-convex.Thus, an alternative optimization algorithm is proposed to update the individual variable in the optimization while fixing other variables, i.e., the so-called coordinate descent method.This method is prevailing in the domain of tensor learning due to its efficiency and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
good convergence property [10], [32], [49].Further, to calculate the gradient and avoid the non-differentiable scenario, the l 2 SVM [10], [50] is utilized, i.e., p = 2 in the loss function l(•, •).Then, to update each variable, KTDC-Se only needs to calculate the gradients with respect to every single variable and utilize the gradient descent for updating.Thus, the calculation of the gradients is shown as follows.A, B, C, and D 1) Gradient of A: Based on matrix format of Tucker decomposition and for the convenience of later derivations, we define

A. Gradients of Matrices
Then, the gradient of the loss function in ( 4) is calculated with respect to A. Mathematically, the gradient is: where tr(•) represents the operation to obtain the matrix trace.

Gradients of B, C, and D. By symmetry,
, and H 4 = G (4) (C ⊗ B ⊗ A) can be defined.Then, ∇ B J, ∇ C J, and ∇ D J are calculated as follows:

B. Gradient of Mode-1 Unfolding Matrix G (1) of the Core Tensor
To update G (1) , the following gradients are separately derived.For the reconstruction loss, H = (D ⊗ C ⊗ B) is denoted.Then, the gradient is: For the classification loss, ŷi = H j=1 α j k(f i , f j ) + b is denoted for simplification.Based on the chain rule, the gradient is: To elaborate on the above equation, polynomial and Radial Basis Function (RBF) kernels are utilized as examples.For the polynomial kernel k(f i , f j ) = (f i f j + c) d , where c is a constant and d is the degree of the polynomial function, then For the RBF kernel k(f i , , where λ is a positive constant: can be obtained.Thus, For the Regularization term, the result is: Similarly, ∇ G (1) . .,∇ f H J 3 ,0, . . .,0] can be obtained.Summing the gradients of the three loss functions can bring the total gradient, i.e., ∇ G (1)

C. Gradients of Classifier Parameters α and b 1) Gradient of α:
The learning weight α is coupled with the classification loss and the regularization.For the classification loss, then Note that ŷj can be explicitly expressed by α, i.e., ŷj = k j α + b, where k i is the ith column vector of the kernel matrix K, and the kernel matrix is defined as K(i, j) = k(f i , f j ).Thus, the above equation can be written to a matrix format.Specifically, if be written, where 1 is an all-one column vector.Further, one can obtain a general format: where I 0 satisfies Further, for the regularization, it's easy to find that Finally, the total gradient is 2) Gradient of b: Similarly, we calculate the gradient with respect to b: where ŷ is the vector of all ŷi s.With the above derivations, the final learning algorithm and flowchart are presented in Algorithm 1 and Fig. 4.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Training on Mini-Batches
The above learning process may suffer storage issues when the number of training data N is large.Specifically, when updating B, C, and D in ( 6), the Kronecker product to calculate H 2 , H 3 , and H 4 requires a large storage for A ∈ R N ×N .To mitigate this issue, we modify Algorithm 1 to train on mini-batches to save the memory [51], [52].
Mathematically, we divide the training tensor X and labels y into K mini-batches {X i } K i=1 and {y i } K i=1 , respectively.Each X i contains some labeled data with labels to be y i and many unlabeled data.Then, in each iteration, we update the minibatch-independent weights Ã, B, C, D and b.Ã ∈ R Ñ × Ñ is the matrix along the first dimension of each mini-batch tensor, where Ñ = * N/K and * • represents the floor function.Notably, keeping Ã the same for different mini-batches is an additional restriction to maintain similarity of A, which doesn't appear in the direct training of Algorithm 1.However, this restriction further guarantees that the discriminative information is contained in core tensors.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Testing on Mini-Batches
For the cross-validation process or online testing, we have another total test tensor X ∈ R Ñ ×T ×L×M that needs to experience the decomposition and classification to obtain the label ỹ ∈ Z Ñ ×1 , where we fix Ñ to be the number of PMU tensors in one mini-batch.The reason of fixing Ñ is that our mini-batch training yields a parameter matrix Ã ∈ R Ñ × Ñ that must be utilized for the test tensor decomposition.Therefore, there should be Ñ PMU tensors in the total test tensor.For real-time testing, if the testing PMU tensor number is not sufficient, we can repeat the testing tensor or utilize some data from the historical dataset to complete the testing mini-batch.Thus the testing procedures are as follows.
1) Obtain Test Feature Matrix G(1) : To obtain G(1) , we utilize the learned parameters Ã, B, C and D. By setting the gradient in (7) to 0, we can obtain where X(1) represents mode-1 unfolding of tensor X .
2) Predict Label Vector ỹ: Based on the Representer theorem, we need the learned weights α and b, historical features in F = G (1) , and test features in G(1) = f 1 , . . ., f Ñ to predict labels.Specifically, we can first calculate a test kernel matrix K(i, j) = k f i , f j , where f i ∈ G(1) and f j ∈ F .Then, the predicted label can be obtained by: Remark: Since we have to train multiple binary models, our method requires a certain level of computation for training.However, the model prediction is fast.For example, we show the average training and testing time in Table IV in the experiments.Our method (KTDC-Se) has the second-largest training time and the second-lowest testing time.The fast testing performance is because we significantly reduce the number of model parameters by tensor Tucker decomposition.This structural tensor decomposition guarantees that KTDC-Se can have compact and informative feature vectors, bringing fast and accurate event identification.Therefore, in realistic applications, we can conduct offline training using more time and computational resources.Then, our light model can be fast implemented for online inference.
We can further improve efficiency by changing the structure of the model design.For example, we can have one base model and restrict all sub-models to sharing the base model.These sub-models represent different branches for classification.The multi-branch design can enable one-time training for all classifiers.Moreover, this design can easily fit into our framework because of our gradient-based training algorithms (i.e., Algorithms 1 and 2).More specifically, the chain rule for gradient calculations enables the information of gradients to be passed from branches to the base, similar to multi-branch neural networks [53].However, due to the space limit, we will discuss this topic in future work.

VI. EXPERIMENTS
For validation, we test over synthetic data sets such as the Illinois 200-bus system and South Carolina 500-bus system [31], [54].We also test our results with realistic data from our utility partners.

A. Dataset Description
We utilize Illinois 200-bus system and South Carolina 500bus system [31] to generate event data.Five event types are considered, including line trip, generator trip, single-phase-toground fault, phase-to-phase fault, and three-phase fault.For each event type, we consider 2 different event locations.Thus, there are 10 unique combinations of event types and locations, i.e., 10 event labels.
Then, we vary the loading conditions for the simulation to generate diversified event files.Totally, we have 80 event files each of which has 10 s event data.Further, we consider the data resolution to be 60 samples per second, yielding 600 samples for each event file.To extract tensors from these streams, we utilize the moving window with the length to be 0.5 s (i.e., 30 samples) and the moving gap to be 0.083 s (i.e., 5 samples) to cut the PMU streams.We utilize a small moving window to obtain data points.In our experiment, each window covers 0.5 s (i.e., 30 samples of PMU measurements).Fig. 5 illustrates why we select 0.5 s as an appropriate window length.Specifically, the plot is a visualization of the PMU streams over time, where the x-axis represents the time and the y-axis represents the measurements (VM denotes voltage magnitude, VA denotes voltage angle, and F denotes frequency).We find that using 0.5 s as the window length can appropriately include partial event information to identify events.Next, the length is not too long to prevent fast and real-time detection.Finally, in Section VI-G, we conduct the sensitivity analysis for the window length, which shows that the length between [0.4,0.7]can help to train an accurate model.
In general, we have 5840 PMU tensors in total.For each PMU tensor, we have T = 30 for the time dimension, L = 200η 1 or L = 500η 1 for the 200-bus and the 500-bus system, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Finally, we also test our proposed method using real-world PMU data from our partner in Arizona, USA.These files totally have 5 labels covering 3 types of line faults (single-phase-toground fault, phase-to-phase fault, and three-phase fault) at 2 locations.After tensorization of data in 35 PMUs, we can obtain X ∈ R 511×30×35×3 and y ∈ Z 511×1 .
Software: For the simulation, we employ a commercial-grade simulator, Positive Sequence Load Flow (PSLF) [55] from General Electric (GE) company.For the model development and validation, we use Python with Pycharm IDE.
r For datasets of the utility in Arizona, USA, the total event tensor is X ∈ R 511×30×35×3 and the input data dimension is 3150.Subsequently, we set R 2 = 6, R 3 = 6, and R 4 = 2, and the feature dimension is 72.For other benchmark methods, we employ PCA to reduce the dimensionality of the raw data and obtain features with the dimension of 160.

B. Benchmark Methods
First, we train our KTDC-Se within the labeled data as a benchmark to demonstrate the impacts of the unlabeled data.Further, we employ state-of-the-art Semi-Supervised Learning (SSL) methods as benchmarks.The details of these methods are shown as follows.
r Deep Residual Network (Resnet) [56]: Resnet is an ef- ficient deep learning model for classification.For this supervised learning model, we utilize only labeled data as comparison.As PMU data have high dimensionality (e.g., 9000 for the 500-bus system with η 1 = 0.2), Principal Component Analysis (PCA) is utilized to pre-process data before training the Resnet.r MixMatch [57]: MixMatch can guess low-entropy labels for unlabeled instances with data augmentation.Then, MixMatch develops a probabilistic procedure to mix the labeled and unlabeled data to train a deep learning classifier.Similarly, we employ PCA to reduce the dimensionality of the mixed dataset from MixMatch and input them into a Resnet [56] as the final classifier.For a fair comparison, the Resnet has the same architecture as the first benchmark.
r FixMatch [58]: FixMatch first generates pseudo labels for data with weak data augmentation.Then, FixMatch develops a criterion to decide the pseudo label is retained or not.Finally, data with retained pseudo labels experience a strong data augmentation for the classifier training.Similar to MixMatch, we utilize PCA + Resnet as the final classifier.For fair comparison, the Resnet has the same architecture as the first benchmark.
r Semi-supervised Ladder Network (SSLN) [59]: SSLN combines supervised and unsupervised learning in deep neural networks with a joint loss function in a ladder network with an auto-encoder model structure.Similarly, we pre-process the data with the PCA method.During the testing, the hyper-parameters for all models are fine-tuned in the 3-fold cross-validation to achieve the best accuracy.In general, by comparing the testing accuracy of KTDC-Se with Resnet, and KTDC-Se-L, we can illustrate the effectiveness of using unlabeled data.By comparing the testing accuracy of KTDC-Se and other methods, we can evaluate the performance of using an integrated model and two-stage models.Especially, Resnet, MixMatch, FixMatch, and SSLN have two separate steps of data pre-processing and learning, which have their biased selections.By comparing the label predicting time of KTDC-Se and other methods, we can evaluate the efficiency of the methods for real-time inference.Finally, by comparing the choice of kernel selection, we can understand how the non-linear kernels boost the performance.

C. Joint Optimization of KTDC-Se is Better Than Two-Stage Models
In this subsection, we evaluate the integration design by comparing our KTDC-Se and other two-stage models.By default, we set the kernel of our KTDC-Se to be polynomial with the degree d = 2. Thanks to the integration without biased selection for the two-stage compression and classification, our model performs much better than benchmarks.Specifically, we report the results of simulated and real-world data as follows.
For the simulated data, we first fix η 2 = 0.3 to divide the labeled and unlabeled datasets for training and testing.Fig. 6(a) and (b) demonstrate the results for the two systems.Since for each η 1 of one system, we conduct 5 times randomization and 3fold cross validation of the PMU location selection, there can be multiple different values of the testing accuracy for each testing Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.scenario.Thus, we present the box plot in Fig. 6(a) and (b) to show the average and the variance.
We find that our KTDC-Se has an average accuracy promotion of 13.3%, 13.6%, and 9.3%, compared to MixMatch, FixMatch, and SSLN, respectively.Notably, the latter 3 benchmark models utilize PCA to pre-process data so that they can be trained at a reasonable cost.Even though these 4 methods utilize the same labeled and unlabeled data for training, the better performance of KTDC-Se shows that an integrated model can be better than the two-stage models.This is because that the integrated model avoids the biased selection of the two separate models.Further, the joint model enables the identification of discriminative core tensors that are sensitive to the event labels.
In our experiment, we implement 5 times random sampling for PMU locations for a given PMU penetration η 1 for two systems.Intuitively, if the PMUs are closer to the event, the accuracy should be higher.Then, for a fixed η 1 and fixed event locations, the relative PMU locations with respect to the event location cause the accuracy variance.Therefore, 500-bus system usually has higher accuracy variance than that of 200-bus system since 500-bus system has a larger range.
This information is also validated from Fig. 4 in the manuscript when η 1 ∈ {0.05, 0.1, 0.15}.However, when η 2 = 0.2, we find that the 500-bus system has a higher accuracy mean and lower accuracy variance.After careful checking, we find the reason is that 5 is a small number for random sampling of PMU locations.Specifically, when η 2 = 0.2, there are 4 out of 5 tests in the 500-bus system to have PMU locations close to the event location.However, for the 200-bus system, there are only 2 tests when PMU locations are close to the event locations.This shows that we may meet the situation when the 500-bus system data brings better performance.
We also utilize F1 score to evaluate the performance for 200and 500-bus systems.F1 score is the harmonic mean of precision and recall metrics, which gives a much better evaluation for datasets of imbalanced classes than accuracy [60].Under this setting, we re-evaluate Section VI-D using F1 score.Fig. 6(c) and (d) illustrate that our proposed KTDC-Se model still has the best performance in different scenarios under F1 score, which demonstrates the superiority of our model.Compared to test accuracy, the result of F1 score for all methods is relatively lower.This is because F1 score has an overall consideration of the precision and the recall rate.However, under the metric of F1 score, KDTC-Se is still the best method.
For the real-world data, we report the testing accuracy in Table II.We observe that the average accuracy promotions of KTDC-Se are 9.7%, 9%, and 8.8%, compared to MixMatch, Fix-Match, and SSLN, respectively.This still supports the advantage of using an integrated model.
We further evaluate the accuracy for each event type for the 200-bus system with η 1 = 0.1.The results can be seen in Table III.Other scenarios can bring similar results.We denote LT to represent line trip, GT to represent generator trip, SP to represent single-phase fault, PP to represent phase-to-phase fault, and TP to represent three-phase fault.Next, the numbers 1 and 2 represent the two different locations.The result illustrates that three-phase faults are easier to identify since they are more severe than others.On the other hand, the single-phase and phase-to-phase faults have lower accuracy since they have similar fault behaviors.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Semi-Supervised Learning Boosts KTDC-Se Model Performance
In this subsection, we compare KTDC-Se, Resnet, and KTDC-Se-L to illustrate the effectiveness of semi-supervised learning.Specifically, for synthetic data, the average accuracy promotions are 18% and 13.3%, compared to Resnet and KTDC-Se-L, respectively.For real-world data, the corresponding accuracy promotions are 15.2% and 11.8%.These results show that there is a significant improvement when using unlabeled data.
The performance can be explained as follows.First, we can compare KTDC-Se and KTDC-Se-L.When doing the tensor decomposition of the PMU tensors, the unlabeled tensors in KDTC-Se enable the core tensors to understand different loading conditions and maintain similarity for different conditions as long as the event label is the same.Then, the trained classifier can successfully tackle different operational scenarios.Second, we can compare KTDC-Se-L and Resnet.We find that Resnet has an even worse performance as Resnet also employs PCA to pre-process data.As illustrated in Section VI-C, the two-stage model performs worse.
To further understand how much labeled data is needed for a good performance of KTDC-Se, we utilize the simulated data to test and fix η 1 = 0.1 and vary η 2 ∈ {0.1, 0.2, 0.3, 0.4, 0.5} for the number of labeled data.As a comparison, we also implement KTDC-Se-L for the labeled data.Further, we also present the optimal testing accuracy via employing labels for all the PMU tensors to train the model, i.e., η 2 = 1.The results of average testing accuracy are shown in Fig. 7(a) and (b).
Based on the plots, we have the following observations.( 1) Training with extra unlabeled data significantly improves the accuracy, which has been explained before.(2) KTDC-Se can efficiently utilize the limited labels.For example, when η 2 = 0.1, the testing accuracy is still higher than 94%.This implies that KTDC-Se can filter information and group data by event labels in the feature space.Such a good result comes from the joint learning process to generate compact and discriminative features.(3) As η 2 increases, the testing accuracy of KTDC-Se gradually increases up to the optimal accuracy.Further, with only 30% of the labeled data, KTDC-Se can bring an accuracy almost close to the optimal one.This again indicates the high efficiency of KTDC-Se to utilize limited labels.

E. Tensor-Based Framework Enables Fast Inference
The model efficiency can be evaluated via the training and testing time.We utilize the simulated data in Section VI-C and report the computational time of all methods on the testing dataset in Table IV.Based on the results, we have the following conclusions.(1) SSLN has the largest computational time due to its largest size with ladder networks, decoders, and encoders.
(2) KTDC-Se has the second largest computational time since each KTDC-Se is a binary classifier and we utilize multiple binary models to create the final decision.Thus, the total training time for multiple models is higher.We further find that KTDC-Se has the second lower testing time, which demonstrates its efficiency.Further, KTDC-Se-L has the lowest time because it uses less data (i.e., only labeled data) for training.Thus, the test kernel matrix has a much smaller size compared to that in KTDC-Se.According to the prediction function in (18), KTDC-Se-L has a lower testing time compared to KTDC-Se.
However, if we utilize both labeled and unlabeled data, KTDC-Se is much more effective in doing the inference than other methods.The reason is that KTDC-Se employs tensor decomposition to explore cross-dimension correlations and obtain a very compact core tensor for tests, but other methods utilize PCA to compress data, which needs more features to achieve the best accuracy.Secondly, other methods utilize a deep model to extract non-linear features, requiring more computational time.Thirdly, Resnet, FixMatch, and MixMatch share the same DNN architecture and consume similar times for testing.On the other hand, the ladder network in SSLN has a larger size and needs more time to predict the label.

F. Non-Linear Kernelization Largely Increases the Accuracy
In this subsection, we study the effectiveness of kernelization on the capacity of the classifier.Specifically, we report the testing accuracy for the 200-bus system when η 1 = 0.1 and η 2 = 0.3.Other results are similar and ignored due to the space limit.Then, we study the case with a constant kernel (e.g., d = 1 for the polynomial kernel) and cases with different non-linear kernels.When kernel is a polynomial function, we vary d ∈ {1, 2, 3}.When kernel is a RBF function, we vary λ ∈ {0.05, 0.1, 0.15}.The results are shown in Table V.First, if compare the constant kernels with other kernels, we find that adding non-linear kernels can significantly increase the accuracy.This is because the PMU measurements have high non-linear correlations.Secondly, the polynomial kernel can lead to an accuracy of 95.5% when d ≥ 2. For the RBF kernel, the accuracy is around 94.7% when λ ≥ 0.1.This shows that the polynomial kernel, especially when d = 2, is better than others.The partial reason is that the quadratic correlations largely lie in the power flow equations.

G. Sensitivity Analysis of the Window Length
Finally, we vary the length of the moving window to test the suitable region.Specifically, we conduct the experiments for the 200-bus system with η 1 = 0.1 and η 2 = 0.3.Further, we set the kernel of our KTDC-Se to be polynomial with the degree d = 2.The result is shown in Table VI.We have the following observations on the results.First, when the window length is less than 0.3 s, the accuracy will decrease.This shows that when the window width is too small, the event dynamics may not be completely covered and the classification will encounter errors.Second, the test accuracy will decrease slightly when the window width is larger than 0.8 s.This is because a large window will contain both normal and event information, which reduces the classification accuracy.Third, the best value of window length is 0.6 s, and there is a robust region [0.4,0.7] to produce an accurate model.

VII. CONCLUSION
The increasing placement of PMUs leads to better power system situational awareness and event identification.Specifically, the ML-based methods can fast identify the event types and locations.However, the high volume and complex correlations of PMU measurements cause inefficiency in existing ML methods.Secondly, recent approaches focus on supervised learning while many event data are unlabeled.The inability to utilize rich unlabeled data hurt the model robustness to diversified loading conditions.To tackle these challenges, we propose our Kernelized Tensor Decomposition and Classification with Semi-supervision (KTDC-Se).Specifically, we treat PMU measurements as tensors and employ an advanced tensor decomposition to remove redundant information and save useful event features, significantly boosting model efficiency.Simultaneously, labeled data in the core tensors are input to a classification model with kernels for accurate classification, and unlabeled data contribute to the decomposition process.This guarantees the model robustness and accuracy.Numerically, we show that KTDC-Se achieves the best accuracy compared to other semi-supervised learning methods in both synthetic systems and real-world systems.

Fig. 2 .
Fig. 2. Illustration of the moving window-based division to generate PMU tensors.

r
KTDC-Se-L: KTDC-Se-L is to train a KTDC-Se model with only labeled data by setting N = H in the model, which demonstrates the effectiveness of employing unlabeled data for training a classifier.

Fig. 7 .
Fig. 7. Results of the sensitivity analysis with respect to labeled data ratios.

( 3 )
The total training time of KTDC-Se is still comparable to other models.This is because training each binary model requires much smaller computations than other benchmark methods.More specifically, the input dataset for each KTDC-Se model only covers two event classes and has a much smaller size.(4) Resnet, MixMatch, and FixMatch have relatively close training times since they have the same input data processed via PCA and similar DNN models for classification.(5) KTDC-Se-L has the smallest training time since it only utilizes labeled data.

TABLE I OVERVIEW
OF TENSOR NOTATIONS AND OPERATORS

TABLE II TESTING
ACCURACY (%) (MEAN ± STANDARD DEVIATION) FOR REAL-WORLD PMU DATA

TABLE IV THE
AVERAGE TRAINING/TESTING TIME (S) OF THE TRAINING/TESTING DATASET FOR DIFFERENT METHODS