Missing Value Replacement for PMU Data via Deep Learning Model With Magnitude Trend Decoupling

This paper develops a forecasting-based missing value replacement model for Phasor Measurement Unit (PMU) data during power system events. The proposed forecasting model leverages a sequence-to-sequence (Seq2Seq) long short-term memory (LSTM) network with an attention mechanism, which is trained with both pre-event and post-event data. The trained forecasting model is utilized to accurately estimate and recover missing measurements in PMU data. To improve the accuracy of the proposed model, we introduced two novel techniques: 1) integrating a prior knowledge matrix into the attention mechanism that effectively preserves correlations within PMU data, and 2) decoupling the magnitude and trend components of the residual forecast so that such forecasting models are separately trained, which boosts the resistance to the noisy signal. Numerical studies on real-world PMU data collected from the North American electric power grid demonstrate that our proposed model achieves 5% to 30% error reduction compare to the baseline models for all key power grid measurements in both root mean square error and mean absolute error metrics. Furthermore, our model exhibits robust missing data recovery performance even when nearly all of the grid event data is missing.


INDEX TERMS
Forecasted PMU values at time step τ .

I. INTRODUCTION
Phasor Measurement Units (PMUs) are measurement devices used in electric power systems to retrieve time-synchronized, real-time measurements of electrical quantities such as voltage and current phasors (magnitudes and phase angles) at various locations in the power transmission grid. In the past two decades, electric utilities around the world [1], [2] have widely deployed PMUs in the bulk power system to enhance system observability and improve situational awareness for system operators. In the United States alone, more than 2500 PMUs have been deployed in transmission grids, resulting in a significant increase in streaming data available for power grid monitoring. In academia, many researchers have developed PMU data-based algorithms which include power system state estimation [3], [4], event detection (including fault detection) [5], [6], event classification [7], [8], [9], [10], and offline event analysis [11].
In the power industry, various PMU applications related to situational awareness for high-frequency dynamics, such as renewable energy sources [12], [13], continuous generator model validation [14], asset health monitoring [15], [16], [17], and linear state estimation [18], [19] have been adopted by Regional Transmission Operators (RTOs) and Independent System Operators (ISOs) in synchrophasor projects. However, there are very few commercial PMU applications, such as eLSE [20], due to the poor PMU data quality and the lack of the grid operators' understanding of the data-driven PMU applications. The poor PMU data quality issue caused by bad data and missing values is one of the most pressing technical challenges for PMU applications to overcome. Missing value replacement is crucial, as data drop as well as latency in PMU data can significantly deteriorate the performance of synchrophasor-based applications, particularly those involving real-time feedback control [21]. The major factor affecting data quality is PMU data loss [22]. Such missing data can be caused by PMU hardware malfunction, GPS-time synchronization issues, and data transmission/communication delays. The current generation of PMU applications merely isolate bad PMU measurements and rarely recover missing PMU measurements. Such data recovery functions are obviously crucial for PMU applications to capitalize on measured data. The most advanced data recovery function is based on model-based estimates and is currently implemented in PMU applications [18]. However, such data recovery functions are used for steady-state PMU measurements only. Applying data recovery functions to grid event PMU applications is still in the research phase. Although power systems normally operate in the normal condition (steady-state), the PMU data integrity during power system events is more critical to the power system operators because blackouts or brownouts may occur following major grid disturbances. Missing value recovery during the events should get more attention for future PMU applications.
The availability of PMU measurements from across the United States has greatly accelerated the research on machine learning-based PMU data analytics. Many researchers have developed data-driven algorithms to recover missing PMU data. The current approaches for replacing missing values in streaming PMU data can generally be classified into two categories.
The first category of methods for handling missing data in PMUs is matrix completion, which relies on past and currently available PMU measurements to fill in the gaps. Reference [23] fills in the missing PMU data by utilizing other PMUs' dynamic behaviors by averaging the begged multiple linear regression model. Reference [24] performs a tensor decomposition on PMU measurements that have missing data, and converts the decomposed factor matrices back to determine the missing values. Reference [25] introduces an event-decomposition participation method that separates PMU data into steady-state and dynamic components, taking advantage of the low-rank matrix characteristic of streaming PMU data. Reference [26] uses low-rank tensor factorization and subspace selection (known as OnLine Algorithm for PMU data processing, or OLAP) to replace missing values. Reference [27] proposes an OLAP specializing in the temporal aspect of the PMU data that primarily uses past values to fill in missing data, and reference [28] advocates that this version is the most advanced algorithm in the field of PMU missing value replacement. However, these matrix completion-based methods are not able to handle extreme cases where all or a majority of PMUs are out of service for a period of time, such as when GPS signals are out-of-service or malfunctioning.
The second category is the forecasting-based method, which primarily uses historical PMU measurements to fill in the missing data. One of the most significant strengths is that the forecasting-based method can cope with the aforementioned severe conditions, i.e., most or all of the PMUs are out of service. Reference [29] employs the Auto Regressive model that uses the past three observations to infer the current PMU measurements. A Lagrange interpolation method is proposed in [30] that effectively recovers the missing values using a few past observations and the Lagrange polynomial coefficient. Reference [31] proposed a time series-based prediction model that effectively combines Kalman filtering and smoothing algorithms to improve data accuracy. Methods in this category fill in the missing PMU data using the observations from the past few time steps. However, during power system events (e.g., line faults and generator tripping), the PMU measurements often demonstrate abrupt and significant dynamic change. The observations from the past few time steps alone are not sufficient to forecast the event behaviors of hundreds of milliseconds ahead.
To the best of our knowledge, no data-driven missing value replacement schemes (in the second category) are based on a deep neural network trained with historical power grid event data. This paper employs the forecasting-based scheme and develops a deep learning model that is trained with hundreds of recorded power system events and corresponding PMU data to replace missing values. The proposed model uses a sequence-to-sequence (Seq2Seq) long short-term memory (LSTM) neural network equipped with an attention mechanism, which is well-suited for predicting multivariate timeseries data. The model is specifically designed to accurately predict PMU data for 2-8 seconds in the future using preevent time-series data of 1 second. Observed short-term correlation patterns among the time-series data inspire us to express these patterns as a prior knowledge matrix and integrate this matrix into the attention mechanism to improve data forecast accuracy. Additionally, we introduce a novel method, which decouples the magnitude and trend components of the data and trains two separate Seq2Seq LSTM models to forecast these components simultaneously. The forecasting model is trained with hundreds of real-world power system events and the corresponding PMU data. The testing results show that our proposed model can provide accurate forecasts and effectively replace missing data even when all PMU measurements are missing simultaneously.
The main contributions of this paper are listed as follows: • Development of a forecasting-based model with the following two novel techniques to fill the missing measurement in PMU data during the events even when no PMUs are available: -Integrate the prior knowledge matrix into the attention mechanism embedded in the Seq2Seq LSTM to improve the data prediction performance. -Decouple the PMU data into magnitude and trend components and separately train two deep learning models to better estimate the upward/downward trend of missing PMU data and improve the prediction accuracy.
• Superb prediction accuracy for missing PMU data compared to five baseline deep learning-based missing value replacement models.
• Outstanding prediction accuracy specifically for a large fraction of missing PMU data (over 30%) compared to the state-of-the-art matrix completion-based missing value replacement model. The rest of this paper is organized as follows: Section II formulates the PMU data forecasting problem and illustrates the overall framework of the proposed method. Section III presents two innovative techniques in the deep learning model. Section IV evaluates the performance of the proposed and baseline models for replacing missing PMU data with a large-scale real-world PMU dataset. Section V concludes this paper.

II. PROBLEM SETUP AND OVERALL FRAMEWORK
The focus of this section is the formulation of the online PMU data forecasting problem and the introduction of a deep learning-based missing value replacement framework. The proposed model utilizes an attentional Seq2Seq LSTM network, which is enhanced with the incorporation of a prior knowledge matrix and magnitude trend decoupling scheme. The input for the model is the historical time-series data for PMU consisting of four different measurements (real power, reactive power, voltage magnitude, and frequency). The output is a prediction of these same measurements that can be used to replace missing values.

A. PROBLEM SETUP
The grid-wide sensor system of each interconnection in the U.S. includes a large number of PMUs (hundreds). In this paper, we formulate the replacement of missing values in PMU data as a multivariate time-series forecasting problem.
is the pre-processed multivariate PMU time-series data, where T is the length of the time series X, x i ∈ R n represents the recorded value of different PMUs at the time of i, and n is the number of PMUs.
is the expression of a continuous subset time-series of X with the length of w and the start time index of i, which is used in notation 3.
is the historical time-series values with the fixed length m (1 second), which is used as the input of the forecasting model.
. .] is the future time-series, which is used as the output of the forecasting model.
The goal of this forecasting task is to predict the future measurement value (Y t ) using past time-series values (X t−m+1,m ). X t−m+1,m and Y t are matrices with two dimensions that correspond to the spatial and temporal information, i.e., the PMU IDs and timestamps. The output time-series' lengths are 2 seconds for voltage-related events and 8 seconds for frequency-related events.

B. OVERALL FRAMEWORK
The overall framework of the proposed deep learning-based missing value replacement model is depicted in Fig. 1. The raw dataset undergoes noise removal, outlier detection, and normalization using an Apache Spark Cluster [32]. Then, the pre-processed and normalized training dataset is used to calculate the prior knowledge matrix. The deep learning neural network is constructed based on the Seq2Seq LSTM with an attention mechanism and prior knowledge matrix embedding. The model is trained using the pre-processed PMU data. The deep learning-based model consists of two separately trained neural network modules, the magnitude forecasting module and the trend forecasting module, which are both depicted in Fig. 1 and discussed later (see III-A2). The future PMU data is forecasted by combining the results from the aforementioned two modules. Each module will be detailed in the next section.

III. TECHNICAL METHODS
In this section, we describe the technical methods used in the forecasting model, which tries to predict the residual time series in the future based on the past time series data. First, the time series is separated into the magnitude and trend components, and two separate neural networks are trained to predict each component. The forecast for PMU data is generated by combining the outputs of these two neural networks. Second, we provide a detailed description of the neural network structure of our proposed model with the uniquely designed prior knowledge matrix.

A. OVERVIEW OF THE FORECASTING MODEL 1) FORECASTING RESIDUAL
To predict future PMU data, the proposed model learns the residual mapping and predicts the change (residual) in measurements after the last input measurement instead of directly predict the PMU data time-series. The final prediction is obtained by adding the output of the neural network to the last input measurements, as shown in the following equation: where y τ is the predicted PMU data at the time step of τ , F forecast (x 1,t , y 1,τ −1 ) is the output from the decoder of the LSTM network. The forecasting model F forecast only needs to learn the residual mapping between the y τ and x t .

2) MAGNITUDE-TREND DECOUPLING
The multivariate time-series data stream from PMU is characterized by noisy signals, which significantly lower the mean directional accuracy (MDA) of its forecasting result. An inaccurate trend in predicting future values can have far more severe consequences than an erroneous magnitude, and the ambient noise in PMU data can significantly reduce the accuracy of trend predictions in forecasting models. To address this issue, this paper proposes breaking down the forecasting problem into two simpler sub-problems: magnitude forecasting and trend forecasting. This can be accomplished by leveraging two distinct models, one dedicated to capturing magnitude, and the other specifically designed to capture trend. We decouple the magnitude and trend components of the residual time-series data using the following equation: where F tre (x 1,t , y 1,τ −1 ) predicts the future trend (i.e., trend of the time-series data). Each forecasted value equals to +1 (increasing) or -1 (decreasing). F mag (x 1,t , y 1,τ −1 ) predicts the magnitude of the future PMU data point, i.e., a positive value indicating the absolute value of the change in the measurement.
In this study, we start from a Seq2Seq LSTM model with an attention mechanism to forecast PMU data, subsequently utilizing the model to impute missing values. This model is used by two separate modules for trend and magnitude, and the outputs from these modules are combined to generate the final prediction. The architecture of the neural network will be discussed in further detail in the following subsection.

B. NEURAL NETWORK ARCHITECTURE
The neural network architecture of the Seq2Seq LSTM equipped with the attention mechanism is illustrated in Fig. 2. Since the network design is based on the Seq2Seq structure, it has no limit on the input and output data length.
This model consists of two key components: the encoder and the decoder, and the decoder is equipped with an attention layer and prior knowledge matrix. Both the encoder and the decoder have a bi-directional recurrent neural network (RNN) structure. The attention mechanism with the prior knowledge matrix in the decoder extracts a weighted feature to model the temporal dependency between the input and output time series. How the above two components extract and learn the temporal correlations in the time-series data is discussed below.

1) ENCODER
The encoder is built on an RNN, which can learn inherent features in past time-series values to predict the future. Given a time-series data, An RNN generally defines a recurrent function, R, and calculates the hidden state, h t ∈ R m , for each time step, t, as: where the function, R, depends on the type of RNN cell (a loop-shaped neural network structure in hidden layers) is used. Backpropagation through time allows the hidden states to capture temporal features during training, but it can suffer from gradient vanishing or exploding over longer timescales, which can significantly degrade the model performance.
To cope with this issue, the LSTM network was introduced as a variant of the basic RNN. LSTM units consist of a memory cell and three controlled gates, as described in [33]. Therefore, we chose to use an LSTM network in our forecasting model to effectively capture time dependencies and extract features for each time step in the input sequence.
Additionally, the proposed model utilizes a bi-directional LSTM network to capture temporal features at each time step by incorporating both past time-series values, hf t , and future values, hb t . This allows for a more comprehensive understanding of the temporal dynamics within the time-series data set.
forward: hf t = LSTM(x t , hf t−1 ) (4) backward: hb t = LSTM(x t , hb t+1 ) As shown in (4) and (5), the forward and backward LSTMs in the encoder network simultaneously process the time-series data in opposite directions. The forward LSTM processes data from t = 1 to T using a forward hidden layer, while the backward LSTM processes data from t = T to 1 using a backward hidden layer. The final hidden state for each time step, denoted as h t , is obtained by concatenating the output of the forward and backward hidden layers at time step t. The time-series data is fed to the encoder as the input. The output is a time-series of hidden states h t that are further processed in the decoder layer's attention mechanism to calculate the attention distribution for each time step in the output time-series data.

2) ATTENTION-BASED DECODER
The decoder neural network is also built based on a bi-directional LSTM network. To prevent squashing all the time-series data into a fixed-length vector (the last hidden state from the encoder), [34] and [35] proposed the attention mechanism, which enables the forecasting model to selectively focus on specific aspects of the input time-series data. Because all the input hidden states are equally and fully leveraged, the model's performance is not affected by the hidden states squashing. Thus, our proposed forecasting model capitalizes on the attention mechanism to propagate predominant features in the time-series data from the encoder to the decoder. Moreover, a prior knowledge matrix is integrated into the attention mechanism as fixed parameters to sharpen the attention.
The calculation of the forecast value of each time step in the decoder is based on the LSTM hidden state at the current time step, d τ , all the hidden states of the input time-series data, h 1..t , the latest previous forecasting result, y τ −1 , and a pre-calculated prior knowledge matrix, P: Each element of the prior knowledge matrix P i,j is calculated by the average Pearson correlation coefficient between the measurements, x i and y j . The prior knowledge matrix directly captures the temporal correlations between various time steps of the input and output time-series.
The attention mechanism is implemented by a two-layer perceptron, followed by an LSTM network in the decoder. For each encoder output h i , its attention score, l(d τ , h i ), is computed as: where d τ refers to the decoder's hidden state at time step τ , h i denotes the hidden state of the input time-series from the encoder at time step i. V T a and W a are the weights of the two-layer perceptron in the attention mechanism.
The attention weight, β τ , is calculated by applying the softmax normalization for the attention scores l: The attention weight β τ (d τ , h i ) represents the degree of correlation between the τ -th output and the input at the time of i. To synthesize the temporal features across all the input time steps, the context vector, c τ is used. It is calculated as the weighted sum of all encoder outputs as shown below: Each context vector at the time of τ , c τ , combines the encoder's hidden states at every time step. It should be noted that the context vector contains the feature from all previous time-series data.
Finally, the concatenation of the context vector c τ and the hidden state d τ form the feature vector for the output series, and a fully connected layer is used to get the forecast result for time step τ . This is shown in (12).
where W and b are the weights and bias of the fully connected layer.

IV. NUMERICAL STUDY
This section assesses the performance of the proposed forecasting model by comparing it to five forecasting-based models and the matrix completion model. The prediction accuracy of the forecasting-based models is evaluated using two indicators: root mean square error (RMSE) and mean absolute error (MAE). These indicators are chosen due to their suitability for evaluating voltage and frequency behaviors during voltage and frequency events. The prediction accuracy of the matrix completion model is evaluated using the mean absolute percentage error (MAPE) when a significant portion of PMU data is missing. The contribution of the prior knowledge matrix is also analyzed by visualizing the attention weights. The data for this study was collected from the PMUs located in the Western Electric Coordination Council (WECC) over a two-year period (2016-2017) at a sampling frequency of 30 Hz. The original data set consists of 42 PMUs, totaling over 6 Terabytes. However, one PMU was excluded from the analysis due to being mostly out of service, resulting in a final sample of 41 PMUs. In addition, historical operator logs from RTOs and ISOs provided 958 power system event labels, which were divided into voltage-related (caused by system faults at transformers, lines, and buses) and frequency-related (generating unit trip) events.

2) DATA PRE-PROCESSING
The data cleaning process involves the removal of outliers from the dataset, filling in missing data using the method described in [25], calculating the real and reactive power, and organizing the streaming PMU data into a three-dimensional tensor. The three dimensions correspond to the timestamp, PMU ID, and measurement channel. This process is performed using a data pipeline implemented with a five-node Apache Spark and Hadoop cluster, which allows for the efficient handling of large-scale data. In Fig. 3, we present examples of voltage-related and frequency-related events to demonstrate the behavior of four distinct measurements taken during the event after the data pre-processing. The start time of the event is at 9 seconds. These event labels included the timing and category/type of the event.

3) DATASET AND CASE STUDY SETUP
After cleaning the data, we pre-process it based on the event types. During the model training process, we use different forecasting lengths depending on the event type, as shown in Fig. 4. This figure illustrates the period for data extraction of each event in the dataset. The event period sequence that we extracted starts one second before the event label time stamp and continues for either two or eight seconds after the event label time stamp, depending on the event type.
The PMU data are extracted from all 958 labeled events. 60% of the events are selected as the training dataset, 20% of the events are used as the validation dataset for the hyperparameter tuning, and the remaining 20% are to evaluate the model performance.
The model parameters are learned with the training dataset, while the early-stopping function is used to monitor the loss on the validation dataset to avoid overfitting. RMSE is adopted as the evaluation metric for the proposed model, which is given by: whereŷ is the forecasted data, while y is the measured data. The training process uses the Adam optimizer and the learning rate is selected to be 0.001. A Linux server with 4 Nvidia RTX 2080 Ti GPU is used to train the proposed and baseline machine learning models.

B. BASELINE METHODS
The present study investigates the performance of five baseline forecasting models: linear regression, fully connected neural network, convolutional neural network (CNN), Seq2Seq LSTM model, and the Seq2Seq LSTM model with Luong attention. The parameters for each model are provided in Table 1, including the range of hyperparameters used in training. All models are trained using the Adam optimizer and the RMSE loss function on the same dataset. The specific characteristics and performance of each model will be discussed in the subsequent subsections.

1) LINEAR REGRESSION
The linear regression model assumes the output measurements have a linear relationship with the input subsequence.

2) FULLY CONNECTED NEURAL NETWORK
This model uses a multi-layer fully connected neural network to model the relationship between the input and the output subsequences. This neural network model consists of 3 layers of fully connected neurons.

3) CONVOLUTIONAL NEURAL NETWORK
This model uses CNN to perform the forecasting task. This neural network model comprises five layers with a sliding filter size of 5 × 5, and 64 filters for each internal layer.

4) Seq2Seq LSTM MODEL
This model uses the original Seq2Seq LSTM neural network to conduct the forecasting task. Both the encoder and decoder are two-layer Bi-LSTM models. The number of neurons for every layer of the LSTM model is 128.

5) ATTENTIONAL Seq2Seq LSTM MODEL
This model appends the Luong attention mechanism to the above mentioned Seq2Seq LSTM model.

C. PERFORMANCE INDICATORS
Two evaluation metrics are used to compare the prediction performance of the five baseline models and our proposed forecasting model. RSME emphasizes large errors. On the other hand, MAE equally treats the individual errors, which means it has weaker weights for outliers compared to the RMSE. Note that voltage events tend to contain abrupt changes in the voltage signal, while frequency events generally illustrate gradual changes in the system frequency.
whereŷ is the forecasted data, while y is the measured data.

D. PMU DATA FORECASTING ACCURACY
The performance metrics of various PMU data prediction algorithms are presented in Table 2. The abbreviation ''PKM'' represents the ''prior knowledge matrix'' while ''MTD'' stands for ''magnitude-trend decoupling.'' As shown in Table 2, the Seq2Seq LSTM algorithm demonstrates superior forecasting performance for all measurement types (active power, reactive power, voltage, frequency) compared to linear regression, fully connected neural networks, and CNN. This is likely due to the LSTM's ability to model temporal correlations in the PMU time-series effectively. The incorporation of the attention mechanism in the ATT-Seq2Seq-LSTM model leads to a reduction in the forecasting error by enabling the model to better capture the temporal dependencies between the input and output timeseries.
The integration of the prior knowledge matrix results in a reduction of prediction error across all four measurement types, particularly for real and reactive power. The application of the magnitude and trend decoupling technique also  enhances the expressive power of the forecasting model, which led to significant improvements in PMU data forecasting performance. In the following subsection, we will examine the ways in which the proposed techniques contribute to the success of the proposed forecasting model.
Note that the RMSE is larger than the MAE for most of the measurement types in Table 2. This occurs when large errors appear in the prediction results. This is mostly due to the fact that PMU time-series include transient signals with significant noise. As shown in the Table, our proposed forecasting model significantly lowers the prediction error in voltage signals during voltage events. Dur-ing frequency events, our proposed forecasting model significantly lowers the active power prediction error. This is because voltage and active power signals are the most predominant signals in the voltage and frequency events respectively. Fig. 5 shows that the MTD is the major contributor to improving the forecasting model accuracy for voltage events. On the other hand, not only the MTD but also PKM contributes to improving the forecasting accuracy for frequency events, especially the active power signals. Thus, both the PKM and MTD effectively reduce the error of the proposed forecasting model for both event types. VOLUME 11, 2023 Table 3 shows the prediction error for the magnitude component and the forecasting accuracy for the trend component of the PMU data for the proposed and baseline methods. The prediction error for the magnitude component is measured by RMSE and the forecasting accuracy for the trend component is quantified by the accuracy, i.e., up/down binary agreement rate. The RMSE and accuracy reported in the table are the average results across the four measurement types. As shown in Table 3, the proposed decoupling technique reduces the magnitude prediction error and increases the trend prediction accuracy, which significantly contributes to the reduction in overall forecasting error. It can also be observed that the proposed forecasting model exhibits higher accuracy in predicting the trend for frequency-related events than voltagerelated events. This is due to the fact that the signals in the frequency-related events are less noisy.
Although the prior knowledge matrix technique also improved the accuracy of the trend prediction and reduced the error of the magnitude prediction, this improvement is limited in comparison to the improvement achieved through the use of the magnitude trend decoupling technique. Thus, using two neural networks to separately predict the magnitude and trend components of the PMU data is the key innovation to effectively and accurately learn the dynamic behavior in the power system.

F. VISUALIZATION OF ATTENTION WEIGHTS
The attention weights between the input and output time-series of the forecasting model at each time step are illustrated in Fig. 6. These attention weights are calculated without considering the prior knowledge matrix. The input consists of the subsequence that starts at one second before the event label, while the output corresponds to the forecasted PMU data. Higher attention weights between the input and output are indicated by blue colors, while weaker weights are indicated by green colors. As shown in the figure, the attention weights are much higher between the last few time steps of the input time-series and the first few time steps of the output time-series. In other words, the attention mechanism focuses more on the final few input data points when predicting future PMU data. Note that the Seq2Seq LSTM model without the attention mechanism relies solely on the final input time step's hidden state of the encoder to make predictions about the future PMU data by the decoder. However, the information about the previous time steps of the input sequence may not be fully captured by the final time step's hidden state, which leads to performance deterioration.

G. IMPACTS OF PRIOR KNOWLEDGE MATRIX ON ATTENTION WEIGHTS
The heat map for the prior knowledge matrix between the input and output time-series are illustrated in Fig. 7. The blue and green colors indicate a strong and weak correlation, respectively. In general, stronger correlations can be observed between the final few time steps of the input time-series and the first few time steps of the output time-series. This pattern is quite similar to that shown in Fig. 6.
The impacts of incorporating the prior knowledge matrix on the attention weights are demonstrated in Fig. 8. Here the attention weights are calculated by the model that integrates the prior knowledge matrix with the attention mechanism. By comparing Figs. 6 and 8, it is evident that the incorporation of the prior knowledge matrix in the attention mechanism leads to an increased focus on the final few time steps of the input sequence for predicting the first few time steps of the output time-series. The combination of the attention mechanism and the prior knowledge matrix, therefore, highlights the importance of the final few time steps in the input timeseries. The improved forecasting performance demonstrated in Table 2 further supports the significant role of the prior knowledge matrix in the attention mechanism.

H. MISSING PMU DATA REPLACEMENT PERFORMANCE
This subsection compares the missing value replacement performance of the proposed PMU data forecasting model and a state-of-the-art matrix-completion algorithm, OLAP. The experiments are conducted with missing data ratios ranging from ratio from 10% to 90%.

1) CASE STUDY SETUP
For each event and PMU time-series pair, it is assumed that the input data is complete and p% of the output time series' first time step data are missing. The proposed and baseline models both estimate missing values. The averaged mean absolute percentage error (MAPE) for the estimated missing data points in all testing events is used as the performance metric.

2) BASELINE METHOD
The state-of-the-art matrix completion-based method, OLAP [26] is used as the baseline method for missing value replacement. OLAP uses singular value decomposition to process the PMU data matrix and identifies the best linear combination of left singular vectors to fill in the missing PMU data.  Fig. 9 shows the missing value replacement algorithms' MAPE under different missing data percentages. The MAPE is calculated by averaging over all event labels and four measurement channels for voltage and frequency events. It can be observed that when the percentage of missing value is 30% or lower, OLAP achieves lower MAPE compared to the proposed forecasting-based model. This is not surprising because OLAP capitalizes on the other PMU data at the time when some PMUs have missing data. However, when the percentage of missing data is above 30%, our proposed model significantly outperforms OLAP. The comparative advantage is more pronounced when the missing data percentage is higher. The reason why the MAPE of our proposed model is nearly flat with the percentage of missing data is that the forecasting-based model does not leverage the other PMUs' measurements at the time step when PMUs have missing values.

3) MISSING VALUE REPLACEMENT PERFORMANCE COMPARISON
Finally, the proposed forecasting-based model can be utilized to replace missing PMU values in the most severe event that is when all PMUs are offline due to GPS satellite malfunction. In this case, the baseline method, OLAP, won't be able to estimate any missing values.

V. CONCLUSION
This paper proposes a deep learning-based forecasting algorithm to fill in missing PMU data during power system events. The proposed deep learning model is trained using PMU data from hundreds of real-world voltage-related and frequency-related events. The proposed model is built on top of a Seq2Seq LSTM model with attention mechanism. The two innovative techniques, integrating the prior knowledge matrix into the attention mechanism and decoupling the PMU data into the magnitude and trend components, successfully improve the prediction accuracy and reduce the forecasting error.
The missing PMU data replacement performance of the proposed and baseline deep learning and matrix completion-based methods are evaluated using real-world PMU data during system events in WECC. The testing results show that our proposed model outperforms the baseline deep learning models in terms of prediction accuracy. When the missing value percentage is higher than 30%, our proposed model achieves much lower missing data estimation error than the state-of-the-art matrix completion-based approach. Further in-depth numerical study results quantify the impacts of the two innovative techniques (prior knowledge matrix and magnitude-trend decoupling) on the proposed forecasting model's accuracy. These two unique techniques helped reduce the RMSE of the proposed model by 10-20% compared to the best baseline method.
In future work, it would be interesting to explore the use of a hybrid forecasting model that combines the current model with another method, such as the matrix completion method, to better handle a mixture of voltage and frequency events. Additionally, extending the scope of event types to include power swing oscillation and converter-driven oscillation would lead to a more comprehensive approach to missing value replacement in the power system. It should be noted that the grid topology is not disclosed, and only PMU measurements are available for these case studies. If grid topology is available, other advanced neural networks, such as graph convolutional networks, can be promising to improve the missing value imputation performance.

ACKNOWLEDGMENT
Disclaimer: this report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.