Hd-Deep-EM: Deep Expectation Maximization for Dynamic Hidden State Recovery Using Heterogeneous Data

Uncertain power generations and loads are continually integrated into the power system, causing high risks of dynamic events. To better monitor systems, advanced meters like Phasor Measurement Units (PMUs) can record high-resolution system dynamic states in real time. However, due to the high cost, the placement of PMUs is limited in a system. Thus, it's hard to obtain all the dynamic system states via PMUs. On the other hand, traditional sensors in a Supervisory Control and Data Acquisition (SCADA) system can broadly cover the system, though they only provide low-resolution measurements. In this article, we propose to utilize PMU and SCADA sensor data to recover the missing dynamic states in the power generation system. The problem has the following challenges based on unique properties of data: (1) Spatially, PMU and SCADA sensors have different locations. Thus, it's a must to approximate the data correlations to estimate the missing data accurately. (2) Temporally, the dynamic transitions in SCADA samples are scarce, urging efficient utilization of the SCADA data to approximate the dynamics. For challenge (1), we employ Deep Neural Networks (DNNs) with high capacities to capture spatial-temporal information to predict dynamic states. For challenge (2), we develop a new mechanism to utilize SCADA data efficiently. Specifically, we iteratively reuse the predicted dynamic states in the SCADA data to retrain the DNN model, gradually increasing the performance. The effectiveness of the proposed training procedure is theoretically verified via the framework of Expectation-Maximization (EM). Thus, our model to fuse heterogeneous data is termed Heterogeneous data Deep EM (Hd-Deep-EM). Finally, we demonstrate the high performance of the Hd-Deep-EM in diversified synthetic and realistic power systems.


I. INTRODUCTION
A POWER-GENERATION system integrates stochastic components and loads to facilitate clean and low-cost production and consumption of power, such as Photovoltaic (PV) power and wind power that highly depend on uncertain weather conditions.These components make the power system vulnerable to dynamic environmental events.To better capture system dynamics, Phasor Measurement Units (PMUs) are continually integrated into the power system to provide synchronized phasor measurements with 30-120 samples per second [1].This high resolution enables accurate monitoring of the dynamic states.
However, installing a single PMU is relatively expensive and costs around several thousands of dollars.According to the US Department of Energy [2], the cost of a single PMU in the United States ranges from $40,000 to $180,000.Meanwhile, the conventional measuring system of power systems, i.e., Remote Terminal Unit (RTU) based SCADA system, has been largely deployed over the last decade [3].As a comparison, the installation cost for an RTU sensor ranges from $200 to $5,000 [4].In addition, the expense of installing PMUs is not restricted to the equipment alone.Other costs include communication infrastructure, data management systems, and cybersecurity precautions to assure the security and dependability of the obtained data.These expenses may rapidly accumulate, making it difficult for utilities to adopt PMUs on a large scale [5], [6].Compared to PMUs, SCADA measurements have relatively low resolution and can't fully record system dynamics.For example, the SCADA system provides only 0.5-2 measurement samples per second [7].
Therefore, numerous studies focus on optimal PMU placement to balance the PMU costs and the benefits from PMUs (e.g., system observability).For example, [8] uses a dual search algorithm to achieve observability.[9] presents a way to quantify the benefits associated with the development of a Wide-Area Measurement System (WAMS) to reach an optimal PMU placement solution.However, in our article, we propose a brandnew perspective to interpolate missing dynamic information of SCADA measurements based on a few PMUs.Hence, with limited computational costs, SCADA sensors can be converted to "virtual PMUs" with the capacity to infer the intermediate missing values.Moreover, our DNN in Hd-Deep-EM can achieve real-time inference for the intermediate data since the forward computations of DNN, i.e., the DNN prediction, are very fast.The goal is achieved by combining the information of SCADA data and PMU measurements to accurately estimate the missing dynamic states between SCADA data, under the domain of Dynamic State Estimation (DSE).
DSE is an important topic in power systems with the growing uncertainties in the system from both the generation and the demand sides [10].DSE is challenging due to the complex dynamic transitions of system states.Existing methods can be categorized into the following groups.The first group utilizes PMU measurements and full system physical equations to estimate dynamic states.The process includes prediction to forecast the future system states and filtering to remove bad data and decrease errors based on current measurements, i.e., the so-called Kalman Filter (KF) [11], [12], [13].KF-based methods require strict assumptions, e.g., lossless network with voltage magnitude to be fixed at 1 p.u.To make the methods more appropriate for real-world systems, Extended Kalman Filter (EKF) [14], [15], [16] and Unscented Kalman Filter (UKF) [17], [18] are employed.However, these methods still require the system dynamic model.
Therefore, the second group of methods utilizes PMU data without system parameters.Specifically, they introduce Koopman operator-based KF (KKF), a data-driven approach with Maximum Likelihood Estimation (MLE) to estimate the system transition and observation matrices [19], [20], [21].In particular, Koopman operator can convert nonlinear dynamics to a proper linear approximation, making the computation easy and enabling the embedded MLE process.However, these methods still require certain statistical assumptions on the state distributions (e.g., Student's t-distribution).This may not be generalizable to diversified system conditions with the increasing penetration of renewable energy.Moreover, they still demand a certain number of PMUs to achieve system-level observability.
Then, the third group of work assumes limited PMUs with a large number of SCADA sensors across the system.Then, they try to estimate the missing values of SCADA data via methods like information fusion [22], Bayesian methods [23], and regression methods [24], [25], [26].These methods, however, are hard to capture complex spatial-temporal correlations of power system datasets.Additionally, they typically suffer overfitting issues since the SCADA data contain limited dynamic information.
In this article, we focus on the Deep Learning (DL) techniques that can effectively capture the complex spatial-temporal correlation of PMU and SCADA data.These methods utilize hierarchical feature learning to extract features spatially or temporally.To capture spatial correlations, methods like Convolutional Neural Networks (CNNs) [27], [28], [29] and Graph Neural Network (GNN) can employ square or graph-based kernels for convolution and extraction of local features.To obtain temporal correlations, Recursive Neural Networks (RNNs) [30], [31] and Long Short-Term Memory (LSTM) [32] employ gates to filter sequential information and learn useful patterns.In this article, we make use of the capacity of DNN to analyze spatial-temporal data for dynamic state estimation.However, DNN-based methods require relatively large datasets with diverse information.DNNs may overfit when the information is limited to uncover the dynamic states.Notably, such a scenario often happens for SCADA data due to the low resolution.Thus, it's quite challenging to build a good DNN model to predict the dynamic states of SCADA systems.
For the overfitting issues, we claim that they always exist no matter how we improve the DNN model.Specifically, the overfitting is due to the limited samples in the low-resolution RTU sensors due to the different sampling rates.If we only use the limited RTU data as output data in the training dataset, most of the dynamic information is missing.Therefore, DNN tends to overly fit the RTU samples without the motivation to explore how to estimate intermediate states.
This article addresses the system problem by explicitly introducing "more" training data with dynamic information.Specifically, the new training data is derived from the predicted hidden states and the corresponding input PMU data, containing rich dynamic information.Subsequently, the DNN can be trained with all the data to further improve the prediction.These two steps form an iterative way of improving the neural network model's performance.To elaborate on the effectiveness of the procedure, we show that the iterative process is essentially an Expectation-Maximization (EM) method with good convergence performance.Thus, we named our method Heterogeneous data Deep EM (Hd-Deep-EM).
EM algorithms impact numerous topics in power systems.They mostly target probabilistic load or power source monitoring [33], [34], [35], where EM helps to estimate the underlying distributions with the maximal likelihood.For example, [33], [34], [35] develop a Gaussian Mixture Model (GMM) to model the density function of the load/source, and utilize EM algorithm to find a suboptimal solution.Some other applications appear in PMU-based cyber attack detection [36] and joint impedance and topology estimation [37].For example, in [36], the objective becomes the log-likelihood of data with underlying cyber attacks.
There are also other studies to combine deep learning and EM algorithms for different domains.Specifically, in signal processing, [38] develops a generalized expectation maximization to estimate channel states and detect signals.[39] detects signals by leveraging the deep unfolding to represent the iterative EM algorithm and improve the detection accuracy.In computer vision, [40] utilizes online EM algorithm to improve the inference in the deep Bayesian network, achieving good image classification performances.[40] reconstructs images via EM network and employs the power of EM algorithm to tackle noises.In general, those methods can hardly be applied to power system PMU and SCADA data due to different resolutions.They also don't provide solid theoretical guarantees.In this article, we theoretically and experimentally clarify the effectiveness and efficiency of Hd-Deep-EM for dynamic hidden state estimation.
For the numerical verification, the Hd-Deep-EM method is tested extensively at various conditions on the synthetic 200and 500-bus systems datasets, where the loading conditions and PMU/SCADA sensor penetrations are diversified for extensive testing.To show the power of Hd-Deep-EM in facing the spatial and temporal difference, we compare the proposed Hd-Deep-EM method with shallow EM algorithm and EM algorithm with a Vanilla neural network.To show the power of Hd-Deep-EM in tackling the temporal trend from the SCADA measurements, we compare Hd-Deep-EM with the traditional neural network.The benchmark methods include EM and deep learning approaches, and Mean Square Error (MSE) is used for evaluating interpolation model accuracy.The results show that our proposed Hd-Deep-EM method can efficiently interpolate missing data to the SCADA measurement.
To summarize, we have the following contributions.(1) We formulate a data-driven problem to utilize synchrophasors from limited PMUs and widespread RTU sensor measurements to estimate the RTU data's dynamic states and interpolate the missing values.The goal is promising since we can utilize limited PMUs to boost the data quality of all RTU sensors.Additionally, the idea is practical due to the high data correlations within a system.(2) We propose our Hd-Deep-EM to solve the problem.Hd-Deep-EM has 3 strong properties to achieve high performance.First, Hd-Deep-EM employs Deep Neural Networks (DNNs) to approximate the correlation from PMUs to RTU sensors.While PMUs and RTU sensors locate at different places, the universal approximation power of DNNs enables Hd-Deep-EM to capture such correlations.Second, we utilize a window of PMU data to predict one sample of RTU sensors at a time slot.Therefore, the DNN in the Hd-Deep-EM can capture the temporal trends for the prediction.This also enables the prediction to be robust to noise, bad data, and the loss of synchronization.Third, the training of the DNN easily faces overfitting issues due to the limited data in RTU sensors with little dynamic information.Therefore, we propose an iterative process to utilize the predicted data in the RTU sensors to re-train the DNN, following an Expectation-Maximization (EM) manner.Thus, the strong theoretical guarantees from the EM algorithm support the convergence of Hd-Deep-EM.(3) We conduct comprehensive numerical validations on transmission grids and show Hd-Deep-EM has the best performance compared to other methods.
The rest of the article is organized as follows: Section II defines the problem.Section III proposes our Hd-Deep-EM.Section IV conduct experiments for baselines and Hd-Deep-EM and Section V concludes the article.

II. PROBLEM FORMULATION
We denote that N p samples are for d p numbers of PMUs and N s samples are for d s numbers of SCADA data.Then, let D p ∈ R N p ×d p represent the PMU data matrix and D s ∈ R N s ×d s for the SCADA data.Moreover, due to the difference in resolution among PMUs and SCADA data, we have N p N s .Then, we denote the missing data matrix as Z ∈ R (N p −N s )×d s .In the transmission system, it's reasonable to assume the SCADA sensors can achieve the full system observability [41], i.e., d s is comparable to the system node number.Our contribution is to use few PMUs to interpolate the data for a large number of SCADA sensors, achieving the dynamic state estimation via the dynamic information from PMUs.
We also describe the measurement types of PMUs and RTUs here.PMUs are highly accurate devices that can measure the phasor quantities of voltage and current, such as magnitude, phase angle, and frequency, in real-time.PMUs can also measure the rate of change of these parameters, which is known as the frequency deviation or ROCOF (Rate of Change of Frequency) [42].These measurements are used to monitor the stability of the power system and detect any disturbances or anomalies that may occur.RTUs, on the other hand, are used for remote monitoring and control of power system equipment such as transformers, switches, and breakers.They can measure a range of parameters, including voltage, current, power, and energy consumption [43].RTUs can also provide information on the status of equipment, such as whether it is on or off and any faults or errors that may occur.In general, the above raw data can be input to the DNN model.Then, with nonlinear activations like Rectified Linear Unit (ReLU) [44], the intermediate neurons after activations can represent the nonlinear features of the model.In Experiment, we mainly focus on the prediction of voltage magnitude as system states.However, other data have similar performance.
With the defined PMU and SCADA data flows, we need to process the raw data and formalize the training data to estimate the dynamic states.While one can build a one-to-one mapping from the PMU measurements to the SCADA data simultaneously, the mapping may not be robust if data on either side have outliers.Thus, to improve the robustness, we propose utilizing the PMU data of neighboring time slots as the input.Fig. 1 illustrates the process where we utilize a group of PMU data (i.e., the blue box) as input to estimate the SCADA sample (i.e., the green box).Naturally, if we input the intermediate PMU data that don't have the corresponding SCADA data, the prediction will be an estimate of the missing dynamic states of the SCADA data.
To predict the measurement states, mathematically, we employ a moving window to segment PMU streams based on the resolution of the SCADA data.
As shown in Fig. 1, for each timestamp of SCADA data, the PMUs data will extract a k × d p moving window that has a center of the target timestamp.Subsequently, each extracted moving window will be vectorized into an input vector.This procedure makes the processed PMUs and SCADA data have the same number of samples N s .Therefore, we have X p ∈ R N s ×(k×d p ) as the input PMU data matrix and D s ∈ R N s ×d s as the output SCADA data.In order to recover the Z with function f , we also apply the moving window to the intermediate PMU streams to predict the missing states Z. Specifically, we denote the PMU data matrix for recovering Z as Xp ∈ R (N p −N s )×(k×d p ) .
Finally, the strategy of this article is to learn the mapping rule The complete problem definition can be summarized as follows: r Problem: Deep expectation maximization for data interpo- lation.
r Given: a set of PMU-based samples with D p ∈ R N p ×d p as the high resolution data and D s ∈ R N s ×d s as the low resolution data, respectively.
r Output: a regression model to learn data mapping f be- tween PMU data and SCADA data and interpolate the missing value of the SCADA data.

III. PROPOSED MODEL
The design of the above mapping f can be diversified.However, existing works can not sufficiently capture the spatialtemporal correlations of PMU and SCADA data.In this section, we first illustrate that a well-designed DNN for the function f can efficiently handle the spatial-temporal correlations and estimate the dynamic states.However, the training of the DNN may suffer overfitting due to the scarce dynamic information in SCADA data.Then, we show the key is to resolve the issue is to completely make use of the hidden dynamic states to update the mapping.Such a hidden value estimation and parameter updating process lie in the domain of the Expectation-Maximization (EM) algorithm.

A. Expectation-Maximization Algorithm for Hidden State Estimation
EM algorithms are developed to tackle problems with latent variables, e.g., the hidden states in SCADA data streams.Specifically, the EM algorithm is an iterative process to separately estimate the statistics of latent variables and the corresponding parameters that determine the likelihood of the latent data.Thus, to apply EM algorithms to our problem, the first step is to identify what parameters should be used to determine the latent variables (i.e., the hidden states).
In general, one can categorize the parametric models into the generative and the discriminative models.The generative model estimates the joint distributions between the known and the unknown variables to generate new data.However, it's hard to choose an appropriate model to estimate hidden states.For example, GMM is usually used in most EM algorithms.However, the GMM model is not capable to represent the distribution of hidden states in power systems with complex spatial-temporal correlations.
Thus, in this article, we propose to utilize a discriminative mapping f to directly map from PMU data to SCADA data.The advantage of a discriminative design is to maintain the spatial-temporal correlations in the PMU data, thus boosting the approximation accuracy of the hidden states.Mathematically, we can write the EM steps as follows.
r E step: Estimate the expected values of the hidden states in SCADA data streams.Namely, input the corresponding PMU data to f and output hidden states.r M step: Update parameters of f for better estimation.
Obviously, the design of f should be carefully considered.In the following section, we display our design using deep learning techniques to achieve accurate dynamic state estimation.
Fig. 2 visualizes the process of the EM algorithms.To investigate the convergence of the EM algorithm, many studies are done and one can refer to [45] for more details.

B. Deep Neural Networks to Capture Spatial-Temporal Correlations
The core of our Hd-Deep-EM algorithm is the deep neural network (DNN) model, which is applied to approximate the data interpolation mapping between PMU to SCADA measurement.Intuitively, a DNN model can help capture the temporal-spatial correlations and learn useful features for the task.Specifically, the DNN model has multi-layer feed-forward neural network structure, which consists of typical three-level network architecture: one input layer, several hidden layers, and one output layer.Since the input data comes from PMU data X p and the output data comes from the SCADA data D s , we denote x as the variable for X p , y as the truth variable for D s , and ŷ as the predicted variable using the DNN.Then, the DNN function can be written as follows.
where N l denotes the number of layers for the neural network, h 0 denotes the input matrix of the network, h i is the output matrix of the i-th hidden layer, b i is the bias term, and g is the activation function for each hidden layer.We elaborate more on the model of the DNN based on the following perspectives.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. 1) Structure and Activations: DNN has a hierarchical structure to gradually extract non-linear features for the task.Generally, the i-th hidden layer models the interactions between the (i − 1)-th features by introducing a connection weight matrix W i and a bias vector b i .Then, the result is activated via a non-linear activation function g(•).The choice of g can vary, and some common options include Rectified Linear Unit (ReLU), sigmoid function, and tangent function, etc. Notably, ReLU is the most commonly used because ReLU-based DNN can efficiently tackle the gradient vanishing problem and has a high efficiency to compute [46].
2) Loss Function: After defining the structure and the inner activation functions, we need to train the DNN with a loss function.For this regression problem, the Mean Square Error (MSE) can smoothly measure the difference between the predicted output ŷ and the true output y: where W = {W i } i is the set of layer-wise weights of the DNN, and B = {b i } i is the set of bias terms.
3) Training Process: With the defined loss function, training is conducted via minimizing the loss with given input/output data.The minimization can be written as: To solve the optimization, many algorithms can be utilized.For example, one can utilize either Stochastic Gradient Descent (SGD) [47] or the Adam [48] to train the model.The defined DNN has the ability to learn the local features and determine the spatial-temporal correlations.To well-train the DNN model to approximate the dynamic states, we require the data to contain enough dynamic information.Nevertheless, the SCADA data has little dynamic information, decreasing the DNN model performance.

C. Deep Expectation Maximization to Boost the Training of the DNN
To resolve SCADA data scarcity, the key is to introduce more SCADA data with system dynamic information.As the dynamic systems states of the SCADA system are hidden, the problem can be decomposed into (1) estimating hidden values and ( 2 In general, the EM algorithm tries to solve an optimization with missing quantities.For the optimization to train the DNN in (4), the hidden dynamic states of the SCADA data are introduced.Thus, the optimization can be rewritten as: where z represents the variable of missing values in the SCADA data.Namely, Z = {z i } . Clearly, z is also the output of the DNN f given the input PMU data x at the corresponding time.However, the unknown knowledge of z makes the direct optimization of (5) impossible.
Therefore, the EM algorithm solves (5) in an iterative manner.In the Expectation (E) step, the algorithm tries to approximate the expected values of z using the current model and data.Then, in the Maximization (M) step, the model tries to maximize the profit, i.e., minimize the loss, to obtain a better model.More specifically, for the k-th iteration, EM algorithm has the following formulations.Based on the definition of z, the expected estimation of z k can be written as: where W k and B k represent the parameters of the DNN at the k-th iteration.The input data x comes from the intermediate PMU matrix Xp that corresponds to the hidden timestamps of the SCADA system.Then, the estimated Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.r M step: Retrain the DNN model to minimize the loss: The obtained parameters W k+1 and B k+1 can formalize the (k + 1)-th DNN model.Since z k is known values, (7) can be conveniently trained using SGD or Adam algorithms.All quantities of variable z k can formulate the missing state matrix Z.For different iterations, the values of Z can change.Thus, we also add the superscript and utilize Z k to represent the predicted missing values in the k th iteration, as shown in Fig. 3. Finally, we summarize the complete algorithm for the Hd-Deep-EM model in Algorithm 1.

D. Double Expectation Maximization Algorithms for Noisy Data
For realistic datasets, noise may lower the performance of the state estimation.To tackle noise data, we modify our Hd-Deep-EM algorithm for better estimation.Specifically, we can assume the noises are independent of measurements and can be easily modeled via a distribution model like GMM.Then, a natural idea is to decompose the model for estimating hidden states and noises.Since both data can be estimated via EM algorithms, we modify our Hd-Deep-EM and propose a Double Hd-Deep-EM algorithms for noisy data.
Specifically, in each iteration, we utilize the DNN model f to represent the mapping from input PMU data to the output.In the meantime, we propose to leverage another Gaussian model to estimate the noises, formulating the double EM algorithms (Fig 3).Mathematically, the double EM steps can be written as follows: r E1 step: Estimate the missing values of SCADA data z k using (6).r E2 step: Compute the noise data Note that for known SCADA data, we utilize y − n k+1 to take place of y to achieve the denoising.In general, the algorithm can be summarized as Algorithm 2. We denote our algorithm as the Double Hd-Deep-EM algorithm.

E. Theoretical Support of Double Hd-Deep-EM Algorithm for Denoising
In the Double Hd-Deep-EM algorithm, we estimate parameters for both the DNN model and the underlying distribution of noises, which requires certain guarantees for convergence to the true noise distribution.The EM algorithm can provide the convergence theorem which states the decreasing distance between the true likelihood of the noise distribution and the estimated likelihood using the data in (8).Specifically, we modify the convergence theorem from [45] and propose the following theorem: Theorem 1: Let θ denote the parameters of the noise distribution.Then, let {θ t } denote the parameter sequence in the EM iteration and t represents the iteration index.If θ * is a limit point of {θ t }, then (1) θ * is a stationary point of the likelihood function of the noise data in (8), and (2) the sequence {l(θ t )} is non-decreasing and converges to l(θ * ), where l(θ) represents the likelihood of the noise distribution.
The proofs of Theorem 1 can be found in [45].Basically, Theorem 1 justifies the convergent dynamics of the EM algorithm.More specifically, Theorem 1 proves that the convergent point θ exists for the parameters of the true distribution of the noises.Then, Theorem 1 illustrates the convergence to θ using the EM algorithm.
The convergence is further supported by numerical validations where we investigate the proposed Hd-Deep-EM algorithm Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Dataset Description
In the experiment, we employ synthetic datasets from 200and 500-bus systems [49], [50].For data preparation, we employ a commercial-grade simulator, Positive Sequence Load Flow (PSLF) to simulate PMU data with 60 samples per second.To diversify datasets, we vary the loading conditions during the simulation, which leads to different system dynamics.Specifically, we can change the initial condition of the load flow in PSLF by different loading curves.We employ the load profiles in real-world PJM Interconnection [51] as different initialization points for the simulated systems.For example, Fig. 4 shows 4 loading curves from the PJM data.We randomly select their points at different time slots for the initialization, which boost the diversity of our experiments.Moreover, we introduce three different fault events to our data set to validate the robustness of our algorithm.As shown in Fig. 5, we visualize the event using Voltage Magnitude (VM) data obtained from PSLF.Further, we assume PMUs only locate at partial buses, with a penetration η = 0.05.For the rest data, we assume they are SCADA data with a sampling process.We obtain the SCADA data with 0.5 samples per second.
Totally, we have over 40 files of 10-second simulated data for each test system.To obtain X p from these time series, we utilize the moving window with the length to be 0.  20×10) for the 200-bus system and D s ∈ R 20×475 and X p ∈ R 20×(20×25) for the 500-bus system.They are matrices of PMUs and SCADA measurements data for one window.Further, we have Z ∈ R 580×190 X ∈ R 580×200 for the 200-bus system and Z ∈ R 580×475 X ∈ R 580×500 for the 500-bus system.They are total extracted matrices for missing SCADA data and PMU data to recover Z.

B. Benchmark Method
As benchmarks for the effectiveness of the proposed model, we employ three distinct methods.The details of these methods are as follows.
r EM algorithm + Linear Regression (LR): LR fits a linear model with coefficients to minimize the error [52].However, the LR model does not consider the spatial differences between PMU and SCADA data.During the testing, the hyper-parameters for all models are fine-tuned to achieve the best accuracy.Especially, we report the detailed results with respect to Mean Square Error (MSE) between the truth and predict Z matrix (i.e., missing dynamic states of the SCADA system) for all methods.
The true Z matrix is the missing value not reported in the RTU sensors.They are in fact unknown to us.Therefore, we propose the following procedures to prepare data for validation.
r Simulation: We run a business-level simulator, Positive Se- quence Load Flow (PSLF), to simulate PMU data with 60 samples per second for 200-bus and 500-bus systems [49], [50].For each node, we can obtain the PMU streams.
r Down-Sampling: We selected a limited number of buses and assumed they are placed with PMUs.Then, the rest buses should be equipped with SCADA sensors.We achieve this target by doing down-sampling for the streams of 60 samples per second.We extract 1 out of 30 samples in the streams as the blue points.These samples are treated as the measurements of the SCADA sensors.Correspondingly, the unselected data can formulate the true Z matrix.Notably, Z is only used in testing to evaluate the prediction error.

C. Deep Model is Better Than Linear Model
In this subsection, we evaluate the effectiveness of our Hd-Deep-EM algorithm in tackling the spatial difference by comparing our Hd-Deep-EM algorithm to LR + EM.The results show that our model performs better than benchmarks.Specifically, we report the prediction performance of simulated data as follows.
Figs. 6 and 7 demonstrates the performances for the two methods in two different test system.The y-axis is the MSE error, and the x-axis represents the number of iterations during training.From the above figures, we find that our Hd-Deep-EM algorithm is lower than the MSE error of 0.05 and 0.4, compared to shallow EM+ LR in 200-and 500-bus test system after convergence.The better performance of the Hd-Deep-EM algorithm shows that the deep learning model can be better than the LR model in our complex scenario.

D. Consider Temporal Correlations Increase DNN Performance
In this subsection, we evaluate the effectiveness of our Hd-Deep-EM algorithm in integrating the temporal correlations.Specifically, we compare Hd-Deep-EM0 and Double Hd-Deep-EM.Due to the window-based segmentation, our Fig. 7. Testing error for the 500-bus system.Double Hd-Deep-EM can successfully integrate neighboring measurements in the temporal domain to estimate the dynamic states at one time slot.However, Hd-Deep-EM0 doesn't have this treatment.Then, we report the results of simulated data as follows.
Figs. 6 and 7 demonstrate the performances for the two methods.From the above figures, we find that our Double Hd-Deep-EM has an average of 0.05 and 0.06 testing MSE error for both test systems after convergence.In comparison, Hd-Deep-EM0 has an error of 0.07 and 0.2.The performance of these two methods can be explained as follow.Window segmentation can capture the data before and after the SCADA time stamp, which can benefit from finding the temporal trending of the PMU data and avoiding the negative impacts of the outliers.The better performance of the Hd-Deep-EM algorithm with window segmented data shows that our proposed deep learning model can easily capture the temporal trend for PMU data in our complex scenario.Finally, we observe that in Fig. 7, DNN model can perform better than Hd-Deep-EM0 that doesn't consider the temporal correlations.This verifies our claims that considering proper temporal correlations in the DNN and Hd-Deep-EM1 (noiseless) can improve the performance of estimating dynamic hidden states.

E. EM Procedure Provides Better Training of the DNN
In this subsection, we evaluate the effectiveness of introducing the EM procedure by comparing Hd-Deep-EM1 and a DNN Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.model.First, we show a case study of how the EM algorithm can improve the training of DNN.We consider a line trip in Figs. 8 and 9. Specifically, Fig. 8 shows the iterative process and our EM framework helps the DNN to make more use of the dynamic data and enable the DNN prediction to gradually converge to the true blue curve.Fig. 9 demonstrates the final comparison of our Hd-Deep-EM and vanilla DNN.We can find that only when we combine EM algorithm with DNN model can we achieve a generalizable model to accurately predict the dynamic data for the RTU sensors.
Then, we report the statistical results of simulated data as follows.Figs. 6 and

F. Window-Based Prediction Enable Certain Robustness to Non-Synchronization
We claim that our method has certain robustness to the nonsynchronization issues.The main reason is that we utilize a window of PMU data, rather than single time-slot data, as the input to the DNN model in Hd-Deep-EM.Therefore, our DNN model utilizes an interval of PMU data to predict one point in the SCADA data streams.Due to the high temporal correlations, our DNN contains certain tolerance for the mismatch.If the non-synchronization doesn't exceed the range of the window, DNN can deliver a good estimation.To support our claim, we conduct numerical validations for the 200-bus case with 10 PMUs, shown in Table I.We find that as long as the delay time between PMU and SCADA sensors is less than 1 s, our Hd-Deep-EM can always attain good performance.

G. Double Hd-Deep-EM Provides Better Performance in Tackling Noise
In this subsection, we evaluate the effectiveness of interpolating noisy data by comparing Double Hd-Deep-EM among all three benchmark models and our Hd-Deep-EM1 method.In the experiment, we introduce noise into our dataset to validate the effectiveness in tackling noisy data.Specifically, we report the results of simulated data as follows.As shown in Figs. 10  and 11, we find that our Double Hd-Deep-EM has an average testing error reduction of 0.04, 0.02, 0.02 and 0.01, compared to EM+LR, Hd-Deep-EM0, DNN method and our proposed Hd-Deep-EM1 for the 200 bus system.Furthermore, Double Hd-Deep-EM has an average testing error reduction of 0.01, 0.02, 0.03 and 0.005, compared to bench mark methods and Hd-Deep-EM1 for the 500 bus system.The results show that our Double Hd-Deep-EM has the best performance of robustness to noisy data.

H. Hd-Deep-EM Has Comparable Testing Time
In this subsection, we compare the training and testing time for different methods, shown in Table II.The results show that the EM framework will generally increase the training Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.time even when we utilize the same iteration numbers.This is reasonable because vanilla DNN will only utilize RTU samples, i.e., the blue dots in Fig. 9.However, the EM framework can bring more training samples, making the training process more time-consuming.However, the training time is still affordable for the offline process.Moreover, if we compare the testing time, we find that all Machine Learning (ML) methods have a very quick inference time, which is due to the fast forward computation of the ML model.

I. Sensitivity Analysis With Respect to Different PMU Locations
An essential point is about the placement and number of Phasor Measurement Units (PMUs) compared to Supervisory   Then, we add the following experiments on the Illinois 200bus system to check how the locations of PMUs can affect the final results.Specifically, in our experiments in the manuscript, we conduct uniform selections for 10 PMUs over the complete grids for 10 times, making PMUs widely located in the grid.In this revised version, we propose a biased selection for 10 PMUs, shown in Fig. 12.We define 5 different regions in the grid with 10 PMU placement and make the rest nodes placed with SCADA sensors to monitor the whole grid.With obtained data, we conduct different methods and report the results of MSE error in Table III.
First, we observe that for all biased selections of PMUs, all methods perform worse than the uniform selection.This is because PMU measurements have higher correlations to the neighboring SCADA data.Thus, if PMUs can be placed in a wide range, their information can be better utilized to interpolate the SCADA data of the complete grid.Second, we find that Region 4 has relatively better performance since it locates in the central area of the grid, which makes the PMUs better to monitor the whole grid.Thirdly, Region 2 has the third good performance.This is because we conduct 4 out of 10 faults around Region 2. In general, if PMUs are widely placed or close to dynamic events, the performance will be better.

J. Sensitivity Analysis With Respect to Noise Levels
In addition, we further evaluate Double Hd-Deep-EM performance with other benchmarks models in 3 different level of noise injection to the simulated data.To quantify the level of the noise, we utilize the Signal-to-Noise Ratio (SNR) which compares the level of a desired signal to the level of noise.Signal-to-Noise  Ratio can be defined as the ratio of signal power to noise power.The result of MSE error is shown in the tables below.
According to the performance in Tables IV and V, we find that our Double Hd-Deep-EM has an average testing error reduction over different noise level of 0.013, 0.017, 0.014 and 0.042, compared to Hd-Deep-EM1, EM+LR, Hd-Deep-EM0 and DNN method for 200 bus system.And Double Hd-Deep-EM has an average testing error reduction over different noise level of 0.004, 0.031, 0.005 and 0.043, compared to bench mark methods for 500 bus system.

K. Event Identification Accuracy Comparison With Different Dataset
To evaluate the effectiveness of interpolation, we propose to compare the interpolated SCADA data, original SCADA data and PMU data for the data-driven task: identifying system events.Specifically, the same event identification algorithm is performed among the three datasets.We utilize the 5 different popular Supervised Learning methods to evaluate the results.The selected methods are K Nearest Neighbor (KNN), Naive Bayes (NB), Decision Tree (DT), Logistic Regression (LR), and Support Vector Machine (SVM).As shown in Fig. 13, we find that our interpolated data can best improve the different learning methods, with KNN being the best.Averagely, our merged dataset has an average increase of 7.92%, and 36.35%,compared to the original PMU and SCADA datasets.

L. Investigations of Different Activation Functions and Initialization Methods
For the DNN models, the choices of activation functions and initialization methods for the weights are biases are important.Therefore, in this subsection, we investigate the performance of different designs.For the effectiveness of activation functions, we conduct the following experiments, as shown in Table VI.We find that the changes in the final results are insignificant   for different hybrid activation functions [54] compared to ReLU functions.Thus, it's reasonable to utilize ReLU as the activation function given its lowest complexity.
For the initialization methods, our experiments before utilize the He initialization method [55].However, we agree that different initialization may cause different results.Some other initialization methods in Pytorch include Uniform Initialization, Normal Initialization, Xavier Initialization, etc [56].To further quantify the impacts of these initialization methods, we conduct the tests for the 200-bus case with 10 PMUs.The result is shown in Table VII.We find that although we change different initialization for the weights and biases, the results are similar.

V. CONCLUSION
The increasing integration of renewable energy brings risks of system dynamic events.To achieve better monitoring of system dynamics, Phasor Measurement Units (PMUs) are placed to yield high-resolution data files.However, PMU units are limited in a system due to the high cost.For regions without PMUs, we propose to estimate the dynamic states using PMU and SCADA data.The problem is challenging due to the spatial differences between PMU and SCADA data, as well as the temporal scarcity of the SCADA measurements.To solve these issues, we employ Deep Neural Networks (DNNs) to capture the spatial-temporal correlations efficiently.The DNN can be trained to map from PMU data to SCADA data, bringing a reasonable estimate of the missing measurements of SCADA data.However, the lack of dynamic information in the SCADA data disables a well-trained DNN.
Thus, we propose to utilize the estimated data to retrain the DNN model and build the complete algorithm in an iterative approach of estimation and retraining.Such a process can be explained under the framework of Expectation-Maximization (EM) algorithm.Thus, we name our method Hd-Deep-EM.Subsequently, we improve our Hd-Deep-EM in realistic applications by considering the noises.Specifically, we propose to utilize another EM algorithm to estimate the distribution of the noise data, bringing a Double Hd-Deep-EM algorithm.To validate the Hd-Deep-EM model, we utilize diversified synthetic and realworld datasets.All results show that Hd-Deep-EM and Double Hd-Deep-EM perform better than other existing methods.

Fig. 1 .
Fig. 1.Moving window-based segmentation procedure to process the PMU streams.
) updating the model parameters as shown in Fig. 3. Mathematically, this iterative procedure is an Expectation-Maximization (EM) algorithm.This subsection derives the Hd-Deep-EM algorithm from training the DNN model, leading to the so-called Hd-Deep-EM model.

r
E step: Estimate the missing values of SCADA data z k .

Algorithm 1 :
Hd-Deep-EM Algorithm.Input: PMU data matrix X p , SCADA data matrix D s , PMU data matrix for recovering data Xp .Hyper-parameters: number of iterations K and DNN-related parameters like batch size and learning rate.Output: Missing data matrix Z for the dynamic states.1: Initial parameters of the DNN f .2: for k = 1 to Kdo 3: E step: Use (6) to estimate the k-th dynamic states.4: M step: Optimize (7) to update parameters of the k-th DNN.5: end for values of z k can be added to train the (k + 1)-th DNN model, which improves the training with more accurate dynamic information in z k .Thus, the training process is the M step, shown as follows.

r
EM algorithm + Vanilla Deep Neural Network: This method considers a Hd-Deep-EM framework with a vanilla DNN.Vanilla DNN does not take temporal correlations into account.In particular, window-based segmentation is configured with k = 1 and does not consider neighboring data for training.To differentiate from the suggested model, this method is referred to as Hd-Deep-EM0 while the proposed model is referred to as Hd-Deep-EM1.In addition, the proposed approach with noise consideration is referred to as the Double Hd-Deep-EM model.

r
Deep Neural Network (DNN) [53]: DNN is extracted from the Hd-Deep-EM to directly train the mapping without EM algorithm.DNN can utilize spatial and temporal correlations effectively.However, the lack of dynamic information in SCADA data may hinder the DNN's effectiveness.In general, by comparing the testing accuracy of the proposed model and the benchmark models, we can evaluate the effectiveness of Hd-Deep-EM.As claimed in the proposed model, our Hd-Deep-EM can (1) capture the complex spatial correlations between PMU and SCADA data, (2) incorporate temporal correlations for a robust estimator, and (3) use the EM framework to iteratively incorporate the temporal dynamics of the SCADA data for better DNN training.Especially compare our Double Hd-Deep-EM with the LR+ EM to illustrate the high representational power of the DNN to capture spatial correlations.Then, we compare Double Hd-Deep-EM with Hd-Deep-EM0 to understand to impacts of considering temporal correlations to estimate the hidden states.Finally, we compare Double Hd-Deep-EM with DNN to illustrate the design of the EM framework.
7 demonstrate the performances for the two methods.We find that our Double Hd-Deep-EM has an average testing error reduction of 0.01 and 0.02, compared to the DNN method.The reason, when training the DNN model, TABLE I TESTING ERROR FOR DIFFERENT DELAY TIME the output SCADA data has limited dynamic information.Thus, direct training can't guarantee the DNN model performance to predict dynamic states.On the other hand, our Hd-Deep-EM can iteratively predict the dynamic states and reuse the predicted results for training, leading to a better training procedure to incorporate dynamic information.

Fig. 10 .
Fig. 10.Testing error for the 200-bus system with noisy data.

Fig. 11 .
Fig. 11.Testing error for the 500-bus system with noisy data.
Control and Data Acquisition (SCADA) sensors.Specifically, the locations of PMUs and SCADA sensors can largely influence on the performance of Deep Learning (DL) algorithms in power systems.
) Update the parameters of Gaussian model using n k .Then, obtain n k+1 using the Gaussian model.
r M2 step: Update parameters of DNN model f via:

TABLE II THE
TRAINING AND TESTING TIME (S) FOR DIFFERENT DATA INTERPOLATION METHODS FOR THE SCENARIO ON THE 200-BUS SYSTEM WITH 1000 ITERATIONS

TABLE III TESTING
ERROR FOR DIFFERENT PMU PLACEMENT TEST CASES

TABLE IV THE
MSE FOR DIFFERENT DATA INTERPOLATION METHODS IN DIFFERENT NOISE LEVEL FOR ILLINOIS 200-BUS SYSTEMS

TABLE V THE
MSE FOR DIFFERENT DATA INTERPOLATION METHODS IN DIFFERENT NOISE LEVEL FOR SOUTH CAROLINA 500-BUS SYSTEMS

TABLE VI TESTING
ERROR FOR DIFFERENT ACTIVATION FUNCTION

TABLE VII TESTING
ERROR FOR DIFFERENT INITIALIZATION FUNCTION