A Method for Cleaning Power Grid Operation Data Based on Spatiotemporal Correlation Constraints

Bad data in a power system fail to accurately reflect its state of operation. Cleaning bad data is thus necessary for the situational analysis of a data-based power system. Methods for the identification and modification of bad data based on state estimation may fail to converge in iterative calculations, and yield poor results when the errors are significant. State estimation also involves non-linear computation, which incurs a significant computational overhead and makes it difficult to efficiently handle large volumes of data. To solve these problems, this paper proposes a method for the detection and identification of bad data based on constraints on spatial–temporal correlation. In the spatial domain, we fix the equivalent power balance condition through the topology of the network. In sequential distribution, we establish data constraints based on the similarity of data among temporal sections. We then check the data in combination with the spatial–temporal correlation constraints. The proposed method was applied to an IEEE 14-bus system and a provincial grid, and the results show that it can reliably detect and modify bad data to significantly improve their quality. The proposed approach avoids non-linear calculations, is computationally efficient, and can quickly detect and modify bad data.

technology for advanced applications in energy management systems [9], [10], and state estimates have come a long way after years of development [11]. On the premise of obtaining the topology of the entire network and line parameters of a single time section, redundant measurements have been used to solve for the estimated data under the condition of minimizing the error by the least-squares estimation method. Bad data are then detected based on the difference between the measured and the estimated data [12]- [14]. While state estimation has contributed significantly to the safe operation of power systems, it has a few limitations: 1) State estimation is applicable only to a single time section, and does not reflect the constraints and connections among multiple sections.
2) The redundancy equations for state estimation increase the computational overhead, making it unsuitable for handling large amounts of data. 3) State estimation requires complete parameter information (network topology and line parameters). Incorrect information can lead to estimation errors. 4) Significant errors in the data may make cause iterative calculations to fail to converge, which negatively impacts the identification of bad data.
Some researchers have developed time series and clustering techniques combined with data mining [15]- [17], but such work is limited to specific application scenarios and concrete goals. With the development of artificial intelligence, a large number of AI-based methods for bad data detection have been developed. Some researchers use AI methods such as deep learning (DL) and generative adversarial networks (GAN) to detect the presence of bad data in a sample. The principle is to identify bad data that is abnormal to the distribution characteristics based on the identification of data distribution characteristics. Within these scenarios, these methods have achieved high accuracy of identification of bad data [18], [19], but their effectiveness of recognition varies over different ranges, and they cannot guarantee consistent data detection over different cycles. Owing to the computational complexity of these methods, they also cannot meet the processing requirements of online temporal data.
Traditional data mining-based methods to analyze the operational behavior of systems necessitate mechanism modeling [20], and computational complexity is a bottleneck for them when rapidly processing large volumes of data. Big data-based techniques offer a different way from traditional modeling analysis to solve these problems. It involves direct and efficient access to information by using correlations in the data without relying on modeling. Power grid operational data features spatial and temporal correlations [21]. Spatially, topological relationships obtain between physical grid systems and power constraints arise owing to the topological constraints among them. Temporally, there are similarities in the data among sections. Nevertheless, such correlations have not been sufficiently exploited in traditional methods. This paper draws on ideas from big data. We apply the correlation between the data against the spatiotemporal constraints of the system operation and organize the constraints on the data. Based on these constraints, we propose a method to identify and clean bad data. The principle of the method is to directly detect and modify bad data in combination with the equivalent power balance relationship as reflected by the equivalent topology and the power constraint-based relationship between the timing sections. We applied the proposed method to a simulation system and a power grid. The simulation-based examples show that it is feasible, can identify bad data, and is computationally efficient. For bad data containing significant errors, the proposed method is more effective than traditional methods in modifying them. The results of experiments on the grid yielded a substantial improvement in the quality of the dataset after data cleaning.

A. RELATIONSHIP BETWEEN SYSTEM STATES AND DATA
Power systems are complex artificial energy systems that operate following the law of conservation of energy. Because there are connections between the various components of the power system network, the operations of each component also obeys energy conservation.
The operating status of the power system cannot be obtained directly, and we can perceive its operating state only by observing measured data. Therefore, there is a causal relationship between the operating state of the system and the operating data. If the power balance is satisfied during operation, the measured data, as a medium that reflects the power system, should also satisfy the power balance under topological constraints.

B. TEMPORAL AND SPATIAL CONSTRAINT RELATIONSHIPS IN POWER SYSTEM OPERATION
The power system operates in compliance with physical constraints and the law of energy conservation at all times. It is shown in Figure 1.
Topological relationships obtain among the various components of the grid. For a normally operating grid, the nodeinjected power P real ii (t k ) and line power P real ij (t k ) satisfy the following rules: 224742 VOLUME 8, 2020 For a node, the input and output powers of the node satisfy the power balance. The sum of the actual power associated with the node is zero.
For the line, the sum of the actual power at both ends is the power loss in the branches.
The operation of the power system is continuous in the time series. In general under normal operation, there are no significant, abrupt differences in the operational data before and after a given time section.

C. RELATIONSHIP BETWEEN BAD DATA AND CONSTRAINTS
The measured grid data P mes (t k ) may be mixed with error δ (t k ), causing them to deviate from the actual data P real (t k ). The relationship among the measured data, actual data, and errors in the line power and node-injected power is shown in (3)-(5): where P mes ii (t k ) and P mes ij (t k ) are power data for node injection and at the branch, respectively, P real ii (t k ) and P real ij (t k ) are their empirical values, and δ ii (t k ) and δ ij (t k ) are errors in the nodeinjection and branch powers, respectively.
The sum of node-related powers for the measured data is shown in (6).
We discuss the following cases: 1) According to (7), when P mes It follows that bad data must exist.
2) According to (7), when P mes The relevant measurement satisfies the physical constraint by excluding the possibility of a correlation measurementbased error cancellation. It follows that there are no bad data.
For line power, both (2) and (3) should be satisfied between the measured and the real data.
We can obtain (9) from (3) and (8): where P mes ii (t k ) and P mes ij (t k ) are the node-injection power and that measured at the branch, respectively, and δ ii (t k ) and δ ij (t k ) are their respective errors.
Theoretically, it must be the case that P lossij (t k ) > 0. We discuss the following cases: 1) According to (9), when P mes The relevant measure does not satisfy the physical constraints, and thus 0. It follows that bad data must exist 2) According to (9), when P mes ij (t k ) + P mes ji (t k ) ≈ 0 holds, the physical constraint is satisfied in the vast majority of cases, ignoring a correlation measure for the eliminated case. It follows that there are no bad data.
3) According to (9), when P mes 0. It follows that there are no bad data. No significant difference is observed in the operating state of the power system in continuous stable operation pre and post. The real data should thus satisfy: If the above formula is satisfied, this means that |δ (t k ) − δ (t k + 1)| ≤ εP, and bad data must thus exist. We define the rate of change in power between two adjacent sections as in (11): We calculated the rate of change in power in adjacent times series for data from the Jilin network. The results showed that the percentage of samples with a rate of change within 5% was as high as 80.6%. The fluctuation in power at adjacent times was slight. This result is valuable as a reference for identifying the quality of data at a given moment. We can set a certain threshold for the rate of change in power to establish the power data in the timing constraints as an additional condition for detecting bad data.
In summary, if the data strictly satisfy the temporal and spatial constraints, they reflect the state of the power grid. In case they do not, bad data must exist, and the available data thus do not accurately reflect the state of the power grid. Therefore, the spatiotemporal constraint can be used to detect bad data.

III. DETECTION AND MODIFICATION OF BAD DATA BASED ON TEMPORAL AND SPATIAL CORRELATION CONSTRAINTS
With the power balance constraint, we can detect undesirable data but cannot identify it. We need to compensate for more control samples by extending the constraint relationship to identify bad data.

A. EQUIVALENT POWER BALANCE
Any node in the grid satisfies the power balance shown in (12): The equation can be organized as (13): where P in i (t k ) is the sum of any combination of powers of P real ii (t k ), P real i1 (t k ), · · · , P real ij (t k ) , and P out i (t k ) is the sum of the complementary power combinations of P in i (t k ). The equivalent power model of a single node is shown in Figure 2 When two nodes are connected, each satisfies the power balance. There may be a power difference between them. The equivalent power model is shown in Figure 3.
As shown in the figure above, where P in In the two-node equivalent power model, the sum of the equivalent power between nodes i and j is the power loss of line i-j. The following formula shows this: a) When P real lossij (t k ) ≈ 0 holds, then P in When multiple nodes are connected, their equivalent equilibrium power is as shown in Figure 4.
As shown in the figure above, where P in , m = j, and P out j (t k ) = P real n (t k ). In the multi-node equivalent power model, the sum of the equivalent power between nodes i and j is the loss of power in line i-j, as shown in (15).
The effect of power loss on equivalent equilibrium power is as follows: a) when P real lossij (t k ) ≈ 0 holds, then P in In summary, when the line power loss is low, we can apply the equivalent power balance to detect bad data. According to grid operation, the percentage of line power losses in a 220 kV transmission line is 1%∼3%, which is negligibly small compared with the actual transmission capacity of the line. This implies that the proposed approach is suitable for this case.

B. METHODS OF DETECTING AND CORRECTING BAD DATA
The equivalent power P inmes i (t k ) in the measured data is composed of the true equivalent power P in i (t k ) and the equivalent error δ in i (t k ). The correspondences are as follows: Based on the above equilibrium constraints, we can detect bad data in the equivalent powers by comparing the equivalence errors.
We discuss the following cases: , Then there is a large difference between P in i (t k ) and P out i (t k ) , P in j (t k ) , P out j (t k ) . This means that P in i (t k ) contains bad data, which can be modified using the remaining three measures: The estimated error δ out i (t k ) + δ in j (t k ) + δ out j (t k ) 3 cannot be exactly zero, but the three errors used for estimation are all insignificant, thus ensuring thatδ in i (t k ) is smaller than δ in i (t k ). Further, due to (21): The direct superposition of insignificant errors reduces the error with a certain probability. We replace the original error with the estimated error. Then, after the data have been modified, they change from the original P inmes i (t k ) to the following: 2) When δ in i (t k ), δ out i (t k ) δ in j (t k ) ≈ δ out j (t k ), this means that P in i (t k ) and P out i (t k ) contain bad data. If there are two quantities in the equivalent power set that are significantly out of equilibrium, it can be concluded that there are bad data in both two equivalent powers, and the remaining two measurements can be used to modify them.P in Even though the estimated error δ in j (t k ) + δ out j (t k ) 2 cannot be completely zero,δ in i (t k ) is guaranteed to be smaller than δ in i (t k ) . If the estimated error is used instead of the original error, then, after modification, the data change from the original P inmes i (t k ) to the following: 3) When there are three or more data items with significant differences, it is impossible to apply the above balancing constraint to identify bad data. It is then necessary to supplement the given sectional data with data from the previous iteration. Theoretically, there is little difference between pre and post data in power system operation. The data measured at both ends of the line increase and decrease at the same time. Thus, in case of a significant difference among P in i (t k ) , P out i (t k ) , P in j (t k ) , and P out j (t k ) that cannot be identified by the equivalence equilibrium constraint, we need to complement the equivalence dataset from the previous section: P in i (t k − 1) , P out i (t k − 1) , P in j (t k − 1) , and P out j (t k − 1) . The criterion for modifying the given sectional data is to select the equivalent power data with confidence min |P (t k ) − P (t k − 1)| and modifying the remaining three, or to apply the data from the last instance directly instead. The process for the identification and correction of bad data is shown in Figure 5, and the steps are as follows: Step 1: We organize the line power and node power into the line-equivalent and node-equivalent power datasets, respectively, based on their affiliations.

Line-equivalent Power Dataset:
Node-equivalent Power Dataset: Step 2: For the line-equivalent power dataset U Pij (t k ), we identify bad data in P out i (t k ) and P in j (t k ) (P ij (t k ) and P ji (t k )) according to the power balance constraint.
Step 3: If there are no bad data in P ij (t k ) and P ji (t k ), the system is observable. We can estimate the node-injected power directly from the line-equivalent power dataset, and determine if the original node-injected power contains bad data. Finally, we modify it according to Kirchhoff's law. If bad data are present, we proceed to the next step.
Step 4: When there are fewer than two bad data items in elements of the line-equivalent power dataset, we can identify the double-ended measurement of the line containing bad data according to the principle of equivalence balance. After modifying the data according to the line-balancing principle, we return to Step 2 to judge the data once again. When three or more elements of the line-equivalent power dataset have bad data, we proceed to the next step.
Step 5: Bad data items are unidentifiable when there are three or more of them in elements of the line-equivalent power dataset. We need to complement the trusted measurements to identify the bad data. We group the trusted line measurements into a set L and proceed to the next step.
Step 6: We test the node-equivalent power dataset according to the power balance constraint test to determine the reliable node power, and generate the reliable power dataset J . We test whether L ∪ J can meet the requirements of system observability. If it can, we identify the bad data according to the line and node power sets, and return to Step 1 for another judgment of the data after modification according to the rule of system operation; if not, we proceed to the next step.
Step 7: If the system does not satisfy the requirements of observability, there are multiple bad data in the measured power. We then need to use measured data from the previous section to review the sectional data of the given one. By comparing U pij (t k ) with U pij (t k − 1), we select the equivalent power data with confidence min |P(t k ) − P(t k − 1)|, which in turn modifies the remaining three. We finally return to Step 2 for another judgment.

A. COMPARISON OF DATA CLEANING EFFECTS OF DIFFERENT METHODS
To test the feasibility of the method proposed in this paper, we used the IEEE 14-bus system as test system and state estimation as comparative method. The system contained 14 nodes and 20 lines. The required variables and parameters of the two methods are shown in Table 1. State estimation

TABLE 1. Comparison of variables and parameters
is based on the premise of knowing the line parameters of the network, which requires a variety of measurements to ensure that the system is observable so that the remaining measurements can be estimated. The method proposed in this paper, however, applies all active measurements in the network, where the number of measurements is 54. It does not require information on the network parameters, nor does it require a step to select quantities that guarantee the observable system. This significantly simplifies the pre-calculation work needed.
Based on the power flow data of the IEEE 14-bus system, we simulated the daily load processing to obtain the power flow data with timing fluctuations of one minute for the recording interval as P(t k ) = P ii (t k ) P ij (t k ) P ji (t k ) T .
P ii (t k ) is the node-injected power, P ij (t k ) is the power at the head of the line, and P ji (t k ) is the power at the end of the line. We mixed the error in P(t k ) to form P error (t k ), and then executed state estimation and the proposed method to obtain the results of state estimationP(t k ) = P ii (t k )P ij (t k )P ji (t k ) T and those of topological constraint modificationP( We defined the overall deviations in the results of the two methods, where the overall deviation of the state estimates was: That of the proposed method was:  We defined the mean deviation of state estimates as and that of the proposed method as µ tp (t k ) = tp (t k ) N , where N is the number of measurements.
We used a random error for P(t k ) with a range of 1% to 10%. Then, we calculated the average deviation of the two methods for 1,000 time sections under different errors. The result is shown in Figure 6.
As shown in the figure above, the two methods had similar results in terms of error detection. However, as the mixing error increased, our method gradually outperformed the state estimation-based method.
We detected the error according to (29).
The actual error was calculated according to (30).
The identification accuracy of the proposed method is shown in Figure 7.
The figure shows that the proposed method was more accurate at identification when the perturbation error was significant. Figure 8 shows that as the ingestion error increased, the average accuracy of error identification tended to increase   gradually. This shows that our method more accurately identified significant errors.
We apply the error data to the state estimation method, and its accuracy of detection of statistical errors is shown in Figure 9.
The result of the state estimation method were similar to those of the proposed method. The identification accuracy of the latter was lower in case of small perturbation errors but higher in case of large perturbation errors. Pairs of average accuracy of error detection between state estimation and our method for different statistical perturbation errors are shown in Figure 10.
As shown in the figure above, our method had a higher accuracy of detection than state estimation for significant errors ranging from 15 MW to 40 MW. VOLUME 8, 2020

B. CALCULATING THE LIMITATION COMPARISON
The computer used for the experiments was cnfigured as follows: processor: core i7@2.60Hz; memory: 8 GB; hard disk: 128 SSD; operating system: Windows 10. We applied state estimation and our method on IEEE 14, 39, 57, and 118 systems over 1,000 sections. The calculated time consumption is shown in Figure 11.
As the system's topology becomes largerbecame more complex, the computation time of our approach increaseds slowly, while the computation timethat of state estimation increaseds significantly. This difference in computation time isoccurred mainly since because state estimation involves a high large number of operations, whereas the kernel of our method is a simple linear calculation which that greatly significantly simplifies the computation process and reduces the computation overhead.

C. CHANGE IN DATA QUALITY PRE AND POST DATA REVISION
We modified all the measured power values with significant errors based the two methods, and compared the power balance of the pre-and post-modification data of the nodes.
The power deviation at the nodes is shown in (32).
The mean value of the total power deviation at nodes in each section is shown in (33).
We cleaned the measured data for a provincial network using the proposed method. The average values W or jd of the power deviation at the node of each section before cleaning on the time series and W se jd after cleaning are shown in Figure 12. The mean value of the total power deviation at the nodes was 9.22 MW while that in the modified data was 0.73 MW, a reduction by 12 times. This shows that the correlation power at the nodes was more balanced after data modification. According to the criterion for determining the bus power balance in code DL-516-2017 for the operation and administration of power dispatching automation systems, we used 10 MW as the limiting value of the power balance error at the node. The modified node imbalance measures were all less than 10 MW, indicating that they had improved data quality.

V. CONCLUSION
The main contribution of this paper is the proposal of a method to clean bad data concerning power grids using spatiotemporal correlation constraints to detect and modify them. In contrast with the traditional modeling approach, the proposed method applies the correlation between the data through spatial and temporal constraints. We used the IEEE 14-bus system and one year of data from a provincial network to test our method. The main conclusions are as follows: 1. Our method imposes fewer conditions than state estimation, and is not dependent on parameters of the transmission components. It does not involve non-linear operations, incurs a smaller computational overhead, and is thus more efficient. 2. After using different ratios of errors in the data, we applied the proposed method and state estimation to them. The results showed that the two methods had similar cleaning effects on bad data when the errors were small, indicating that our proposed method is feasible; when the errors were significant, the proposed method was more accurate than state estimation, which shows its superiority. 3. We applied our method to actual grid operation data, and the results showed that it can reduce power balance errors at the node. The data after data cleaning tended to be power balanced, and had better quality.