Distribution Grid Topology and Parameter Estimation Using Deep-Shallow Neural Network With Physical Consistency

To better monitor and control distribution grids, the exact knowledge of system topology and parameters is a fundamental requirement. However, topology information is usually incomplete due to limited sensors in the grid. Therefore, estimating the system parameters using partial data is a critical topic for distribution systems. Due to the high nonlinearity of unobservable system quantities and noises, Deep Neural Networks (DNNs) are widely utilized for accurate estimation. While traditional approaches either treat DNNs as a black box or embed little physical knowledge into DNNs, they cannot guarantee that the DNN model is consistent with physical equations and hence lack accuracy and interpretability. Therefore, we propose a Deep-Shallow neural Network (DSN) for distribution system estimation. The key is to create virtual nodes to represent nodes without sensors in the system, and denote a DNN to approximate missing quantities at virtual nodes. Then, the Power Flow (PF) equations can be estimated via a shallow neural network, achieving physical consistency. Isolated by virtual nodes, the whole system is decomposed into a set of reduced graphs with approximate PF equations. Likewise, we introduce a Reinforcement Learning-based search algorithm to connect the reduced grids into one connected system. Correspondingly, the DSN is fine-tuned to achieve consistency with the PF equations of the connected graph. Finally, the paper illustrates the superiority of the proposed DSN due to its physical consistency. Specifically, comprehensive experiments demonstrate the high performances of our DSN over other methods in distribution grids.

clean and economic energy consumption.However, DERs bring issues like bi-directional power flows [1], instantaneous system power unbalance [2], and more voltage violations [3].To enable better monitoring and control, significant work has been proposed based on the assumption of complete system information on topology and parameters [4].
However, such knowledge is often incomplete due to limited instrumentation and low investment interests [5], e.g., in the secondary distribution networks.Secondly, distribution grid topology can experience relatively frequent topological and parameter changes.The goal of these reconfigurations is typically for achieving better operation [6], [7], [8], regular maintenance [9], and quick restoration after system outages [10].Thirdly, many plug-and-play components are currently owned by individual users and therefore, their connectivity and status are not always available to the distribution grid operators.Overall, it is desirable to accurately estimate the distribution system parameters and topology for better monitoring.
To achieve this goal, existing methods can be mainly categorized into three groups: statistical test-based methods, traditional Machine Learning (ML) methods, and Deep Learning (DL) methods.The statistical test methods compute certain statistics using electrical measurements to determine the connectivity of the system.For instance, [11], [12] calculate the voltage statistics (e.g., covariance) to recover the topology.While, [1], [13], [14] estimate the edge weights via voltage and current expectations.Then, graphical model-based methods are employed to learn the topology.References [15] and [16] compute the second-order statistics of voltage and power injections to estimate grid topology.However, these works mostly focus on topology recovery but do not estimate system parameters.
To estimate system parameters, traditional ML-based methods treat the system identification problem as a regression task.Then, by minimizing the error of the model output, one can learn the model parameters that represent the system parameters and topology.The linear regression model is the most common method to learn a linear form of PF equations with certain variable transformations [9], [17], [18], [19], [20], [21], [22], [23], [24].The model is linear with respect to edge parameters given transformed input/output physical quantities.These approaches include some unrealistic assumptions regarding the grid.For example, [17], [18], [19] require all c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
nodes of the system to have micro Phasor Measurement Units (micro-PMUs).References [21], [23] assume nodes without micro-PMUs have zero power injections.Reference [9] relaxes the zero-injection assumption but still requires the unmetered nodes to have low power correlations with metered nodes.
Finally, [20], [22] require special perturbation techniques to actively gather data.To summarize, the incomplete placement of micro-PMUs or other meters limits the capacity of linearly expressing the power flow equation, thus impacting the performance of the above referenced methods.Such observations are further theoretically supported in [21], where Moffat et al. prove that it is impossible to recover the complete topology and parameters given limited phasor measurements.At the same time, it is valuable to accurately learn the system equations within the metered nodes to construct the so-called reduced grids [25].To achieve reduced grids, the key is to capture the complex nonlinear correlations induced by missing data, measurement noises, and some plug-and-play devices with user privacy considered.Therefore, some other researchers have tried to employ non-linear ML models like Support Vector Regression (SVR) [26], [27].However, the ability of SVR to capture a representation is limited for distribution grids.To enhance the ability to capture the grid representation, Neural Networks (NNs) for universal approximation are primarily considered for modeling the power flow equations [28], [29], [30].Above referenced works treat NNs as black boxes to learn the mapping rule.However, the black box models lead to little physical interpretability and cannot guarantee the generalizability of the learned model to different loading conditions.Thus, in this paper, we try to provide answers to the following question: can we maximize the recovery of distribution system information while deploying the universal approximation capability in other unrecoverable areas?To recover the power flow equations of the partially observable distribution systems and maximize the recovery of the physics, the key is to decompose a DNN model into a linear part with locally full observability and a nonlinear part with a high capacity to handle the randomness from hidden quantities.
We propose a Deep-Shallow neural Network (DSN) with a deep-shallow structure to obtain maximal physical consistency.DSNs can be shrunk to physical equations for sub-areas with full observability.Under full observability, only the shallow NN in the DSN activates.Further, the DSN creates some virtual nodes to represent an unobservable area.The deep NN in the DSN activates in this situation to approximate the variables of the virtual nodes.Then, physical quantities related to the virtual nodes are constrained into a safe range and the local topology of the virtual nodes is guaranteed via controlling the sparsity of the DSN.Additionally, virtual nodes will isolate the distribution system into multiple reduced grids.To reconnect these reduced systems, we propose to utilize Reinforcement Learning (RL) to search the connection and find the most suitable one, leading to a reduced and connected grid.Correspondingly, the DSN model is fine-tuned to accurately represent the final grid with the most physical consistency.To summarize, the following contributions have been made in this paper: • Introducing the problem of maximizing topology and parameter recovery for interpretation while minimizing approximating error in the non-recoverable regions; • Determining the solution of the DSN to the proposed problem with a structural design and physicallyconstrained initialization; • Proposing to fine-tune the DSN and produce a connected and reduced graph using RL methods; • Providing extensive experiments on diversified physical systems against many state-of-the-art models to demonstrate the superiority of our proposed methods.The remainder of the paper is organized as follows: Section II defines the problem.Section III proposes the DSN model and the RL-based DSN fine-tuning.Section IV conducts experiments for the baseline and DNS models, and Section VI concludes the paper.

II. PROBLEM FORMULATION
A distribution system can be modeled as a graph G = (V, E) where V is the node set and E is the branch set.Further, the node set V can be categorized based on the availability of sensors, e.g., micro-PMUs.Specifically, we denote O as the observed node set with phasor data, and H as the unobserved node set without such data.Basically, O ∪ H = V and O ∩ H = ∅.Our goal is to build an accurate PF mapping of the power grid using measurements of O.
Before elaborating on our method, we present the underlying PF model.The accurate PF model can be obtained with complete system parameters and state measurements.Mathematically, for the i th node, we denote u i and v i as the real and imaginary components of a voltage phasor.Also, we denote p i and q i as the corresponding active and reactive power.These quantities can formalize the power flow equation.For simplification of later derivations, we utilize the PF model in rectangular coordinates as follows.
where | • | for a set represents the operation to obtain the set cardinality.G ik and B ik represent the real and imaginary parts of the (i, k) th entry of the admittance matrix, respectively.However, for distribution systems, the system parameters G ik and B ik are usually unknown and the measurements of u i , v i , p i , and q i are only available when i ∈ O.More importantly, [21] theoretically proves that it's impossible to obtain complete system parameters using partial measurements.Thus, we cannot obtain the exact PF model and need an accurate approximation tool.Namely, we need to find a map f such that y ≈ f (x), where x = [u 1 , . . ., u |O| , v 1 , . . ., v |O| ] T and y = [p 1 , . . ., p |O| , q 1 , . . ., q |O| ] T are the measurement variables from nodes in O. T represents the transpose operation.
There are diversified ML models to represent the mapping function f .In this paper, we propose that the selected model should have two distinguishing properties.First, the model should have enough capacity to capture the nonlinear correlations of input voltages, including the explicit quadratic relationship of observed voltages and implicit approximation of hidden voltages.Second, the model should maintain the consistency of the PF equations.Specifically, the model parameters in f should retain the same values and sparsity as the PF model.Therefore, the learned f can be generalized to different system loading conditions.To summarize, we have the following problem definition.
• Problem: ML-based PF approximation with maximal physical consistency.• Input: N measurements {x n } N n=1 and {y n } N n=1 from observed nodes, where x n and y n are the n th samples of vector variables x and y, respectively.
• Output: an accurate mapping f with high approximation capacity and maximal physical consistency.We utilize voltage phasor for parameter estimation, which may have high correlations.Hence, it's demanded to verify the estimation accuracy.We show following treatments are effective in our paper.First, The nonlinear transformation can reduce the correlations of voltage phasor.For example, the nonlinear terms, i.e., u i u k , u i v k , and v i v k , are treated as the input to the linear layer of the following DSN model.Although the correlation coefficients ρ between Voltage Magnitude (VM) or Voltage Angle (VA) are large, we find the values of ρ reduce significantly for variables among u i u k , u i v k , and v i v k .Table I shows the absolute average ρ values calculated by measurements of the IEEE 18-bus system with real-world load profiles.Obviously, the nonlinear terms have a much smaller value, bringing a high performance for topology identification and parameter estimation.
Second, the correlation can be further decreased if we consider the incremental data [14], [15], [16].Basically, we can calculate the incremental term where t is the index for the time slot.Table I shows that the correlation value for the incremental term is significantly reduced compared to that of the raw voltage phasor.In the following derivations, we assume the notations defined in this section represent the incremental data.Third, in our following Equation (2), the l 1 based regularization with the suitable tuning of λ 1 can lead to excellent recovery of the topology and line parameters [31].Due to the above three treatments, we show in Table II in Experiment that our DSN with l 1 regularization can obtain accurate topology and line parameters.

A. Deep-Shallow Neural Network From Elementary Examples
In this subsection, we investigate the design of the neural networks to achieve consistency of the PF equations.First, we consider two elementary examples for the model design.Case 1.All nodes are observable (metered) in the system.Under this condition, Equation (1) implies that we can construct a 2-layer shallow NN to represent the PF equations.The top left of Fig. 1 illustrates a 4-node fully metered system.Then, the corresponding PF equations can be represented by the shallow neural network at the bottom left of Fig. 1, where the input goes through a multiplication layer and a linear summation layer to predict the output.The multiplication layer helps to compute the multiplicative terms like u 2 1 and u 1 u 2 in (1) and the summation layer helps to enforce the weighted summation for multiplicative terms, which is consistent with the PF equations.
Next, we consider the second case.Case 2. Only one node is observable in the system.As shown in the top right of Fig. 1, only node 1 has measurements.Then, we cannot directly utilize the shallow NN due to the missing quantities.Thus, a natural evolution of the solution is to directly deploy a black-box DNN to map from the inputs to the outputs, shown in the middle right of Fig. 1.Obviously, such a black-box model cannot maintain physical consistency and can easily encounter issues like overfitting.To handle this problem, we propose a structural design of the DNN model with a certain level of physical consistency and interpretability.Specifically, we propose to create a virtual node in Case 2 and form a 2-node system with the virtual node and the observed node.As shown in the top right of Fig. 1, in a reduced grid, node 1 is observable and node 2 is the virtual node that represents the aggregation of the unobservable nodes.Then, the voltage and power of the virtual node and the observed node can fit the shallow NN like Case 1.In particular, the shallow NN can take in measurements of [u 1 , v 1 , u 2 , v 2 ] T and output [p 1 , q 1 , p 2 , q 2 ] T .Based on the power balance constraint, p 2 and q 2 can be completely determined by v 1 , v 2 , u 1 , u 2 , and the parameters of line 12.Thus, we only need to build the mapping from T and learn accurate line parameters.However, the pre-requisite is to obtain good estimates of u 2 and v 2 for node 2 .
To estimate u 2 and v 2 , we make use of the universal approximation power of DNNs.Namely, we build another DNN model to capture the voltage measurement correlations and map from the input [u

B. General DSN Case to Create Reduced Graphs
This subsection extends the DSN model to the general case where a subset of nodes in the distribution grid is observed.We specify the model equations and formally propose an efficient training algorithm.The right part of Fig. 2 illustrates a distribution grid with part of the nodes observable.To build a DSN for such a grid, the key is to identify how to place the virtual node and formulate the reduced grid.
For this goal, the focus is on small grid components as in Case 2 in Section III-A.Specifically, we decompose the whole graph G of the distribution grid into |O| unit-graphs {G i = {V i , E i }} |O| i=1 with the graph center to be one observable node and radius to be 1 (the distance between every two connected nodes is defined to be 1).Then, for each center node i, denote Neigh(i) = {k|ik ∈ E}.Thus, {i ∪ Neigh(i)} is the node set for G i .Obviously, based on the presence of hidden nodes in G i , the unit-graph may or may not require a virtual node.Therefore, we have the following categorizations.
F-Graph.We denote a fully-observable unit graph as a F-Graph, as shown in the middle part of Fig. 2. Namely, for a F-Graph G i , all nodes in G i have micro-PMUs.For example, in Fig. 2, nodes {1, 2, 3} form an F-Graph where node 1 is the center node and nodes 2 and 3 are neighboring nodes.Thus, F = {i|i is a center of a F-Graph}.Then, based on (1), ∀i ∈ F and all k ∈ Neigh(i), the data of [v i , u i , v k , u k ] T and [p i , q i ] T can be the inputs and outputs of the 2-layer shallow NN in the DSN, respectively.Nevertheless, the unknown connectivity implies the missing knowledge of F, i, and G i .The issue can be solved via a sparsity-based pre-training method.Basically, one can denote a mapping of the 2-layer shallow NN as )) represents the map to create the multiplicative terms like u 2 i and u i u k for input neurons u i and u k and where || • || p is the l p norm, and λ 1 is a positive constant.Namely, we add the l 1 norm of W 1 as a regularization term to enforce sparsity that selects correct multipliers for p i and q i where i ∈ F. Correspondingly, the sparse selection helps to identify the connection between nodes i and k and produces non-zero terms if k ∈ Neigh(i), which has been verified in existing work [18], [32], [33].Notably, for node h / ∈ F, the sparse selection does not work and results in under-fitting and a large loss due to the missing data.Thus, to distinguish the node set F, we create an evaluation standard based on the loss: where y n [i] is the i th element of y n and is a positive threshold.If (3) holds, then i ∈ F.Then, based on the non-zero terms in W 1 , the system can easily find the set Neigh(i).Specifically, let ψ(i, k) denote the vector of neuron indices in the multiplication layer of the shallow NN such that the elements Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in ψ(i, k) represent the indices of multiplier u i u k , u i v k , v i u k , and v i v k , sequentially.Then, we denote Neigh Initial Guess and Constraints on DSN based on F-Graphs.To fully utilize the pre-trained results, the following initialization and constraints, denoted as F-Graph initialization and constraints, are • Initialization 1.We note that the pre-training using Equation ( 2) can bring a good parameter estimate for lines in G i , ∀i ∈ F. Thus, we keep parameters In the meantime, we zero out the rest elements in W 1 since these elements (i.e., line parameters out of G i ) are inaccurately estimated.• Initialization 2. The symmetry of the admittance matrix helps to re-assign estimated parameters to some entries in In general, this re-assignment process enforces the coefficients of multipliers to be the same in different equations, e.g., the estimated coefficient of u i u h is the same for equations to calculate p i and p h .• Constraint 1.The known connectivity of F-Graphs requires a constraint for the following training process, i.e., ∀i ∈ F, The above initialization and constraint are necessary due to the following reasons.(1) They restrict the partial weights in the shallow NN in the DSN to be consistent with line parameters and connections within F-Graphs.(2) They give a good estimate of the weights from the pre-training.Therefore, the shallow NN in the DSN can completely represent the PF equations within F-Graphs.(3) They make full use of the symmetry of the admittance matrix to guarantee weights for node h ∈ O \ F.
Despite above efforts, there are hidden nodes in Neigh(h).Thus, we need to investigate the impacts of these hidden nodes and how to represent the PF equations within the neighborhood of h.Specifically, we propose the following definition.P-Graph.We denote a partially-observable unit graph as a P-Graph, as shown in the middle part of Fig. 2. Namely, for a P-Graph G h , there exists at least one node k ∈ Neigh(h) such that node k does not have a micro-PMU.For example, in Fig. 2, nodes {6, 4, 7} form a P-Graph where node 6 is the center node and nodes 4 and 7 are the 1-hop neighboring nodes.Next, node 4 is hidden.Then, we have P = {h|h is a center of a P-Graph}.As we have identified F using (3), we can compute P = O \ F. Note that in the above F-Graph, as a result of the identification and parameter initialization process, the connectivity and line parameters of node i and h have been determined if i ∈ F and h ∈ P.However, other connections can hardly be directly identified due to the hidden data.Thus, for each P-Graph, one can create a virtual node as in Case 2 in Section III-A.The virtual node is an aggregated node to represent the remainder of the grid for one P-Graph.Therefore, we can assume ∀h ∈ P, each node in Neigh(h) is either in F or a virtual node.Since all F-Graphs are known, we can identify the P-Graph G h = {V h , E h } as follows.
where v is an index of a virtual node.The right part of Fig. 2 summarizes all the F-Graphs and P-Graphs for the presented system.
Virtual Nodes Creation and Complete DSN Structures.We introduce a DNN model f D to map from x to the voltage of virtual nodes whose number is the same as |P|.For the sake of convenience of later derivations, we denote the vector of voltage components for nodes in F, P, and virtual nodes as x F , x P , and x V , respectively, and x V can be computed as x V = f D (x).Similarly, let y F and y P be the power data of nodes in F and P, respectively.By default, set x = [x F ; x P ] and y = [y F ; y P ], where [•; •] is the vector concatenation.According to the above designs, y F has been accurately approximated by f S 1 (x).Therefore, we need another shallow NN to approximate the contributions of virtual nodes to y P and simultaneously maintain the physical consistency.
Subsequently, we introduce )) maps from voltage components to their multiplicative terms and )) is the linear weight matrix.With the defined f S 1 , f S 2 , and f D , we can construct our DSN as shown in Fig. 3.The model formula is: (6) where a zero vector 0 is concatenated since virtual nodes bring zero contributions to y F .Generally, we enforce f S 1 to approximate y F and f S 1 and f S 2 to approximate y P .Therefore, f S 1 and f S 2 can successfully represent the PF equations of the reduced grids with virtual nodes.Moreover, we can conveniently enforce physical constraints for virtual nodes, making the training of the DSN more efficient.
Constraints of DSN based on P-Graphs.The prior knowledge of distribution grids can enforce certain constraints for the P-Graphs.For example, we can enforce the following constraints that are common to all distribution grids, denoted as P-Graph constraints.
• The voltage components of virtual nodes x V should be constrained in an interval I = [I 1 , I 2 ] to maintain the voltage stability.• The known connectivity of P-Graphs requires a constraint.Namely, While we numerically observe that the above two constraints can bring good results, some other domain knowledge can also be introduced, e.g., the range of line parameters and line power.More specifically, we can utilize the voltage values in the intermediate neurons x V and the line parameter values in W 2 to calculate the line power and introduce similar penalty terms for line parameters and line power.In general, the variable voltage of virtual nodes x V and line parameters in W 2 provide significant possibilities to enforce power system constraints and make DSN consistent to power flow equations, which never appear in previous work.During the training, the connectivity constraints on the F-Graph and G-Graph can be easily satisfied by restricting the corresponding 0s to be non-trainable all the time.For the interval constraint of x V , one can add penalty terms when the x V falls outside of the interval.Specifically, we define the interval distance where λ 2 is a positive constant.We show the following two steps can make sure x v can fall inside the interval.
• Initialization.For the DNN f D (•) in the DSN, we set the last layer's bias vector to be 1, i.e., all entries are ones.
For the rest weight matrices and bias vectors, we set their entries to be zeros.Consequently, the initial output of f D (•) is 1.

• Output clipping. The clipping technique helps to restrict
the quantity within a certain range [34].For entries in x V = f D ([x F ; x P ]) that are smaller than I 1 (or larger than I 2 ), we can simply set the corresponding entries to be I 1 (or I 2 ).Finally, since the above process provides the connections within F-Graphs and P-Graphs, these unit graphs can be connected together to obtain a set of reduced grids { G k } k .Namely, we check ∀i ∈ O and find the neighborhood nodes Neigh(i) to build the connections.Further, the weights in W 1 and W 2 can determine the parameters in { G k } k .To summarize, the final training process of DSN can be seen in Algorithm 1.
Remark: We provide the following explanations and summary of the recoverable and non-recoverable line parameters.First, authors in [21] prove that it's impossible to recover all the parameters and connectivity without full-rank measurement matrices, let alone partially-available phasor Hyper-parameter: penalty coefficient λ 1 and λ 2 , threshold to identify F, batch size M, and learning rate η.
1: Build f S 1 (x).Then, train f S 1 (x) under objective of (2) via Stochastic Gradient Descent (SGD) [35].2: Identify the set F and P using (3).3: Obtain F-Graphs and P-Graphs by ( 4) and ( 5), respectively.4: Implement F-Graph initialization and constraints for f S 1 (•) and P-Graph constraints for f S 2 (•).5: Utilize F and P to construct DSN via (6).6: Train DSN under the objective of (7) via SGD.7: Connect F-Graphs and P-Graphs to obtain a set of reduced grids { G k } k and utilize the trained W 1 and W 2 to identify line parameters for { G k } k .
measurements.Thus, our goal in the DSN model is to recover as many parameters as possible while maintaining the functional form and constraints, bringing excellent performances under different operational conditions.Second, to achieve the goal, we decompose the system into F-Graphs and P-Graphs in Section III-B.We show in Equation ( 2) that the parameters of the F-Graph can be estimated via a 2-layer shallow NN with sparsity regularization, i.e., parameters of the line i 1 i 2 where i 1 , i 2 ∈ F and the line i 1 h 1 where i 1 ∈ F and h 1 ∈ P.These are all the recoverable parameters.Third, without additional information, it's impossible to accurately recover the parameters of the rest lines, including line h 1 h 2 where h 1 , h 2 ∈ P and line h 1 v where h 1 ∈ P and v is a neighboring virtual node.This is because there are infinite combinations of the voltage and line parameters related to virtual nodes to fit the observed measurements.However, assigning them physically feasible values can lead to physical consistency and high generalizability of the DSN model.The above reduced grids { G k } k are isolated by virtual nodes.In general, we make the DSN consistency to PF equations of { G k } k via sparsity regularization and range/connectivity constraints in the F-Graph constraints and P-Graph constraints.This reduces overfitting and promotes generalizability.Nevertheless, the model for reduced grids may still be overly complicated due to too many virtual nodes.For example, some virtual nodes at the same location can be merged with the same voltage.Thus, we finally utilize an RL-based search algorithm to connect the reduced graphs, fine-tune DSN, reduce the intermediate neurons (i.e., the input to f S 2 (•)), and find the best fit.

C. RL-Based Search to Fine-Tune DSN and Connect Reduced Graphs
For the |P| virtual nodes, there are at most 2 ( |P| 2 ) possible connections to be searched, which results in a high computational cost when |P| is large.To reduce the computational Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.complexity, we enforce constraints to enable a valid search space by requiring the search result to be a sparse and connected graph.
Mathematically, we denote the virtual node set as N = {n 1 , n 2 , . . ., n |P| } to make them distinguishable from previous node indices.Then, the search starts from node n 1 and chooses another node n i ∈ N \ {n 1 } to make a connection n 1 n i .We denote the action of selecting n i as a 1 .Subsequently, we can conduct a 2 such that the next choice of node n h comes from the set n h ∈ N \{n 1 , n i } to make the connection n i n h .After repeating the above process |P| − 1 times with sequential actions {a 1 , a 2 , . . ., a |P|−1 }, there is a tree structure among all the virtual nodes, resulting in a connected graph.For example, in Fig. 4, the search process add connections by actions a 1 and a 2 to obtain a tree structure.
However, there may exist loops within the connections of the virtual nodes to better represent the true grid.Thus, we enable extra M + 1 actions {a |P| , . . ., a |P|+M } to enable but not necessarily create the loop structures.More specifically, for 1 ≤ m ≤ M + 1, we denote n k as the current node.Then, the action a |P|+m selects a node from the set N .Next, if the new connection already exists, the graph remains unchanged after this action.For instance, in Fig. 4, the action a 3 is to build the connection between n 3 and n 2 , which already exists.Otherwise, another connection is built for constructing a loop.The action a 4 in Fig. 4 connects node n 3 and n 1 and formulates a triangle.Obviously, we only need to restrict M to a moderate number so that the sparsity is guaranteed.There are the following approaches to determine a moderate value.
(1) We can directly assign a large M (e.g., M=|P|).Since the action to build loops can repeat to build the same connection (e.g., a 3 and a 4 in Fig. 4), a search algorithm can still find the optimal structure with a large M and enough search trials.(2) We can assign a reasonable value based on some prior knowledge.For example, one may set M = −1 if the network is radial.(3) We can dynamically increase the value of M in the search process.This saves computational time and improves convergence.In general, we believe there is a large room to explore search-based topology identification as future work.In the numerical section, we set M = 1 since our test cases are radial distribution grids.Eventually, we find that there are no loops for the best-fit topology.
In general, the above |P| + M iterations generate a state sequence {s 1 , s 2 , . . ., s |P|+M+1 } and an action sequence {a 1 , a 2 , . . ., a |P|+M }, where each s k represents the connectivity among virtual nodes at the k th state.The state-action transition process essentially defines a Markov Decision Process (MDP) that can be conveniently optimized by utilizing Reinforcement Learning (RL) techniques.In this paper, we employ Q-learning to conduct the optimization due to its efficient convergence property [36].
After defining states and actions, we need a reward function to evaluate the action sequence and guide the training of the RL to generate the optimal action set.Intuitively, we prefer the selected actions (i.e., connections within N ) to better fit the training data.Thus, we propose to formulate the reward function based on the Mean Square Error (MSE) [37] of the DSN using the training dataset so that the search results can mostly fit the data.Next, the architecture of the DSN should be finetuned based on the search results according to the searched grid topology.
A simple and efficient fine-tuning process is adopted as follows.We can assume the line parameters are 0s between two connected virtual nodes.Namely, we can merge the virtual nodes n i and n k as long as line n i n k is a searched line in s |P|+M+1 .Then, the voltage estimates for nodes n i and n k should be equal.Therefore, in the DSN model, one only needs to merge the hidden neurons for the estimated voltage of n i and n k , leading to a new DSN structure.We denote the new mapping function as f (x).
Subsequently, we can train f (x) using the same objective as (7).The MSE error of the trained f (x) can help to construct the reward.Basically, the metric is the Normalized Root-Mean-Square Error (NRMSE) [38], [39] such that NRMSE = Hyper-parameter: penalty coefficients λ 1 and λ 2 , threshold to identify F , batch size M, learning rate η, p for the greedy method to select the action in Q-learning, α and γ as learning rate and discount factor respectively, for the Q-function updating, the maximum number of episode T, and M as the extra action number. 1: Conduct a random sample from the uniform distribution p 0 ∼ U(0, 1).5: Choose a random action for a k .Choose a k = arg max a Q t (s k+1 , a).Utilize (8) to update the Q function.16: end for 17: Set f (x) = f T (x) and G = G T .Then, utilize the trained W 1 and W 2 to identify line parameters for G. 2 , where σ y is the standard deviation of the outputs.Then, the reward can be calculated as R = 1 1+NRMSE .R is a terminal reward for evaluating the performance of the state and action sequences after one search.Consequently, for each state transition (s i , a i , s i+1 ), we also assign R as the reward (7).With defined states, actions, and rewards, our Q learning algorithm tries to iteratively update the Q function: where Q(s k , a k ) is a function to evaluate the k th state-action pair (1 ≤ k ≤ |P| + M), and t as a subscript shows it's the t th episode.α and γ are pre-defined learning rate and discount factor, respectively.The optimization max a Q t (s k+1 , a) is a lookup of the Q-table and determines the maximum Q value given s k+1 .Finally, we employ the greedy method to encourage some explorations of random actions in Q learning.We summarize the Q-learning algorithm in Algorithm 2. Our model is not necessarily limited to handling radial networks.For the observed nodes, we categorize them into F-Graphs and P-Graphs.However, they can both be loopy.In particular, for F-Graphs, the loopy structure and line parameters can be easily estimated using the optimization in Equation ( 2) as long as all nodes are observable, e.g., the triangle structure in Fig. 2 and Fig. 4. For P-Graphs, the remaining unidentifiable connections and parameters are related to the virtual nodes.Although we add one singleton virtual node with one line for each P-Graph, our RL-based search algorithm provides possibilities for the construction of loops among different virtual nodes, as stated in this subsection and displayed in Fig. 4. Therefore, our model can tackle loopy structures for both observed and hidden regions.
It's remarkable that our DSN is a flexible model that can make the best use of input information to provide a good approximation of the PF equations.Basically, (1) The structural design of DSN is based on the identification of the F-Graphs and P-Graphs.Consequently, line parameters in the F-Graphs can be accurately estimated and stored in the weights of the DSN.(2) The functional form and the plug-and-play physical constraints for virtual nodes can further improve the consistency of DSN to the underlying PF equations of reduced grids.Then, the end-to-end training of DSN refines the model and guarantees high generalizability to diversified operational conditions.(3) The RL-based search algorithm can connect all reduced grids and fine-tune the DSN model.Therefore, the final DSN model is optimal to represent the PF equations of a connected graph and best fit the data.
With the above properties, the obtained DSN model can (i) reduce to the PF equation if the system is fully observable, (ii) estimate as many line parameters as possible for F-Graphs and some lines of P-Graphs, and (iii) conduct universal approximation with physical constraints if only singleton nodes are observable.In conclusion, under different observability levels, our DSN model can deliver a good solution for system modeling.Finally, our numerical results in Fig. 5 and Table II further support our claims.We find that even when the observability level γ = 0.1, the power prediction error and line parameter estimation error are still very limited compared to other methods.

A. Dataset Description
In our experiment, we utilize simulated data from standard IEEE distribution systems [40] and residential load profiles [1].IEEE provides standard power system models, including the grid topology, parameters, and generation models for accurate simulations.The model files and the simulation platform, MATPOWER [40], are based on MATLAB.In this experiment, we incorporate IEEE distribution networks of 18-, 33-, 115-, and 123-bus systems for testing.To conduct the simulation, the load files are required as the input to the systems.Specifically, we employ the Pacific Gas and Electric Company's (PG&E's) hourly residential load profile (8760 samples for one year) in North California [1].With the above data, MATPOWER produces the system states of voltage and power in (1).

B. Benchmark Models
To fully investigate the strong interpretability and generalizability of the DSN, we compare our proposed DSN with other advanced DNN models.Specifically, we have the following benchmark models for comparison.
• Resnet [41]: Deep Residual Network creates a shortcut connection to pass the deep information directly to the shallow layers.Such a skip-connection effect not only Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.helps to avoid gradient vanishing issues in the training phase, but also contributes to the model generalization ability since the low-complexity features are connected to the output, thus decreasing the model complexity.
• SINDYs [32], [42]: The sparse identification of nonlinear dynamics (SINDy) utilizes the sparse regression technique to recover the parameters of the physical systems, while the base of the regression can be selected via DNNs.In our experiments, we employ the quadratic term as the symbolic bases and eliminate the DNN part for simplicity.
• DNNs with Dropout Method [43]: Dropout method randomly disables neurons in training, thus preventing the neurons from over co-adapting and increasing the model generalizability.
• Physics-conditioned GAN [44]: Physics-conditioned Generative Adversarial Network (GAN) is a cutting-edge method to conduct distribution grid state estimation with pseudo-measurements of hidden nodes.In Section IV-F, we demonstrate that our DSN can also conduct state estimation with pseudo-measurements for hidden nodes and achieve better performance.

C. Model Evaluation
We propose the following metrics to evaluate the generalizability and interpretability of DSN and benchmark models.
Generalizability.The generalizability of the ML model represents the capacity to take the place of the underlying PF equations when the system condition changes.To evaluate the generalizability, we conduct a 5-fold cross-validation.Mean Square Error (MSE) of the testing dataset is used as the measure for generalization errors.
Parameter estimation.SINDYs and our DSN method have the capacity to estimate the distribution system parameters.To evaluate the system identification performance, we consider the lines among node set O, where both the line weight estimation error and the connectivity should be evaluated.Since the connectivity can be converted to the sparsity of the Laplacian matrix, we utilize the so-called normalized Total Vector Error (nTVE) [9] to evaluate the difference between the estimated L and the true Laplacian matrix L:

D. Results for Model Generalizability
In our experiments, we test different systems with changing PMU penetration level γ ∈ {0.1, 0.2, . . ., 0.9} to comprehensively compare the model generalizability among different methods and 5-fold cross-validation is conducted.The results are shown in Fig. 5(a) to 5(d).We find that for each trial, our DSN always achieves the lowest MSE value in the validation dataset.Further, the MSE of our DSN decreases as γ increases, while for other methods, the MSE increases.
The lowest generalization error comes from (1) the wellextracted local governing equations that are generalizable to different datasets and (2) the physical constraints that enable the physical variables to be within the physical range.Also, the increase of sensor penetration (γ ) leads to more physical parameters being captured, thus decreasing the MSE further.However, for other methods without physical consistency, MSE will increase due to the growth of the output dimensionality.

E. Results for Network Parameter Estimation
In this subsection, we show the results pertaining to the second function of our DSN, i.e., estimating the edge weights to construct the underlying physical equations.For the line parameters and connectivity among observed nodes O, we calculate the nTVE(%) for evaluation.The comparison is between our DSN and the SINDy since other DNNs cannot estimate the physical equation parameters.
The results are shown in Table I.Generally, our DSN far outperforms the SINDy method for all systems when γ < 0.5.Empirically, the DSN's nTVE is around 10% ∼ 25% of the SINDy's nTVE.When γ increases, the performance of the DSN and SINDy will become closer.However, DSN's nTVE still only covers around 40% ∼ 60% of SINDy's nTVE.The reasons are as follows.(1) DSN employs a testing criterion in (3) to decompose O into F and P.Then, the initialization rule of the DSN can enable the shared weights between F and P to always be accurately estimated in the pre-training of the F-Graph.For the SINDy method, however, the shared weight estimation incurs errors due to hidden quantities.(2) when γ < 0.5, the hidden nodes are dominant so that the DSN model performs much better than SINDy.(3) when γ is increasing, the number of hidden nodes decreases so that the inaccurate estimation of the shared weights in SINDy decreases, forcing DSN and SINDy to have closer performance.

F. Results for State Estimation of Hidden Nodes
In this subsection, we demonstrate the additional function of our DSN to estimate the voltage of hidden nodes if we have extra pseudo-measurements.In particular, for partiallyobservable systems, pseudo-measurements of hidden regions are often needed to enable state estimation for all hidden states [44].Namely, one needs to have access to historical data of hidden nodes, which may have low data quality and large errors.Existing work employs the Gaussian mixture model [45], 2-stage clustering [46], NN to process pseudomeasurements [47], [48], and Generative Adversarial Network (GAN) [44] for distribution grid state estimation.With pseudomeasurements, we find that despite the lack of system parameters, our DSN model can also bring accurate hidden state estimation.This is mainly because our DSN model can (1) utilize pseudo-measurements to pre-train the DNN in our DSN and approximate a deterministic map between observed states and hidden states without PF equations, (2) employ shallow NNs to learn the underlying PF equations and further regularize the approximated hidden states, and (3) enforce power system constraints for the approximated hidden states.
While the problem deserves a lot of further effort to investigate different system assumptions in future work, we present a preliminary study by assuming that we know the locations and low-quality pseudo-measurements of hidden nodes.Hence, we can use the measurements of observed states and pseudo-measurements of observed and hidden states to train the DNN in the DSN.Subsequently, the known locations of hidden nodes enable us to replace virtual nodes with hidden nodes to construct the shallow NNs in the DSN.Finally, we can jointly utilize the measurements of observed nodes and low-quality pseudo-measurements of hidden nodes to estimate the line parameters and hidden states.To numerically support our claim, we compare our methods with the recent method Physics-conditioned GAN [44] for the MSE error of the hidden state estimation.Specifically, we employ 8760 voltage and power samples for observed nodes and 500 voltage pseudo-measurements for observed and hidden nodes.The former samples are used to conduct end-to-end training for the DSN model while the latter measurements help to pre-train the DNN model in the DSN.
Fig. 6 demonstrates the result.Specifically, our DSN achieves a 40% ∼ 60% error decrease compared to physicsconditioned GAN.The key reason is that physics-conditioned GAN maps from noise vectors to the hidden data while our built-in DNN captures the relationship between the voltage of observed and hidden nodes.Obviously, voltage correlations are easier to learn.The noise vector-based transformation, however, highly depends on the data quality and sample numbers of the pseudo-measurements.Moreover, although physics-conditioned GAN assumes known system parameters to conduct state estimation, our DSN can learn the underlying parameters to facilitate the inference of hidden states.Therefore, the accurate estimation performs similarly to the physical law in the physics-conditioned GAN.

V. DISCUSSION OF THE UTILIZATION OF DSN
We strongly believe that our DSN model can work as a foundation for steady-state system modeling and many other applications.Specifically, we present the following potential utilization.
• Parameter estimation and topology identification with partial observability.For distribution systems with sparse measurement locations, our DSN model can provide accurate parameter estimation for the identifiable portions (e.g., line parameters of F-Graphs).Specifically, the sparsity in the shallow NNs in the DSN implies the underlying topology, and the corresponding weight matrices can be simplified to obtain the line parameters.The identified topology and parameters lay the foundations for the distribution grid operation and optimization.• Power prediction.Since our DSN is a mapping from observed voltage to power, the power prediction is a Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
natural function given input voltage.The function may be useful when sensor communication fails or the power data are tampered.Therefore, our DSN with high generalizability can provide a validation of the power consumption.• State estimation.While our DSN is a forward mapping from voltage to power, it can be treated as a building block for distribution system state estimation, an important function for system modeling in face of distributed generation and bi-directional power flow.First, since DSN is a neural network, we can embed DSN as the encoder part to an autoencoder [49].Therefore, the decoder can transform from power to voltage, becoming an NN-based state estimator.Notably, the high physical consistency of our DSN can implicitly boost the performance of the decoder with physical constraints.Additionally, as stated in Section IV-F, our DSN can provide estimate states for hidden nodes given some pseudo-measurements.• Outage detection.Renewable energy and bi-directional power flows may increase the chances of line outages for the distribution grid.Luckily, our DSN provides power prediction and line parameter estimation, which may be useful to detect and locate the line outage that brings significant load trips [50].
• Bad data or false data detection.The learned parameters in the DSN can help calculate the residual error for bad or false data detection [51].Traditional detection methods rely on full system observability and complete knowledge of system parameters.Nevertheless, our DSN provides a doable approach with partial observability.Mathematically, the residual error can be calculated by the difference between the predicted and the measured power for the DSN.• Renewable power and load profile generation.Power profile generation for residential loads or renewable energy is important for system stability.In particular, recent advances employ Generative Adversarial Networks (GANs) to generate profiles due to the high approximation power of GANs for data distributions [52].While traditional GANs directly map from noise vectors to power data in their generators, we can create another generator that maps from normal system states to the power, i.e., our DSN model.The above examples imply that when the underlying PF equations are missing, our DSN model is an important function to correlate system states, power, and parameters.Due to the physical consistency and the NN structure, we can easily combine our DSN with other NN-based functions for power system applications.Except for the listed points, there is still a large room to explore the utilization of our DSN.We will thoroughly discuss them in future work.

VI. CONCLUSION
We propose a Deep-Shallow Neural Network (DSN) to identify the PF equations of distribution systems under partial observability.DSN can be automatically shrunk to Power Flow (PF) equations for fully-observable areas and aggregate hidden nodes to virtual nodes to create a reduced grid.
Then, DSN maintains the physical ranges and topology of the reduced grid with proper physical constraints.Finally, we provide a Reinforcement Learning (RL)-based search algorithm to connect the reduced grids to form a connected grid.In this process, the DSN enjoys fine-tuning to represent the PF equation of the connected grid.Extensive experiments are conducted over various distribution systems demonstrating reduced errors compared to contemporary methods.
Subsequently, the neurons representing [u 2 , v 2 ] T can be treated as input to the shallow NN, formulating a Deep-Shallow neural Network (DSN).The bottom right section of Fig. 1 represents the DSN structure for Case 2. Notably, there are no measurements of u 2 and v 2 , and we can't directly guarantee the intermediate neurons in the DSN to have exact values of u 2 and v 2 .To mitigate the problem, we can enforce physical constraints with clear definitions of voltage and line parameters in the DSN, i.e., the intermediate neurons and the weights of the shallow layer in the DSN.More specifically, we can calculate other power system quantities with the data of voltage and line parameters, which enables the implementation of different physical constraints.More details are shown in Section III-B.In general, the deep-shallow hierarchy and physical constraints make the DSN consistent with underlying PF equations.

Algorithm 1
Train-DSN({x n } N n=1 , {y n } N n=1 ) Input: Training data {x n } N n=1 and {y n } N n=1 .Output: the DSN model f (x), the reduced grids { G k } k , and the corresponding parameters.

Fig. 4 .
Fig. 4.An illustrative example of the RL-based search for the connected graph.

Algorithm 2 Q
Learning to Search G Input: Training data {x n } N n=1 and {y n } N n=1 .Output: the DSN model f (x), the connected reduced grids G, and the corresponding parameters.

Fig. 5 .
Fig. 5. Results of MSE for different models and different distribution grids.

TABLE I THE
ABSOLUTE AVERAGE VALUE FOR THE CORRELATION COEFFICIENT Conduct a k and transit from s k to s k+1 .Generate a state sequence {s 1 , s 2 , . . ., s |P|+M } and an action sequence {a 1 , a 2 , . . ., a |P|+M }.Utilize connection in s |P|+M to generate G t and sponding DSN model f t (x).Train f t (x) via SGD and calculate the reward R t . 15: