Deep Statistical Solver for Distribution System State Estimation

Implementing accurate Distribution System State Estimation (DSSE) faces several challenges, among which the lack of observability and the high density of the distribution system. While data-driven alternatives based on Machine Learning models could be a choice, they suffer in DSSE because of the lack of labeled data. In fact, measurements in the distribution system are often noisy, corrupted, and unavailable. To address these issues, we propose the Deep Statistical Solver for Distribution System State Estimation (DSS$^2$), a deep learning model based on graph neural networks (GNNs) that accounts for the network structure of the distribution system and for the physical governing power flow equations. DSS$^2$ leverages hypergraphs to represent the heterogeneous components of the distribution systems and updates their latent representations via a node-centric message-passing scheme. A weakly supervised learning approach is put forth to train the DSS$^2$ in a learning-to-optimize fashion w.r.t. the Weighted Least Squares loss with noisy measurements and pseudomeasurements. By enforcing the GNN output into the power flow equations and the latter into the loss function, we force the DSS$^2$ to respect the physics of the distribution system. This strategy enables learning from noisy measurements, acting as an implicit denoiser, and alleviating the need for ideal labeled data. Extensive experiments with case studies on the IEEE 14-bus, 70-bus, and 179-bus networks showed the DSS$^2$ outperforms by a margin the conventional Weighted Least Squares algorithm in accuracy, convergence, and computational time, while being more robust to noisy, erroneous, and missing measurements. The DSS$^2$ achieves a competing, yet lower, performance compared with the supervised models that rely on the unrealistic assumption of having all the true labels.


I. INTRODUCTION
D ISTRIBUTION systems are taking a more active role in the energy transition.These active distribution systems require more extensive monitoring and control, which is possible by developing Distribution System State Estimation (DSSE) [1].Currently, state estimation (SE) is mostly only possible in the transmission systems, and several challenges exist to extending SE to distribution systems successfully.First, conventional SE algorithms for transmission systems are challenging to adopt to distribution systems as the assumptions differ.Additionally, the distribution grid lacks real-time measurements.Conventional algorithms assume full observability of the grid with redundant measurements, which is impractical [2].To address the observability issue of distribution systems,forecasted values based on historical data called pseudomeasurements1 are used to compensate for the lack of measurements, but they are often inaccurate and can impact the SE accuracy [3].Also, the Weighted Least Squares (WLS) method used for SE is time-consuming and sensitive to data noise for large distribution systems [4].Multi-area SE has been widely investigated to speed up the estimation process [5], [6], and has been extended to distribution systems [7].However, the convergence and sensitivity issues remain, and division into multiple areas brings in communication and timesynchronization challenges.
Different algorithms have been proposed to improve the robustness and convergence of SE, notably the branch-current WLS, the Least Absolute Value, and the Generalized Maximum Likelihood [4], [8].Branch-current algorithms are more robust to parameter selection and uncertainty and are more suited for the weakly meshed and radial topologies in the distribution system [9].Although, these algorithms suffer from the lack of qualitative measurements in wide distribution systems and the uncertainty of distributed loads and generators.Kalman Filters aim to improve speed and estimation performance under low observability.They are linked to the Forecast-Aided SE concept, where model-based approaches use the previous states as extra information to enhance accuracy and speed [4], [10]- [13].However, Kalman Filter SE is limited by the assumptions of system linearization and the Gaussian distribution of the measurements, which reduce its accuracy and robustness.Indeed, power systems are highly nonlinear, and measurements can show a non-Gaussian distribution [14].
Data-driven techniques showed promising results in performing fast DSSE without the above-mentioned assumptions.Deep learning models showed remarkable results to fit data for the SE task [15]- [18].This supervised learning approach trains neural networks to fit labels, which are the grid's state variables.These labels are usually provided from simulations, as getting them from the grid is often impossible.As such, these approaches suffer from the scarcity of real labelled data and supervised learning is only possible using simulation data.Therefore, the models can only fit simulators exclusively and not real systems.Even though some approaches try to improve this technique by introducing some inductive bias [19] and physic-awareness [20], they all require extensive supervised learning using large labelled databases [4].
Combining model-based and data-driven approaches is a promising research direction to overcome the limitations of the model-based techniques with data-driven tools [11].This approach is considered in [21] to combine the efficiency of Kalman Filters with the robustness of supervised Deep Learning architectures.It showed interesting results in low-dimensional problems; however, it suffers from highdimensional problems due to the need for labelled data and unstable training.In the field of 'hybrid' approaches, [22] combines data with physics and develops a model-specific deep neural network (DNN) by unrolling a SE solver to enhance estimation performance and alleviate computation expenses.However, the model is trained with fully labeled data, and the physic-awareness of the approach is limited and does not include the structure of the system.
To address the lack of labeled data for training data-driven models, the concept of weakly supervised learning has been proposed [23].This is similar to supervised learning but it is used for tasks where data is only partially or inaccurately labeled, relying on physical information, mathematical tools, or model-based modules to enhance the training process.While unsupervised learning also trains models without labels, weakly supervised learning still relies on imperfect values defined as labels.Although weakly supervised learning is highly practical for tasks such as DSSE, it has not yet been investigated for this particular problem.
Topology and parameter estimation in distribution systems from limited measurements have also been explored in the literature, with promising results.In [24], a model-based algorithm using linear regression is proposed for topology and parameter estimation from limited measurements.In [25], smart meter data is utilized to estimate the topology through an Ordinary Least Squares method.Meanwhile, [26] proposes a data-driven Lasso algorithm for topology estimation.
Graph Neural Networks (GNNs) are a particular family of deep learning models that use the underlying network structure as an inductive bias [27], [28] to tackle the curse of dimensionality and reduce the data demand.GNNs have also shown robustness to perturbations in the network topology [29]- [31], which makes them appealing data-driven alternatives for the DSSE task.GNNs are investigated for power system applications, where the electrical lines correspond to the graph's edges and the buses correspond to the graph's nodes [32], and the data varies for the specific application.
GNNs have been investigated for their potential use in SE in power systems, as shown in [19], [33]- [35].The models in [19], [33] demonstrate that GNNs can accurately perform SE while being robust against noise and missing data.Moreover, [34] shows that GNNs can provide fast and robust SE, and any inaccuracies in the data would only impact local estimation.Additionally, [35] highlights that GNN models can handle fast sampling measurements, thereby improving SE in power systems.However, the heterogeneity of components in power systems cannot be accurately modeled using simple graphs, and thus, GNNs have limited expressivity in the graph model for power system applications.
Despite the increasing development of GNNs applications in power systems and growing research on deep learning for DSSE, the literature on GNN for DSSE is limited to parallel works [36], [37] In [36], an electrical-model-guided GNN is used to perform DSSE and compared to conventional methods and other machine learning techniques.This approach demonstrated higher accuracy and robustness, indicating the potential of GNN-based approaches.In [37], a GNN model is combined with matrix completion techniques to perform DSSE without the need for a detailed system model, highlighting the robustness of GNN approaches to model inaccuracies.While these approaches show promising results, they rely on labeled data for training, which is impractical due to the limited observability of the system's state.
In this paper, we propose the Deep Statistical Solver for Distribution System State Estimation (DSS 2 ), a GNN model based on the Deep Statistical Solver architecture [38] specialized for optimization tasks on power systems.The model is trained in weakly supervision manner to tackle the issues of data scarcity and inaccurate labeling.The success of such weak supervision is conveyed by considering physical information of the network and the physical laws of the power flow equations in the training loss function, rendering labels obsolete.Specific contributions include: 1) DSS 2 , the Deep Statistical Solver model for accurate datadriven DSSE using a weakly supervised approach.2) adding physical constraints as penalization to the loss function, enhancing the model's performance.3) the innovative use of weakly supervision in the context of data-driven DSSE, which leverages the power flow equations to restrict the model's search for the mapping function, hence reducing the data demand and improving robustness to inaccurate measurements.
Our goal with this proposition is to enhance the accuracy of grid estimation in situations where there are limited measurements and low-quality input values, such as pseudomeasurements.To achieve this, we propose using a deep learning model that learns solely from the grid's measurements and the physical information of the system.We validate the model using various case studies on the IEEE 14-bus, 70-bus, and 179-bus systems and compare it to the WLS algorithm baseline and other Deep Learning architectures.The proposed DSS 2 is up to 15 times faster, 4 times more accurate, and more robust than the standard WLS algorithm.Our model also outperforms supervised learning approaches, being 10 times more accurate in line-loading estimation while alleviating the need for labelled data.Interestingly, our approach is better for larger networks as the GNN learns in the neighborhood of buses, and the larger the power network, the more data to learn from.The source code of this work is available on GitHub [39].This paper presents the methodology in Sections II and III, introducing the Deep Statistical Solver model and extending its usage to DSS, respectively.Sec.IV are the case studies and compares the performances to the baseline algorithm and other data-driven models.Sec.V concludes this work.

A. Conventional problem formulation
The state estimation problem aims at finding the state vector x based on a noisy measurement vector z.Conventionally, we consider the voltage amplitude and angle at every grid bus as state variables, and z can include any measurement type: where we consider n buses and m measurements.V i is the voltage amplitude at bus i, and φ i the voltage phase angle.
We have 2n − 1 state variables, as φ 0 is set to 0 by the slack bus convention.Linking the measurement vector z to the state vector x, we define a measurement function h(x): where ε ∈ R m×1 is the measurement noise vector, and h(x) are the power flow equations shown in Eq. ( 3) [40].
In this measurement function, Eq. ( 3), ∆φ ij = φ i −φ j +ϕ ij is the voltage angle difference across the line that connects bus i to bus j, ϕ ij is the shift angle of the transformer if any, and Y ij and Y sij are respectively the line and shunt admittance of the line between bus i and bus j.Measuring flows at bus i, we have P ij→ and Q ij→ as the active and reactive power flow from bus i to bus j, and P ij← and Q ij← as the power flow from bus j to bus i.Current flow I ij follows the same convention.Finally, we derive the active and reactive power injections at bus i, P i , and Q i from the power flows.All these outputs are possible elements of z, depending on the measurement infrastructure.
The measurement function h(x) contains equations that connect the state variables V i and φ i to all types of measurements in the network.The first two lines correspond to identity functions that link the state variables to their direct measurements.Lines 3-6 consist of AC power flow equations used to derive power flows from the state variables.Lines 7-8 derive current flows from the previously derived power flows.Finally, lines 9-10 derive the nodes' power injection by ensuring exact power balance in each network node.
The measurement function h(x) is nonlinear, and Eq. ( 2) includes the probabilistic noise vector ε.In SE, we are interested in finding the inverse relation h −1 (z) to estimate the state vector x while compensating the error ε.The conventional SE approach, shown in Fig. 1a, uses the iterative Newton-Raphson algorithm to minimize a WLS objective function [40].This technique uses the redundancy of measurements to provide an accurate estimation.However, the approach requires at least the same number of measurements as state variables, meaning m ≥ 2n − 1, and the system needs to be fully observable.Moreover, matching this requirement but failing to provide enough redundancy highly impacts the estimation accuracy.In practice, m ≈ 4n achieves satisfying results, which is impractical for the distribution system [41].The iterative process may even diverge in case of poor observability or high noise level in the measurements [4].
Another approach to approximate h −1 (z) is to train an Artificial Neural Network (ANN) to map this function.ANNs approximate functions using a series of nonlinear operations parameterized by their trainable weights θ [42].These weights are estimated during the training phase to approximate the given relation.For the SE task, the model is trained to approximate the inverse relation h −1 (z), considering the measurement vector z as the input of the ANN and the state vector x as its output.We use the state estimation's convention of x as output and z as input of the model.: In a common supervised learning approach, the approach assigns a label vector y as the true value of x for each measurement sample (one measurement vector z).In the training process, shown in Fig. 1b, the model fits the data using available labels as reference.This approach, although quite efficient, is impractical for DSSE due to the lack of labelled data.Instead, our contribution combines Deep Learning and WLS optimization to propose a weakly supervised learning approach, alleviating the need for labels.

B. Weakly supervised learning
To develop a weakly supervised learning approach for the DSSE task, we incorporate the power flow equations from Eq. 3 within the training phase, as shown in Fig. 1c.The H2MGNN algorithm takes the measurement vector z as an input, and provide an estimated state vector x as an output, which is then used to retrieve the estimated values h(x) of the network using the power flow equations (Eq.( 3)).The loss function of the model is then set to be the minimization objective of the WLS approach: with σ k the standard deviation of the measurement k's uncertainty assumed as Gaussian distribution, and M the measurements set.While SE aims at considering the actual noise φ of the measurement vector, a 'first guess´of this value is assumed.σ k .|M| = m is the number of measurements (a) Weighted least squares approach (b) Supervised learning approach (c) Proposed weakly supervised DSS2 approach Fig. 1: State estimation with (a) WLS with Newton-Raphson solver uses an initial guess x 0 of the state vector x that iteratively updates x b until the tolerance ∆x > ϵ or a maximum of iterations is reached, (b) supervised learning uses a label vector y to train an ANN to fit the output x to the input z and (c) the proposed weakly supervised approach considers the power flow equations h(x) (Eq. 3) to get the estimated measurements and fits the output x of the GNN model to the input z without labels.The target optimization is similar to WLS.
where | • | is the cardinality of a set.We assume uncorrelated measurements.
The detailed training procedure shown in Fig. 1c can be divided into four steps: 1) An estimated state vector x is given by the H2MGNN algorithm from an input measurement vector z.
2) The power flow equations integrated into the measurement function h(x) are used to retrieve network's values from the estimated state x.
3) The estimated values h(x) are compared to the actual measurement vector z in the Weighted Least Squared loss function.An estimation error is retrieved from each input measurement and weighted by the inverse of the measurement's variance.4) The sum of all estimation errors consists in the loss function, and we apply gradient descent to find the partial derivatives and tune the H2MGNN accordingly.In this training process, the model is trained by minimizing the error between the measurements z and the estimated values h(x), and the uncertainty σ 2 of the measurements is used as weights to emphasize learning from the most accurate measurements.With this loss function, we implement a weakly supervised learning approach where we use the input measurements z as noisy, imperfect labels that the H2MGNN needs to fit through the power flow equations, and no ground-truth labels are used for training.Function (3) is differentiable w.r.t the output state variables, and the gradient can be expressed using the measurement Jacobian matrix H(x) = ∇h(x).
With this method, the target optimization of the training phase is exactly the conventional WLS minimization problem, allowing the model to learn an input-output mapping that represents this function.Our goal is to achieve a similar level of performance as the WLS approach while improving the model's numerical stability, computation time, robustness, and observability requirements.

C. Physical penalization terms in the loss function
We propose aiding the WLS learning loss (5) with different penalization terms to reduce the number of local minima and 'guide' the outputs towards physically-feasible solutions.We assume our model does not need to estimate unstable states as protection schemes are faster and more reliable, so we guide the learning process to only estimate stable states in the output.Considering stable networks, we add three terms to the loss: • Voltage level stability criteria: power systems ensure a voltage level between V LB = 95% and V U B = 105% per unit to remain stable.Therefore, a two-sided penalization on the line loading when the prediction gives a loading higher than l U B = 100%.
Adding these terms to the loss function, the equation used in the training process becomes: where λ 0 , λ 1 , λ 2 , λ 3 are hyperparameters set to balance the effect of each mathematical term during training.These terms penalize the model output towards physically plausible boundaries and avoid diverging toward local minima that are well beyond the physical margins of the system.

III. THE DEEP STATISTICAL SOLVER MODEL
This section proposes the H2MG structure, the modelling of the heterogeneous components of the distribution grid, the Hyper-Heterogeneous Multi Graph Neural Network (H2MGNN) and how to learn the H2MGNN in weak supervision for DSS by applying Sec.II. ) as hyperedges connected to any number of connection ports with their features.

A. Hyper-Heterogeneous Multi Graph (H2MG)
The H2MG uses hypergraphs to model power grids.Power grids are complex networks where different heterogeneous components are connected as shown in Fig. 2a.Modelling power networks solely with vertices and edges, as done with standard graph models, Fig. 2b, leads to information losses when merging grid components together into graph objects.More versatile modelling of such networks is possible using hypergraphs, Fig. 2c, where each component can be modelled as a specific hyperedge which can mitigate the loss of information.
The H2MG formalism is defined by: • Objects as hyperedges: every object in the network is modelled as a hyperedge that can connect to any number of vertices.This is shown in Fig. 2 where each component is modelled separately as a hyperedge: we represent lines and transformers as hyperedges connected to two vertices, whereas buses are modelled as hyperedges connected to one vertex.• Vertices as ports: vertices represent the interfaces between objects.In a hypergraph, vertices are connection points between the components (the hyperedges).These connection points between components in a power system are the network's buses.Therefore, we model buses as both hyperedges as network components and vertices as network interfaces.• Hyper-Heterogeneous Multi Graph: the collection of hyperedges connected through vertices forms a hypergraph, and we call this hypergraph heterogeneous if it contains multiple classes of objects.Hyperedges carry features and outputs, while vertices, as connection ports, do not carry input-output information.

B. H2MG Neural Network (H2MGNN)
The H2MGNN is a GNN architecture that works with H2MG models.It uses a recursive process to learn information from the hypergraph and related features.It is a recurrent and residual GNN architecture, with trainable mappings implemented as standard ANNs and trained through standard backpropagation.As presented in Algorithm 1, we consider four types of variables: • Vertex latent variables, considering a vertex set V corresponding to the interface role of buses: h v i , ∀ i ∈ V; • hyperedge latent variables, considering c ∈ C as the objects' class, and e ∈ E c as the objects' hyperedge: h c e ; • hyperedge inputs z c e ; • hyperedge outputs xc e .In our setting, the hyperedge index e refers to the object's connections: considering a vertex i and its neighbouring vertex j, e = i for a bus, and e = ij for a line.
In the initialization of Algorithm 1, the hyperparameter d sets the dimension of the latent variables.We initialize these latent variables with a flat start (zero values) and set predicted output variables to the initial values xc e,0 dependent on the task.For DSSE, a common initialization is V i = 1 p.u. and φ i = 0rad.Then, the H2MGNN algorithm recursively updates these variables in the system with trainable mappings ϕ θ .An iteration variable t is defined to weigh each iteration in the update process and T is the maximum number of iterations.At each iteration, latent variables are updated by an increment defined through the message-passing step similar to conventional GNN algorithms: for i ∈ V do 3: for c ∈ C do for i ∈ V do 11: for c ∈ C do 13: for e ∈ E c do 14: with N (i) the set of hyperedges connected to vertex i, and o the connection port of a hyperedge (if connected to multiple ports).The final output of the model is stored in the hyperedge outputs xc e .

C. Proposed DSS 2 implementation
As presented in Fig. 1c, we use the H2MGNN model to estimate the state variables x from the measurements z through Alg. 1 and train it through the weakly supervised approach with the WLS as target optimization.For the DSSE described in Sec.II, the DSS 2 model approximates the inverse of the measurement function Eq. (3) as Eq. ( 4).To simplify the model, we consider balanced systems and only model the positive sequence of the networks.We model buses, lines and transformers and integrate generators and loads as nodes' power injection, as commonly done for the DSSE task.
The input features follow the WLS algorithm where, for each measurement, we consider the two, the measured value and its uncertainty.Voltage angles are considered as possible inputs of the model to allow the use of synchronized phasors, but are not required as most distribution systems do not carry such measurements.We also add all other parameters as features needed to compute the measurement function Eq. ( 3) as topology parameters.The features and parameters assigned to each class of components are listed in Table I.The model's output is every bus's voltage amplitude and angle, as typical in SE.Finally, we add booleans to detail components: 1 z defines buses with zero-injection (no consumption or generation), 1 s defines slack buses, and 1 cl defines closed lines.These booleans simplify the model and provide more information about the network to the DSS 2 model.In other words, this simplification considers 'virtual measurements' to enforce zero power flow at buses without injection (1 z ), no power flow at the connected buses to an open line (1 − 1 cl ) and V s = 1 p.u. and ϕ s = 0 rad at the slack bus s where s is the index of the vector 1 s that equals 1.Since distribution grids typically have a limited number of measurements, we assume a low amount of measurements and incorporate pseudomeasurements to complete the observability of the system.These pseudomeasurements are based on historical demand data and are added as active and reactive power injections P i and Q i for buses where observability is lacking.

IV. CASE STUDIES
Case studies have been undertaken to provide insights into the proposed approach and evidence of its efficacy.After stating the case studies settings and showing the efficiency of the proposed weakly-supervised learning approach, we analyse the performance of the DSS 2 exploring the trade-off of providing labels and accuracy, subsequently, investigating the accuracy, convergence and computational speed for larger networks.Finally, we investigate the performance of the proposed approaches for different measurement noise, when the measurements are disturbed, and when we have higher and lower load levels and renewable powers.

A. Test systems and setup
We considered the 14-bus CIGRE MV distribution grid with PV and Wind distributed energy resources (DER) activated Line admittance: Y ij Shunt admittance: Ys ij Output features Voltage magn.: V i Voltage angle: φ i (a) 14-bus CIGRE MV grid from [43] (b) 70-bus Oberrhein MV sub-grid from [41] (c) 179-bus Oberrhein MV grid from [41] Fig. 3: Three test networks consisting out of trafos ( ), lines ( ), MV/LV buses ( ) and HV buses ( ).The state can be estimated using the set of power flow measurements ( ) and voltage measurements ( ).Relevant indices are indicated, and lines' indices are underlined.Case studies on the 70-bus grid focus on buses indicated with .
[43], the 70-bus Oberrhein MV/LV sub-grid, and the whole 179-bus Oberrhein grid from [41].The networks are presented in Figure 3.The measurement locations for each network are shown in Fig. 3.These measurements M either measure the power flow over lines or the voltages at buses and were assumed with different Gaussian noise, as further discussed.
For each network, 8640 load samples were collected, equivalent to one year of hourly data.Each load scenario considers load levels of 24 consecutive samples discretized hourly for all loads in the network.These load scenarios resulted from a Monte Carlo sampling on standard load profiles taken from [44], considering a 15% uncertainty.For each sample, in each scenario, assuming balanced systems, the AC power flow computed the full true state using PandaPower 2.9 [41] and Python 3.8.Subsequently, one sample's full true state considered all loads and generators' active and reactive power levels, the bus voltage levels, phases, and line loadings.System operators do not have access to this full true state; however, some key variables are provided by the measurements specified earlier.
These observed variables were assumed corrupted with zeromean Gaussian white noise at the measurement locations.Between 0.5% and 2% standard deviations were assumed for the voltage and current measurement noises, and between 1% and 5% for the active and reactive power measurement errors.Pseudomeasurements of power levels were considered at every (unobserved) bus using generic load and generation profiles taken from [44].
The dataset was split into train, validation, and test sets, following an 80/10/10 split.In supervised learning, the measurement vector z at the measurement locations mentioned above represents the input to the model, and the full state represents the label y.
Several baseline models were assumed as follows.The standard SE WLS algorithm [45], a standard ANN model trained with supervised learning, and the DSS 2 model but trained with supervised learning (referenced with sup.DSS 2 ).The WLS algorithm from PandaPower 2.9 was used, and the deep learning models were implemented in Tensorflow 2.8 [46].The ANN was designed with 5 layers of 32 hidden values, tanh activation functions and a Glorot normal initializer.The code to reproduce the case studies of this paper can be accessed in GitHub [39].

B. Efficiency of the weakly-supervised learning
This section investigates the efficiency of the weakly supervised learning DSS 2 approach and hyperparameters that can impact the state estimation accuracy.The hyperparameters penalization factor λ = λ 0 = λ 1 = λ 2 , batch size, dropout rate r, ℓ 2 -regularizer, and the number of iteration T were fixed.A grid search tuned the hyperparameters learning rate within the ranges α ∈ {0.001, 0.002,

C. Trade-off between accuracy and available labels
This case study investigates the performance of the proposed weakly supervised DSS 2 model on the 14-bus system compared to three baselines.The second column in Table III summarizes the results. 3he RMSE for voltages of the proposed weakly supervised DSS 2 was three times lower than the WLS, 2.5% versus 9.9%.In more detail, Figure 5a shows the voltage estimation RMSE per bus.The RMSE was lower than the 0.5% threshold for all buses, showing successful learning from voltage measurement data while handling measurements' noise.The difference in RMSE between the observed (buses 1,8, and 12) and the unobserved buses are small, showing the capability of our DSS 2 model to extrapolate to all buses.The supervised models (ANN, sup.DSS 2 ) estimated the voltage more accurately, as expected, as they learned from the ideal true voltage data having an unfair, impractical advantage.The RMSE of line loading of the weakly supervised DSS 2 reaches performances equivalent to the WLS, outperforming the supervised models by a wide margin according to Table III.This observation offered insights.Supervised models poorly estimated indirect values such as the line loading that were calculated using the power flow equations.The models only outputted the state variables and supervised models poorly considered the coupling of the state variables in the estimations of line loading.However, the weakly supervised model learned directly through the power flow equations about the coupling with the effect of estimating line loading more accurately.In more detail, Figure 5b shows the loading estimation error per line.The weakly supervised DSS 2 model had a very high accuracy on measured lines (lines 0 and 10) and their extension (lines 1 and 11).However, there was a clear drop in performance for the estimation of transformers' loading, shown at indexes 12 and 13.The simple modelling of transformers or slack may have led to this reduced accuracy as the transformers and lines were considered in the same class of models.As a result of this simple modelling, the H2MGNN considered the same mapping for these components, which may have reduced the accuracy of transformer estimations.

D. Convergence, accuracy and computation speed in large networks
This case study investigates the performance of the proposed DSS 2 compared to the WLS in larger networks, the 70-bus and 179-bus networks, along three performance criteria: the convergence rate, the accuracy, and the computational time.The 2 nd and 3 rd columns in Table III summarize the results.
When analysing the convergence rate, the DSS 2 always converges, and the WLS never converges in the 179-bus network.The WLS was unstable in this large and noisy network, leading to these poor convergence rates.WLS' convergence issues with noisy measurements in large systems is already well-known [4], [10].Many noisy measurements constrain the Newton-Raphson solver and can lead to divergence.More specifically, the WLS had issues in handling flow measurements.In response to these issues and to compare the accuracy and computational times of DSS 2 with WLS, only voltage measurements and pseudomeasurements were used in WLS to increase the convergence rate (WLS* in the table ).This increased the convergence rate in the 70-bus system but did not increase the convergence of the WLS in the 179-bus.Therefore, in the 179-bus system, the tolerance of the Newton-Raphson iterative process and the number of iterations were increased (WLS** in the table).Increasing these parameters increased the convergence rate at the cost of lower accuracy and slower processing.
When analysing the accuracy, a key advantage of the DSS 2 becomes visible.DSS 2 outperformed the WLS in every metric in the two larger networks.The models based on GNN, such as DSS 2 , learn from local operations (in the neighbourhood of buses) and extrapolate to other locations (to other neighbourhoods of buses).Therefore, the more buses and lines in the network, the more local operations to learn from that can further enhance the model's accuracy.Also, these networks have more static loads and less DER than the 14bus network, so the variation of voltage and line loading was smaller, and the estimated values from the DSS 2 become more accurate.Figures 6a and 6b compare the estimated voltage levels through a sampling period in the 70-bus system for the measured bus 34 and the unmeasured remote bus 223, respectively.The accuracy of the DSS 2 model estimating the voltage in measured nodes through noisy measurements was high.However, the model lacked generalizability when estimating voltage in remote, unmeasured nodes.
When analysing the computation times, in the last row of Table III, the DSS 2 increasingly outperformed WLS for larger networks.The computational time of the WLS and the DSS 2 increased from the 70-bus network to the 179-bus network by factors of 10 and 2, respectively.The DSS 2 scaled to a larger network 5-fold better than the WLS algorithm.The WLS needed more iterations for this larger system until the Newton-Raphson converged, although the tolerance was increased, which typically decreased the computational times.The DSS 2 also showed a lower variance in the computational times as it is not based on an iterative algorithm.

E. Measurement noise
This case study compares the robustness to measurement noise of the DSS 2 to the WLS in the 70-bus network.The level of measurement noise refers to the standard deviation σ i of the Gaussian noise added to the measurements.Three different levels of noise were considered.The default level had 1% noise on ideal measurements of voltage and current, and 2% noise on the ideal measurements of active and reactive power; the low level had 0.5% and 1% noise, and the high level 3% and 5%, respectively.
At high noise, Fig. 7 shows the RMSE of the DSS 2 was more than 10 times better than that of the WLS showing significantly higher robustness of DSS 2 .DSS 2 had a similarly high accuracy at low and high noise as in the default noise level.DSS 2 learned to process many noisy signals with different standard deviations within the high noise level ranges and GNN structures.The dropout step during training improved the capability of the DSS 2 model to handle stochasticity, including noise.Fig. 8 compares the voltage level estimation at high measurement noise for the bus 34.The DSS 2 successfully cancelled the increased noise, whereas the WLS algorithm struggled to stay accurate.

F. Missing and erroneous measurements
This case study investigates the impact of missing and erroneous measurements on the DSS 2 and the WLS algorithm at the 70-bus network.Case (i) assumed a missing voltage measurement on bus 39 that was naively replaced with their historical mean value.Case (ii) assumed erroneous voltage measurements on buses 39, 58 and 80, and erroneous active power flow measurements in lines 162 and 165 with a higher deviation from the true state values than the expected (standard) deviation.Case (iii) assumed missing voltage measurements on buses 34, 39 and 80 and erroneous voltage measurements on bus 58.
Fig. 9 shows the results.The DSS 2 had high robustness to missing and erroneous measurements in all three cases, with a similar RMSE as the default case (no missing or erroneous measurements).However, the erroneous measurement case (ii) impacted the WLS, showing an increase of around 20% on relative voltage RMSE.Fig. 10 focuses on one bus with erroneous measurements, the bus 34 in case (iii).The measurement in bus 34 was missing for the whole sequence and was naively replaced by the empirical mean value (light blue).
A key insight of this analysis is that the DSS 2 was not impacted by this missed value and successfully provided an accurate estimation.Interestingly, the DSS 2 model was not trained to handle such events.However, using known patterns from neighbouring information, the DSS 2 remained accurate.Indeed, the GNN architecture increased the interpolation capabilities by incorporating the data symmetries w.r.t. the underlying graph.

G. Changes in power levels of load and renewables
This case study investigates the generalization capability of the DSS 2 (and the WLS) to changes in levels of power in the loads and distributed generation compared to the training dataset.Three cases altered the power levels for the testing dataset on the 70-bus network by: (−30%, +30%) 30% decrease in generation, 30% increase in load, (+25%, +100%) 25% increase in generation, 100% increase in load to simulate a system near overload.(−75%, +60%) 75% decrease in generation, 60% increase in load to simulate more voltage deviations Note that the DSS 2 model was never trained on such cases; only default power levels were used for training.Fig. 11 shows the results.In the case of a 'small' load change (−30%, +30%), the DSS 2 showed good estimation performances with only a small increase in RMSE.However, in the cases (+25%, +100%) and (−75%, +60%) the RMSE significantly increased.The lines were highly loaded in the case (+25%, +100%).Hence, the loading estimation was highly impacted.In the case (−75%, +60%), the deviation in voltage was more harmful to the voltage estimation.These results explored the limitations of the changes in loading levels that the DSS 2 model could handle.Good results were perceived for changes in loads of around 30% showing good generalization capability of the DSS 2 model to handle state estimation tasks under limited uncertain changes.However, the model became sensitive when the network was extremely loaded or under strong voltage deviations, and then, the model does not generalize well anymore to extreme conditions.
V. DISCUSSION AND CONCLUSIONS This paper introduces the Deep Statistical Solver for Distribution System State Estimation.This Deep Learning architecture incorporates the power flow equations in the loss function for physics awareness.Our proposed DSS 2 approach uses the same objective function as the WLS, allowing to train of the model with a noisy and poorly labelled dataset.This approach is called weak supervision learning, and we combine it with a GNN architecture to enhance the learning from local patterns and the robustness of the model.A remarkable advantage of the DSS 2 is that the larger the power network, the better the performance.The DSS 2 is based on a GNN architecture that learns from local patterns (in the neighbourhood of buses).Hence, the larger the network, the more local patterns the GNN-based architecture can learn from.We consider this remarkable as conventional power system analysis, for example, for estimating the state, often scales poorly with network size, whereas DSS 2 showed the reverse effect.Another outstanding advantage is that through learning in the neighbourhood of buses, the DSS 2 model becomes robust and invariant to changes in individual values, such as missing, erroneous measurements.This is an important practical advantage over other conventional methods (and the studied supervised models) that depend on the accuracy of individual measurements.Our different case studies show that the DSS 2 is faster, more robust, and more scalable than WLS as DSS 2 does not involve iterative algorithms and learns from local patterns and noisy measurements.Compared to supervised models, the weakly supervised DSS 2 shows equivalent speed and voltage accuracy while outperforming the supervised models in estimating indirect values such as line loading.We conclude that learning from the power flow equations and the neighbourhood are the strengths of DSS 2 ; these incorporate a coupling between voltage magnitudes and voltage angles to fit the measurements.Finally, the DSS 2 model does not require labels as the approach is weakly supervised learning from the power flow equations.This type of learning makes the DSS 2 model more practical than other ML-based approaches as labels are scarce.
Our implementation of the DSS 2 has limitations.First of all, the penalization method used in training impacts the quality of estimation but does not ensure any guarantee of convergence during testing.Feasible solutions cannot be guaranteed with a data-driven method that focuses on individual accuracy.Secondly, in our implementation of the H2MG architecture, the assumption to modelling transformers as lines may have particularly limited the accuracy of transformers' loading estimation.There, the model was 'forced' to learn a similar inputoutput mapping for lines and transformers that may reduce the expressivity of the model.Then, the DSS 2 's estimation is impacted when the load power level in the network varies significantly.The generalization ability of the DSS 2 showed a limit of around 30% load changes.The changes in measurements are encouraging but should be improved.
Future work could investigate the types of measurements and meter placement decisions that would maximize the DSS 2 performances.Adding an algorithm that detects changes in the data could benefit quantifying the confidence of state estimations by DSS 2 .Combining the DSS 2 for state estimation to a state-of-the-art anomaly detector could improve generalization.Also, an extension to unbalanced systems is deemed possible by extending to unbalanced systems modeling and power flow equations, and it should be investigated in the future.Then, the network's model in the deep learning architecture could be improved.The proposed model is simple; however, the H2MGNN architecture allows for advanced modelling of components that can further increase the expressivity and performance of the DSS 2 .Finally, future work should explore robustness to model inaccuracies and implementation for distribution grids that undergo topology changes.This can be achieved by leveraging the robustness of GNN to graph variation.Such an implementation will further improve the practicality, robustness, and accuracy of the model.

Fig. 2 :
Fig. 2: Modelling the network (a) with two generators, three loads, two lines, and two transformers as (b) a standard graph and (c) an H2MG.The standard graph (b) has vertices ( ) and edges ( ) with their features represented as boxes.The H2MG (c) models the components, bus ( ), line ( ), and transformer () as hyperedges connected to any number of connection ports with their features.

Fig. 6 :Fig. 7 :
Fig. 6: Estimation of the voltage level at (a) bus 34 and (b) bus 223 of the 70-bus network under normal conditions and across the sampling period, using WLS ( ), and DSS 2 ( ).True voltage ( ) and measurement ( ) as references.

Fig. 8 :Fig. 9 :
Fig. 8: Estimation of the voltage level at (a) bus 34 of the 70-bus network under high noise level and across the sampling period, using WLS ( ), and DSS 2 ( ).True voltage ( ) and measurement ( ) as reference.

Fig. 10 :Fig. 11 :
Fig.10: Estimation of voltage level at bus 34 of the 70-bus network under missing measurement conditions and across the sampling period, using WLS ( ), and DSS 2 ( ).True voltage ( ) and measurement ( ) as reference.
added to the loss function to enforce this criterion.2 • Phase angle stability criteria: large variations in phase angles are improbable in stable systems.For example, a phase angle difference of more than ∆φ U B = 0.25 rads between two neighbouring buses would characterize an unstable network.Therefore, we add a second two-sided penalization [∆φ − ∆φ U B ] + + [−∆φ U B − ∆φ] + to the loss function to constrain this phase angle difference to ∆φ U B = 0.25.•Line loading stability criteria: power systems regulators ensure the network's security by applying safety margins to line loading.To keep the model output within a physical range, we apply a third penalization [l − l U B ] +

TABLE I :
Features and topology parameters of the H2MGNN (modelled as an H2MG).V i -σ V iVoltage angle: φ i -σφ i Active power inj.: P i -σ P i Reactive power inj.: Q

TABLE II :
Hyperparameter values of DSS 2 trained for three power networks.
The efficiency of the learning approach is shown in Fig-ures 4a and 4bwhen training on the 14-bus network.The

TABLE III :
Mean (standard deviation in parentheses) values of performance metrics in default conditions.WLS* is the WLS algorithm without flow measurements, and WLS** with increased tolerance.The bold font shows the best model or algorithm for each metric.