Open Data-Driven Automation of Residential Distribution Grid Modeling with Minimal Data Requirements

—In the present paper, we introduce a new method for the automated generation of residential distribution grid models based on novel building load estimation methods and a two-stage optimization for the generation of the 20 kV and 400 V grid topologies. Using the introduced load estimation methods, various open or proprietary data sources can be utilized to estimate the load of residential buildings. These data sources include available building footprints from OpenStreetMap, 3D building data from OSM Buildings, and the number of electricity meters per address provided by the respective distribution system operator (DSO). For the evaluation of the introduced methods, we compare the resulting grid models by utilizing different available data sources for a specific suburban residential area and the real grid topology provided by the DSO. This evaluation yields two key findings: First, the automated 20 kV network generation methodology works well when compared to the real network. Second, the utilization of public 3D building data for load estimation significantly increases the resulting model accuracy compared to 2D data and enables results similar to models based on DSO-supplied meter data. This substantially reduces the dependence on such normally proprietary data.


I. INTRODUCTION
The global path to carbon-neutral energy generation leads to more renewable energy sources, such as wind parks and photovoltaic systems, in the electricity grid [1], [2].These renewable energy sources are often distributed throughout the grid, instead of the traditionally centralized energy generation by fossil fuels or nuclear power.As such, these decentralized energy sources are often located in the lower grid levels, which leads to an increased generation in the distribution grid level, especially through rooftop photovoltaics.This requires detailed simulations of all grid levels, from high-voltage (HV) to low-voltage (LV).Simulations of the medium-voltage (MV) and LV grids are particularly important since those grids were not designed to incorporate large quantities of generation originally.Furthermore, these simulations are crucial to develop local and district-based solutions for challenges caused by the transition to carbon-neutral energy generation.
This work was conducted within the framework of the Helmholtz Program Energy System Design (ESD).The authors gratefully acknowledge funding by the German Federal Ministry of Education and Research (BMBF) within the Kopernikus Project ENSURE 'New ENergy grid StructURes for the German Energiewende'.
However, the required grid data and models often are not readily available due to data privacy concerns, necessitating often labor-intensive modeling processes before the actual simulations can be performed.Thus, the goal of the present work is to further develop methods for the automated creation of grid models, building on our previous work [3], using readily available open data sources.While the goal is not the exact recreation of existing grids, the generated models aim to be realistic enough that they could describe the real grid.These grid models are intended to support various use cases, e.g., machine learning applications and realistic simulations on a wide range of simulation software, such as load flow calculations, quasi dynamic simulations, root mean square (RMS) and electro-magnetic transient (EMT) simulations.
The main contribution of the present paper is a new twostage optimization method for the automated generation of LV grids relying solely on openly available data sources, such as OpenStreetMap (OSM) [4] and OSM Buildings [5] that further enables the automated model generation of the 20 kV medium-voltage grid.This significantly reduces the data requirements compared to previous approaches [3], [6]- [15].Furthermore, we present a comprehensive comparison of various data sources -open and proprietary -for the automated generation of distribution grid models.
The remainder of this paper is structured as follows: In Section II, we provide an overview of current related work.In Section III, we describe the methodology behind our proposed model generation method, including the building load estimation with various data sources, and our optimizationbased transformer placement method.We then evaluate the proposed method and compare the results using the different data sources in Section IV before discussing the results in Section V. Finally, we conclude with our findings and give a brief outlook on our future work in Section VI.

II. RELATED WORK
Since the demand for distribution grid models is apparent for many use cases centering around grid simulations and studies, the data-driven generation of such models is a wellresearched topic.
The process for creating so-called Reference Network Models (RNMs) described in [6] marks a fundamental work in the field of automated power grid modeling.This approach utilizes data on the location and demand of customers, locations and capacity of distributed generation (DG) and transmission substations, and economic and technical parameters to generate European-style power grid models from the HV level down to individual LV customers.The grid models are created by a heuristic branch-exchange method to minimize the cost of the grid using a pre-defined catalog of standard equipment.Building on this, [7] creates and publishes several models of selected test areas in the MATPOWER [16] format and compares key indicators of these models with real-world data provided by European DSOs.A further development of this process is described in [8], which focuses on the development of an online tool for the automated model generation from OSM data, area parameters (e.g. consumer density, power factor, and MV/LV transformer locations), and DSO indicators.
More enhancements to [6] are described in [9]- [11] with the adaption to U.S.-style distribution grids and an appropriate validation approach for this grid type.The approach introduced in [9] is capable of creating complex U.S.-style power grids with their typical single phase connections and voltage regulators on the LV level.While the previously mentioned methods expect the location and demand of customers as a direct input, [9] describes a method for estimating this demand based on land use data (usually from commercial vendors) and a library of reference buildings.Considering the complex nature of U.S.style distribution grids, this approach creates detailed models in the OpenDSS [17] and CYME format.
Other approaches, such as [12] focus on the German power grid.This approach utilizes OSM data and the known total number of LV networks in Germany to generate a total of 500 000 LV distribution grid topologies.The generated topologies are validated on a statistical basis with real grid data, such as the number of nodes and edges per LV grid and the total line length.Other approaches in the same research context focus on the generation of transmission grid [13] and MV grid [14] models instead of the lower voltage levels.The method described in [15] is focused on regional distribution grid structures as found in Germany and utilizes nine different data sources to classify buildings for load estimation and for generating the MV and LV grid topologies.The method generates MATPOWER [16] models and is validated on a highlevel basis against real networks.Despite all of these approaches, the research in this domain is still open in a few key areas: Most of the methods available in the literature utilize some kind of proprietary data or very specific knowledge that prohibits a wider applicability.Furthermore, to the best of our knowledge, there has not been a comparison of models generated using different available data sources.Lastly, many available methods create models in very basic formats, such as MATPOWER.In the present work, we address those three areas by presenting a method that requires minimal data input and generates versatile PowerFactory models.Furthermore, we compare the generated models under different input data circumstances.

III. METHODOLOGY
In this section, we first describe the building load estimation for low-voltage grids before describing the two optimization stages of the grid model generation as shown in Figure 1 sources to estimate the load per building that is a required input of the two-stage grid layout optimization.In the first stage of this optimization, the 20 kV grid topology with its 20/0.4kV substations is generated using a k-means clustering approach for the substation placement and a travelling salesman problem (TSP) optimization for the line routing between the stations.In the second stage, the underlying 400 V grid is generated by solving a variation of the minimum cost flow linear optimization problem.Since the methodology described in this work focuses on areas dominated by underground cables instead of overhead lines, whose routes might be available in map data, we make the common assumption that cables are laid out along roads and paths.

A. Building Load Estimation Based on Variable Data Sources
Since most existing distribution grids evolved over many years and traditionally only loads were considered in the planning of low-voltage distribution grids, we neglect photovoltaics (PV) generation in our methodology for estimating the required sizes of equipment, such as transformers and cables.The impact of renewables in the low-voltage grid is rather part of studies that can be carried out with the generated grid models, see e.g.[18].Thus, the automatically generated network topologies mainly depend on the assumed loads for the identified buildings in the modeled area, since the network topology is generated with an optimization algorithm based on load data.
For the building load estimation, we consider three different open and proprietary data sources.These data sources are OSM data (O 2D ), 3D OSM Buildings (O 3D ) and finally the number and location of electricity meters provided by a local distribution system operator (EM ).The load estimation methods described in the following are, however, used for residential buildings only.Non-residential buildings are identified via tags included in OSM data and require a special treatment.
In order to estimate the load of a residential building i, we utilize the household standard load profile H0 [19] that is multiplied by an estimated yearly energy consumption E i : The estimation of the yearly consumption of a building starts with estimating the energy consumption for a single residential where nR is the average the number of residents in a household according to [20], A is the floor area of the residential unit, and nLA is the statistical number of large electrical appliances per household for the selected test area [21].Specific values used for the application of this method can be found in Table I.
The number of residential units in a building i, if only the floor area is considered, is estimated as where A U is the median floor area of a residential unit.This median value A U is determined over all floor areas of the buildings A i in the target region which geographically defines an area: To model the energy consumption of individual buildings E i , which is based on the consumption per unit E U , the base areas of the buildings are decisive and may vary greatly depending on the data basis.1) OSM data to determine the base area of buildings: Using the building floor area, the energy consumption per building E 2D i is obtained by multiplying E U with the number of residential units nU 2D i , see (3), and a scaling factor S U , The scaling factor S U can be approximated by the average number of stories per building and can be adjusted downward to accommodate for a larger proportion of single-family homes.
2) OSM data and height information from OSM Buildings: In this estimation, the building height H i is considered to approximate the number of floors of a building using data provided by OSM Buildings [5].According to [22] the floor height is 2.5 − 3m.In this range, an average floor height h f is chosen for the further calculation.The corresponding energy consumption E 3D i is calculated as where the additional factor H i /h f replaces the previously introduced scaling factor and is individual for each building i.
3) Electricity meter data supplied by public utilities: The information about the number of electricity meters (EMs) for all buildings nU EM i provided by the public utilities (DSO) is used for the direct calculation of the year-round load data.In this data, an electricity meter for general electricity is given for buildings with multiple households.This additional meter is considered in this context with the fictitious electricity consumption of 0.1 households.The load for a building can be estimated as follows, where A U,i is the average floor area of a residential unit in the building A i .A U,i is calculated as The generated load time series, as described in (1), can then be utilized to perform quasi-dynamic simulations to evaluate the models under a wide range of conditions.

B. Stage 1: Automated Generation of the 20kV Grid
In this section, we introduce the first stage of the optimization-based grid generation that handles the creation of a 20 kV distribution grid.This stage is based on map data and the load estimation introduced in the previous section, and consists of two consecutive steps: First, the estimation of the number and locations of the 20/0.4kV substations and second, the grid topology generation connecting these substations.
1) Calculation of the Number and Locations of Substations: In order to estimate the adequate number of transformers for the grid, the results of the load estimation are used.This estimation is based on the peak load share per building P peak,i that is calculated as where nU i is the estimated number of residential units of building i, which depends on the utilized data source.In accordance with [23], each household accounts for a peak load share of 2 kW.Assuming a power factor λ for the household loads and a loading of at most L T for each of the 20/0.4kV substations with a rating of R T , the needed number of transformers nT is determined with the ceiling function applied to the quotient of the total amount of power P that is consumed by each of the N buildings and the adjusted transformer rating.This leads to the estimation of the number of transformers nT as In order to then compute the locations of the stations, a kmeans clustering algorithm is used [24].Each of the nT returned clusters is based on each building's geographical coordinates, uses the case-specific number of households per building nU i as the weight for the computation, and has a cluster center that is used for the transformer location.Once this is done, a graph representation of the street layout is obtained, that includes the closest HV/MV substation and the area where the grid is to be generated.Then the calculated station positions are added as nodes to this graph structure in the same way that each building is appended.
2) Optimization-based 20kV Grid Topology Generation: An optimization approach is then used to determine the associated network topology for the identified 20 kV stations.The problem is modeled as a travelling salesman problem [26], with the starting and ending point being the HV/MV substation.The places to visit are the 20 kV stations and the distances are the shortest paths on the corresponding graph that is weighted with the actual, geographical length of each road.This problem is approximated using the Christofides algorithm [27] and can be formalized as follows: x i,j = x j,i ∀(i, j) ∈ P aths (i,j)∈P aths The MV/HV substation together with the 20 kV stations form the set of nodes, Stations, while P aths is the set of the shortest paths (i, j) between each of the nodes in Stations.Furthermore, l i,j is the length of the shortest path between node i and node j.The decision variable x i,j equals 1 if the path between node i and node j is part of the solution, and 0 otherwise.(12) ensures symmetry while (13) makes sure that each station is connected to two other ones.The last constraint (14) ensures that the solution is indeed a single 20 kV ring and not just a union of other, smaller rings.

C. Stage 2: Automated Generation of the Low-Voltage Grid
In our previous approach [3], residential LV grid models are derived from the street layouts available in OSM as a possible cable route.In the present paper, a Python application converts these layouts to a graph representation and adds the buildings of the residential area as found in OSM.For each of these buildings, load data is computed using one of the datasets described in Section III-A.In order to obtain the grid topology, a variation of the minimum cost flow optimization problem modifies the graph to comply with electrical lowvoltage grid topology standards.This optimization is carried out with specified 20 kV substation locations.The optimization problem is formulated as a mixed integer linear program (MILP) with binary decision variables and is stated in the following equations: Generally speaking, the program decides which of the edges of the graph will be used as cables for power delivery by choosing the cheapest way to supply all demand.Therefore, the objective function (15) minimizes the cost cost i,j to install a cable from node i to node j that is proportional to the potential cable route length.A binary decision variable install i,j is introduced in order to only account for potential routes hat are actually used or, in other words, where a cable is installed and the route used to supply buildings with power: It is 1 if the cable is installed and 0 in all other cases.The first flow conservation constraints (16) ensure that for all nodes i that are not a source, the power flowing in f low j,i equals the consumption residual i of the node itself or is flowing out again.For sources that represent the secondary substations, the power flowing out equals the externally provided power that is also denoted residual i .A positive residual i indicates an external source, while a negative one indicates consumption.For this consumption, possible PV injection is not considered in the optimization out of the aforementioned historical reasons, only the load data computed in the previous steps is used.The second constraints (17) ensure radiality in the obtained grid, as LV distribution grids are usually operated in a radial mode.Hence, only one cable can be used to supply a node with power.Lastly, the constraints (18) make sure that a certain loading limit cap max i,j for the cables is not exceeded and that energy can only flow over a cable that is also installed.The algorithm repeats this optimization while lowering the maximum cable loading limit permitted in the optimization for a cable that is installed alongside a public way, in order to obtain a topology that uses all the secondary substations equally.In this context, this limit is also referred to as the cable capacity.In this improved version of the cable capacity estimation (CCE), the approach for lowering this capacity changed from the Newton Bisection method (NB-CCE) described in [3] to the proposed Inverse Proportional approximation (IP-CCE): where cap max i describes the maximum cable capacity in the i-th iteration of the optimization and N is an adjustable parameter.Once the capacity restriction of an iteration makes the problem infeasible, the capacity of the last feasible iteration is selected and the results of that optimization are used for the rest of the workflow.
This new approximation allows for fast progress at the start of the optimization method and finer steps towards the end.This ensures that the found final capacity is actually close to the theoretical optimum that is the lowest integer capacity value with which the model remains feasible and that the obtained solution is still balanced regarding secondary substation

D. PowerFactory Model Generation
In the last step, the obtained grid data structure is converted to a DIgSILENT PowerFactory model.Since system parameters such as line impedances cannot be estimated from the utilized data sources, this approach uses a library of system components that are widely used in the target area.This library can, of course, be adapted to reflect component choices of the local DSO if these are known.This also includes the automated creation of grid diagrams and an initial load flow calculation.For each individual building, a busbar system as shown in Figure 2 can be created.Although the present paper only describes the methodology to estimate the loads for these house models, this approach offers the flexibility required for the analysis of future scenarios with various PV, battery and electric vehicle settings.This Python tool chain is fully automated, relies solely on a good OSM data coverage of the target area, and the obtained grids can be readily used for power grid analysis studies.However, due to varying grid design approaches around the world, its usage is limited to regions with European-style distribution grids.

IV. EVALUATION
In this section, we describe the study area and the evaluation criteria before presenting the results.A meaningful comparison between the proposed method and the methods found in literature is infeasible due to several factors, such as a focus on different voltage levels [13], [14], different grid styles [9], and missing implementation details [8], [12], [15].Thus, we focus on the evaluation of the impact of various available data sources and the comparison with real grid topology data provided by DSO.
For this evaluation, we consider the three data sources introduced in Section III and generate the distribution grid model either with a priori knowledge of the 20/0.4kV transformer locations together with the 20 kV grid topology (T K ) or with a calculation of the transformer locations and the 20 kV network topology (T C ).This altogether results in six combinations of (d, t) tuples for data source and transformer data, as given below: With this wide variety of data source combinations, we aim to investigate the impact of data source quality on the quality of the resulting grid models.In addition to the six models generated by these combinations, we consider a topological model (DSO) based on GIS data provided by the DSO of the target area.This model is not generated by the method described in this paper, and has two significant differences compared to the generated models: First, it includes switching devices that allow the reconfiguration of the grid topology, and second, it contains two separate cables for most of the streets, one on each side.
For the analysis of the generated models, graph metrics are applied to the automatically generated topologies, and the results of network calculations based on voltage drops and line loadings are compared.While the (DSO) model is considered as the reference model for the topological comparison of the models, we consider the (EM, T K ) as a reference model for evaluation of the electrical properties.This is because the (DSO) model lacks information regarding the demand, whereas the (EM, T K ) is the generated model based on the most accurate available data, i.e., electricity meter data and known transformer positions.From these comparisons, statements are made regarding the quality of the automatically generated networks.In particular, an answer is given to the question which data are sufficient for an automated modeling of the power grid for reliable statements.

A. Study Area
The study area is selected as in the analysis in [3] where detailed information was collected from an on-site inspection.This allows a direct comparison of the generated grid topology with previous results.The study area spans over roughly half a square kilometer and contains 241 buildings, including single houses, duplex houses, town houses as well as apartment towers.Even though the area is mostly residential, it also encompasses non-residential buildings such as a school, a community center and some shops.As the few shops in the area are in buildings that also encompass residential units, for this analysis they are treated as residential units for the sake of simplicity.The consumption of buildings with a purely nonresidential use is calculated using the overall area available in the building multiplied by the average kWh per m 2 for the building type.For the present school and kindergarten, this value is 20 and 22 kWh per m 2 per year respectively, and for the community center it is 9 kWh per m 2 per year [28], (c) Case (EM, TK ): Grid model using the DSO data on the number of electricity meters at each building for a given number and location of transformers.Fig. 3: Distribution grid modeling with a priori known number and location of substations using different data sources for the load modeling comprising the combinations (X, T K ).Note that each building is modelled as a subsystem with a PV model, residential load, cable and as a preparation for future applications with a battery.The color of the nodes and lines indicates the affiliation to a specific transformer, with a consistent coloring between the three variants.Fig. 4: Distribution grid modeling with a calculated number and position of substations using different data sources for the load modeling comprising the combinations (X, T C ).The locations of the transformers are calculated with a k-means approach for the spatial load distribution in each case.The color of the nodes and lines indicates the affiliation to a specific transformer.As there is no one-to-one relation between the transformers of the different variants, the coloring is not consistent.[29].For the overall area, the number of levels of a building is obtained either from the level tag within the OSM dataset or assumed to be two, thus S U = 2 in this case.

B. Load Estimation Comparison
Comparing the load estimations based on available 2D and 3D OSM data to the reference estimation based on DSO supplied electricity meter (EM) data results in the deviations shown in Table II.The table shows that the 2D-based estimations result in a significantly lower load overall, while still containing buildings with significantly higher load estimations.Thus, a simple adjustment of the scaling factor S U would not be sufficient, as a higher value would also increase the positive deviations.On the other hand, the 3D-based estimations, while also containing some buildings with large deviations, match the overall load estimate very closely and perform better than a scaled 2D-based estimate would.

C. Topological Comparison
In this section, we perform a topological comparison of the models generated using the different data sources and generation methods.First, we describe our findings on the geographic representations of the models before quantizing them using two graph metrics, i.e., the number of nodes per transformer and the eccentricity of the transformers.
a) Geographic representation: The model in Figure 3a, generated by (O 2D , T K ), shows six 400 V subgrids with similar sizes that are not very compact and in some cases spread across the whole study area.Especially notable are the two cables at the right edge of the area.Figure 3b shows the model created by (O 3D , T K ), which includes much more compact subgrids with larger size differences.Notably, the brown and green subgrids are significantly smaller compared to the previous case.Figure 3c shows the (EM, T K ) model with even smaller subgrids in the lower part of the study area.Consequently, the remaining subgrids are significantly larger.This model also includes an unrealistic connection via a footpath in the brown subgrid.Examining the models with calculated transformer positions, it is noticeable that the (O 2D , T C ) case, as shown in Figure 4a, contains only two transformers compared to the usual six transformers in all other models.The (O 3D , T C ) model in Figure 4b  b) Number of nodes per transformer: In this metric, a node represents a busbar in the generated PowerFactory model.For each building, our generation methods create two busbars: One for the building itself and one at the connection of the main cable and the house cable.The number of nodes per transformer is calculated by determining the supplying transformer for each 400 V busbar.Figure 6 shows the number of nodes per transformer for the six generated models.The two models using only the 2D OSM data reveal a very even distribution of nodes between the different transformers, due to the homogeneity of the estimated loads.Furthermore, it is noteworthy that the (O 2D , T C ) case consists of only two transformers with around 250 connected nodes each, which is highly unrealistic.According to DSO data, one 630 kVA transformer, which is the installed transformer type in the study area, serves on average 79 residential units.The other cases show a more heterogeneous distribution of nodes due to the large differences in building load estimations, that are far more realistic for some multi-story buildings.While containing more nodes overall due to some implementation details, the (DSO) model located in between the (O 3D , T K ) and (EM, T K ) variants with a good matching to (O 3D , T C ). c) Eccentricity per transformer: The eccentricity of a node is defined as the maximum distance to all other nodes in a graph.Thus, when determining the eccentricity of a transformer T i , we calculate where N Ti is the set of 400 V nodes that are supplied by T i , and d is the cable distance between two nodes.Essentially, the eccentricity of a transformer describes the maximum cable length between the transformer and the buildings supplied by it.The comparison of eccentricities in the generated models is illustrated in Figure 7.This plot reveals two outliers in the (O 2D , T C ) case with an eccentricity of 2.1 and 1.2 km respectively, while all other transformers have an eccentricity below 0.75 km.The (DSO) model features very similar eccentricity values to the (O 3D , T C ) variant with a small range of values, because of the configuration criterion to keep the individual partitions compact.In general, the EM -based models exhibit a wider distribution of eccentricities than their O 3D -based counterparts, whereas the T C -based models have overall lower eccentricities than their T K -based equivalents.

D. Electrical Comparison
In this section, we perform a comparison of the electrical properties of the generated distribution grid models.In a first step, we analyze the voltage profiles of all cases and verify This also correlates with the shorter radial feeder length of 0.461 km compared to 0.726 km in the reference model.In both diagrams, the voltage profile with the steepest drop is associated with a transformer with high loads in multi-story buildings supplied with short length cables.Thus, this is another indicator of the appropriate choice of methodology for approximating the location and number of transformers, based on load estimation from building data, which itself also seems promising.Further, we analyze the statistical distribution of line loadings of the 0.4 kV cables as shown in Figure 10.First, it is noticeable that the distributions for the (O 2D , X) cases differ seriously from the others.This is due to the weak load estimation based on 2D OSM data that leads to overall lower line loadings.To further analyze the similarities of the line loading distributions, a similarity index is defined based on the Euclidean distance of the histogram distributions.The full table of similarity indices in pairs of the cases is given in Figure 11.The highest similarity, i.e. smallest distance, with the reference case (EM, T K ) (shown in red in Figure 11) is observable for case (EM, T C ) with 15.68, where both share the same smart meter data distribution over the buildings.The second-best match with the reference is the case (O 3D , T K ) with 57.48, where the transformer number and positions are identical and a priori known.The very close third-best option is (O 3D , T C ) with 59.7, which does not need any additional input data for network topology generation at all and seems a promising approach for future use.The  second finding from the similarity index table concerns the quality of the approximation of the number and location of the transformer stations.For this purpose, we compare given (T K ) and calculated (T C ) transformer properties for each case.The values shown in blue in Figure 11 indicate a high accordance between the (T C ) and corresponding (T K ) models.
Note that this is only a statistical comparison without consideration of the real spatial distribution, which is handled in Section IV-E.

E. GIS-based Comparison
To be able to evaluate the approach for the automatic 20 kV network generation, the calculated transformer positions T C are compared to the known positions T K for selected cases.This includes the comparison of cases (EM, T C ) and (O 3D , T C ) to case (EM, T K ).In particular, each calculated position is mapped to the closest known transformer.If this   mapping is not unambiguous, the closest calculated position is chosen to be mapped to the known transformer location in question.That way, there is a one-to-one comparison, the results of which are depicted in Table III.
Given that the study area measures 582 m from the most northern to most southern point and 766 m from east to west, the rather low distances mean that the methodology is working well.From the spatial vicinity of the calculated and known transformers, it becomes evident that the cluster-based computation for the transformer locations yields good results.On the other hand, the fact that the values change not much from the EM to the O 3D case means that the OSM data with height information of buildings is sufficient for good results for transformer placement.

F. Runtime Evaluation
The runtime of the introduced Inverse Proportional Cable Capacity Estimation (IP-CCE) method (19) as part of the optimization method in the course of low voltage grid topology generation (see Section III-C) is evaluated with a comparison to the Newton bisection method in [3].
To obtain the runtimes, the application was timed on a Windows machine with 32 GB of RAM storage and an Intel® Core™ i7-10700K CPU with 8 physical cores running at 3.8 GHz.As can be seen in Figure 12, the runtime of the optimization was significantly reduced by 1307 seconds, reducing the total runtime of the model generation from 1618 to 311 seconds.This results in an overall speedup factor of 5.2.While most of the programs steps do not differ and therefore stay roughly the same, the difference in the optimization process can be clearly seen.This improvement is especially important because with bigger models to be created, the optimization time is expected to grow exponentially.

V. DISCUSSION
The presented new model generation approach enables the automated generation of distribution grid models from minimal open data sources.The evaluation of the models generated using various data sources, from open 2D and 3D data to proprietary electricity meter data and GIS data including transformer positions, demonstrates the viability of open data sources for the automated generation of realistic distribution grid models.However, the evaluation also shows that the 2D data alone, as available in OSM, does not yield realistic models with the presented approach.On the other hand, combined with the height information found in OSM Buildings, the generated models closely resemble the models utilizing proprietary electricity meter data supplied by a local DSO, which confirms the viability of the household estimation method.
While the load estimation is shown to be crucial for the subsequent estimation of the transformer number and positions and the cabling, the transformer placement itself is less important for the evaluated metrics.Especially for the loadings of the 400 V lines, the transformer placement is nearly negligible, as Figure 11 shows.In general, however, the transformers placed by the deployed optimization algorithm lead to more efficient cable layouts (see Figure 7).
Overall, the evaluation shows the viability of our new method to generate realistic distribution grid models from openly available building data.In some instances, the estimated household numbers might even be more accurate than the DSO data, since some buildings contain an unrealistic high number of meters.However, a thorough comparison to a real model under various load scenarios still needs to be performed.Furthermore, the load estimation method proposed in the present work is limited to residential buildings and needs to be expanded to nonresidential and mixed buildings.

VI. CONCLUSION AND OUTLOOK
The present paper introduces a new distribution grid model generation method, that enables the automated generation of grid models relying solely on openly available data sources, i.e., OpenStreetMap (OSM) and OSM Buildings.It introduces a building load estimation method that is based on the estimation of households that utilizes open 2D and 3D building data.While the household estimation based on 2D data results in a poor model quality, our evaluation shows that the 3D-based estimation performs similarly to proprietary electricity meter data supplied by the distribution system operator (DSO) of the study area.While our evaluation highlights the importance of the available data for the load estimation, it also shows the relatively small importance of the actual 20/0.4 kV transformer placement for the generation of realistic models.
In the future, we will consider seasonal PV generation and thus residual loads together with further methods for the load estimation.These include a way of incorporating commercial and industrial areas and buildings into the model, as well as a way to automatically detect the kind of residential areas that are present in order to be able to individually handle them.Furthermore, our approach has to be further developed in order to be able to accommodate more varying and challenging network topologies such as inner-city blocks with buildings inside the inner courtyard or overhead lines that are still present around our study area and thus in Germany.Moreover, as seen in the (DSO) model, real grids contain switchgear that allows the reconfiguration of the network topology to react to certain grid events.We will work on methods to incorporate these switching capabilities into the model generation process, and to find realistic configurations for the switchgear.Additionally, the comparison of various data sources and methods will help to select the best suitable open data source option for automated parallelized large-scale modeling of the power system by embedding the distribution grid models into the higher voltage levels and finally to cosimulation with other energy sectors.

Fig. 1 :
Fig.1: Two-stage optimization method for automated distribution grid generation with load estimation based on various alternative data sources.

Fig. 2 :
Fig.2: Detailed view of the automatically generated components in each house model.Note that for the analysis in this study, only the residential load is considered.However, this approach offers the flexibility required to study various potential future scenarios.
(a) Case (O2D , TK ): Grid model using the OSM dataset for given number and location of transformers.(b) Case (O3D , TK ): Grid model using the OSM Buildings dataset with included height data of buildings for given number and location of transformers.
(a) Case (O2D , TC ): Grid model using the OSM dataset for a calculated number and location of transformers.(b) Case (O3D , TC ): Grid model using the OSM Buildings dataset with included height data of buildings.The number and location of transformers are calculated.(c) Case (EM, TC ): Grid model using the DSO data on the number of electricity meters at each building.The number and location of transformers are calculated.

Fig. 5 :
Fig. 5: A simplified representation of the (DSO) model, showing the areas supplied by the six different transformers.The coloring is equivalent to Figure 3.
presents a realistic partitioning into subgrids, with two smaller subgrids in the lower part and four evenly distributed larger subgrids in the upper part of the study area.The (EM, T C ) in Figure 4c contains very unevenly distributed subgrids with a high variance in size.Compared to the six generated models, the (DSO) model offers flexibility in its topology, as it contains switchgear at 15 different locations with a variety of switching options, resulting in a very high number of interconnection variants.Since the data provided by the DSO does not include details about the switching states of the switchgear, some assumptions are required to obtain a valid configuration.The configuration that results in the partitioning shown in Figure 5 is based on the transformer positions and aims to balance the consumption in each area while keeping each area compact.Overall, the (DSO) model shows the closest similarity to the (O 3D , T C ) version of the generated models, which requires the least a priori knowledge of sensitive data.

Fig. 6 :
Fig. 6: The number of nodes per transformer shows a very even distribution for both O 2D cases and a wider distribution for the other cases.

Fig. 7 :
Fig.7: The eccentricity per transformer shows the maximum distance between a transformer and the connected 400 V nodes.This metric reveals two very large distances for the first case, while the other cases show more evenly distributed eccentricities.In general, the EM-based cases show a wider distribution than the 3D-data-based cases.

Fig. 8 :
Fig. 8: The voltage profiles for case (EM, T K ) show steep decreases for the brown and green feeders, which contain multifamily buildings with around 70 smart meters each.

Fig. 9 :
Fig.9: The voltage profiles for case (O 3D , T C ) also show a steep decline for the feeder containing the multifamily buildings (pink).In this case, however, the overall voltage band is much more narrow.

Fig. 10 :
Fig. 10: The histograms of line loadings at 0.4 kV show a high accordance between the O 3D and EM -based models.Furthermore, the line loadings in these models are only slightly affected by the transformer placement method.

Fig. 11 :
Fig. 11: The similarity indices of the cases based on histograms of line loadings show a high accordance between models with the same load estimation and different transformer placements (smaller values indicate high similarity).Furthermore, they confirm the similarity of the O 3D and EM models.

Fig. 12 :
Fig. 12: Runtime comparison of the introduced Inverse Proportional estimation (IP-CCE) to the Newton Bisection method (NB-CCE) from [3] in seconds.A speedup by a factor of 8.6 is achieved for the optimization with the new method for cable capacity estimation.The 20 kV graph generation includes the steps described in Section III-B, while the 400 V graph generation comprises Section III-C.The last step includes the generation of a PowerFactory model as described in Section III-D.

TABLE I :
Specific values used for the load and transformer estimation.

TABLE II :
Deviation of OSM-based load estimations compared to the load estimation based on DSO-supplied electricity meter data.

TABLE III :
The results of the location-based comparison show very similar distances between known and calculated transformer positions for the O 3D and EM model.