On the Properties of Next Generation Wireless Backhaul

With the advent of 5G, cellular networks require a high number of base stations, possibly interconnected with wireless links, an evolution introduced in the last revision of 5G as the Integrated Access and Backhaul (IAB). Researchers are now working to optimize the complex topologies of the backhaul network, using synthetic models for the underlying visibility graph, i.e., the graph of possible connections between the base stations. The goal of this paper is to provide a novel methodology to generate visibility graphs starting from real data (and the data sets themselves together with the source code for their manipulation), in order to base the IAB design and optimization on assumptions that are as close as possible to reality. We introduce a GPU-based method to create visibility graphs from open data, we analyze the properties of the realistic visibility graphs, and we show that different geographic areas produce very different graphs. We run state-of-the-art algorithms to create wireless backhaul networks on top of visibility graphs, and we show that the results that exploit synthetic models are far from those that use our realistic graphs. Our conclusion is that the data-based approach we propose is essential to design mobile networks that work in a variety of real-world situations.


I. INTRODUCTION
T HE performance requirements of the fifth generation of cellular networks (5G) foresee an extreme densification of Base Stations, named next Generation NodeB (gNB). While the target of 4G was to reach 8-10 Base Stations per km 2 , 5G will need tens and maybe hundreds of gNBs per km 2 [1], and networks beyond 5G probably even more. The interconnection of gNBs providing gigabit performance to mobile terminals requires novel solutions in the access network [2], [3]. Release 16 of the 3GPP standard for 5G introduced the concept of Integrated Access and Backhaul (IAB) to support densification without skyrocketing the cost [4]. The core idea of IAB is that gNBs functions can be split in two parts: the majority of the gNBs are IAB-nodes that collect the traffic from user terminals and have no wired connection, some gNBs are instead IAB-donors that are fiber connected to the network core. A wireless backhaul between gNBs needs to be created to route the user traffic from IABnodes to the closest (or best) IAB-donor.
The same concept is at the core of 6G, that plans to go beyond mmWave frequencies (as in 5G) aiming at THz communications [5]. Higher frequencies provide higher capacity, but have limited communication range and almost no capacity to penetrate obstacles, thus requiring Line of Sight (LoS) between transmission end-points, which reflects in an even higher densification of IAB-nodes. Release 16 foresees only a directed acyclic graph (DAG), possibly multi-hop topology, but the extension to meshed topology to enhance performance, reliability, and dependability is already under discussion.
Given this trend, wireless mesh networks will play a key role in future cellular networks, which we collectively refer to as nextG, whose backhauls will use mmWave and THz communications with massive MIMO antennas, enabling steerable directional beams that can be activated dynamically; finally they will integrate computing inside the network with Multi-Access Edge Computing (MEC) to support smart "verticals" like cooperative driving [6], [7]. We call this kind of network Next Generation Wireless Backhaul (NGWB).
Given a city made of hundreds of thousands of possible locations for IAB-nodes (buildings, traffic lights, light posts, etc.), and given some performance parameters (delay, throughput, reliability, energy efficiency, ...), the design of NGWBs requires first the selection of the node placement, next the decision on links activation [8], [9] and where to place network functions [10], and finally how to properly route traffic.
It is clear that NGWBs open extremely interesting perspectives for the application of network science to communication networks. Yet, one of the limitations we face is the lack of data to design and test algorithms. So, as it happens in many contexts, synthetic topologies based on heuristic considerations are used in design and simulation, making scientific results depending on some abstract representation of reality. The accuracy and credibility of analysis and simulations in a context that cannot be validated with real data is an old problem [11], and it is still a problem today, as in very popular research areas only a tiny fraction of papers provide source code and data to reproduce results [12]. This paper moves a first step in the direction of providing a robust methodology to study the multi-faceted challenges that NGWB networks provide, with a data-based approach suitable to exploit network science, and Open Data and Software to build upon.
The main focus of this paper is the analysis of several network topologies obtained with a 3D ray-tracing approach boosted by the use of Graphical Processing Units (GPUs) to speed up computations. Starting from open data describing buildings elevation in 9 different regions populated by about 15 M people, we obtain the visibility graph, i.e., the graph G v ðN v ; E v Þ where N v is the set of all locations on buildings in the area that we consider potential gNB locations and E v is the set of all the potential links that could be realized considering the Line-of-Sight (LoS) constraint. We do not consider instead street lamps and similar locations as they can, at most, be used as DAG extensions to a meshed NGWB. Since LoS is a prerequisite to make wireless links at high frequencies, the visibility graphs enable the study of NGWB with a novel perspective. We try to answer some key questions like: i) is a wireless mesh backhaul network feasible in areas with different population densities? and, ii) is it reasonable to assume that the visibility graphs of realistic networks in different settings have common features? We anticipate that the answer to the first question is positive, while the second answer is negative, which confirms the base intuition that motivated our work: We need real data to obtain realistic results. We corroborate this intuition by comparing some key metrics computed on realistic topologies with graphs obtained from SoA synthetic models, showing extreme differences. Finally, we go beyond the observation of graph properties, and we apply one state of the art algorithm for the creation of a wireless mesh backbone using both the realistic graphs and the original model used by the authors. The analysis confirms once more that there is a striking difference between the results on synthetic models and on realistic graphs.
A wireless mesh network like NGWB adds one more level below the physical connection graph: The visibility graphs G v ðN v ; E v Þ that defines the potential communication links between nodes, as contrasted to the actual communications links that define the graph of the physical connections G ' ðN ' ; E ' Þ. The former depends on the properties of the 3D space where the network is built, and is made of all the potential nodes that are in LoS and all the edges connecting them. The latter depends on the algorithm that is used to embed a communication network in a visibility graph, i.e., to choose which nodes and which edges are activated. Needless to say, the properties of G v strongly influence the construction of G ' , which in turn strongly influences the performance of applications running on G ' .
Summing up, the contribution of this work moves along three intertwined lines: The introduction of a data-driven methodology for the computation of the visibility G v , which is the fundamental constraint to build the communication graph G ' ; The description and analysis of G v to understand its properties and features; The study of how the features of a realistic G v impact the design of G ' with algorithms at the state of the art.

II. BACKGROUND AND STATE OF THE ART
We divide the SoA analysis in three different parts detailed in the subsections that follows.
1) SoA on Visibility Graphs G v : The study of visibility graphs is essential also outside the networking area, for instance, in landscaping [13]. Early studies on the impact of G v in communications are limited to non-urban sites due to the low accuracy of Digital Elevation Models (DEMs) and to small areas due to the inherent complexity of determining the visibility [14]. The availability of accurate digital maps and technologies such as Laser Imaging Detection and Ranging (LiDAR), which increase the vertical accuracy of building height estimation down to centimeters and submeter horizontal mapping, and the use of GPUs allowing extra-fast computation of some tasks, empowers visibility analysis with classical ray-tracing algorithms [15] in large urban areas [16]. In this paper we use [15] as a building block to calculate the viewshed from a single point, and we implemented a new methodology that enables the computation of the visibility graph on large areas with tens of thousands of buildings. The open source code for Nvidia GPUs is available online. 1 2) SoA on Wireless Backhaul Creation: In the recent past, several works proposed algorithms to design a wireless backhaul G ' for 5G networks trying to maximize different metrics: reliability [17], energy efficiency [18], and cost [19]. These studies rely on Millimeter Wave (mmWave) links or free-space-optical links [20], both requiring LoS between the endpoints. The introduction of IAB in the latest revision of 5G provided a concrete application and reinforced the interest in the creation of efficient backhaul topologies [4], [8], [9], [21]. The design of G ' is not only interesting per se, but has direct implications on MEC and Virtual Network Function placement, that inevitably depend on the network topology [22], [23].
To the best of our knowledge the works that focus on the creation of G ' take a very simplistic approach to model the characteristics of G v , due to the total absence of literature that characterizes the realistic properties of G v . Since the publication of realistic G v topologies is one of our contributions, we stress its importance showing that the performance of some state-of-the-art proposals for the creation of G ' change significantly when applied to a real G v or to one generated with simplistic assumptions. The work that is somehow closer to this paper is our previous contribution [24], which uses a similar ray-tracing approach, but with the objective of comparing growth strategies for Community Networks. The methodology is also different: This work uses a GPU-based approach that enables the computation of the whole G v for entire cities to extend the analysis to an unprecedented scale. What we retain from our previous work is the name of the software tool, called TrueNets.
In a recent paper [25], we took advantage of the same opendata and a similar GPU-based approach, to find the optimal placement for gNBs in a LoS coverage context.
3) Path Loss Models for Urban Areas: In the context of cellular networks and flying networks, there is a strong interest in trying to estimate the probability of LoS and the path loss over a link connecting two points in space. We mention and later use for comparison two approaches taken from the literature, even if they were not derived in the context of rooftop backhauls. To the best of our knowledge, no synthetic model for rooftop backhauls (or in general mesh networks) LoS probability exists, thus we chose for comparison two models derived for scenarios that are close to our analysis. The results we present in Section V-D show that they are, as predictable, inadequate, thus calling for either data-driven network design or further efforts in LoS probability modeling.
The first one uses state-of-the-art stochastic models for LoS such as the ETSI 3GPP [26] or the WINNER II [27] models. These approaches model the probability of LoS with an exponential decay for different scenarios (Urban-Micro, Urban, Rural, Suburban). LoS probability for the Urban-Micro scenario, the one closer to our scenario, is defined as: P ETSI LoS ðdÞ ¼ min where d is the distance between the two points. The formulas for the other models differ only in the parameters' values and can be found in the WINNER II documentation. Albeit this model refers to the LoS probability between a base station and a terminal on the ground, it was used in the works that address the design of IABs [4], [21] that we use as a benchmark. The second model we consider is provided by Al-Hourani [28] and is intended to estimate the LoS probability between two flying objects at height h 1 and h 2 respectively, at a certain distance d. LoS probability is defined as: where r 0 is the average radius of the buildings (approximated as cylinders), 0 is the average building density and GðhÞ is the complementary cumulative distribution function (CCDF) of the building heights. Contrary to the ETSI model, the Al-Hourani model requires the knowledge of the properties of the area where it is applied. Whenever we use it, its parameters are fitted to the data of the area we consider.

III. VISIBILITY ANALYSIS: GENERATING G v
We first define some fundamental concepts of visibility analysis, which we then apply to generate G v .
A Digital Elevation Model (DEM) is a matrix of geolocated altitudes with a certain spatial density, which is the resolution of the digital map considered (in our case, 1 point per squared meter). DEMs can be obtained using different technologies, such as Radar, Photogrammetry, LiDAR, etc. However the only datasets which are both widely available and have sufficient precision are the ones obtained through LiDAR technologies by Public Administrations and released as Open Data. In our research we take advantage of the datasets published by the Italian Ministry of Environment 2 . Given the origin of the x/y plane set in (1,1), the DEM is a matrix of real numbers E 2 R m x Âm y , where m x and m y are the number of samples in the x and y dimensions respectively, and E x;y , is the elevation measured in ðx; yÞ 2 N 2 , which are the integer indexes of the points in the map. Consider an ordering oðx; yÞ on the matrix indexes and let i ¼ oðx; yÞ and j ¼ oðx; 0 y 0 Þ. Any ordering is valid, but we can consider oðx; yÞ ¼ xm x þ y if m x ! m y , or oðx; yÞ ¼ ym y þ x otherwise, to fix ideas. We call p i and p j the points in space identified by the triplets ðx; y; zÞ where i ¼ oðx; yÞ, z ¼ E x;y , and ðx; 0 y; 0 z 0 Þ where j ¼ oðx; 0 y 0 Þ, z 0 ¼ E x; 0 y 0 respectively. The visibility analysis is the process of determining, for every point p i (called the observer point) all the other points p j that have visibility with p i . Let ÇðE; p i ; p j Þ be a function that returns 1 if there is direct LoS between p i and p j and 0 otherwise. The visibility analysis is defined formally as the process of obtaining the viewshed from p i , i.e., a matrix V i of size ðm x Â m y Þ defined as: Evaluating ÇðE; p i ; p j Þ is a computationally intensive task, but recent advances in GPU optimized algorithms [16] allow performing this process in areas that contain tens of thousands of buildings, with billions of potential links to be tested. The design of this process, together with the software provided to the community, is one of the contributions of this paper.
We obtained the DEM of 9 real-world areas, corresponding to 9 administrative municipalities in Italy. The areas belong to urban, suburban and rural areas 3 , whose properties are reported in Table I. To process the data we use the Numba libraries that exploit the CUDA architecture for NVIDIA GPUs 4 .
Our goal is to study the visibility graph G v where each vertex of the graph corresponds to one point on the roof of an existing building, so we need the shapes of the buildings in the specified areas. This information is obtained using two different sources: Openstreetmap 5 and the open data made available by the single municipalities. The latter source is more accurate but not necessarily updated, while the first source is more frequently updated especially in urban areas, but may miss some buildings. For each area, we use the source that provides the highest density of buildings per km 2 without attempting to merge the data, a process that would be too error-prone. Therefore, given a closed polygon s k that represents building b k we implement a function that provides a binary matrix S k of the same dimension of E so that: where AðÞ returns the area limited by s k . In the following we detail the steps needed to create the annotated visibility graph G v once we have S k for all the buildings in a certain area. The code and the data used in the research are available to the community for further research and results' validation/falsification 6 . Algorithm 1 outlines the general methodology to build G v once the nodes' locations N v have been determined. The algorithm requires E, which is the DEM and N v which is the set of nodes' locations, and computes the line of sight for every couple of points in N v . The detailed algorithm that defines the function Ç is available in the additional material.

A. Roof Node Placement
The first step for the computation of G v is to determine N v , which requires to attribute one point p i to each building b k , so jN v j equals the number of buildings in the area. The precise position is important because it influences the chances of having LoS with the other chosen points. In the real world this is done inspecting the roof and visually searching for the place with the best visibility towards other buildings. In an automated procedure, we need an algorithm that tries to maximize the probability of LoS towards other buildings.
For every s k we are looking for a point p i with coordinates ðx; yÞ 2 Aðs k Þ so that the resulting graph G v has the highest number of edges. An exact algorithm needs to explore all the possible combinations of all the ðx; yÞ 2 s k for every k 2 ½1; . . . jN v j, whose number grows exponentially with jN v j. By modeling the problem as a colored graph, where each point k on the roof is colored with the same color, we can reduce it as the search for the connected rainbow subgraph, which has been proven to be NP-Hard [29]. For this reason we need a heuristic.
A reasonable assumption is that points that have a good visibility toward the centroids of buildings will also have good visibility toward the good visibility points on the buildings, so that, starting from centroids, we can select these points with an algorithm that scales quadratically with the number of buildings. However, centroids may actually fall outside the building roof (e.g., internal courts), so we use the point c k , which is the point of coordinates ðx k ; y k ; z k Þ where ðx k ; y k Þ are the coordinates of the a suitable point that ensures ðx k ; y k Þ 2 Aðs k Þ as defined by the C++ function GEOSPoin-tOnSurface from the GEOS library. For the sake of readability we'll call these points pseudocentroids.
We define the pseudocentroid c k ¼ ðx k ; y k ; z k ¼E x;y þ2Þ adding 2 m to z k since we assume gNB antennas are elevated on the roof with a pole 7 and we compute all the viewsheds V k from any point c k to any other point also elevated by 2 m. Since Ç is a symmetric function, given a point ðx; y; E x;y þ 2Þ, each element V k x;y represents the availability of LoS from ðx; y; E x;y þ 2Þ to c k . Summing all V k we obtain V ¼ P jN v j k¼1 V k , which is a cumulative visibility matrix whose values range from 0 to jN v j indicating how many centroids c k are in LoS from any elevated point on every building.
The second step is searching, for every building b k , the point " p i that has the highest visibility of centroids. This is obtained masking V with S k : V Ã S k (where Ã is the element-wise multiplication) selecting all the values of V belonging to points inside b k , and returning the coordinates of the maximum: x;" y þ 2Þ is going to be the position of the gNB antennas for building b k . We repeat this process for every building and we obtain N v as the collection of all the " p i . From now on, when we mention a generic point p i , we always refer to points chosen with this procedure. 8 Algorithm 1: Algorithm for the computation of G v .
Require: E (DEM of the area), N v (Set of points),

B. Building the Visibility Graph
Once the set N v has been determined we need to build the set E v computing Çðp i ; p j Þ for each couple of points p i ; p j 2 N v , tracing a ray between p i and p j and checking whether any obstacle intersects the ray. As in the previous task, implementing this algorithm in a GPU allows parallelizing the task among the large number of cores available. Each core is assigned to one point p i 2 N v and calculates the LoS towards any other point p j 6 ¼ p i 2 N v using well known algorithms from ray-tracing literature [15]. The pseudocode of our specific implementation is available in the additional material. Fig. 1 shows the visibility graph of the sample area. The properties of the resulting graphs are reported in Table II and commented in Section V.

C. Complexity Analysis
The worst-case time complexity for building G v is OðjN v j 2 Þ Á CðÇÞ, where the first part represents the iteration on every pair of nodes and the second part represents the complexity of the LoS computation between two given points. CðÇÞ depends on the longest LoS link for which the algorithm needs to evaluate every space element intersected by the ray.
Assuming that E is a square matrix, the longest link would be its diagonal which is ffiffiffiffiffiffiffiffiffi 2jEj p , which leads to CðÇÞ ¼ Oð ffiffiffiffiffiffi ffi jEj p Þ, thus a general complexity equals to OðjN v j 2 Á ffiffiffiffiffiffi ffi jEj p Þ. In order to express the complexity solely on the size of the area we can take advantage of the fact that the number of buildings grows linearly with the area. This leads to an overall complexity of OðjEj Á ffiffiffiffiffiffi ffi jEj p Þ On the other hand, the SoA algorithm for viewshed analysis [15] has a complexity of OðjEj 2 Þ to calculate a single viewshed. In order to calculate the visibility graph the complexity would be OðjN v j Á jEj 2 Þ, which again expressed only in terms of the area is OðjEj 2 Á ffiffiffiffiffiffi ffi jEj p Þ. Computing such an algorithm on datasets like U3 (more than 50 k nodes and 100 km 2 ) would not be feasible using a normal CPU, however modern GPUs, with their high number of cores and large RAM allow to speed up the process and compute the whole visibility graph in a reasonable time. We used an NVIDIA Tesla P100 GPU which has 3584 cores and 16 GB of memory. This allows the computation of the whole process for the largest city in the data-set in roughly one hour at a speed of 40 M-links per second.

IV. GENERATION OF G '
Given G v and the position of a set N ' & N v of gNBs, G ' is embedded into G v choosing a set E ' & E v that will create the wireless backhaul. Our goal is to show that the realistic data-set of G v topologies that we publish is key to obtain realistic results in the design of a next generation wireless backhaul G ' . Contrarily to classical large-scale, low density mesh networks that span across multiple municipalities [30] a NGWB is expected to extend a wired backhaul in localized regions, so in this section we describe three processes: i) how we create realistic localized G v and N ' ; ii) how we create simplistic localized G v and N ' for comparison; iii) two strategies from the state of the art to choose E ' , whose performance will strongly vary when applied to realistic or simplistic data.

A. A Realistic Localized G v and N '
In each of the 9 areas we select 5 sub-areas of approximately 1 Â 1 km, and in each sub-area we call G the border of the convex hull that includes all the buildings fully contained in the sub-area, with AðGÞ 1 km 2 its area. We choose N v assigning a point p i to each building in the area with the procedure described in Section III, and finally we produce the localized realistic G v applying Algorithm 1 to these points. Given a desired density of gNB per km 2 (r 2 f30; 60g), we set the size jN ' j ¼ rAðGÞ and we pick a random set N ' & N v . In the figures we simply refer to data generated with this process with the TrueNets label.

B. Simplistic Localized G v and N '
Given the same 9 areas and jN ' j as defined above, we generate two versions of a simplistic G v . As a baseline we pick N v ¼ N ' using a Homogeneous Poisson Point Process (which we refer to as HPPP) in which locations are chosen with a random uniform choice in AðGÞ, without any relation to the building maps. In order to create E v , for each couple of nodes in N v , we add an edge with a probability given by the ETSI model using (1). This simple strategy is the one used in most of the papers that propose approaches for the creation of wireless backhauls, such as the ones from Polese et al. [4], [21] mentioned later on.
In a second, slightly more realistic approach, we pick N ' as in the realistic case of Section IV-A, but we use the ETSI model to generate the edges E v . This is an intermediate model in which the 2D distribution of the points is not completely uncorrelated from the city map, but is similar to the 2D distribution of the buildings. Yet, without 3D information, the edges are chosen with a synthetic model and not with the Ç function. As we use the OpenStreetMap data, we refer to this process as OSM.

C. Choosing E '
Both the described processes produce a visibility graph G v and a set N ' of gNBs. In the IAB terminology, some of the nodes are donors, i.e., they are connected to the wired backbone, and the other nodes need to build a multi-hop path to some donor. We randomly choose d0:1rAðGÞe donors (at least 10% of the gNBs), as in [21]. For each donor a Directed Acyclic Graph (DAG) is created that interconnects the reachable nodes, and the union of all the DAGs provides G ' . We mentioned in Section II-A2 that there are several proposals to choose the DAGs, and thus create the backhaul graphs, among which we pick two. The first is one of the heuristic proposed by Polese et al. [21] that assumes no centralized coordination. We report the results for the algorithm named DPS_WF, in which each node tries to connect with a multi-hop path to the physically closest donor. The algorithm is distributed and greedy, thus not optimal. The second one is an optimal centralized strategy for the creation of the backhaul with the smallest distance (in hops) from each node to its donor. The strategy chooses E ' as the union of all the edges that are in the shortest path from any node to the closest donor, computed with classical Dijkstra's algorithm. It is optimal in the sense that it minimizes the distance between each node and its donor.
Both algorithms are taken from literature and we use them to test the impact of G v on the properties of G ' .

V. ANALYSIS OF G v
This section presents the features of the visibility graph G v generated considering the whole areas and the 5 sub-areas of 1 km 2 . Due to space constraints in the rest of the paper we include and comment only a small set of the figures we generated, that are enough to robustly support our conclusions. The rest of the figures can be found in the supplementary material, together with the links to the data sets and the source code.

A. Size of the Giant Component
Before we provide the results on the analysis of G v , we want to highlight the importance of the algorithm chosen to select " p i . We compute G v with three different strategies: i) using the heuristic described in Section III-A; ii) using the pseudocentroid c k ; and iii) using the highest point on the roof. Fig. 3 shows the relative difference in terms of number of edges between our heuristic and the other two. For instance, in case of comparison between the heuristic (h) and the pseudocentroid (c) placement the metric is It is clear from the figure that the improvement with respect to the pseudocentroids is substantial. In fact in areas such as U3 the number of edges doubles. The comparison with the highest point of the roof still shows a relevant gain, up to 10% in U3, with one exception: R1, which is a mountain rural area composed of isolated hamlets at very different heights, where overall visibility is more influenced by the position of the hamlets than the characteristics of the buildings. Note that, in absolute terms in U3 we gain 7,696,210 edges while in R1 we lose 13,187 edges, so overall the advantage is considerable. Moreover, the highest point in the roof could be hardly accessible (e.g., a chimney).
This confirms the importance of a solid and repeatable methodology to produce the visibility graphs like the one we provide. It also suggests that in the real world small differences in node positions reflect in large differences in the graph properties, i.e., the network density in the real world is extremely sensitive to small differences in nodes placement. This makes it an interesting challenge to define generic synthetic models able to capture that variability. From now on we only consider the points p i selected with the heuristic. Table II reports the number of nodes that can not be connected to the giant network component. It is always less than 2.1% for the urban and suburban areas and below 10% in the rural areas, with the exception of R1 where, due to the morphology of the area, roughly 17% of the nodes are not in the giant component. However, in the process of network construction that we followed we did not devote any effort to have full connectivity while in a real setting several ad-hoc solutions can be introduced, e.g., higher trellises on roofs, repeater nodes in strategic locations even in the absence of a building -recall that R1 is in a mountain area. So the first key finding is that a NGWB mesh network covering almost entire cities or vast suburban/ rural areas is possible in 8 out of 9 settings, without any specific attempt to maximize coverage. This is itself an extremely interesting result, as it confirms that the concept of IAB is feasible in practice and encourages several applications that rely on LoS links, such as backup networks, community networks, the extension of wired access where it can not be provided, or even dedicated networks made of Free Space Optical links for quantum key distribution [31]. An additional result that provides further insights is shown in Fig. 4, where the size of the giant component is evaluated in the case when only a subset of the buildings, randomly selected, are used. The figure shows that even with a small percentage of randomly selected buildings it is still possible to build a backhaul connecting most of the nodes. In fact with just 30% of the buildings it is possible to connect more than 95% of the nodes in 7 out of 9 areas, and in all cases more than 80% of the selected nodes are connected.
Note that this positive result should be considered as a lower bound, as operators can easily improve it choosing high buildings or those in strategic positions, instead of choosing at random.

B. Coverage, Degree Distribution, Link Length, LoS Probability
The topological properties of G v do not show clear regularities among different areas or even inside the same area. Fig. 5 reports the degree distribution of all rural and urban areas, and shows that while the distributions for U2 and U3 seem to suggest a power law trend, the same trend is less distinguishable for U1 (also due to a more compact distribution). Rural areas have a noisy behavior indicating irregular degrees, also due to the smaller number of nodes in the area, and R3 has a second mode close to the maximum degree. The average degree is high, as it ranges from 145 to 1317. In the network with the lowest (highest) edge density, the average degree corresponds to 1.8% (20%) of the number of nodes.
The physical length of the links of G v displays large differences even among areas of the same kind. Fig. 6 reports the cumulative density function of the link length for all the areas, and it shows that the curves for the three areas of the same kind are always distinct and they start to diverge very early in the graph.
These observations have two key direct implications for the performance of NGWBs, the first is that a high average degree implies a huge number of G ' networks that can be embedded on G v . This is very important because it enables to divide the physical network in a large number of virtual backbones to support different applications. This can be an enabler of the network slicing features of 5G: A high density of links provides many possible physical IABs to map slices on, each one with different performance in terms of delay, robustness, etc. The second implication is that the diversity in the link length strongly impacts the perfor- mance of the network, and the choice of the technology to build links on the selected edges, as the propagation of signals changes significantly with the technology selected. This calls for techniques to build G ' that are tailored for the specific target area, discouraging a one-size-fits-all approach, and justifies the need for the real-world data sets we publish.

C. Antenna Elevation
The differences in G v shown in Section V-B are due to two concurrent factors, a different distribution of building elevation, mostly due to terrain factors, and a different distribution of buildings in the 2D map of the area. Here we focus on the first one. Fig. 7 reports the ECDF of the z value for all the points p i (accounting for the ground elevation, building height and 2 m pole) in all areas and 2 sets of sub-areas. The z values are referred to the lowest point of the area. Fig. 7 shows clear differences between areas, without a recognizable pattern even among areas of the same kind. R3 shows a bimodal behavior that is due to the earth altitude, rather than the buildings' height, while R2 shows a smooth trend that is quite different from the other curves. If we zoom on a single area, we find again different situations. U1 is surrounded by mountains, so that a different choice of the subarea yield even more variability (Fig. 7(b)). The same can not be said for S2, in which the five sub-areas show a very similar behavior (Fig. 7(c)). Again, we observe that the variability of the data do not allow a single model for all the areas, and not even for sub-areas inside the same one. This has a strong impact on the accuracy of simulations that use a single model to describe every scenario, as we discuss in Section VI.

D. Comparison With SoA LoS Models
We now compare the probability of LoS estimated on the graphs generated with TrueNets with the two LoS models introduced in Section II-A3: The ETSI (or WINNER) in (1), and the Al-Hourani model in (2).
As explained, these two models were not derived for rooftop backhauls, thus the goal of this analysis is understanding if these model are somehow adequate for the design of rooftop backhauls or not.
Consider a certain area on which we compute G v , let us call E p the set of all the potential edges, i.e. all the couples ðp i ; p j Þ of the points in N v , with i 6 ¼ j. An edge e ¼ ðp i ; p j Þ is present in E p even if Çðp i ; p j Þ ¼ 0. We call dðeÞ the length of e. For computational reasons, we extract a random fraction r (r ¼ 1% in urban areas and 10% in sub-urban and rural areas) of edges from E p called E p , and we bin the edges in E p based on the edge length, with interval D ¼ 200 m: Then for each e 2 B l we compute the LoS probability P ETSI LoS ðdðeÞÞ using the ETSI model as in (1), and we calculate the average on the whole bin: We repeat the same procedure to obtain M WINNER LoS ðlÞ (using the WINNER model for rural areas) and M AL-H LoS ðlÞ (for the Al-Hourani model). In the latter case we compute P AL-H LoS ðdðeÞ; z; z 0 Þ using (2), with e ¼ ðp i ; p j Þ and z; z 0 the elevation values of p i and p j . We replace P ETSI LoS with P AL-H LoS in (7) and we obtain M AL-H LoS ðlÞ. Finally, we use TrueNets to compute Çðp i ; p j Þ and we have a fourth value: , the x axis of each graph is cut to avoid the noise introduced by bins with less than 0.1% of the sampled edges. The upper graphs report the numerosity of the bins, which shows that in the urban areas the density of buildings smooths the distribution, with some small fluctuations. We observed this behavior also in the other two urban areas. The sub-urban area maintains some regularity, while the rural area shows a completely different behavior. In this case, the area seems to be partitioned in small clusters that generate the multi-modal shape of Fig. 8(c), due to the settlement structure done by small, dense hamlets scattered in a mountain environment: two hamlets facing each other across a valley at distance x give a very large contribution of LoS links around this value.
The curves in the bottom part of the figure report the four values of M LoS . We observe that the Al-Hourani model seems to differ largely from TrueNets values, which is due to two effects. The first is that Al-Hourani was designed for short links (below 250 m) so its application to longer links extends beyond its initial purpose. In the supplementary material we report the curves for the range 0-500 m that show (for some cases) a trend that is closer to M LoS . Second, Al-Hourani models the LoS of drones moving in a 3D space, so the position of the drones is in the empty area among buildings, while we put our nodes on top of the buildings. We already observed the large difference caused by a change in the position of p i inside the same building area (Table II), so it is not surprising that the Al-Hourani model does not fit a real-world LoS probability between building roofs.
The ETSI model for Urban-Micro has a trend that is reasonably similar to M LoS but on a totally different scale, which suggests that it could be adapted to fit the real data at least in the initial part of the curve. One of the most interesting observation is that the WINNER models for the area types we consider (suburban and rural) yield completely different results: Not only they are a pure exponential models, but they also follow a decay that is often completely different from the measured values.

E. Robustness
One intrinsic limit of both the ETSI and Al-Hourani model is that they use a radial symmetry, so given p i they assume the probability of LoS with p j is independent of the angle of the segment between p i and p j , which is of course not true in a real setting. To quantify the effect of the path loss model we need to evaluate the properties of visibility graphs generated with different LoS probability models. We already studied the properties of G v on entire areas, so here we focus on G v built among a subset of buildings in sub-areas that are most likely to be interesting for designing real communication graphs.
We use the methodology described in Section IV-A to select a random set of buildings in one sub-area, and then we build the visibility graph G v among all the p i on the selected buildings. We repeat the process ten times with different random seeds (r = 30 gNB/km 2 ) in all the 45 sub-areas obtaining a total of 45 Â 10 random topologies.
Since availability and fault tolerance is a key requirement for nextG, we focus on a robustness metric: the Effective Graph Resistance . takes into account the presence of parallel (possibly disjoint) paths, and is computed as the average of the resistance between any two nodes s; d in the network, computed as if the graph were an electrical resistive circuit where the links have unitary resistances (the interested reader can refer to Ellens et al. for details [32]). is defined as: where i is the value of the i-th eigenvalue (ordered by their value) of the Laplacian matrix of graph G v . strictly decreases when an edge is added to the network, so the smaller the more robust is in general the mesh, but also the larger is the overall capacity of a mesh built on top of it, as it can exploit more disjoint paths between nodes. For each generated graph we compute the relative percent difference as follows: 100 ETSI À TrueNets TrueNets (10) for the case of the ETSI model, and similarly for the Al-Hourani model. Fig. 9 reports the values averaged on all graphs. It is evident that even using the ETSI model, that was somehow closer to the TrueNets data in Fig. 8, the robustness metric is completely different, and so the properties of the potential embedded G ' will be. The Al-Hourani model maintains a higher similarity, but robustness still differs in a range between þ33:6% and À63:4%. Again, this has direct implications on the design of reliable networks and in the support of integrated network slices.

VI. ANALYSIS OF G '
To underline the impact of realistic data in the evaluation of scientific contributions we selected two state-of-the-art algorithms for the generation of G ' , and we compare its performance when applied to synthetic G v models based on HPPPs (Homogeneous Poisson Point Processes) to place nodes and synthetic models of LoS probability, or to realistic G v estimated with TrueNets. The first one is a greedy distributed algorithm from Polese et al. [21], which was proposed as an algorithm to deploy IAB networks [4]. The second one is the classical centralized shortest path algoritm from Dijkstra used to compute a multi-source spanning tree. As documented in Section IV we dissect the comparison in three parts, showing the impact of both modifications incrementally and finally together. The baseline, which we call HPPP+ETSI, is the synthetic model used by Polese [21]: N ' is selected with an HPPP and E ' is based on the ETSI UMi model. The intermediate one we call OSM+ETSI uses the realistic positions of N ' from real-world data and retains the ETSI model for the selection of E ' . Finally, TrueNets uses both realistic node location and realistic edge LoS measured from our data set.
For each of the 45 sub-areas of 1 km 2 and for each strategy we generate 10 different possible networks, each one with a different random choice of N ' and donor nodes and a different resulting E ' (the process is the same explained in Section IV-A) and we average the results. As usual, we report only the minimal set of results to support our conclusions, more results which corroborate the same conclusions are available in the supplementary material.

A. Distributed Algorithm
The first result, reported in Fig. 10 a), shows the ratio of IAB-nodes that can not be connected with a multi-hop path to any IAB-donor. These nodes are isolated and can not be part of the backhaul network. This result indicates that the main difference is due to the LoS probability model. In fact, regardless of the nodes selection strategy for N ' , the fraction of disconnected nodes remains very similar. In all three areas the difference among the baseline and OSM+ETSI (blue and black bars) is below 30%. On the other hand, the usage of a realistic visibility graph dramatically impacts the ratio of unconnected nodes, with differences up to 700% due to the strong difference in the LoS probability as reported in Fig. 8.  Increasing the density of gNBs, the fraction of disconnected nodes becomes marginal and also differences between strategies decrease as one may expect.
The second result, reported in plot b) of Fig. 10, is the metric used by the authors in the original research: The distance between each IAB node and its donor gNB in terms of hops, which is a key parameter to estimate the latency in an NGWB, but also to estimate the effective capacity given a technology to set up links on edges. The plots report the ECDF of the hopcount and show that not only the G ' generated using synthetic data has between 20% and 30% fewer nodes (Fig. 10 a)), but also the length of the path to the donor is much longer.

B. Centralized Algorithm
The results found for the distributed algorithm generally holds true also for the centralized (and optimal) algorithm. The first result, reported in plot a) of Fig. 11 shows that the usage of a realistic visibility graph still provides some gains in the connectivity of nodes. In fact, albeit the values are more compressed, the usage of a realistic visibility graph diminish the number of unconnected nodes.
The second result, reported in plot b) of Fig. 11, also confirms the results found for the distributed algorithm. By using a realistic visibility graph, the paths are generally shorter due the fact that longer edges are less likely to be present in synthetic graphs than in reality. Additionally, we note that when employing a centralized algorithm the average number of hops diminishes significantly, suggesting that the algorithm proposed in [21] was not conceived to minimize the number of hops. Without further findings, a centralized algorithm may be preferable for delay-sensitive applications.
We can conclude that, in this case, the use of synthetic data produces results that are extremely pessimistic, primarily in terms of admission to the network, but also in terms of performance, so that an evaluation based on these models may lead to hamper the development of a technology that is instead very promising. These results do not change if we use synthetic models to estimate the LoS and realistic node positions, but they dramatically change when we use realistic data for both node positions and LoS availability, confirming the importance of data-based approaches and models for the generation of G v .

VII. CONCLUSION
The key finding of this paper is the observation that a onesize-fits-all synthetic model for NGWB are not accurate enough to capture the complexity of the real world. Past experience tells us that when we ignore the variability of the real situations, systems simply fall short in delivering their expected social or economic value, or are not deployed due to pessimistic performance expectation.
For this reason, we propose a novel approach to the problem, that is to use open data together with ray-tracing to produce accurate data sets to study the performance of networks in their real environments, and we provide an initial data set together with open source code to extend it in the future. We believe the data and the methodology we provide help researchers to obtain robust results in several network planning tasks for NGWBs. Among them we mention: The study of the realistic coverage of nextG networks. We know that nextG requires critical densification of BSes, but only data can realistically quantify it; The study of the real applicability of flying networks to extend nextG coverage. Drones have been proposed for this task when needed, but again, their efficacy needs to be evaluated using real data; New, more accurate models that can capture the features of existing places for which we do not possess open data. The last point is an intriguing one. With more datasets we can abstract new models that enable to estimate the performance of networks even in the absence of all the data. These models may depend on data that are easier to access than the full DEM, such as satellite images, ground elevation profiles (without building elevation), statistical distributions of building height in existing areas. Yet these models need new instruments coming from network science and data analysis to complement the classical measurement campaigns that generated the ones we use today.