Clustering Algorithm-Based Network Planning for Advanced Metering Infrastructure in Smart Grid

Nowadays, legacy electrical distribution systems are migrating to a new modern electric grid with the capability of supporting different applications such as advanced metering infrastructure (AMI), distributed energy resources, and electric vehicles. Among these applications, AMI is playing an important role in delivering data from customers to power utilities, supporting reliable real-time monitoring, and remote operation of power quality data and voltage profile. The AMI consists of smart meters, data aggregation points (DAPs), the utility control center, and communication networks. Appropriate network planning plays an essential role in facilitating the exchange of data between consumers and power utilities as well as accommodating new smart grid applications and future growth. This work proposes an optimal placement of DAPs for AMI based on machine learning clustering techniques in residential grids. Network partitioning is introduced to create sub-networks and graph algorithms generate a deployment topology given optimization constraints. A new measurement metric called coverage density is considered to indicate neighborhood area networks (NAN) zones with the appropriate coverage. Three real scenarios of NAN are considered: urban, suburban, and rural. The proposed algorithm is evaluated and compared with conventional heuristic optimization methods with respect to average and maximum distance between smart meters and DAPs, coverage density, and execution time.


I. INTRODUCTION
Advanced metering infrastructure (AMI) plays an important role in the future smart grids with many benefits for power utilities as well as for consumers. Smart meters (SM) enable power utilities to collect electrical consumption inside houses and communicate these measurements to utilities. In Chile, about 250,000 SMs have been installed, and 6 million SMs are expected to be installed by 2025 covering residential and industrial customers [1]. In the United States, electric companies have installed more than 88 million SMs out of the 154.1 million meters, covering nearly 57.1% of households by the end of 2018 [2]. In Great Britain, about 34% (9.4 million) of the total electrical meters are SMs which are operated by large energy suppliers in domestic properties and the number is still growing up [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Lin Zhang .
The AMI is one of the main components in the future smart grid paradigm that supports delivering data from utilities to customers and vice-versa [4]. Compared with the one-way communications used in legacy systems, the concept of two-way communications between customers and utilities enables the exchange of electrical consumption, billing information, and grid status, which is relevant information for utilities and stakeholders involved. The AMI network is hierarchically composed of different communication levels including home area networks (HAN), neighborhood area networks (NAN), wide area networks (WAN), and utility control center, as depicted in Figure 1. The main component of the HAN is the SM, which acts as a gateway communication device between electric appliances at the home level and the power utility. The NANs are groups of different HANs located nearby which can be considered as a single zone or a cluster. Each NAN is connected to a Data aggregation point (DAP), which aggregates the data in HANs and forwards it to the utility control center. The WAN links between NANs and the utility control center using different communication technologies. When data arrives at the meter data management system (MDMS) located at the utility control center, data is processed for billing and other demand control services.
The AMI brings several benefits for consumers and power companies. One of the most important benefits is to provide historical data for billing and historical analysis and allowing customers to make informed choices about energy usage based on the price [5]. This benefit can be extended to what is called a demand response (DR) where power utilities could increase the price during stressful conditions to alleviate grid load (due to exceeding normal usage patterns or consumption) and, in a similar way, can decrease price when there is no major overload to systems [6], [7]. This situation can also contribute to major participation from users contributing to grid stability [8]. Regarding the utility side, and thanks to the big data generated by AMI operation, different machine learning methods can be used to classify customers by tariff range in order to implement DR [9].
Communication technologies play an important role in the correct operation of AMI networks. Several communication schemes could be deployed to interconnect among different AMI layers covering HAN, NAN, and WAN. However, there is no unique solution or a correct choice for communication technology in AMI networks. For instance, wireless schemes have the advantage of being widely used due to their low economical cost, high number of possible connected devices, and ease of deployment [10]. However, a major flaw can be channel interference between technologies and terrain constraints as they can directly affect these communications. Wireless technologies include WiFi, ZigBee, LoRa, and Sig-Fox. On the other hand, wired schemes can be considered when planning AMI networks given their high capabilities for delivering high data rates, fault tolerance, and long transmission ranges [11]. Nonetheless, they are economically unviable for relatively low data rate communication at NAN premises and given their elevated deployment costs cannot be considered for every AMI network context. Therefore, many considerations should be taken into account while designing the AMI network such as terrain constraints, dense areas, and difficult access to install nodes. In such configurations, both HANs and NANs are the main points of data generation and data aggregation. The communication between these two layers is critical for the correct operation as sensitive power consumption data as well as control and alerts signals are carried out [5], [12].
Considering the importance of the AMI network and the fact that it is considered a critical element in several smart grid applications, several research topics are related to this area including security issues, new communication technologies, and network planning. Security concerns must be taken into consideration while deploying these networks, and the feasibility of new communication paradigms such as LoRaWAN, SigFox, and Narrow-band IoT are getting many interest and benefits. However, AMI's network planning and the allocation of data aggregation points in AMI (DAP placement problem) have not been well explored. Given the importance of DAP allocation when deploying AMI networks at NAN premises, network engineers must carefully plan their infrastructure as these devices will aggregate the measurement data of HANs.
This work aims to fill the gap regarding the DAP placement problem in the AMI network. Considering DAP placement as a network planning problem, this work proposes a new resolution method considering clustering techniques using machine learning. In previous research work, most proposals are based on integer linear programming (ILP) models and heuristic methods given the multiple variables presented. We propose clustering techniques for DAP placement problem to minimize the distance between DAPs and SMs by dividing neighborhoods into sub-networks. The proposed system is evaluated for three different neighborhood areas: urban, suburban, and rural premises, with the idea of showing that the proposed technique can also be applied to different neighborhoods without loss of generality. We introduce a new measurement metric called coverage density that will set references for network engineers to know if certain zones are being correctly covered.
The rest of the paper is structured as follows. Section II introduces the related work. Section III presents the mathematical formulation of the problem. Section IV describes the proposed algorithms and solutions employed. Section V presents the achieved solutions. Finally, Section VI concludes the paper with challenges and future directions.

II. RELATED WORK
The main components of AMI network are smart meters, data aggregation points (DAPs), and meter data management systems (MDMS). The DAPs are responsible for aggregating the measurement data reported by every smart meter in NAN and forward it to MDMS through the wide area network. For NANs, as several HANs compose these networks (e.g., hundreds or thousands), there is a constant flow of information that must be considered during the design stage. Considering this critical task, DAPs are crucial for the correct operation of NANs and their location is an issue that must be taken into consideration by utilities and network engineers. The DAP location problem has not received much attention from academia and industry where proper DAP location leads to better network performance.
Prior studies have covered AMI from different perspectives including network planning, optimization techniques, network simulation, and communication technologies. Authors in [13] presented a genetic algorithm (GA) to solve the network planning problem, i.e., terminal assignment at minimum cost, considering specific transmission mediums (Wi-Fi and GSM) and solving the problem constrained to have minimum wireless interference. Results have been presented in terms of necessary access points versus base stations to provide minimum coverage. However, the work has been evaluated for an urban area scenario. Genetic algorithm has been considered in [14] where concentrator redundancy and reliability are the key terms for choosing links in networks with a fault tolerance context. The proposed model's main constraints are the cost of power estimation, device reliability, the number of desired concentrators, and network cost. Nonetheless, the work lacks realistic topographies to test the proposal in order to give space to real environments.
Optimization techniques have been applied in the AMI context [15] given their feasibility to model several concepts and propose new solutions based on several design criteria like minimum cost deployment, minimum power consumption, maximum reliability, and maximum connectivity fault tolerance. Authors in [16] proposed the use of a minimum spanning tree (MST) algorithm with the idea of providing optimal routes and minimum connectivity. The optimization problem has been solved with the use of heuristic approaches considering distance minimization and coverage maximization. Authors in [17] studied the architecture deployment in order to minimize the average packet delay. With the use of WiFi and optical fiber, the problem has been solved in three stages: wireless access problem, passive optical network (PON) connectivity, and wavelength division multiplexing (WDM) PON terminal assignment. The first scenario using a connectivity graph, the second scenario with a steiner tree-based ILP model, and the latter with K-Means clustering to assign optical network units (ONU) with their respective optical line terminals (OLTs). Fuzzy logic has been employed in [18] to plan a network with fault tolerance at NAN premises, considering the transmission power of devices (i.e., smart meters) and communication performance indexes based on reading success rates and average response times under a multi-hop communication scheme. Despite considering real neighborhood scenarios, all previous proposals have been constrained to only one neighborhood area.
Network partitioning is a concept that can extend the already known ILP models. Authors in [19] have based their solutions using network partitioning to introduce two algorithms based on K-means clustering that minimize the average and maximum distance while varying the number of concentrators. Instead of choosing a random point that achieves the aforementioned objectives, the same points or nodes of the network are selected as the concentrators assuming a multi-hop or mesh scheme. The same idea has been given in [20] but considering fault tolerance in addition to the connectivity problem which has been modeled as set covering problem (SCP). The problem has been solved through a heuristic based on K-means clustering taking into consideration cost, computational costs (i.e., memory) and desired level of redundancy.
Real implementation and simulation are the most widely used tools to validate a network design. In [21], the authors designed a communication scheme using PLC technology and validated it with the use of NS-3 simulation considering the IEC 61968 standard regarding the delay, message types, and packet size. The same concept has been used in [22], where authors validated their designed communication scheme using WiMAX for NAN and optical fiber for WAN premises using OPNET simulations considering the IEEE 2030-2011 standard regarding transmission delay, which needs to be between 4ms and 15 seconds. Furthermore, the OPNET simulation tool was employed in [8], testing WiFi and ZigBee schemes for proper AMI communication in terms of delay and throughput for a utility-provided NAN. Authors in [23] planned a LoRaWAN-based AMI network using the Forsk Atoll radio network simulation tool in terms of physical layer constraints (bandwidth, spreading factor, and link budget), coverage (Okumura-Hata propagation model) and capacity estimation of required smart meters. The same framework has been applied in [24] but for a SigFox-based AMI. Coverage, reliability, latency, battery life, and security concerns are analyzed in [25] to prove that narrow-band IoT communications are suitable for AMI network requirements, considering implementation of this technology at a university campus in China.
To give insights into possible bottlenecks of the LTE infrastructure for AMI, a tool has been used for studying the average access delay, collision probability, and traffic generated in an AMI network based on LTE communications in Montreal, Canada [26]. In [27], the planning of PLC-based AMI network is presented with the IBM CPLEX tool and a markovian model for the media access control (MAC) layer of PLC communications. The main parameters of the study are route reliability, connections, and number of required DAPs for an urban area. Not only technological issues must be considered while planning a network but also the resources. In [28], authors have proved that using an economical cost model based on a real implementation of AMI in Korea, a hybrid AMI communication scheme is cheaper than a single scheme that could enable decreasing the total costs by 19.1%.
Issues regarding interoperability, standards, and protocol must be considered on designing AMI networks. Device Language Message Specification (DLMS) is a concept used for abstract modelling of communication entities, which translates into the common language between AMI devices [29]. Companion Specification for Energy Metering (COSEM) sets rules, based on existing standards, regarding data exchange between energy meters. These schemes have already been applied in European utilities, along with PLC communications, and in North America with the use of RF mesh infrastructure [30]. However, as different schemes support DLMS/COSEM communication, coordination between utilities and regulations must exist in order to make AMI networks interoperable [31], issues that must be tackled on planning AMI networks.
Novel communication technologies known as low power wide area networks (LPWAN) are being used to provide connectivity to smart devices given their low costs, low power consumption, and long communication range. Examples are LoRaWAN [23], SigFox [24] and narrow-band IoT [25]. However, the aforementioned approaches do not tackle the DAP placement problem under their domain as the focus is concentrated on the feasibility of such technologies. Table 1 presents a comparison among previous related work.
Network design and planning of AMI have not been adequately studied in the literature. There is a lack of methods and tools for AMI network planning, where available methods are based on specific technology or applications. Most of the previous work considered different technologies and planned their network based on different assumptions. However, considerations of real scenarios have not been given proper importance. Considering the critical role of DAPs in the AMI network for the correct operation, the DAP placement problem has not received the proper importance in previous investigations.
This work aims to fill the gap in network design and planning for AMI. The main contributions are: (1) Propose a framework for optimal placement of data aggregation points (DAP) for AMI in residential grids; (2) Based on the real geographic data of smart meter, the proposed framework generate an AMI topology; (3) Analytical models are VOLUME 9, 2021 considered to evaluate three different AMI scenarios including densely populated area (urban), suburban and countryside (rural); (4) Performance evaluation with respect to average and maximum distances between smart meters and DAP, coverage density and execution time.

III. PROBLEM DEFINITION
In a typical distribution power system, the main objective of the AMI network is to provide connectivity to all smart meters in NANs. Given the fact that the connectivity among smart meters could be achieved using different wired/wireless communication technologies, this work will focus on the general problem without considering medium constraints, network cost, or compatibility issues with other technologies.
Consider a neighborhood area network (NAN) modeled by a non-negative weighted edge undirected connected graph G = (V , E) in a 2-dimensional space, where nodes (vertices) V are smart meters (SM) and edges E are links. Under the constraints of the AMI and NAN context, links will have weights associated to the distance between SMs. It is worth noting that these weights can also be modeled as energy, cost, traffic load, and delay.
In AMI network, every smart meter node will be represented by their coordinates (longitude, latitude) as a (x, y) pair and have a communication range modeled by a circle with radius r c meters. Considering that these SMs are representing houses, SMs will have neighbors or adjacent SMs, and the criteria for adjacency is that one SM i will be neighbor or adjacent SM j if d(i, j) ≤ r c , where d(i, j) is the distance between SM i and j given that i, j ∈ V .
As NANs consist of hundreds or thousands of SMs, it is imperative to divide the entire network into subnetworks. Each one of these subnetworks has their own data concentrator (DC) or data aggregation point (DAP) that receive the data sent from SMs and forward it to the wide area network (WAN) that finally deliver the data to the utility control center. These subnetworks are modeled by the set DAPS = {DAP 1 , DAP 2 , . . . , DAP n }, where n is the number of desired DAPs to be placed. Each one of these subnetworks consists of a concentrator node or cluster head (the actual DAP) and their associated nodes of SMs. The associated nodes to every DAP are denoted by the set SN = {SN 1 , SN 2 , . . . , SN n }, where SN n represents the associated nodes to the subnetwork n.
The main objective of this work is to find the optimal placement for DAPs in AMI network that minimizes or shortens the distance between DAPs and SMs. To achieve this objective, we propose the objective function as given below.
Distance Minimization: The above objective function is subjected to: Constraint 2 indicates that the union of every subnetwork will result in the original network. Constraint 3 implies that there are no common members between every subnetwork, i.e. every subnetwork is disjoint from each other. Constraint 4 implies that every subnetwork member is able to communicate with each other either directly or through adjacent nodes. Constraint 5 indicates that every DAP is chosen among the SMs that belong to the corresponding subnetwork.
The described problem is known as the DAP Placement Problem, a particular case of the Facility Location Problem proposed in Linear Programming. Given the fact that this problem is considered NP-hard, heuristic approaches must be taken into account in order to solve this problem. The following section explains the proposed solution using machine learning clustering technique, known as K-Medoids. The proposed solution is compared with the solution given in Ref. [19]. Also, the performance of the proposed solution is evaluated for different NAN scenarios and clustering metrics.

IV. DAP PLACEMENT CLUSTERING ALGORITHM
This section explains the DAP placement problem, including distance measurement among smart meters and the proposed clustering algorithm.

A. DISTANCE MEASUREMENT
Given that every SM will be modeled by their (longitude, latitude) coordinates, the need to have a proper approximation of the real distance between meters must be taken into account. A raw approximation of these coordinates as a X-Y Euclidean space cannot be made because these coordinates represent unique spherical points on earth. As the coordinates are on the surface of a rounded shape and considering that the earth has a certain curvature, the actual plane of study is rounded. A measure that deals with this kind of uncertainty is the Haversine Distance which computes the minimum great-circle or spherical distance between two points of a sphere given their longitudes and latitudes.
Let (λ 1 , φ 1 ) and (λ 2 , φ 2 ) be the coordinates of SM 1 and SM 2 , respectively. The shortest spherical distance between these two points is d, given the radius R of a sphere (for these results R = 6373 km), and the circular section angle θ between these two measures (also understood as the central angle between any two points on a sphere). Then, the Haversine Distance d is calculated as the following formula: Haversine(θ) = Sin 2 ( 48996 VOLUME 9, 2021

B. DISTANCE BETWEEN ADJACENT SMART METERS
After defining how distance will be measured between smart meters, there will be adjacent and non-adjacent SMs. Let i, j ∈ V and consider r c as the transmission range of both smart meters. If d(i, j) ≤ r c , both smart meters are adjacent. This means that any pair of adjacent smart meters are able to communicate with each other directly without the need for relay nodes.

C. DISTANCE BETWEEN NON-ADJACENT SMART METERS
When any pair of smart meters cannot communicate directly i.e., d(i, j) > r c , these SMs are non-adjacent. Consequently, it is mandatory to communicate through relay nodes. This implies the use of multi-hop communication between SMs that are far from each other. Another problem arises regarding which route to follow so that every SM can communicate with each other (independently if they are adjacent or not). As our network is modeled as a graph, any searching algorithm from graph theory could be applied. In particular, Floyd-Warshall (FW) algorithm will be used to search every route from any source node to any other destination node. The FW algorithm looks for the shortest route from one source node to any other destination nodes considering from 1 to n possible intermediate nodes and chooses the route with minimum cost (in this case based on distance). The general procedure to obtain the distances and optimal routes from source to destination translates into a D minimum distances matrix and a P routing matrix. Figure 2 shows the proposed framework for AMI network planning. The input stage consists of smart meters coordinates, the transmission range of smart meters, and the desired number of DAPs. Then, distance and route matrices are computed considering the aforementioned methods. Next, a revision of existing clustering methods is mandatory: e.g., K-Means. It is a powerful tool for getting labels into unknown data. This implies to choosing a centroid that minimizes a certain distance measure with respect to cluster points, which in common cases is a point in space. However, this case cannot take place in our context due to communication schemes constraints related to the possible use of other SM as relays of other nodes and because it implies that there exists a direct communication link between every node to its respective cluster head which could not be feasible due to distance. Therefore, cluster heads or centroids must be part of the set of nodes or vertices from the network graph.

D. CLUSTERING ALGORITHMS
In order to solve this, we propose to solve the network planning problem with the use of a clustering algorithm from machine learning called K-Medoids [32]. It aims to look at cluster heads for which the total dissimilarity to all cluster objects is minimal. Under this work context, this will be understood as distance. The description is also referred to as the optimization objective in equation 1 and translates into partitioning points with respect to a cluster head which is chosen among the existing nodes inside every cluster or sub-network.
This algorithm is used in the Data Science area, in particular with Machine Learning, as it is part of the tool-set of clustering algorithms. The reasons to choose this particular algorithm are labeling points is analogous to sub-network partitioning which is the reason of NAN network planning, cluster heads are chosen among existing nodes making it feasible for NAN partitioning and mainly due to the lack of information or development of machine learning techniques in the AMI context. As stated in [9], machine learning techniques have been widely used for load analysis, load forecasting, load management, and connection verification tasks for the operation of smart grids but not for a planning scenario similar to the one presented here.
The detailed description of K-Medoids algorithm is presented in Algorithm 1, as given below: 1) Randomly select k out of the total n nodes as cluster heads or medoids. 2) Associate every node in the network to one of the chosen k medoids according to the minimum dissimilarity or distance (equations 1, 2, 3, 4 and 5). 3) For each medoid and for each node in every subnetwork: swap the current medoid and node. 4) Repeat step 2). 5) Compare the total distance before and after the swap.
If the distance between every member has been reduced, keep the swap. In other cases, undo the swap. 6) Repeat steps 3), 4) and 5) until convergence, i.e., there is no more distance reduction for each subnetwork or cluster.

7) Save set of DAPs (daps) and set of subnetworks
(subnetworks). Please note that the AMI architecture is a tree network topology, which is the output of our network planning framework as a general case. As DAPs act as gateways in the AMI network context, every smart meter must have connectivity to a DAP, either directly via a single hop or through other smart meters (relay nodes) following a multi-hop routing scheme. On the other hand, if every smart meter has direct communication with their DAP, i.e., no relay nodes, a star topology would be the output.

V. TESTING, PERFORMANCE, AND EVALUATION
In this section, we will present the data used for testing the aforementioned algorithms alongside with their respective metrics. The main parameters of the study are distance minimization, coverage density, and execution time. We evaluate the methods on a 3 GHz 9th generation Intel Core i5 CPU using a single-thread with 8 GB of RAM. The simulation tool was built in a Debian 10 Linux machine and implemented using Python 3 programming language. Results were compared against the state-of-art metrics proposed by authors in [19], which are minimum and maximum distance from SM to DAPs, alongside with our proposed metrics.

A. NEIGHBORHOOD AREA NETWORK
Real neighborhoods were employed for the evaluation of the proposed algorithms. An urban neighborhood is selected from Santiago, Chile, a suburban one from Puerto Montt, Chile and a rural area also from Puerto Montt, Chile are chosen as the NANs. Each node of the NAN is modeled by their (longitude, latitude) coordinates which were obtained using Google Maps. These areas are shown in figures 3-5. Before every analysis, it is important to note that the main objective is to provide connectivity to the maximum number of eventual customers for the given areas. Several communication schemes can be considered when implementing these systems, from wired to wireless transmission mediums, from cellular to 5G networks, and different packaging and information gathering methods. Nonetheless, all of the aforementioned methods are constrained to the following facts: there will be groups of nodes that are connected to one concentrator node that will communicate with their own nodes and with WAN nodes. SMs individually will not necessarily have direct communication with their assigned concentrator node in the cluster, which implies that nodes will have to relay data in those cases until it reaches the destination. Therefore, the latter case must have at least one route from any source node to the cluster head or concentrator.
We ran several tests for every network and metric, ranging from 1 to 10 DAPs and 10 times the same experiments in  order to have a proper sample to analyze and conclude. For the next subsections, 3 figures per case are shown to avoid information overwhelm.

B. PLACEMENT AND ROUTES
This section presents the results from the execution of the proposed algorithm on three real neighborhoods regarding DAP placement and routes.
The communication range was studied for the three different neighborhoods, increasing r c value from 20m to 180m linearly until minimum connectivity between every SM can be achieved. This process resulted in choosing 80m for urban and suburban areas, while for rural premises 180m was used. The parameters are fixed for every test and analysis made in this study. The DAP placement problem is an under-explored topic in the smart grid domain. Thus, the selection of a certain communication scheme (as shown in related work) limits the general study of this topic. Also, considering the elevated costs of installing any wired scheme for AMI networks, and the fact that not many case studies have been presented in the literature, our proposed algorithm named ''K-Medoids'' is compared with the state-of-the-art solution ''CDPAavg'' given in [19]. To the best of our knowledge, this is the first work that study the DAP placement problem, considering output topology, clustering results, and routes for three different neighborhoods.
For the urban area, both algorithms perform as expected, as shown in Figures 6.a and 7.a. On increasing the number of concentrators, i.e., DAPs ≤ 6, K-Medoids achieve a more uniform distribution of nodes, in contrast to the results given by CDPAavg. The remaining scenarios (DAPs ≥ 7), considering cluster head location, show no major difference in the distribution obtained by both algorithms. Therefore, the results are similar despite the optimization objective considered.
Concerning the routing, on a lower number of concentrators scenario, almost every border node in the subnetworks must employ relay nodes (up to 3 hops) to reach the corresponding concentrator, as depicted in Figures 6.a and 7.a. Also, the existence of a single route connecting zones (the small cluster of nodes at the upper left corner) is a critical issue for this type of placement. If the link is not operational, the entire zone is isolated from the network. When deploying more DAPs, the number of hops required to reach the destination is reduced, as shown in other figures. For a dense concentrator number, this number is even reduced, as shown in Figures 6.c and 7.c.
Regarding the suburban area, results are presented in Figures 8 and 9. When the number of DAPs is small (e.g., DAPs ≤ 3), the result achieved by both algorithms is very similar with respect to cluster head location. Both algorithms capture the same zones of the map and distribute DAPs according to their objectives. The same occurs when the number of concentrators increases, as stated when DAPs ≤ 6. Nonetheless, as more DAPs are placed in the neighborhood (Figures 8.c and 9.c), cluster distribution tends to be similar between both methods with the existence of big and small sub-networks, meaning that as DAPs increases there is no noticeable impact on topologies (considering that their optimization objectives are different).
As expected for routing, for a small number of DAPs, there is still a need for a multi-hop scheme for some nodes to reach the destination. Nonetheless, as distances are not as huge as the rural network case, when DAPs ≥ 6, practically every node (except for a pair shown in Figure 8.b) can communicate directly with their concentrator. It should be noted that we assumed the communication range of DAPs and SMs are circular shaped. However, real constraints about wireless communication such as medium access control and interference should be included in calculating distances between nodes, as some of them will not be able to communicate.
Finally, the performance for the rural area is shown in Figures 10 and 11 for CDPAavg and K-Medoids, respectively. The obtained locations for DAPs using the CDPAavg method tend to be at the center of every cluster zone, which corresponds to the same minimization objective; therefore, this kind of placement is expected. On the other hand, K-Medoids aims to minimize the dissimilarities between nodes and the cluster-head (for this context, distance), obtaining different placements with respect to CDPAavg, which are explained    due to the initial DAP places randomly chosen (and the corresponding initial seed for the random generators).
Nonetheless, it is worth noting that when the number of DAPs increases (see Figures 10.c and 11.c), the final placement of them can be considered similar as both methods obtained similar cluster sizes (given by their convex hulls) but at different geographical locations. As seen in figures, the existence of large and small clusters is shared by both methods meaning that similar zones are found. Additionally, and for every cluster, their concentrator is located at the center of each subnetwork. Hence, it is possible to say that when the number of concentrators is larger, the final placement 49000 VOLUME 9, 2021   of the DAP will tend to be around the center of the cluster independent of the chosen optimization objective.
Concerning network routing in a rural area, and independent from the optimizing algorithm, it is clear that when there are fewer concentrators per area, the number of hops needed to communicate between nodes and DAP is greater when comparing a scenario with a considerable number of DAPs due to the larger distances between nodes. Even when there is a considerable amount of DAPs (e.g., 9 DAPs), still exist nodes that need at least 3 hops to reach the destination. Nonetheless, this translates into less delay for communication allowing for better response times for metering and control signals from utilities to customers and vice versa.

C. DISTANCE COMPARISON
This subsection presents the analysis of the distance percentages obtained by each method for the three neighborhoods with respect to DAP placement and routes. We analyzed the average and maximum distances from SMs to their associated DAPs for each neighborhood in terms of their percentages of occurrence. The plots are shown in Figures 12,14,16 and 13,15,17. To have a proper sample to compare, we ran both algorithms 10 times and took the average and maximum distances for each DAP or scenario considering all the gathered samples per iteration, obtaining the scaloned curve shape for each algorithm and metric. The present analysis is compared to the one presented by the authors of [19] in order to have the same comparison criteria.
Considering the urban area with a low number of concentrators, both algorithms achieved a similar average (AVG) and maximum (MAX) distances in terms of the distance between smart meters and DAPs (SM-DAP). The same behavior is present when DAPs ≤ 5, but when DAPs = 6 (see Figures 14.b and 15.b) K-Medoids starts to have a much more narrow or bounded curve (in both AVG and MAX distances). Therefore, the gap between minimum and maximum VOLUME 9, 2021    distance is small in K-Medoids compared with CDPAavg. As a consequence, node placement will be more uniform and intra-cluster distance will be smaller with respect to their cluster head. When more concentrators are placed, there is no major difference in terms of AVG or MAX distances, which means both methods achieve similar results.
Regarding the suburban area, both methods achieve same distance results, as shown in Figures 14.a and 15.a. For DAPs = 4 CDPA outperforms K-Medoids in both AVG and MAX distance scenarios, which is the only scenario where this happens. For DAPs = 6 the minimum and maximum distance gaps for K-Medoids are less than the ones achieved  by CDPAavg, which guarantees that clusters will be more uniform regarding node distribution and intra-cluster distance between other clusters. In all other cases, the gaps are very similar between methods which demonstrates that both achieve the same results. In particular, when DAPs ≥ 8, K-Medoids MAX distance is less than 80 m. In this work, we assume an ideal case with no transmission delay regarding access control and interference, except for processing tasks. This translates into direct communication from SMs to their respective aggregator.
For the rural area, both methods achieve similar distance thresholds which are represented in every step of the curves. Nonetheless, the distance interval between minimum and maximum distance for both scenarios (AVG and MAX distances) is more compressed for K-Medoids rather than for CDPAavg for almost every DAP scenario, which translates into more controlled zones for better equipment operation, as the variation for the measures is less compared to CDPAavg. An interesting result is the one presented in Figures 12.b and 12.c. It is natural to think that as more DAPs are installed, the communication distance, in general, will have a decreasing behavior, but this is not the case for CDPAavg.
Considering the non-existence of a constraint that limits the number of cluster members, having clusters with large number of members is expected. This also explains the long distance gaps, for every DAP, in CDPAavg distance plots due to the extended coverage range of certain clusters as seen, for instance, in Figure 6.b.
If we consider the average (AVG) distance measure as best-case and maximum (MAX) distance as worst-case. For the rural area, we can conclude that using K-Medoids, every node is connected directly to their associated DAP, given that their greater distance is less than the communication range. While analyzing a worst-case scenario, there is communication but with at least 2 or even 3 relay nodes.

D. NODE DISTRIBUTION AND EXECUTION TIME
This section analyzes and comments on the node distribution (in terms of numbers) and algorithm execution times obtained from the experiments. Achieved results are presented in Tables 2, 3 and 4. As we ran both algorithms 10 times and achieved different results, the results presented here are a single instance execution out of the 10-run tests performed to show the possible variations of the obtained results.
For the urban area scenario, as the area has a lot of nodes concentrated in different zones, for low DAP coverage (DAPs < 3), the distribution between both algorithms can be considered similar as well as the coverage density  measure presented in Table 2.a. When increasing the number of concentrators in the area, i.e., DAPs ≤ 6, the zones are clustered as expected providing accurate separations for both algorithms and similar coverage as seen in Table 3.a.
When there is a great number of concentrators, i.e., DAPs ≥ 7, results can be considered very similar with no significant differences considering the different optimization objectives which can be seen in Figures 6.c and 7.c. With this information, the obtained conclusion is that, for high density areas, the results will be very similar with respect to the number of smart meters in each cluster. Nonetheless, K-Medoids outperform the CDPAavg algorithm by 5 orders of magnitude in all these tests proving that great results can be achieved in less time.
Regarding the suburban area scenario, and considering that this configuration can be represented as a small scale urban area but in a larger field, both algorithms' results are quite related considering the performance with the urban area. For DAPs ≤ 3, results are very analogous and as the number of concentrators increases, chosen cluster heads will vary due to the different optimization criteria, and this can be seen in the number of SM in every cluster (see Table 3.a). In other scenarios, i.e., DAPs ≤ 6, both algorithms achieve correlative results with no great variation in terms of the member of cluster nor the coverage density, which can also be seen graphically in Figures 8.b and 9.b. In the final case, when DAPs ≥ 7, results are very similar considering node members and coverage. However, execution times are still a great differentiating factor between both methods where there is 4 order-of-magnitude difference in every DAP scenario. Both achieve very close results, but K-Medoids gets same results in a greatly reduced time window.
Finally, for the rural area, when low concentrators are present (i.e., DAPs ≤ 3), K-Medoids achieve more coverage density than CDPAavg (see Table 2.c). For the next scenarios (DAPs ≤ 6), both methods achieve similar results in terms of the number of members and coverage density. And analogously, as with the other networks, when many concentrators are present, results are quite similar with no major difference other than the execution times. During these tests, execution times got a separation of 3 orders of magnitude. Therefore, the conclusion is that with existing machine learning methods, the results are quite comparable and very similar to those achieved by a very-well defined heuristic method, as shown with CDPAavg.
Several other aspects need to be addressed. For example, there is a trade-off between the number of DAPs and the coverage area, which depends entirely on the utility requirement when installing these devices. When a DAP fails, the problem becomes more complex as an entire sector of SMs will be disconnected from the main grid. This problem motivates the study of the resilience and reliability of AMI networks for DAPs and SMs failure. Other implications of AMI networks can be found around the applications of LPWAN technologies. Given the capability of low cost, great coverage, and low power consumption motivates the need for economic and market studies related to LPWAN. Furthermore, as novel technologies arise around AMI networks, the proposed framework can be tested using the new paradigms in order to reduce costs and improve smart grid connectivity.

VI. CONCLUSION
This paper proposed a network planning framework based on machine learning clustering techniques for the DAP placement problem. First, different inputs for the network planning process are considered, including smart meters coordinates, the transmission range of smart meters, and the desired number of DAPs. Then, distances and routes are computed in order to provide the proposed clustering method with the data to compute solutions and metrics. In order to evaluate the proposed algorithm, a simulation model was developed for different neighborhoods including urban, suburban, and rural premises. The results were compared against a typical DAP placement heuristic. The proposed framework locates DAPs and SMs in a way that the difference between the minimum and the maximum distances is much smaller, translating into better coverage among areas. The proposed algorithm results are ready-to-use for planning NAN topology, which can be deployed for different premises by utilities and network engineers. The proposed solution provided accurate network planning for possible NAN deployments in different premises without loss of generality. Given that machine learning techniques are widely used in data mining, the realized experiments gave proper insights that it can also be used in other contexts like network planning for DAP placement problems. Future works aim to investigate the feasibility of the proposed solution for LPWAN networks and network resilience in case of smart meter or DAP failures.