Unsupervised Clustering for 5G Network Planning Assisted by Real Data

The fifth-generation (5G) of networks is being deployed to provide a wide range of new services and to manage the accelerated traffic load of the existing networks. In the present-day networks, data has become more noteworthy than ever to infer about the traffic load and existing network infrastructure to minimize the cost of new 5G deployments. Identifying the region of highest traffic density in megabyte (MB) per km2 has an important implication in minimizing the cost per bit for the mobile network operators (MNOs). In this study, we propose a base station (BS) clustering framework based on unsupervised learning to identify the target area known as the highest traffic cluster (HTC) for 5G deployments. We propose a novel approach assisted by real data to determine the appropriate number of clusters k and to identify the HTC. The algorithm, named as NetClustering, determines the HTC and appropriate value of k by fulfilling MNO’s requirements on the highest traffic density MB/km2 and the target deployment area in km2. To compare the appropriate value of k and other performance parameters, we use the Elbow heuristic as a benchmark. The simulation results show that the proposed algorithm fulfills the MNO’s requirements on the target deployment area in km2 and highest traffic density MB/km2 with significant cost savings and achieves higher network utilization compared to the Elbow heuristic. In brief, the proposed algorithm provides a more meaningful interpretation of the underlying data in the context of clustering performed for network planning.


I. INTRODUCTION
The 5G and beyond is ideated for the provisioning of use cases defined by 3GPP, from ultra-reliable low latency communication (URLLC) services to enhanced mobile broadband (eMBB) and massive machine type communications (mMTC) [1]. These use cases are offered as services that should be capable to sustain the tight requirements needed by applications like virtual reality, vehicle-to-all (V2X) and mission critical communications. Inevitably, network infrastructures supporting 5G services are being planned. In this respect, network planning assisted by real network data is of utmost importance as the process of identifying the highest traffic region (per km 2 ) at cluster level, determining the appropriate number of clusters and to provide lower cost per bit. Network planning not only provides the intended coverage and capacity to the subscribers [2], [3] but it is also an effective way to reduce capital and operational expenditure (CAPEX and OPEX, respectively) for the MNOs [4], [5].
Traditionally, network planning utilizes forecast data or estimated traffic demand, to provide extended coverage or capacity enhancements in the target area [23]- [25]. However, the requirements of new 5G use-cases demand data-driven network planning to deliver ultra-low latency, higher data rates, ultra-reliable network deployments [26] with lower cost per bit. The network data acquisition for traffic and infrastructure information of an MNO endows network planning to deploy network with lower cost per bit [27] by identifying the densest traffic region. Therefore, the proposals for 5G network planning should not only explore how to deal with data acquisition but also include data-driven decision making by adopting appropriate ML techniques.
Different studies investigate ML and big data aspects for next generation networks (NGNs) [28]- [30]. The NGNs are expected to be highly complex systems due to heterogeneity in devices, networks, services and application requirements. ML techniques have versatile accomplishments in adapting big data analytics, data-driven decision making, correct parameter estimation and multi-objective optimization problems [31], [32]. Both ML and big data approaches can be applied to NGN scenarios and techniques including massive multiple-input multiple-output (mMIMO), machine to machine (M2M), vehicular ad-hoc networks (VANETs) or internet of things (IoT) [33]. At the same time, ML techniques can play a very important role in brining new frameworks of data analytics for efficient control, operation, optimization and network planning of 5G and beyond [34], [35]. In particular, clustering techniques have received much interest in the academic community to handle network problems in a variety of settings, like CR [36], mobile ad-hoc networks [37], VANETs [38], [39], wireless sensor networks [40], [41], IoT [42], [43], fog computing alongside with small cells [44], [45] and 5G [46].
In this work, we propose a network planning algorithm to perform network clustering and provide highest traffic cluster (HTC) as a deployment area for new 5G services fulfilling the MNO's requirements of the traffic density and lower cost per bit. We utilize available real mobile data [47] with an unsupervised clustering technique to identify the HTC of minimum cost per bit. This study contains substantial contributions that distinguish it from the existing work on clustering-based network planning. In the first place, real open data are used with the k-means clustering technique to clusterize the MNO's area into k clusters. Second, an algorithm is proposed to identify the HTC and appropriate value of k, based on MNO's requirements which are not previously investigated in the existing literature.
The manuscript is organized as follows. In Section II, we provide the state of the art for clustering techniques adopted in several network problems. Consequently, we contextualize clustering for network planning in Section III. In Section IV, we develop our approach of proposed framework for clustering and analysis to determine appropriate value of k and to identify the HTC. Section V contains a detailed explanation of the developed algorithm while the traditional Elbow method is discussed in Section VI. Finally, we present our results and conclusions in Sections VII and VIII, respectively.

II. RELATED WORK AND CONTRIBUTION
In mobile communication networks, clustering has been mainly applied from two viewpoints, namely to associate users and BSs according to a defined criterium to cluster them. These criteria may range from interference minimization to throughput maximization and spectral efficiency improvement. In this section we compile the most relevant areas for network-related clustering.
From a user-centric perspective, a number of works have addressed throughput maximization [48]- [50]. The authors of [48] study resource allocation to maximize user throughput and the clusters are formed by taking into account the physical distance and social ties between the users, also ensuring fairness among the clusters. In [49], a joint clustering plus scheduling algorithm is proposed to maximize throughput by limiting the cluster size. The cluster size is associated with the increase in the number of users in terms of fairness and throughput degradation. To balance the trade-off between throughput and fairness a dynamic power optimization and user allocation problem is investigated in [50] with a limit of two users per cluster to enhance throughput, resulting in fixed members and a large number of clusters.
Users clustering has also been investigated to improve spectral and energy network efficiency [51]- [53]. In [51], a two-step clustering scheme first divides small cells into disjoint cell clusters according to the neighboring relationship and then the UEs in each cell cluster are further grouped into UE groups with the target of minimizing intra-cluster interference. A final two-step power allocation scheme maximizes the network energy efficiency. A statistical framework has been proposed in [52] to improve both spectral and energy efficiency, being the cluster size sensitive to changes in user and BS densities. In [53], the authors propose users clustering to enhance energy efficiency by lowering signaling overhead. In this work, the user with the best channel quality will communicate with the cellular BS on behalf of the whole cluster to reduce overhead and minimize energy consumption. However, as a large cluster size means high intracluster signaling overhead, the cluster size is bounded by the overhead generated inside the cluster. Besides, VANETs offer a a suitable scenario to study vehicles clustering to face some challenges that characterize vehicular networks. The clustering of vehicles is explored in [54] based on their moving speeds, the number of hops and road conditions where provisioning of desired data rates is achieved by serving users with different clusters. In [55], the authors propose vehicle clustering up to three hops to get stable clusters in terms of low latency and high packet delivery. By limiting the number of clusters, they overcome the handover problem and achieve better connectivity. In [56], an SDNbased scenario is investigated where the clustering technique is adapted to cluster vehicles based on data acquisition of real-time road conditions. The results show that the packet delivery significantly improves by forming clusters based on real-time road conditions. Interference management has long been a popular issue investigated in the context of BS clustering. The authors of [57] propose a clustering strategy to mitigate inter and intracluster interference. It is investigated that the higher threshold interference formulates smaller-size clusters and also minimizes the overhead and network latency. In [58], a greedy approach is used to form clusters by randomly selecting an initial BS as a cluster head and adding nearby BSs until the cluster size threshold is reached. The analysis shows lower interference when the cluster sizes are small and a higher cluster size can be seen as a compromise with the inter-cell interference. The optimization technique of BS clustering proposed in [59] makes use of antenna down-tilt to limit the interference to a small number of cells where the BSs that are jointly serving the users are clustered together. The authors conclude to recommend smaller tilt of the antennas residing in the same cluster and large tilt for antennas of the other clusters.
Transversal to clustering areas is how to determine the ideal cluster size or number of clusters, a critical factor for solving a particular network problem. For instance, the problem of congestion due to large scale data communication is addressed in [60]. The appropriate number of clusters is determined by simulation and analyzing the variations in packet delay, packet loss and the number of vehicles in corresponding clusters. Increasing the number of clusters reduces the number of collisions, lowers the packet loss ratio and prevents congestion. The authors of [61] propose a novel scheme to adjust the size and the number of clusters in a large-scale cloud radio access network (C-RAN) where the processing and computational complexity of large scale channel metrics is enhanced by clustering. In [62], a clustering technique is proposed to regulate cluster size such that users need to take permission from BS before joining the cluster. When the cluster size is small, there is insufficient multichannel diversity, which reduces the transmitter's gain. When the cluster size increases, the multichannel diversity improves, and thus enhances the retransmission efficiency.
Foregone in review of the above literature, our question remains unanswered, that is what should be the appropriate size or number of clusters from a network planning perspective. Though some works have addressed clusteringbased network planning in the past [63]- [66], they cannot be straightforwardly applied to 5G networks due to the very different requirements. Some previous studies have incorporated clustering to network planning [67]- [70]. The authors of [67] used clustering approach for cell planning such that the signaling overhead is minimized. In [68], a clustering approach is used to automate the 5G network planning by identifying the appropriate geographical locations for BS placements based on service quality targets and regulatory constraints. Another work in [69], proposes a network planning tool based on a clustering algorithm to optimize service quality in order to meet the desired targets. The work of [70] proposes a framework to process input data from several sources and perform learning-based clustering to enhance self-planning, self-healing and self-optimization capabilities of 5G networks. However, traffic density, network utilization, MNO's requirements and cost per bit discussions in the context of BS clustering are not jointly addressed in the present-day literature. In the next subsections, the takeaways from the literature review are provided followed by authors' contributions.

A. TAKEAWAYS
Based on the above discussion, we find that clustering for users and BSs is being conducted on many performance measures like throughput, spectral and energy efficiency, interference, latency and high packet delivery concerning particular use cases. The same performance indicators regulate the cluster size or number of clusters. Traditionally, the clustering techniques have been examined in diverse network problems, though, disclosing insights from real network data such as traffic load and infrastructure information of the BSs is not addressed. To overcome the gaps we consider real network data and new performance metrics to cater the cost minimization and regulate the number of clusters. The new technological knowledge sought in this study is the correlation of clustering, network planning and performance metrics like target area (km 2 ), traffic density (MB/km 2 ), cost per MB ($) and network utilization (%). These performance metrics are considered in the context of MNO's financial and technical requirements in the proposed clustering framework.

B. CONTRIBUTIONS
This study proposes a clustering methodology in the context of network planning. In brief, this paper brings the following contributions: 1) We propose the utilization of an unsupervised clustering technique assisted by real network data to determine the 5G deployment area. 2) In contrast to the conventional method, a new methodology is proposed that incorporates the MNO's criteria in computing the appropriate number of clusters denoted by k.

3) We develop and propose a learning based network clus-
tering algorithm to identify the highest traffic density cluster (HTC) which serves as 5G deployment area to offer new services with minimum cost per MB.

III. CLUSTERING FOR NETWORK PLANNING
The advent of 5G in recent years suggests that the new network deployments will be carried out by identifying the densest traffic area per km 2 in order to minimize cost per bit by achieving higher network utilization. We consider the cost per bit metric that is being utilized in several disciplines e.g., electronics, information theory, satellite systems, optical and communication networks to model and analyze the cost associated with the delivery or transfer of data (see [71]- [75] for details) in order to decide cost-effective solutions. It is an efficient metric to model different parameters and conduct comparisons across MNO's network, technologies and spectrum. The cost per bit values are usually considered as absolute e.g., $15 per MB or GB depending on the considered model and scenario. In this study, we consider delivery of data between user and BS whose associated cost per MB is VOLUME 4, 2016 considered corresponding to the geographical area of cluster (km 2 ) based on an appropriate value of k, traffic density MB/km 2 and the network utilization (%). The network utilization is based on the MNO's traffic density which reveals different traffic loads with respect to geography. The MNO's traffic and infrastructure information can be acquired through the network data with the aim to determine the highest traffic region at cluster level. The utilization of the legacy network data in this regard can be beneficial to reveal traffic and infrastructure correlations. As an alternative to conventional data sources, crowdsourced real network data of the MNO can be utilized for planning of 5G and beyond to infer about network traffic and infrastructure. When the data is available, the following step is to choose which learning technique is best suited to the data's labeled or unlabeled nature. Supervised learning is defined by its use of labeled datasets for classification or regression problems [76]. Given the unlabeled nature of the crowdsourced data, we believe that unsupervised clustering assisted by real network data can reveal insights into the highest traffic density area. In unsupervised learning, clustering refers to revealing unseen patterns from the unlabeled data in the form of k clusters [77]. It consists of the organization of data in a way that there is high similarity in intra-cluster compared to lower inter-cluster similarity [78]. For clustering problems, the kmeans clustering technique is widely adopted for traffic analysis, network analytics, data mining and pattern recognition [79]. When compared to the algorithms of the same class, k-means ensures convergence, reconfigurability, automation, scalability, high computing efficiency and low computational complexity with large datasets [80]. Researchers show that the k-means clustering provides exceptional results in network traffic classification with a precision of up to 90% [81]. However, choosing the right value for k i.e., the total number of required clusters plays a significant role in revealing the accurate insights from the considered data.
The widely used legacy method to determine the value of k is Elbow method [82], [83] that does not interpret the data insights while suggesting the k value. However, we believe that determining the value of k is subjective in the context of what one is trying to achieve with the given targets and constraints. Therefore, considering the crowdsourced network data, we scrutinize the value of k determined by Elbow method in the context of achieving the MNO's requirements. The appropriate value of k has two fold importance for the MNO. First, the value of k corresponds to the coverage area of the clusters that compromises the deployment cost related to the adequate number of radio sites. Second, the number of clusters k correlates with the traffic density MB/km 2 which decides the potential network utilization of the newly deployed gNBs (Next Generation Node-B).
Consider the illustration of small and large values of k as provided in Fig.1(a & b), respectively. In the case of smaller value for k = 3, the corresponding clusters (C1 to C3) are bigger in size as compared to the clusters (C1 to C6) for k = 6. For k = 3, the accumulative traffic samples of BSs in C3 is 21 which identifies it as an HTC compared to C2 and C3. On the other hand, for k = 6 the total number of traffic samples for HTC C4 is 12. The HTC of k = 3, namely C3, has a larger coverage area compared to HTC of k = 6, namely C4, as shown in Fig.1(a & b), respectively. Therefore, a large number of gNBs will be required for C3 to deploy new 5G services which will result in a higher deployment cost and the network may be over-budgeted if the financial limit of the MNO is exceeded. Besides, traffic density MB/km 2 will be lowered which may yield to the under-utilization of the network. As a result, the under-utilization of the network will eventually raise the cost per MB for the MNO. In contrast, C4 of Fig.1(b) has a smaller coverage area which means that MNO will be offering its new 5G services to a very limited area with a smaller number of deployed gNBs compared to C3 of Fig.1(a). However, the traffic density MB/km 2 of C4 will be higher, which means traffic demand is higher and may result in an over-utilized network. In this case, the deployment cost may be under-budgeted and the cost per MB will be decreased but at the same time, MNO will not be able to reach a larger number of subscribers due to the smaller coverage area. Therefore, clustering for network planning should be incorporated in terms of costeffectiveness and improved network utilization. An adequate strategy is required to handle not only the MNO's budgetary limits, but also network over/under-utilization. The proposed clustering framework is developed on real mobile data to ensure that the network utilization is improved and the cost per MB is minimized.

IV. NETWORK CLUSTERING FRAMEWORK
In this section, we introduce the vision of network clustering which will be fully developed in Sections IV-A to IV-D following the scheme of Fig.2 represents the number of data variables: r e m e t e n e l e s e . . .
• r e : radio access technology (RAT) of row e. The network data acquisition entity of Fig.2 processes the database D through two phases, i.e., Pre-Selection and Post-Selection. The Pre-Selection phase acquires the MNO's data while the Post-Selection phase ensures reliability of the acquired data. Consisting of these two phases, the network data D of the MNO m is processed by the NetDataDrilling Algorithm (see [27] for details) to find h (highest traffic TAC ID), as shown in Fig.2. The corresponding geographical area of the TAC ID h is represented by A h . The network data corresponding to h is then fed to the clustering algorithm which is used to divide the area A h into k clusters. For clustering we used the k-means learning algorithm, to be discussed in Section IV-A. The clusters obtained from the kmeans technique are evaluated based on the traffic samples to determine the HTC (labeled as v), described in Section IV-B. Next, we perform radio network dimensioning to determine relevant parameters of the HTC (see Fig.2). The conventional radio network dimensioning (CRND) is performed for HTC to determine the site range R v and the area A v covered by the HTC, to be discussed in Section IV-C. At the same time, network dimensioning for new 5G services is performed for the area under HTC with the objective to determine the offered capacity β v , minimum offered data rate r min and the required radio sites Z v . Finally, cluster analysis is performed that is the nucleus of our study to determine the appropriate k fulfilling the MNO's requirements of coverage area A v and the highest traffic density D v MB/km 2 , to be discussed in Section IV-D. The MNO's requirements are provided as bounds on coverage area (a l , a u ) and the traffic density (d l , d u ). In cluster analysis, we formulate the problem of estimating the potential network utilization P v and the cost per MB C v of the HTC given that the MNO's requirements are fulfilled. The result of the cluster analysis is the 5G deployment area A v (km 2 ) and the traffic density MB/km 2 of the HTC (see Fig.2) with lowest cost per MB C v and highest network utilization P v .
The different entities of this framework are developed in the subsequent sections.

A. CLUSTERING ALGORITHM
The k-means clustering is an efficient and unsupervised ML algorithm widely used to clusterize data into k clusters [84]. The k-means method consists of a two fold mechanism; first, it selects k data points known as centroids and other data points are assigned to the corresponding closest centroids based on Euclidean distance. Second, once the clusters are formed, re-computations are performed for the centroids of each cluster. This mechanism iterates until the cluster formation converges. The k-means algorithm tends to minimize the following objective function:- where ||x (j) i −c j || 2 represents the distance between data point x (j) i and the centroid c j of the jth cluster. In our study, there are E data points that represent the geographical coordinates of the cell towers inside the area A h . In the database D, x (j) i represents the latitudes and longitudes (l 1 e , l 2 e ) of the cell tower i inside the jth cluster. The latitudes and longitudes VOLUME 4, 2016 (l 1 e , l 2 e ) of the cell towers are fed to the k-means to clusterize the area A h with respect to the Euclidean distance.

B. HIGHEST TRAFFIC CLUSTER
The HTC refers to the cluster with the highest number of traffic samples corresponding to the cell towers residing inside that cluster. The objective to find the HTC is to determine the geographical area of the subscribers with the highest traffic demands. Thus, when the clusters are obtained from the clustering algorithm we aim to find the HTC among the k clusters. Compared to other clusters, HTC carries highest traffic density MB/km 2 which maximizes the network utilization and decreases the cost per MB for the MNO. The computation of the HTC is based on the aggregated number of samples of all the existing cell towers as: where v is the cluster label of the HTC which contains a unique set of cell towers N = {n 1 , . . . , n e , . . . , n E }, where n e represents the CID and N = |N | is the total number of towers in HTC. The term s j,i refers to the traffic samples of the cell tower i inside the cluster j where as the variable V j represents the total number of traffic samples of the jth cluster.

C. RADIO NETWORK DIMENSIONING
Radio network dimensioning (RND) is an essential phase of network planning process to determine the required number of radio sites and cell range such that the coverage and capacity requirements are fulfilled depending on path loss, transmit power, data rates and frequencies [27]. For 4G BSs the CRND approach is used where the corresponding 4G frequency bandwidth and related data rate are considered to obtain coverage area of legacy sites. We determine 4G coverage area based on CRND as the current users are being provided coverage by 4G. In this way, determination of required radio sites is achieved in order to provide coverage by new gNBs. In this study, RND is performed for the HTC in order to calculate its area based on the coverage provided by the 4G cells. The coverage area of the HTC enables network planners to decide the radio and capacity requirements of the new network deployments. Since, the network data belongs to the legacy network we use the CRND approach of LTE. We perform CRND to determine the site range R v of the cell tower inside the HTC. We compute the site range R v in order to estimate the coverage area A v of the HTC based on LTE service model [6], [8], [27] as:- where W aux is the number of cell towers within the HTC and A = 1.95 is the area coefficient.
On the other hand, we perform service-based network dimensioning (SBND) with NetDimensioning algorithm [27] for the HTC area A v based on frequency and data rates of 5G. In contrast to CRND, 5G SBND is performed with the NR parameters to obtain the required gNB sites to provide coverage in A v and to determine the required capacity in the area. The SBND algorithm [27] provides the capacity β of a gNB, minimum data rate r min and the number of radio sites Z v required in HTC area. Thus, the cost Y v associated with the deployment of gNB sites under the HTC area is determined as:- where ϑ is the cost of deployment per gNB in dollars [85]. The aggregated network capacity of the HTC β v in MB can be determined in the same manner as:- where r min is the minimum data rate offered by the designed capacity β when all the users per gNB are active in the HTC. The designed capacity β is obtained by probabilistically characterizing the 5G radio resource control (RRC) states such that the data rates are guaranteed for MNO's defined percentage of time (see the capacity model of [27]).

D. CLUSTER ANALYSIS
The cluster analysis is the core of our framework to determine the network parameters and adapt the appropriate value of k such that the MNO's requirements of the traffic density MB/km 2 and deployment area km 2 are fulfilled. The MNO's requirements are not only considered to cater for the financial constraint but also to improve the efficiency in the context of network utilization. The financial constraint controls the deployment cost of the target area where new 5G services will be deployed. At the same time, these services likely to be offered within the area of highest traffic density in MB/km 2 , thus, achieving the higher network utilization with lowest cost per MB for the MNO. We define traffic density of the HTC as D v in MB/km 2 . The traffic density D v depends on the number of samples s j,i of the cell tower i of cluster j and the coverage area A v of the cluster. Thus traffic density D v is computed as:- Next, we compute the network utilization which is a very important parameter in determining the cost per MB of the HTC. Supported by SBND, each user in the HTC will experience data rate equal to r min . Hence, we consider that the minimum volume of traffic transferred between user and the BS is represented by the conversion of traffic samples (s j,i ) into current network utilization T v in MB as:- where there are α = 0.125 bytes in a bit which is used to convert bits into bytes and the term E i=1 s j,i represents the total number of samples of cluster j. Thus, the cost C v per MB of the HTC is a function of the total deployment cost Y v of the gNBs and the current network utilization T v as:- To determine the potential network utilization P v percentage of the HTC, we can use the aggregated network capacity β v in (5) as:- where T v is the current network utilization in (7). The two relevant metrics in the proposed framework are the coverage area A v and the traffic density D v MB/km 2 of the HTC. The constraints to determine the appropriate value of k are translated as MNO's requirements for new 5G deployments which are controlled by bounds on the deployment area (a l A v a u ) km 2 and on the traffic density (d l D v d u ) MB/km 2 of the HTC, respectively. These upper and lower bounds are imposed qualitatively but not quantitatively, though, the translation may not be obvious. The coverage area A v in km 2 is constrained by the MNO to control the financial aspect of the deployment as larger clusters imply higher deployment cost. Therefore, over or under-budgeting for the new 5G deployments is handled with the introduction of A v . On the other hand, traffic density D v in MB/km 2 is the second requirement of the MNO to achieve adequate level of network utilization to handle cost per MB. As the higher D v , the higher the traffic and the lower the cost per MB, and vice versa. We need to ensure that the network is not under or over-utilized, thus bounds on D v are imposed accordingly. To introduce these MNO's requirements within the proposed framework a network clustering algorithm is introduced that is to be discussed in the next section.

V. NETWORK CLUSTERING ALGORITHM
This section presents the network clustering algorithm which is the implementation of the proposed framework presented in Fig.2 and whose pseudocode is given in Algorithm 1. We recall that, to the best of our knowledge, this is the first network clustering proposal taking into account radio dimensioning, MNO's requirements of coverage area A v and traffic density D v MB/km 2 assisted by real network data. In this section, we develop the proposed algorithm, named as NetClustering, to clusterize A h and to determine the appropriate value of k according to the MNO's requirements. First, we use k-means to clusterize the geographical area A h in to k clusters and identify the HTC. We perform the CRND technique to determine LTE site range R v and compute the area A v of the HTC. We also perform SBND technique by NetDimensioning algorithm to get the required number of radio sites Z v , gNB site capacity β and the minimum data rate r min for the HTC. We develop the mechanism to analyze the clusters and corresponding network traffic from real mobile data following the MNO's requirement to determine the appropriate value of k.
The proposed NetClustering (Algorithm 1) is developed to work out two problems. First, it performs the data acquisition by the NetDaraDrilling procedure and determines the area A h . Second, based on MNO's requirements it acquires the appropriate value of k and determines the HTC for new 5G deployments. It requires the following inputs: for all n e ∈ L The proposed algorithm is designed to be evaluated on a range of k = [K min , K max ]. We introduce some auxiliary variables to store intermediate results. V aux represents the summation of the samples of those cells within HTC for each k, then having a dimension equal to k. W aux represents the VOLUME 4, 2016 summation of cell towers inside the HTC (with dimension equal to the number of cells forming the HTC).
First, we call the NetDataDrilling procedure [27] to obtain the TAC ID h of the highest traffic TAC area A h (step 5). The data points corresponding to h are represented in D(t e ) in step 6. Next, network clustering is performed for each k (for loop in step 7). In step 8, clustering is performed by calling the algorithm for the corresponding value of k. We pass the coordinates of the cell towers (l e = (l 1 e , l 2 e )) of the area A h as D(t e , l e ). The output of the k-means algorithm is given by the set of cluster's label in L having dimension equal to k. To sum the traffic samples per cluster, we begin the loop) in step 9, keeping in p the index of the n e CIDs (step 10) belonging to the k clusters. In step 11, if CID n e belongs to D(n e ) we save the aggregated traffic samples per cluster in V aux (step 12) and the number of cell towers per clusters are provided in W aux (step 13). In step 16, we find the index v of the HTC and the corresponding label is represented by L v (step 17).
Next, CRND procedure is called (step 18) to obtain the cell tower range R v in the HTC area. Based on R v , coverage area A v of the HTC is computed in step 19. In step 20, the HTC area A v is provided to NetDimensioning algorithm [27] to get the required number of radio sites Z v , the gNB capacity β and r min for HTC. In step 21, we compute the total cost of deployment Y v for Z v radio sites to be deployed in the HTC area A v . Next, in step 22 we compute the current traffic T v of the HTC. To obtain the traffic density D v of the HTC we utilize (6) in step 23. Next, we compute network capacity β v (step 24) of the HTC based on the required number of radio sites Z v and the data rate r min . Based on the previous computations, we then determine C v (the estimated cost per MB of the HTC) in step 25 along with the potential network utilization percentage P v in step 26. Finally, the MNO's criterion is introduced (in step 27) along with bounds [(a l , a u ), (d l , d u )] on A v and D v , respectively. Given that the MNO's criterion is fulfilled, the results are updated in step 28. Finally, the results of the HTCs for the range of k = [K min , K max ] are complied in step 30.
The complexity of Algorithm 1 is mainly based on three aspects. First, it depends on the size of the database D represented by the number of samples E in millions. The NetDataDrilling procedure in step 5 process D with a finite number of iterations (see [27] for details). Thus the complexity term for NetDataDrilling can be represented as , where |D(t e )| represents the number of TACs in D. Second aspect of complexity reclines on the k-means algorithm (step 8) to cluster the area A h in k clusters with a finite number of iterations given by the for loop in step 7. Thus the complexity term is represented as O(k × |D(t e )| + |∆K|), where ∆K represents the granularity to increment k for the next iteration, respectively. The third aspect is subject to the complexity of the RND algorithm given by O(B × |∆P | + |∆Q|). The term B represent the bandwidth where cellular services are configured and evaluated for the transmit power ∆P and cell load ∆Q granularities for next iteration [27]. Note that the RND algorithms are independently executed for CRND (step 18) and NetDimensioning (step 20), respectively.

VI. ELBOW METHOD
The Elbow method is primarily based on k-means learning technique that computes the sum of squared distances (distortions) from each point to its assigned centroid as a function of k [86]. The appropriate value of k is selected by running the k-means algorithm across a range of k. The method plots the distortion as a function of k and choses the k at the point where distortion drops drastically forming the smallest angle. The distortions can be calculated using (1) as explained in the Section IV-A. The pseudocode of the Elbow heuristic is given in Algorithm 2. We start by initializing the value of k = 2 in step 1 and clustering is performed for a range of k = [K min , K max ] (for loop in step 3). We measure the distortions by using (1) and values are stored in V k having dimension equal to k (step 4). In step 5, all the distortion values are updated to the result for each k and finally result is returned in step 7.

Algorithm 2: Elbow Heuristic
The complexity of Algorithm 2 is similar to the k-means as the distortion values are computed independently by finite number of iterations of the for loop in step 3. Thus, the complexity of Elbow heuristic is given as O(k×|D(t e )|+|∆K|), where ∆K represents the granularity of k.

VII. RESULTS AND ANALYSIS
This section presents the results and the corresponding analysis of the considered parameters of our study, i.e., cost per MB C v , potential network utilization P v (%), traffic density D v (MB/km 2 ) and the 5G deployment area A v (km 2 ). The simulation results of the proposed NetClustering algorithm are presented and compared with the Elbow heuristic.
The simulation includes the area A h shown in Fig.3 obtained from the NetDataDrilling procedure [27], where LTE cell towers are located across the area. The simulation curves presented in this section have been obtained by averaging the results from 1,000 executions, each corresponding to one independent, random and uniform users distribution. In this study, we investigate two scenarios with different bandwidth B = {30, 50} MHz, while other simulation parameters are provided in Table 1.
Reference to the Algorithm 1, the NetDataDrilling procedure [27]  Latitude FIGURE 3. Highest traffic TAC area A h obtained from NetDataDrilling procedure [27] where the blue circles represent cell towers corresponding to their longitudes and latitudes, respectively. Elbow heuristic suggests the number of clusters k = 3, as shown in the elbow curve of Fig.4. These three clusters are obtained from the clustering algorithm and are shown in Fig.5, where HTC is shown with the towers presented with white circles. On the other hand, the same database corresponding to the ID h feeds to the proposed NetClustering algorithm for a range of clusters K min to K max . The curves of cost per MB (C v ) and potential network utilization (P v ) are presented in Fig.6. The network utilization curve has a non-decreasing behavior with the increase in number of clusters from 2 to 100. As k increases, the area A h tends to divide into smaller coverage areas per cluster thus increases both the traffic density D v MB/km 2 and network utilization P v , as the services are offered in a limited area within a smaller cluster size. However, the overall achievable value of P v is not more than 20% and 17.4% for B = {30, 50} MHz, respectively. The reason for a lower percentage value of P v is due to the fact that the designed bandwidth provides more capacity than the current requirement of the subscribers per cluster. On the other hand, cost per MB C v curves tend to decrease with smaller size clusters as the coverage is provided into a more concentrated area of subscribers with higher traffic demands. When the k value is large, it means that the area is divided into multiple smaller regions, therefore, it becomes convenient for the MNOs to deploy an adequate number of radio sites fulfilling the current requirement of the subscriber's traffic. The smaller size cluster means that the MNO has to deploy new radio sites in a smaller area, thus decreases the deployment cost Y v . Network planning driven on the cluster level minimizes C v for the MNOs, however,   the opportunity cost is paid in offering new services within the limited geographical area. Besides, if the traffic density D v and P v are not considered while deciding the appropriate k, the deployed network may become over-utilized over time for larger values of k as more number of subscribers will be acquiring new services within a smaller concentrated area. In contrast, the deployed network may become under-utilized with the smaller values of k, as can be seen in the trend followed by the P v curves for both bandwidth scenarios in Fig.6 Table 1. Within this range, the highest potential network utilization and the lowest cost per MB cluster is provided by k = 29 (see dotted lines in Fig.6) for both bandwidth cases. The clustered area A h for k = 29 is presented in Fig.7, with cell towers plotted with white circles. It is clear that larger value of k = 29 results in smaller size clusters (Fig.7) compared to larger size clusters for k = 3 (Fig.5). However, the geographical region of the highest traffic remains the same under the HTC obtained from both Elbow heuristic and NetClustering algorithm, respectively. The comparison between appropriate values of k suggested by Elbow method and NetClustering algorithm is based on cost per MB C v , potential network utilization P v , traffic density D v MB/km 2 and the deployment area A v (km 2 ) is provided in Fig.8 for the given bandwidths. The deployment area A v of HTC is independent of the bandwidth, thus reveals A v = 123.97 km 2 for both bandwidth cases as shown with the green bars in Fig.8. In case of NetClustering, the deployment area is A v = 20.19 km 2 and is under the bounds provided by the MNO. The results (for k = 3, Elbow heuristic) show under-utilization of the network because coverage is provided in a larger area while the potential network utilization is lower with P v = 5.11% and P v = 4.28% for B = [30,50] MHz cases, respectively. On the other hand, HTC (for k=29, NetClustering) has better network utilization of P v = 11.2% and P v = 9.42%.
The traffic density values for the Elbow heuristic based HTC are very low i.e., D v = 38.17 MB/km 2 for both bandwidth cases, respectively. The lower D v value uncovers the fact that the formation of the large size clusters (for k = 3) are not suitable in this region as the traffic load is not much. On the other hand, the HTC obtained by NetClustering has better traffic density with D v = 85.16 MB/km 2 (for k = 29) as shown in Fig.8 Fig.8. The lower cost per MB is due to the fact that the appropriate value of k = 29 results in smaller size clusters which saves deployment cost of the MNO and provides cheaper clusters in terms of cost per MB. Deploying the new 5G gNBs within the smaller size clusters for k = 29 seems to be a better choice for the MNOs. It saves the deployment cost Y v and at the same time a lower cost C v per MB is achieved with better network utilization P v and higher traffic density D v MB/km 2 . However, the price paid is the computational complexity in terms of CPU time, as shown in Fig.9. The complexity curves are presented by increasing the number of samples E or data points d e by 1 up to 10 millions. The NetClustering algorithm consumes more CPU time, but it is still linear with E. In this case, the computational complexity is significantly less essential than the cost, as clustering can be performed off-line on non-realtime basis.

VIII. CONCLUSION
Revealing insights about the current traffic loads from the existing network infrastructure assist the network planning to reduce the cost for the MNOs. In this paper, we show the network planning problem of identifying the HTC area A v of the highest traffic density D v MB/km 2 by employing clustering based on ML. The appropriate value of k and corresponding HTC is identified by the NetClustering algorithm fulfilling the MNO's requirements on A v and D v . We show that the proposed algorithm determines the value of k such that the potential network utilization P v is higher while the cost per MB C v is minimized. The performance comparison is evaluated based on cost per MB C v , traffic density D v , deployment area A v and the network utilization P v of the HTC. We compare these parameters and observe that the NetClustering algorithm not only attains up to 45% cost savings per MB but achieves higher network utilization P v compared with the Elbow heuristic. We have evaluated our proposed algorithm on the two bandwidth scenarios of 30 and 50 MHz and our algorithm shows consistent performance. As future research lines, we are committed to exploring the ML applications in the context of network data combined with radio dimensioning of mmWave for 5G and beyond. In sixth-generation (6G), one of the primary use cases is ultra-massive machine type communication (umMTC) with a density of 10 7 devices per km 2 . The proposed clustering framework can be evolved to optimize spectrum and energy efficiency in large-scale IoT scenarios for newly defined MNO's requirements.