Mobile Network Coverage Prediction Based on Supervised Machine Learning Algorithms

The need for wider coverage and high-performance quality of mobile networks is critical due to the maturity of Internet penetration in today’s society. One of the primary drivers of this demand is the dramatic shift toward digitalization due to the Covid-19 pandemic impact. Meanwhile, the emergence of the 5G wireless standard and the increasingly complex actual operating environment of mobile networks make the traditional prediction model less reliable. With the recent advancements and promising capabilities of machine learning (ML), it is seen as an alternative to the traditional approaches for ground to ground (G2G) mobile communication coverage prediction. In this study, various ML models have been tested and evaluated to develop an ML-based received signal strength prediction model for mobile networks. However, the challenge is to identify a practical ML model that can fulfill the computing speed criteria while still meeting the prediction accuracy. A total of six categories of ML models, namely Linear Regression (LR), Artificial Neural Network (ANN), Support Vector Machine (SVM), Regression Trees (RT), Ensembles of Trees (ET), and Gaussian Process Regression (GPR) that consists of more than 20 types of established algorithms/kernels have been tested and evaluated in this paper to identify the best contender among them, in terms of speed and accuracy. Findings from the evaluation showed that the GPR model is the most accurate model for Reference Signal Received Power (RSRP) prediction in terms of $RMSE$ and $R^{2}$ , followed by ET, RT, SVM, ANN and LR. Nevertheless, prediction speed and model training times are also important factors in determining the most practical model for RSRP prediction for several real-world mobile network planning applications. Finally, the ET model with Random Forest (RF) algorithm has been selected and highly recommended as the most practically employed ML model for developing rigorous RSRP predictions model in multi-frequency bands and multi-environment. The developed prediction model is capable of being utilized for the network analysis and optimization.

Partnership Project), 5G is divided into two categories, namely Sub-6GHz 5G, which operates in the same spectrum band as 4G LTE (450 MHz to 6 GHz), while mmWave 5G operates in 24.25 GHz to 52.6 GHz spectrum band [13]. For 5G networks operating at mmWave, the signal itself is more vulnerable to the effects of the transmission environment, which consequently shortens the effective communication distance [3], [5], [6]. The antennas need to be placed nearer to users, which translates to about ten times more antennas to be installed compared to the previous 4G requirement (approximately 100 to 200 meters apart) [2], [5], [6], [14]. Therefore, 5G network development requires extensive and critical radio network planning and analysis. It will involve numerous types of data and factors to identify the best positions for antenna deployment [15]. Because of the complexity of planning and analysis works which are not feasible to be executed using traditional methods, a high-level data analytics adoption using machine learning (ML) is required [5], [9]. This is coherent with the latest technical specifications issued by 3GPP (Release 18), where ML capabilities are beginning to be adopted as part of advanced network planning for 5G future deployments [16].
ML is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms and has recently gained popularity in the field of wireless communication [2], [17], [18]. The ML-based prediction model is seen as a game-changer in modern mobile network planning due to its capability to produce more accurate results than the traditional empirical-based prediction model. It is also more efficient in terms of data processing capability compared to the deterministic-based prediction model [12], [19]- [22]. Besides, the ML-based models can improve their accuracy over time without having to be programmed [9].
In traditional prediction methods, mobile network planning is inflexible [22]. The prediction is constrained to certain specifications and conditions such as frequency, antenna height and environmental characteristics. However, the reality is that the operating environment of modern radio networks is more diverse and complex [23]. Therefore, it is necessary to explore and build a new prediction method that can operate more flexibly and universally to adapt to modern mobile networks' operation complexities. For ground-toground (G2G) mobile communication scenarios, the authors in [18], [22], [24]- [28] have tested and evaluated the capabilities of several ML models for the mobile networks' performance prediction. Overall, the results of the studies proved that the ML-based prediction model had outperformed the traditional methods in terms of computational efficiency, applicability, and accuracy.
ML-based prediction models for path loss in an urban environment have been implemented in Beijing, China, in which authors in [27] using Artificial Neural Network (ANN), Support Vector Regression (SVR) and Random Forest (RF) models. The root-mean-square error (RMSE) value generated is in the range of 4 dB to 5 dB. The work was performed on the frequency spectrum of 877.26 MHz, 2021.4 MHz, and 2127.6 MHz with the input parameters used are: (i) the separation distance between transmitter (T x ) and receiver (R x ); and (ii) frequency. The work was based on 2,558 datasets with ratios of 53.6% and 46.4% for model training and testing, respectively. Similarly, in [25], the work was conducted in an urban environment in Lisbon, Portugal, at 3.7 GHz and 26 GHz frequency bands, using a real 5G network. The input parameters and response variables used are almost the same as [27]. However, the used dataset is two times larger. Meanwhile, the models studied are only SVR and RF, with RMSE values of around 6 dB to 7 dB.
Studies for the urban environment using a simulation dataset were carried out by [18] on the 2140 MHz frequency band only. A total of 5,150 datasets were generated using ray-tracing techniques. Like the previous two studies, the response variable is path loss. However, three new input parameters were introduced in this work, i.e.: (i) T x height and R x height; (ii) coordinates of R x ; and (iii) the status of signal propagation path, being either line of sight (LOS) or non-line of sight (NLOS). The predictive results are based on SVR and RF models with RMSE value between 2 dB to 4 dB. However, according to authors in [26], the R x location ought to be eliminated to get a prediction model with better generalization and adaptability that could be advantageously employed in various geological areas.
The use of simulated datasets on urban environments has also been carried out by [24], but the focus is only on NB-IoT networks, a branch of mobile communication that uses low bandwidth. The frequency spectrum tested were at 900 MHz and 1800 MHz with the input parameters used are: (i) the separation distance between T x and R x ; (ii) the height of the building involved; and (iii) the signal propagation path status (LOS/NLOS). The used dataset was almost seven times larger than [18], with sampling ratios of 80% and 20% for training and testing, respectively. However, only ANN and RF models were tested and evaluated with RMSE values around 4 dB to 6 dB.
It has been implemented by [28] for suburban environments on the 450 MHz, 1450 MHz, and 2300 MHz frequency bands in South Korea. As in previous studies, the input parameters used are almost the same except for introducing a new input parameter, which is the ratio of T x height to the R x height. However, the authors did not clearly state the actual size of the dataset. The ML models studied were only ANN and GPR, with RMSE value between 8 dB to 9dB for both models.
In [26], the ML-based prediction model was tested in a rural environment in Greece using ANN, SVR, and RF models. The RMSE value is between 4 dB to 5 dB. This work was only implemented on a single frequency band of 3.7 GHz. The input parameters used are as follows: (i) 3-dimensional (3D) distance between T x and R x ; (ii) T x height and R x height at the above sea level (ASL); and (iii) the signal propagation path status (LOS/NLOS). Meanwhile, the response variable is VOLUME 10, 2022 path loss. The dataset was around 2,200, with a sampling ratio of 70%, 15%, and 15% for training, validating, and testing, respectively.
Lastly, the research work in [22] is more comprehensive than the other studies mentioned earlier. The ML-based model has been tested in multi-environments, a combination of dense cities, open areas, and inland lakes. The input parameter used is also quite comprehensive, consisting of (i) 2-dimensional (2D) and 3D separation distance between T x and R x ; (ii) frequency; (iii) height difference between T x and R x ; (iv) T x tilt angle; (v) T x azimuth angle; (vi) transmitting power; (vii) clutter and building heights information; and (viii) the vertical distance of the R x from the major lobe signal line of the T x antenna. Besides, this is the only one work that uses Reference Signal Received Power (RSRP) as the response variable. In addition, the dataset is very large, consisting of 12,011,833 datasets. Unfortunately, it is only tested on an RF model with an RMSE value of 6.11 dB. The works mentioned above are further summarized in Table 1 for a clear comparison.
Although there is no specific rule of thumb for determining the number of datasets suitable for developing an ML-based mobile network performance prediction model, we found that a larger dataset may provide more consistent predictive performance even if the dataset is generated through computer simulation. In this study, the response variable for the developed prediction model is RSRP because it is the key parameter directly representing the state of network signal level at the UE location [29] in 4G LTE and 5G NR networks. We decided to use RSRP instead of path loss in this study to manifest the final output in a more meaningful way. In the meantime, the ML models tested in the previous works for G2G mobile communications are only ANN, SVR, RF, and GPR. Besides, the applied algorithms/kernels of the models are limited to a certain type only.
AS Such, the contributions of this study can be summarized as follows • An extensive measurement campaign consists of 21,323 datasets collected in various outdoor environments using an Android-based drive test solution, which the cleaned dataset is available for download at [31].
• A guide to generating input parameters using onlinebased radio planning tools (CloudRF) • A comprehensive methodology is presented in preparing a clean study dataset for ml-based rsrp predictive modeling development • Assessment and validation of different state-of-the-art ML models and their performance for the application in G2G mobile networks signal strength prediction.
• Development of ml-based model for faster mobile network coverage prediction • The ML-based RSRP prediction model is developed for various deployment scenarios, such as in urban, suburban, open areas, etc. In the meantime, the model is also compatible with multiple frequency bands.
• Evaluation and optimization approach for received signal strength in existing 4g lte mobile networks • Performance analysis of ML-based model against previous work done by [22], [25], but considering an extension of scenarios and other improvements in methodology, as shown in Table 1.
The rest of the paper is organized as follows: Section II describes the principle of the considered ML model. Section III presents the experimental setup for the data collection and datasets preparation. The learning, testing, and validation of ML models are outlined in Section IV. The performance of all tested ML models is evaluated and discussed in Section V, followed by model applications and optimization approaches in Section VI. Finally, Section VI concludes the article.

II. MACHINE LEARNING BASED MODELS FOR MOBILE NETWORKS RSRP PREDICTION
Predicting the RSRP of mobile networks can be categorized as a regression type problem [26], [32]. Regression problems are a part of supervised ML techniques, where the models are trained based on a given labeled data [27]. The state-ofthe-art supervised ML-based models that will be tested and examined in this study are discussed below.

A. LINEAR REGRESSION (LR)
LR is an ML-based method that finds the linear relationship between the input parameters and response variable [33], [34]. LR is easy to implement and easy to interpret the results. However, its disadvantage is the initial assumption of a linear relationship between input parameters and response variables, which is not suitable when dealing with non-linear relationships [35].

B. ARTIFICIAL NEURAL NETWORK (ANN)
ANN is a network of artificial neurons that mimic the way of human brain functional [36]. The basis of the ANN operation is to find the best correlation to represent the relationship between the input parameters and the response variable [26]. ANN is suitable for non-linear regression problems and shows good predictive performance with a large dataset [27]. ANN is built by a network of three segments: the input layer, the hidden layer, and the output layer, where data processing and forecasting activities are carried out on the hidden layer [36]. Therefore, the ANN prediction performance is highly influenced by the settings on the hidden layer. However, excessive use of hidden layers will cause the models to become more complex and tend to discourage generalization capabilities [37], [38].

C. SUPPORT VECTOR MACHINE (SVM)
SVM, also known as SVR, is a supervised ML method using kernel functions to solve regression problems. The kernel converts the dataset to different dimensions to obtain the best hyperplane settings to represent the correlation between input parameters and response variables [18], [26]. Thus, nonlinear dataset can be map into linear relationships at higher dimensional spaces [27], [32], [39]. SVM is less prone to overfitting and has excellent generalization capabilities but is less efficient when dealing with large datasets, particularly with a lot of noise [35], [40].

D. REGRESSION TREES (RT)
RT is a decision tree-based method that uses a tree-like structure in making predictions based on the rules set at each node [23], [41]. Thus, RT performance is highly influenced by the number of nodes from the root to the leaf [42]. Data processing and analysis in RT are easy to interpret. It is also capable of handling issues related to missing values in the VOLUME 10, 2022 study datasets; However, it is easily affected by noise, leading to overfitting if the tree setting is too deep [35], [40].

E. ENSEMBLES OF TREES (ET)
ET is an approach of combining multiple RT to produce an output [24]. In principle, ET is a group of weak learners combined to form a strong learner [43]. Therefore, it requires more computational power and a longer training time than RT. There are two types of ET techniques: Bagging or Boosting [44]. Bagging is a technique to decrease the variance in the prediction by generating subsets of data chosen randomly with replacement from the original training dataset [26]. Meanwhiles, Boosting is an iterative technique that adjusts the weight of an observation based on the prior learner's result [15]. ET can overcome the overfitting issues in RT [35], [45], which is more robust to the noise. RF is one of the popular ET algorithms that uses bagging techniques in performing regression predictions [26], [46].

F. GAUSSIAN PROCESS REGRESSION (GPR)
GPR is a Bayesian non-parametric model [15] that utilizes kernel functions in solving regression problems [47]. It derives the relationship between input parameters and response variables from unknown functions [28]. GPR can produce good predictions at high dimensions, even with small datasets [48], and support non-linear and complex problems [49]. However, the drawback of GPR is that it requires high computational power [50], [51]. Fig.1 describes the overall concept of execution of this study. The study dataset is constructed from a comprehensive measurement campaign conducted in the Federal Territory of Putrajaya, a planned city that serves as the administrative capital of Malaysia. The selection of Putrajaya was based on the following factors: (i) has a unique town planning architecture that combination of multi-environment characteristics, i.e., high-rise and mid-rise buildings, single and double story terrace houses, lakes, densely-vegetated parks and open areas in one territory [6], [52] as shown in Fig. 2 and (ii) one of the testbed location for 5G NR network deployment beside Cyberjaya and others several major cities [53], [54]. However, at the time of this study, there is still no 5G NR mobile network officially operating for public consumers. Therefore, RSRP reading measurement can only be performed on the existing 4G LTE networks. Even though RSRP in 4G LTE and 5G NR networks have different characteristics, but the purpose of its use is the same, which is UE periodically measures RSRP for performing cell selection/reselection and handover process [30]. Since the development of the prediction model is based on a modular approach, so it can be easily extended for 5G NR network parameters in the future.

III. DATA COLLECTION AND DATASET PREPARATION
The hardware and software involved in the measurement campaign are highlighted in Table 2, while Fig. 3 shows the data type and relationship between the platforms in the measurement. To minimize the fast fading effect especially  due to the Doppler shift, the measurement campaign was conducted at a vehicle speed below 40 km/h [21].
Before dataset preparation activities are executed, the drive test data needs to be cleaned. The data collected at static conditions need to be removed to ensure the data are free from any errors [57]. After the data cleaning process was completed, UE location information in decimal degree format was extracted into the .csv file format. It was used as a reference input for the generation of the following parameters: (i) 2D distance between eNodeB (eNB) and UE; (ii) height of eNB antenna at above sea level (ASL); (iii) height of UE at ASL; angle of signal from eNB to the position of UE (tilt); and (iv) signal propagation path status between eNB  and UE. These parameters are created utilizing a web-based radio planning tool called CloudRF [58]. The terrain and clutter reference data in CloudRF are 10 meters in resolution. Besides, CloudRF also integrate high-resolution 3D buildings information from OpenStreetMap [59].
As shown in (1) and (2), two additional parameters were derived from the above-generated parameters, i.e., the 3D separation distance between eNB and UE (D 3 ) [22], [25], [26] and the height ratio between eNB antenna and EU (H R ) [28]. Based on [60], the concept of calculation of these two parameters applied in this study is illustrated in Fig. 4 whereas T 1 and T 2 are representing the height of eNB antenna and height of UE at ASL respectively, while D 2 is the 2D separation distance between eNB and UE. Therefore, D 3 is defined by: while H R was calculated using the following formula: Signal propagation path status between eNB and UE was labeled as '0' to represent no obstacle present (LOS); meanwhile, '1' represents NLOS condition. This parameter was labeled as Obstacle (Obs). The input parameters mentioned above are further summarized in Table 3 for a clear explanation. The selection of these five parameters as input data was based on the knowledge of electromagnetic wave propagation, which has been applied in previous works (as summarized earlier in Table 1).  The F q and RSRP were extracted directly from the measurement campaign data. Before implementing ML model training and testing activities, outliers in D 3 , H R , Tilt, and RSRP were identified and removed. This process was done using interquartile range (IQR) analysis on 21,323 datasets, as shown in Fig. 6. Finally, a total of 18,048 cleaned datasets were prepared and ready for the ML model training and testing process. Overall, the above process is summarized and illustrated in the flow chart shown in Fig. 5.

IV. MODEL TRAINING AND VALIDATION
MATLAB 2020a Regression Learner and Neural Net Fitting application were used to train and validate the ML-based RSRP predictive model. The simulation is performed using an Intel Core i5 10 th gen laptop with an onboard Radeon VOLUME 10, 2022  The regression learners utilized in this study are summarized in Table 4. All these learners were examined using 10-fold cross-validation (CV), a resampling method that split up the dataset into ten portions and then train and test procedure executed on different iterations to avert overfitting, based on a similar work done by authors in [18], [22], [25]. The parallel training function was disabled to optimize the machine processing power on each examined algorithm/kernel. Meanwhile, the Neural Net Fitting application used a two-layer feed-forward neural network with 40 hidden neurons, as shown in Fig. 7. 70% of the dataset was used for training, and the remaining 30% was split up evenly for validation and testing purposes. Different algorithms were examined, such as Levenberg-Marquardt, Bayesian Regularization and Scaled Conjugate Gradient.  To examine and validate the performance of each ML model, it is important to assess the statistical error between the measured and the predicted RSRP values. RMSE, as shown in (3), is a commonly used metric to evaluate the performance of the regression prediction models. It is given, in decibels [22], by: where n sample is the total number of samples, y i is actual value, andŷ i is predictive value. The smaller values of RMSEindicate a better prediction of the ML model. According to [26], predictive models with RMSE values less than 7 dB is considered acceptable, especially in an urban environment.
On the other hand, we used the coefficient of determination (R 2 ), as shown in (4), to reveal the degree of performance of the prediction models. It is used to describe how well the input parameters in the model explain the response variable's variability. The larger R 2 values, the more variability is explained by the model. It is given by [28]: Among the other important factors observed while validating the model are the duration of training times and prediction speed, which always indicate the level of complexity and efficiency of the model. Therefore, a predictive model with a balanced performance between RMSE, R 2 , training time, and prediction speed is highly recommended as the most practically employed ML-based model for developing rigorous RSRP prognoses model in multiband and multi-environment.

V. RESULTS AND DISCUSSIONS
Details performance of examined ML models is shown in Table 5 and Table 6. Subsequently, the best contender of each  model category was selected and plotted into a Pareto chart, as shown in Fig. 8.
Based on the results, we can conclude that the GPR model outperforms others in terms of RMSE and R 2 . This is because GPR, a non-parametric kernel-based probabilistic model, can handle small and large size datasets with fewer errors, even in high-dimensional space. GPR able to learn from the overall distribution of datasets and properly tune the hyperparameters setting to produce smoothing results. However, the GPR comes with drawbacks such as high computational power requirements and longer processing time, especially when dealing with large and imbalanced datasets. Imbalanced datasets happen when a significant disproportion within the datasets can cause unequal class distribution.
The next best performance of RMSE and R 2 values are achieved by the ET model with the Bagged Trees algorithm (also known as RF). RF is faster than GPR in terms of processing time. Compared to the Exponential-GPR, which is the most balanced performance model in the GPR category, RF is 85 times faster in training duration. Besides, it can also achieve around 72,000 observations per second compared to Exponential-GPR, with just around 5,100 observations per second. This is because RF, which is based on the decision tree method, is simpler to implement than GPR models.
The model category that ranked third in this study was RT with the Medium Tree algorithm has shown the most balanced performance compared to other RT algorithms. The RMSE and R 2 values achieved were 6.46 dB and 0.67, respectively. Although both RT and RF are based on the decision tree technique, the bagging working mechanism in RF has shown its superiority in producing better prediction accuracy than RT, which only relies on the prediction results from one decision tree only.
The fourth-placed was held by SVM. Although SVMs are kernel-based models like GPR, that can manipulate the datasets at different dimensional spaces, but the best predictive performance observed in this study is only 6.62 dB for RMSE and 0.66 for R 2 . This is possible due to the limitations of the SVM model's capability, which is less efficient when dealing with large datasets.
The fifth place was occupied by ANN models with the best RMSE and R 2 results were 6.82 dB and 0.64, respectively. Although ANN accuracy is just 0.64 dB less than RF, but it is required 12.6 times longer in terms of model training times. This is due to the complex data processing analysis running in the hidden layers. Typically, to improve the accuracy, the number of hidden layers must be increased, but this will make training time and prediction speed becomes worse. In some cases, increasing the number of a hidden layer will not improve the prediction accuracy, but it could be the worst, which deters the model generalization capability. As a result, the model could not perform accurate predictions when a new input dataset was introduced. Therefore, increasing the number of hidden layers in the ANN needs to be done with caution.
Lastly, the LR models with the best result are 8.65 dB for RMSE and 0.41 for R 2 . These results directly indicate that the LR model is unsuitable for developing the ML-based RSRP predictive model. This is possible due to the limitation of the LR-based model, which is not suitable to apply to nonlinear relationships problems.
Concerning the study's objective of generating a more flexible, and fast-paced prediction model, RF is proven to be the most practical option for developing an optimal MLbased RSRP predictive model. To further enhance the RF model prediction capabilities, the optimum hyperparameter setting must be identified. To achieve this, the Optimizable Ensemble function has been executed and the result is shown in Fig. 8. The newly generated values of RMSE and R 2 are equal to 5.74 dB and 0.74, respectively. Therefore, RF prediction accuracy was increased by 0.44 dB and model variability has increased by 0.04. As a result, the optimum hyperparameter settings for the RF model are 495 number of learners, two minimum leaf sizes, and three number of predictors to sample.
In Fig. 9, we compare the RF model prediction performance with the previous work done by [22], [25]. The newly generated values of RMSE and R 2 of the RF model in this study is comparable to the values in the previous works. Although [22] uses more diverse input parameters with  massive dataset sizes, but it does not show a significant difference in prediction accuracy and model variability. In some cases, employing too many input parameters will make the ML learning algorithm more complex and deter the model's generalization capabilities. Besides, outliers' detection and removal activities must be executed properly before proceeding to the ML training process, especially when dealing with a massive number of datasets. Meanwhile, insufficient influential input parameters executed in [25] will limit the actual capabilities of the RF algorithm in producing optimal predictive performance. This is because, some other influential input parameters, such as signal propagation status LOS/NLOS are very significant in generating an accurate predictive result. Therefore, based on these comparison results, we can conclude that the five input parameters utilized in this study are adequate to produce an optimal RSRP prediction model.

VI. OPTIMIZATION
In this section, optimization approaches to existing 4G LTE networks using a developed prediction model are presented. Certain parameters such as height and tilt angle of the eNB antenna will be adjusted to forecast an improvement that can be achieved on poor RSRP readings. Poor RSRP readings from Cell ID 11, eNB 131163 will be utilized as a study    sample, as shown in Fig. 11. A total of 45 out of 1,260 point locations identified received poor RSRP readings. Table 7 shows the original and new setting of Cell ID 11, eNB 131163. Over 100 prediction iterations were performed per configuration parameter using the proposed model to obtain these new settings. Accordingly, different antenna height and tilt configuration values were utilized in each prediction iteration. The configuration values that resulted in the best coverage were then selected as optimal new settings. Model parameter inputs such as D 3 , H R and Tilt must be adjusted accordingly, while F q and Obs are considered unchanged.
AS a result, an improvement in RSRP reading is shown in Table 8 and Fig. 12 It can be concluded that adjustment on both height and tilt angle of the antenna is able to improve the level of received signal strength at 39-point locations (86.7experienced poor receiving RSRP levels.

VII. CONCLUSION
This paper presented and examined several ML model' categories with various algorithms/kernels that aimed to predict mobile network performances through RSRP in multiband and multi-environment. For this purpose, ML models, including LR, ANN, SVM, RT, ET, and GPR, were applied and evaluated. The models were trained using measurement campaigns, carried out at 4G LTE frequency band in Malaysia, i.e., 1800 MHz and 2600 MHz, in diverse multi-environment around Putrajaya. The results showed that the GPR model is the most accurate model for RSRP prediction in terms of RMSE and R 2 . However, due to its huge drawback on training times and prediction speed, the second-best model, which is the RF model, is highly recommended as the most practically employed ML-based model for developing a rigorous RSRP prognoses model in multiband and multi-environment. At the end of the article, optimization approaches on existing 4G LTE networks have been demonstrated by utilizing the capability of the developed ML-based RSRP prediction model. Finally, the future work is to train and test the RF algorithm with the real 5G NR measured data using the same approaches. Furthermore, the influence of spectrum bandwidth and UE position with respect to the front, side, and back lobe of antenna radiation patterns will be exploited. Besides, correlation with others network key performance parameters such as RSRQ will study and tested in future work.