Enhanced Wireless Channel Estimation Through Parametric Optimization of Hybrid Ray Launching-Collaborative Filtering Technique

In this paper, an enhancement of a hybrid simulation technique based on combining collaborative filtering with deterministic 3D ray launching algorithm is proposed. Our approach implements a new methodology of data depuration from low definition simulations to reduce noisy simulation cells. This is achieved by processing the maximum number of permitted reflections, applying memory based collaborative filtering, using a nearest neighbors’ approach. The depuration of the low definition ray launching simulation results consists on discarding the estimated values of the cells reached by a number of rays lower than a set value. Discarded cell values are considered noise due to the high error that they provide comparing them to high definition ray launching simulation results. Thus, applying the collaborative filtering technique both to empty and noisy cells, the overall accuracy of the proposed methodology is improved. Specifically, the size of the data collected from the scenarios was reduced by more than 40% after identifying and extracting noisy/erroneous values. In addition, despite the reduced amount of training samples, the new methodology provides an accuracy gain above 8% when applied to the real-world scenario under test, compared with the original approach. Therefore, the proposed methodology provides more precise results from a low definition dataset, increasing accuracy while exhibiting lower complexity in terms of computation and data storage. The enhanced hybrid method enables the analysis of larger complex scenarios with high transceiver density, providing coverage/capacity estimations in the design of heterogeneous IoT network applications.

relations and hence overall quality of service metrics, mainly constrained by interference. Deterministic wireless channel modelling techniques, such as ray launching or ray tracing, provide accurate results both in terms of received power estimation and time-dependent variables (e.g., power delay profiles or delay spread distributions). The main drawback of those techniques is the potentially large computational cost, which is dependent on scenario size, consideration of detailed scenario topology and the inclusion of additional effects, such as diffraction or diffuse scattering [3], [4].
Recently, approaches based on artificial intelligence have been explored in the field of electromagnetic analysis, in multiple fields such as EM scattering, inverse Scattering, direction of arrival estimation, radar and remote sensing [5], as well as in other network oriented aspects, such as dynamic resource allocation algorithms [6]. In the case of wireless channel modelling, some works on channel estimation have been presented in relation with: massive MIMO systems [7], node distribution in wireless sensor networks [8], machine learning assisted path loss prediction [9], empirical connectivity model for an extended monitoring network of environmental parameters optimized by machine learning on an extensive data set [10], low altitude propagation model based on a machine learning approach [11], or an enhanced empirical propagation model combined with machine learning techniques from extensive measurement sets in the UHF focused on coverage analysis of DTV systems [12], among others. Future trends are foreseen for upcoming beyond 5G systems, in which multi state, multi-dimensional networks can be analyzed and optimized by the aid of quantum machine learning techniques [13]. Different applications within wireless channel characterization, resource allocation optimization or system level enhancement [14]- [20], are presented in Table 1.
Deterministic based methods can provide accurate wireless channel estimations for complex scenarios with high density of constitutive elements, particularly for the case of indoor scenarios. However, as previously stated, accuracy comes with the tradeoff of high computational cost. This is given mainly by a precise definition of the elements within the scenario under consideration and the discretization level of the physical propagating wave front and the equivalent set of rays within the defined solid angle in the case of volumetric approaches. In order to reduce computational complexity, several approaches have been proposed, based on the combination of deterministic Ray Launching techniques with other approaches, such as neural networks and the electromagnetic diffusion equation [21], [22]. This has given rise to the use of hybrid simulation, an approach that has provided improved results in elements such as tracking in nonlinear systems, supported by fuzzy systems [23]. In [24], we proposed the combination of in-house 3D Ray Launching (3D RL) code with collaborative filtering (CF) recommender systems [25]. The main idea was to use the ability of CF methods to predict rates and infer the values of empty cells in matrices obtained in low definition (LD) simulations, reducing the computational complexity of high definition (HD) simulations. The proposed methodology was applied to RF power distribution in complete volumes of several scenarios, and it could be extended to other parameters as well.
In this article, we present an optimization methodology to decrease computational cost, based on the analysis on the permitted maximum number of rebounds (NR) of the launched rays, initially described in [26]. The study of multiple simulation databases (created with different NR values) enables the VOLUME 8, 2020 analysis of the minimum number of rays per simulation cell, which in turn reduces estimation errors in the LD to HD result association phase. The proposed approach provides increased accuracy, while reducing computational costs regarding HD simulations. The new proposed method increases efficiency by discarding results of those cells in which the total number of rays detected is lower than a set value (NR), acting as an effective threshold. The corresponding threshold is obtained, by means of 3D ray launching simulation analysis. These values exhibit high error as compared to high definition ray launching simulations, being equivalent to noise. The application of collaborative filtering techniques significantly improves the overall accuracy.
The rest of article is organized as follows: Section II provides background on recommender systems and collaborative filtering. Section III describes validation with 3D RL simulation tools and wireless channel measurement results. Section IV describes the hybrid CF-3D RL methodology with optimized NR parameter analysis. Section V presents the experimental validations, ending with the concluding remarks.

A. RECOMMENDER SYSTEMS AND COLLABORATIVE FILTERING
Nowadays, data management is facing a paradigm shift due to the widespread adoption of cyber-physical systems, which will increase both the volume and the way data are exchanged [27]. As a consequence, real-time services face several changes due to increased regulations, client demands and big data challenges [28]. In this context, automatic recommendation systems [28] are gaining momentum due to their inherent characteristics, which provide manageable and personalized information to users [30], [31]. Collaborative filtering [24] encompasses disparate recommendation methods and is nowadays the most widely used technique due to its adaptability according to the input data. CF relies on the assumption that users that share similar behavior/experience in specific topics will have similar tastes or interests according to some quantifiable metric. Usually, the relationships between users and items are stored in the form of n × m matrices (i.e. n users and m items), where each cell (i, j) stores the evaluation of user i on item j. Fig. 1 shows an example of such data representation.
The literature classifies CF methods into three main categories according to the data they manage [24], [32]: (i) Memory-based, which use all the available data about users, items and relationships, (ii) model-based, which create a model (e.g. by using machine learning, dimensionality reduction or statistical models) from the complete set, and (iii) hybrid-based, which incorporate other data sources (e.g. social networks, demographic data). Nevertheless, despite the benefits provided by CF methods, there are several challenges that such systems need to face, being the most acute the cold start, scalability, sparseness and privacy issues [32]- [36]. For more on CF, we point the interested reader to [32], [37] for a review of the state-of-the-art and the most relevant advances and trends.
In this paper, we adopt the most well-known memorybased CF variant with the nearest neighbors approach (KNN), where users compute their similarities according to a metric (e.g. Euclidean distance, cosine similarity) to find which are their closest neighbors (i.e. their corresponding most similar profiles). Therefore, given a pattern with inconsistent/erroneous values, we will select its k most similar patterns (according to a ground truth database) to infer/predict the RF power level [38], [25].

III. 3D RAY LAUNCHING TOOL VALIDATION
As stated previously, an in-house 3D Ray Launching algorithm has been used in this work to predict radio wave propagation in a complex indoor environment. The proposed algorithm is a geometry-based deterministic approach where different parameters can be considered as inputs, namely the number of reflections, operation frequency, transmitted power, bit rate, angular and spatial resolution and the radiation pattern of the considered antennas. A detailed 3D scenario is created considering all the obstacles within it, by means of the conductivity and relative permittivity of all the materials at the frequency of operation of the system. The main drawback of these methods is their high computational complexity due to 3D space analysis. To overcome this problem, several articles in the literature analyze convergence analysis of different approaches in terms of the number of reflections or the launching ray's density. In [39], a quasianalytical ray propagation model to obtain the RF field within an aircraft cabin is proposed. The convergence analysis of the algorithm is presented in terms of rays' propagation time and number of bounces, showing that the high content of metallic parts inside the cabin shows a slow rate of convergence. Another study in [40] presents a hybrid method of GO/PO and physical theory of diffraction where the dependence of field convergence on the maximum number of reflections is 83072 VOLUME 8, 2020 investigated. The work in [41] proposes a ray density normalization within a novel stochastic ray launching approach to accurately predict signal levels in curved geometries. In [42], a mixed ray launching/tracing method propagation modelling for large areas is proposed, analyzing the convergence of the number of reflections and diffractions of such areas. The study in [43] analyzes the number of rays to achieve convergence in a ray tracing approach for RCS modeling of large complex objects.
In the same way, the convergence analysis in terms of launching rays and number of reflections has been performed for the in-house 3D ray launching algorithm, and it is presented in [44], showing the optimal parameters to be used in the algorithm to achieve good accuracy with affordable computational time. Considering these parameters, in this section, the 3D RL validation is presented for a scenario under test (SUT).
The SUT is 'Laboratori 231', an indoor scenario of 8.8m × 4.7m × 3.7m, located at the School of Engineering of the Rovira i Virgili University, in Tarragona, Catalonia, Spain (cf., Fig. 2). As it can be seen, the scenario is a small conference hall where mainly tables and chairs are present. Due to the reduced size of the scenario and the high density of obstacles, the scenario is a complex one in terms of multipath propagation components.   In order to perform the validation of the proposed 3D RL tool, the SUT has been created for its simulation (cf., Fig. 3 and Fig. 5). The dimensions of the scenario, the shapes and sizes of the elements within, and their material properties have been set as close as possible to the real scenario. Table 2 shows the dispersive material properties used in the simulations.
The used 3D RL parameters are summarized in Table 3. Note that the parameters have been chosen in order to suit the equipment employed in the measurement campaign: As a transmitter, an XBee mote (ZigBee) with a whip antenna. As a receiver, a monopole antenna (Titanis 2.4 GHz Swivel SMA Antenna from Antenova) coupled to an Agilent FieldFox N9912A spectrum analyzer. The location of the transmitter (red dot) and the measurement points (green dots) are represented in Fig. 3. Fig. 4 shows the comparison between the measured RF power level and the estimated values obtained by the 3D RL algorithm for 12 different measurement points. As expected, the results show good agreement, with a mean error of 0.23 dB with a standard deviation of 2.43 dB. Therefore, the simulation tool has been considered validated to be used in this SUT in order to perform the hybrid ray launching-collaborative filtering technique study proposed in this paper.

IV. METHODOLOGY
The proposed optimization is applied to LD simulations performed by the presented 3D RL algorithm for both creation of DBs and simulation of the SUT, which is where the whole improved methodology to obtain RF power distribution is applied.
The permitted maximum NR in a 3D RL simulation gives the maximum number of interactions between launched rays (from the transmitter) and the obstacles within the scenario. This parameter is fixed by the user before the simulation. The NR affects the accuracy of the obtained results, being more accurate the higher this number is. However, a higher NR implies more calculation time. Besides, the accuracy of the results tends to converge at a specific NR, which for the kind of indoor scenarios evaluated is six rebounds. In other words, for NR > 6 the accuracy improvement is negligible, but the required calculation time increases significantly. As an illustrative example of the effect of the permitted maximum NR, Fig. 5 shows the RF power distribution estimations obtained at height 2m of the SUT, for NR = 0 (LD1), NR = 2 (LD3), NR = 4 (LD5) and NR = 6 (LD7). In the previous methodology [23], we built DBs using LD simulations with NR = 3. However, in this new approach, we use a set of indoor scenarios, described in Section III.A, to create DBs from NR = 0 to NR = 6. Each scenario has been simulated in LD for the seven cases.
NR affects the accuracy of LD simulation results because it plays a role in the number of rays that reach each simulation cell: the smaller NR, the smaller the number of rays that reach each cell, and the higher the number of cells that are not reached by any launched ray. These empty cells (i.e., cells without an RF power level) in LD simulations are filled by our 3D RL-CF method. However, empty cells are not the only problem. We detected non-empty cells that lessen the accuracy of the 3D RL-CF results. These cells usually exhibit a very low RF power level, because they have been reached by a very low number of rays during the simulation. To address this problem, we propose a depuration method, explained in Section III.C, to suppress cells with inaccurate values. A comparison between the previous and the new methodologies is shown in Fig. 6. The depuration process is highlighted in orange.

A. DATA COLLECTION -DATABASES CREATION
Ten scenarios have been defined to build the DBs needed to apply the proposed methodology. All scenarios are indoor and similar in terms of morphology and density, like in [3]. In this case, a different PC has been used for the simulations: Intel (R) Core(TM) i5-4690 CPU @ 3.50GHz, with 32 GB RAM. The features of the scenarios used to build the DBs are summarized in Table 4. The Size column shows the l: length, w: width and h: height of the scenario, and column Density shows the percentage of the volume of the scenario which is occupied by objects (i.e., not air). It is important to note that the features of the scenarios used to build the DBs limit the scenarios that can be analyzed. As can be seen in Table 4, the chosen SUT features are between the limits of the DB scenarios. Moreover, the presented methodology can be applied to every scenario as long as the DBs are built accordingly to the required features.
Each scenario has been simulated in LD using 3D RL with several values for NR (from 0 to 6). A total of 70 simulations have been used to create seven DBs (i.e., {DBLDi, ∀i ∈ [1,7]}), each containing the results of simulations with a different NR value. Fig. 7 shows the simulation time for all cases.

B. DATA AGGREGATION FOR R MIN COMPUTATION
In this section, we detail the cumulative distributions obtained for all scenarios in terms of number of rays R i per cell, MAE and sparseness. Note that we represent the aggregate numbers after applying LD simulations using NR values from 0 to 6. These data are used to determine the value of R min in Section III.C. Fig. 8 shows the aggregate distribution of number of cells that had a specific number of rays R i . We can observe that, as expected, the higher the number of rays the lower the number of cells. Such outcome is related with the one depicted in Fig. 9. In this case, we show the cumulative percentage of sparseness of the scenarios according to all values. For example, in the case of R i = 0, we have a 0% of sparseness   and thus, all cells have at least this number of incident rays. In the case of R i = 1, we observe that only a 69% of all cells have more than R i > 1. In the case of R i = 30, we observe a sparseness value above 97%, which means that less than 3% of cells registered a R i > 30. The outcomes state that most of the cells of the system have a low number of R i . Considering that R min is set to 4 as later discussed in Section III.C, this means that in average we discard more than 60% of values (cf. Fig. 9) due to their bad quality when VOLUME 8, 2020 performing LD simulations. In this regard, Fig. 10 shows the maximum absolute error (MAE) of all values according to each R i . It is obvious that the higher R i the less error in the values. However, as seen in Fig. 9, selecting only cells with high R i would lead to a high sparseness, hindering the prediction process.

C. DATA DEPURATION PROCESS
We state that if the number of rays, R i , passing through a cell, C i , during the simulation is small, the value in that cell will be inaccurate. With the aim to identify those nonempty cells with small R i , containing inaccurate values, we set a quality threshold for the minimum number of rays, R min , which should pass through a cell to consider its value valid. R min is set once for the whole volume of the scenarios.
It is worth emphasising the difference between the number of rebounds, NR, and the number of rays, R i , in a cell. Although they are closely related concepts, the former determines how many times rays interact with objects in the simulation, while the latter counts the number of rays that pass through each cell.
In order to determine R min , we analyse the error (i.e., the difference between obtained values in LD and HD simulations) and the sparseness of LD simulations for several values of R min . First, we compute the error of all scenarios using the mean absolute error (MAE) as follows: where n is the number of non-empty cells, LD i is the value of the LD simulation for cell i, and HD i is the value of cell i in the HD simulation. We only compare non-empty LD cells with their HD counterparts (i.e., empty cells are discarded). The error is classified depending on the number of incident rays per cell to obtain an aggregated error for all simulations. Next, in order to compute the sparseness, we classify cells depending on their number of incident rays and we count the percentage of cells that have, at least, a given number of incident rays. The R min value is selected to balance the trade-off between sparseness and MAE. In a nutshell, we want to eliminate  inaccurate cells, but we have to minimise sparseness to preserve enough patterns/information in the DBs. Note that a high R min value means better quality simulation results (values) but a smaller number of patterns in the DBs. A shortage of patterns affects CF predictions, since it prevents the recommendation algorithm from finding proper analogies. A simple procedure to determine the value of R min is to find the discrete value of the minimum intersection point of both functions (i.e., error & sparseness). The result, depicted in Fig. 11, is that the quality threshold value R min should be set to 4. Therefore, in the depuration process, cells with R i < R min = 4 are discarded. Fig. 12 shows a graphical example of the refinement process.

D. DATABASE SIZE COMPARISON -ORIGINAL VS. DEPURATED
Following the procedure described in [24], recommender/CF databases, containing squared 2D patterns of several sizes q× q have been created from the simulations (cf., Section III.A). We have created two sets of databases: One set {DB Manhattan } obtained using the original simulation results, and another set {DB Rmin } obtained using their depurated versions resulting from the application of the procedure described in Section III.B. That is, discarding all the patterns, which have one or more cells below the R min incident rays threshold. In each set, we distinguish 21 different databases depending on the size of the 2D pattern (i.e., 2×2, 3×3 and 4×4) and the NR of the LD simulations (i.e., LD1(NR = 0), LD2(NR = 1), LD3(NR   13 shows a comparison between these two sets regarding the number of generated patterns. Table 5 shows the aggregated results. The proposed depuration procedure reduces the size of databases (i.e., number of stored patterns) by more than 30% in all cases.

V. EXPERIMENTAL RESULTS OF THE PROPOSED ALGORITHM
The proposed depuration algorithm could be applied in two different phases: (Phase I) During the creation of knowledge databases to reduce the number of patterns and filter inaccurate values from LD simulations, and (Phase II) Before the prediction on the SUT simulation to filter inaccurate values of the SUT LD simulation before applying the prediction algorithm. Depending on the phase/s in which the depuration algorithm is applied, we can distinguish four cases: • Case A: No depuration is applied (original approach [23]) • Case B: Depuration is applied only in Phase I • Case C: Depuration is applied only in Phase II • Case D: Depuration is applied in both phases Our depuration algorithm reduces the number of patterns in the databases, contributing to a better efficiency of the overall solution (cf., Section III.D). However, beyond efficiency improvements, in this section, we assess the effect of the depuration algorithm on the prediction accuracy. To do so, we apply the 2D CF-RL hybrid method proposed in [24] with several prediction strategies (cf., Table 6) on the original simulation of the SUT and the depurated simulation of the SUT.
Prediction strategies differ on the size of the 2D pattern (i.e., q ∈ [2, 4] and on the database source (i.e., original simulation results -DB Manhattan , or their depurated counterparts -DB Rmin . Each strategy is applied to the existing simulations obtained with different NR (i.e., from LD2 to LD7). LD1 (i.e., NR = 0) is discarded because DB LD1 R min DB LD1 R min is empty for any q ∈ [2, 4], (cf., Fig. 13). In each strategy, we set k = 100 as the maximum number of neighbours used to compute the prediction. Results accuracy is measured in terms of the MAE between our predictions and the HD simulation values. All results are shown in Table 7 highlighted in different colours depending on the case. The average MAE for each case is given in Table 8.
It can be observed that using our depuration algorithm helps to reduce the error. The original approach, Case A, [24] is the one with the worst performance, whilst Case D is the best performer. However, the differences are not substantial. Comparing cases A and B, and cases C and D we observe that the effect of using the original simulations {DB Manhattan } or the depurated ones {DB Rmin } is minimal, slightly favoring the use of the depurated ones (i.e., −0.02 dB). Similarly,   comparing cases A and C, and cases B and D we can see the effect of using the original simulations of the SUT or its depurated counterpart. In this case, using the depurated SUT simulations is clearly better (i.e., −0.98 dB). It is worth noting that applying our depuration algorithm to the SUT increases its sparseness significantly (cf., Table 9). However, despite this apparent loss of information, the results exhibit a lower error, which supports the usefulness of our proposal not only in terms of efficiency but also in terms of error reduction. Regarding strategies, there is no clear evidence supporting the use of one or another, although there is a slight trend towards reducing the error whilst increasing the pattern size. Surprisingly enough, using simulations with a small NR (e.g., LD2, LD3) proves to perform better than those obtained with larger NR (e.g., LD7). This occurs because the higher the NR, the more specific and characterised the scenario is, and finding similar patterns becomes more difficult. Therefore, the obtained results indicate that the proposed methodology effectively maintains power level estimation accuracy, with a reduction in database size. Therefore, computational complexity is reduced, enabling an increase in the computational volume of the scenario under test.

VI. CONCLUSION
We have proposed an optimization of our previous 2-dimensional RL-CF approach, based on the analysis of the NR parameter. We presented a methodology to obtain the R min value, which is computed finding a trade-off between sparseness and the error between LD and HD values. The R min value enables the depuration of simulations by removing invalid values, which reduces DBs size and increases efficiency and accuracy. While the previous approach applied CF to the empty cells of LD of 3D RL simulations, this new method discards the results of the cells reached by a number of rays lower than a set value (NR). This discarded cell values are considered noise due to the high error that they provide comparing them to high definition ray launching results. Thus, applying the collaborative filtering technique both to empty and noisy cells, the overall accuracy of the proposed methodology is significantly improved. The main contributions of this article are summarised as follows: (i) we studied the NR parameter to observe its effect on LD simulations (ii) we computed the R min depuration value according to sparseness and error values obtained for each simulated scenario considering different NRs (iii) we showed that our optimized, depuration-based proposal obtains faster and more accurate results when applied to both databases and the SUT. Future work will focus on the clustering of scenarios according to their features and characteristics. Therefore, scenarios with similar contexts will be grouped, enhancing the results, especially in simulations with high NR. The proposed methodology enables performing coverage/capacity analysis whilst reducing the computational cost of wireless systems in scenarios with high node density, morphological complexity and size. Note that the methodology can be applied to every kind of scenario building and feeding the databases accordingly. The proposed methodology deals with frequency dependent power level characterization. Great deal of interest lays within the analysis of time delay characteristics and their effect in overall system performance [45]. Application of hybrid 3D RL + CF techniques within time domain parameters will also be explored as a future work line.