A Deep Learning and Geospatial Data-Based Channel Estimation Technique for Hybrid Massive MIMO Systems

This paper presents a novel channel estimation technique for the multi-user massive multiple-input multiple-output (MU-mMIMO) systems using angular-based hybrid precoding (AB-HP). The proposed channel estimation technique generates group-wise channel state information (CSI) of user terminal (UT) zones in the service area by deep neural networks (DNN) and fuzzy c-Means (FCM) clustering. The slow time-varying CSI between the base station (BS) and feasible UT locations in the service area is calculated from the geospatial data by offline ray tracing and a DNN-based path estimation model associated with the 1-dimensional convolutional neural network (1D-CNN) and regression tree ensembles. Then, the UT-level CSI of all feasible locations is grouped into clusters by a proposed FCM clustering. Finally, the service area is divided into a number of non-overlapping UT zones. Each UT zone is characterized by a corresponding set of clusters named as UT-group CSI, which is utilized in the analog RF beamformer design of AB-HP to reduce the required large online CSI overhead in the MU-mMIMO systems. Then, the reduced-size online CSI is employed in the baseband (BB) precoder of AB-HP. Simulations are conducted in the indoor scenario at 28 GHz and tested in an AB-HP MU-mMIMO system with a uniform rectangular array (URA) having 16x16=256 antennas and 22 RF chains. Illustrative results indicate that 91.4% online CSI can be reduced by using the proposed offline channel estimation technique as compared to the conventional online channel sounding. The proposed DNN-based path estimation technique produces same amount of UT-level CSI with runtime reduced by 65.8% as compared to the computationally expensive ray tracing.


I. INTRODUCTION
The next generation communication networks experience the severe challenge of an increasing number of connections between the base station (BS) and the dense-deployed user terminals (UTs). It brings a huge demand for high-speed data transmission despite a severe shortage in bandwidth resources. Massive multi-input multi-output (mMIMO) systems with an extremely large number of antenna elements can significantly improve the spectral and energy efficiency in millimeter-wave (mmWave) frequencies via focusing narrow and high-gain beams towards a small region and thus increase the overall data rate and channel capacity via spatial multiplexing [2]. For downlink transmission, multi-user mMIMO (MU-mMIMO) using hybrid precoding (HP) is prevalently deployed at the BS to tackle abovementioned challenges, where HP supports a large number of antenna elements via a concatenation of a radio frequency (RF) beamformer, few power-hungry RF chains and a lowdimension baseband precoder [3]- [5].
Configuring the hybrid precoder requires the complete knowledge of channel state information (CSI) for both RF beamformer and baseband precoder regarding a massive number of antenna elements. Considering the configuration parameters of the RF beamformer design, there are two possible strategies: i) using fast time-varying instantaneous CSI obtained by conventional online channel sounding [6]- [8], or ii) using slow time-varying CSI such as the angular information and the channel covariance matrix [9]- [14]. In the first approach where the full-size fast timevarying instantaneous CSI is utilized as input parameters for the hybrid precoder design, a large-size CSI training and feedback overhead is required during online connections. However, in the second approach where the RF beamformer is designed based on the offline slow time-varying CSI, only the small-dimensional fast time-varying instantaneous CSI for the baseband precoder is required for the hybrid precoder configuration.

A. RELATED WORKS
Channel estimation is essential for MU-mMIMO systems to perform the interference management and to ensure the reliable transmission with linear precoding. In [6], a novel channel estimation approach for the hybrid precoding in MU-mMIMO systems is proposed to estimate the angle information and gain information independently by decomposing the instantaneous CSI, where the RF beamformer is constructed based on the angle-of-arrival (AoA) information. Joint user scheduling and MU-MIMO HP algorithm for mmWave mMIMO systems are discussed in [7], where the RF beamformer in the BS adopts the channel matrix derived from instantaneous CSI via singular value decomposition (SVD). In [8], the size of the feedback CSI overhead for the RF beamformer in a hybrid processing scheme is practically reduced by quantising the phase entries with limited available bits while achieving satisfactory data rate performance. However, all above studies cannot completely remove the pilot and feedback overhead for the RF beamformer because they use the full-size instantenous CSI.
In terms of the second scheme with slow time-varying CSI, [12] presents a joint spatial division multiplexing (JSDM) technique that serves user locations that have similar channel covariance matrices at a group level. As the channel covariance matrix for each group changes extremely slowly compared to the coherence time of the instantaneous channel matrix, the required online CSI for the RF beamformer is notably decreased by exploiting the groupwise channel covariance information obtained via low protocol overhead. However, this scheme still relies on online channel sounding for the groupwise channel covariance matrix. In [9]- [11], an angular-based HP (AB-HP) technique is proposed for mMIMO systems based on user groups, where the input parameters of the RF beamformer are the groupwise angle-of-departure (AoD) information. However, this scheme simply assumes that the AoD information for each user group are known. Therefore, obtaining the slow time-varying CSI between the BS and UT groups without any knowledge of the online channel, especially estimating the angle information, is of great importance to fully drop the large-size CSI overhead required for the RF beamformer. In [15], the CSI between the BS and the UT (i.e., the directional properties of the channel) is estimated by firstly conducting real-life channel measurements in the environment for the original channel matrix and then extracting the angle information from it via space alternating generalized expectation (SAGE) maximization. The CSI has a high accuracy comparing to CSI obtained by online channel sounding, but the channel measurement campaign is extremely inflexible and time-consuming if the environment and antennas have any change. In [16], a channel estimation framework containing the deterministic ray tracing and the cluster extraction is presented. The deterministic ray tracing overcomes the inflexibility in changeable environment and does not need any knowledge of antennas, but it could be computationally complex if multiple propagation mechanisms are considered and could not achieve a accurate CSI if propagation mechanisms are not correctly assumed. Deep learning has been introduced in UT-level CSI acquisition because of its strong power to predict unknown variables from measured data. [17] leverages a feedforward neural network (FFNN) to predict the range of AoA and AoD for the dominant paths of each UT based on data directly collected at the BS. In [18], a deep neural network (DNN) based framework composed of a series of parallel multilayer classifiers is proposed for high-precision AoA/AoD estimation. However, none of them explores the relationship between the environment and the UT-level CSI. As the UT-level CSI depends on the layout of the environment, it is reasonable to generate the UT-level CSI using a data-driven artificial intelligence (AI) model learning from a large amount of three-dimensional (3D) geospatial data of the environment and the related propagation rules. Meanwhile, the COST 2100 stochastic channel model in [19] produces the clusterwise CSI in terms of UT-level CSI for densely deployed UTs within the scope of the entire area by clustering. Nonetheless, conventional clustering techniques such as the KPowerMeans [20], the kmeans [21] and the k-nearest neighbors [22], are not robust enough to imperfect UT-level CSI produced by the deep learning based technique. Fuzzy clustering such as UKmeans in [23] and the conventional fuzzy c-means (FCM) clustering algorithm in [24], can be promising solution to combat the weakness of the imperfect UT-level CSI.

B. CONTRIBUTIONS AND ORGANIZATION
In this paper, a learning-based channel estimation approach for the MU-mMIMO hybrid precoding RF beamformer has been presented. The preliminary UT-level CSI of a selected set of UTs is generated by offline ray tracing based on known geospatial data and then used for a DNN learning the correlation between the UT-level CSI and the geospatial data. The UT-level CSI produced by offline ray tracing and the DNN-based path estimation technique forms clusterlevel CSI by a proposed FCM clustering algorithm. The proposed UT grouping algorithm further groups feasible UT locations (in a considered service area) into non-overlapping UT zones and calculates the corresponding UT-group CSI based on cluster-level CSI. Finally, the RF beamformer in the MU-mMIMO system is configured using offline UTgroup CSI instead of online channel sounding to avoid large CSI overhead.
The main contributions are summarized below: • Deep learning model for UT-level CSI acquisition: The ray tracing in the preliminary work is computationally expensive even on the assumption that only two propagation mechanisms are considered when calculating the UT-level CSI for each feasible UT location. This paper introduces a deep learning model for the UT-level CSI acquisition, which generates same amount of slow time-varying UT-level CSI for feasible UT locations in the service area with runtime significantly reduced. We provide extra analysis and conduct extensive simulations for the comparison between the deep-learning approach and the non-deep-learning approach. • FCM Clustering for imperfect UT-level CSI: Distinct from hard clustering techniques used for stochastic channel models in [19]- [22], the proposed FCM clustering produces robuster cluster-level CSI by introducing fuzziness between clusters, in order to combat the imperfect UT-level CSI caused by the DNN-based path estimation. Compared to the previous FCM clustering technique presented in the preliminary work, the proposed FCM in this paper further prunes clusters by neglecting UT-level CSI distant from the cluster center when calculating the cluster-level CSI. Therefore, the cluster-level CSI is more representative of UT-level CSI in each cluster. • UT zone formation: The offline channel estimation technique provides slow time-varying UT-group CSI of each UT zone in the service area for the RF beamformer, where UT zones are defined as 3D irregular shaped non-overlapping spaces corresponding to the geospatial data of the service area. The rest of this paper is organized as follows. The system model is introduced in Section II. Section III presents the proposed channel modeling method using ray tracing and FCM and its illustrative results in the indoor scenario. In Section IV, we propose two approaches using deep learning for CSI acquisition and the comparison between the ray tracing and the deep learning methods. Section V provides the example of this offline channel modeling for AB-HP design in the MU-mMIMO system. Finally, the paper is concluded in Section VI.

II. MASSIVE-MIMO HYBRID PRECODING & GEOMETRY-BASED 3D CHANNEL MODEL
In this section, we present the hybrid precoding architecture and the geometry-based 3D channel model.

A. HYBRID PRECODING ARCHITECTURE
We consider a total of K single-antenna UTs in the service area, where K a ≤ K UTs are active, K − K a UTs are inactive. The BS is equipped with M antennas deploying the AB-HP technique proposed in [9] to serve K a active UTs as illustrated in Fig. 1. The downlink signal vector transmitted by the BS is denoted as s = FBa ∈ C M , where F∈C M×N RF is the RF beamformer, B∈C N RF ×Ka is the BB precoder, a ∈ C Ka is the downlink data symbol vector to K a simultaneous UTs, which is encoded by i.i.d Gaussian codebooks (i.e., i.i.d. entries of a follow the distribution of CN (0, 1)) 1 , and N RF is the number of RF chains. For the power constraint of P T , we have E s 2 2 ≤ P T . Hence, in comparison to the conventional fully digital precoding (FDP), the number of RF chains in the AB-HP technique is reduced from M to N RF . Meanwhile, in order to support K a UTs, the number of RF chains should follow K a ≤N RF M . Then, the received signal at the k th UT is written as follows: where h k ∈ C M is the channel vector of the k th UT, b k ∈ C N RF is the k th column of B, a k is the k th element of a, and w k ∼ CN 0, σ 2 is the complex Gaussian noise. The full channel matrix between the BS and all UTs is defined as The signal-to-interferenceplus-noise ratio (SINR) at the k th UT is derived as [9]: where Afterwards, the average sum-rate of the MU-mMIMO system is obtained by [9]: Moreover, we consider a total of G UT groups in the service area, where K a active UTs are mapped to G a ≤ G active UT groups according to the similarity of slow time-varying UT-level CSI and K g UTs are in the g th UT group such that K a = Ga g=1 K g . Then, the AB-HP technique can develop the RF beamformer in blocks F = [F 1 , · · · F Ga ] only using the UT-group CSI, where F g ∈ C M ×N (g) RF for each group requires N (g) RF . In other words, the AB-HP technique only utilizes the slow time-varying AoD information to develop the RF beamformer F. Afterwards, instead of using the full channel matrix H, the BB precoder employs the reducedsize effective channel matrix H = HF ∈ C Ka×N RF seen from the BB-stage. Unlike the conventional FDP, the AB-HP technique can significantly reduce the hardware cost/complexity and the CSI overhead. In this paper, the development of RF beamformer F and BB precoder B is omitted for focusing on the channel estimation as the main motivation of this paper. Please see [9] for the design of RF beamformer F and BB precoder B.

B. GEOMETRY-BASED 3D CHANNEL MODEL
According to the geometry-based 3D channel model [9]- [11], the channel vector between the BS and the k th UT is defined as follows: where N c is the number of clusters, L j,k is the number of paths in the j th cluster with j = 1, · · · , N c , N L = Nc j=1 L j,k is the total number of paths, g k,j l ∼ CN 0, 1 N L is the complex path gain of the l th path in the j th cluster, P k,j is the average power of the j th cluster. Here, Λ (ϕ k,j l , θ k,j l ) and, φ(ϕ k,j l , θ k,j l ) ∈ C M denotes the antenna gain and phase response vector, respectively, at the corresponding (ϕ k,j l , θ k,j l ) angle of departure (AoD) direction where the azimuth AoD (AAoD) of the l th path in the j th cluster is ϕ k,j l ∈ µ ϕ D k,j − σ ϕ D k,j , µ ϕ D k,j + σ ϕ D k,j , i.e., specified by the mean AAoD µ ϕ D k,j and the spread of AAoD σ ϕ D k,j . Similarly, the elevation AoD (EAoD) of the l th path in the j th cluster is θ k,j l ∈ µ θ D k,j − σ θ D k,j , µ θ D k,j + σ θ D k,j , i.e., specified by the mean EAoD µ θ D k,j and the spread of EAoD σ θ D k,j .

III. CHANNEL ESTIMATION BASED ON GEOSPATIAL DATA AND FUZZY C-MEANS
As discussed above, designing the RF beamformer F needs the knowledge of CSI in terms of a large M channel gain and phase response pairs and hence requires a large feedback communications overhead between UTs and the BS if using conventional online channel sounding. To eliminate this requirement, instead of online channel sounding, we propose to perform offline CSI computations based on 3D geospatial data of the service area and FCM technique as shown in Fig. 2. Firstly, the data preprocessing simplifies the original geospatial data representation to reduce the computational complexity in the subsequent phases. Next, the path estimation generates UT-level CSI in terms of all paths from the known BS position to all feasible UT locations in the service area by offline ray tracing. Finally, the UT zone formation derives all non-overlapping UT zones with their associated CSI via 3 main steps: (i) FCM clustering to form the relationship between existent paths and clusters via the membership matrix, (ii) cluster parameterization to develop the clusters, their regions and parameters, and (iii) UT grouping to establish the non-overlapping UT zones and their associated CSI.

A. GEOSPATIAL DATA PREPROCESSING
The original geospatial data defines objects by vertices of their planes in Universal Transverse Mercator (UTM) Coodinates System [28], where curved planes are modeled by several flat planes. For easy data intepretaion and visualization, vertices are re-defined in a relative coordinate system by setting a new origin instead of in the original absolute coordinate system. Then, for computational simplicity, we quantize all vertices by proper resolution to merge objects not affecting the propagation, such as the interior and the exterior walls of buildings.

B. PATH ESTIMATION
The offline ray tracing generates UT-level CSI for the entire service area via four steps: (i) collect feasible UT locations in the service area from its geospatial data, (ii) produce paths in terms of all objects between the BS and all feasible UT locations, and (iii) compute path parameters according to the trajectory and the path loss model for all paths. For each BS-UT location pair, paths that carry signals arriving at the UT with a power level of 25 dB lower than the strongest signal are labeled as non-existent paths [29] while the remaining existent paths are considered as the existent paths. In this research, only dominant line-of-sight (LoS) propagation, i.e., single-segment paths, and single reflection propagation, i.e., 2-segment paths, are considered in the simulation.

1) Feasible UT locations
In consideration of LoS propagation and single reflection propagation, feasible UT locations are only selected from UT locations outside of the objects in the service area based on the geospatial data. Besides, in order to reduce the required computational complexity, CSI of feasible UT locations distributed 0.5 meter apart in all dimensions in the service area is adoped for the offline channel estimation.

2) LoS path
The LoS path has one segment. Its trajectory is obtained by examining line-plane intersections of the segment and other objects. The LoS path is non-existent if the segment contains line-plane intersections on any objects. Otherwise, this LoS path is existent. Its path loss is given by [30]: where λ is the wavelength, η is the path loss exponent in [31], d is the length of all segments and A m is the frequency-dependent molecular absorption attenuation given by [30]: where k(f ) is the molecular absorption coefficient for frequency f available in [32]. As the molecular absorption does not affect the propagation loss much at low frequencies, k(f ) = 0 for low f (e.g., at sub-6 GHz).

3) Reflection path
The single-reflection path has two segments connecting the BS location, the reflected points, and the UT location. Its trajectory is computed based on the mirror theory. The single-reflection path is non-existent if any of segments passes other objects. Otherwise, this single-reflection path is existent. Its path loss is the sum of (5) and the reflection loss given by [33]: where R c is the reflection coefficient from Fresnel's equations and R s is the surface roughness factor associated with the material of the interacted objects [34].

4) Path representation
The vector of the s th segment in a path, defined as (x s y s z s ) in Cartesian coordinates, is calculated by subtracting one point from the other in the direction from the BS to the UT. To get the angle information of the s th segment, the segment vector is converted from Cartesian coordinates to spherical coordinates (ϕ s θ s d s ), given by: where atan2 refers to the four-quadrant inverse tangent [35], ϕ s is the azimuth angle, θ s is the elevation angle, d s is the segment length.
Paths are characterized by two parameters, a path vector derived from its segments and the received signal strength (RSS). Using angle and distance information of all segments in the l th path (i.e., (ϕ 1 θ 1 d 1 ) in the direction of BS to UTs for LoS paths, (ϕ 1 θ 1 d 1 ) in the direction of the BS to the reflected point and (ϕ 2 θ 2 d 2 ) in the direction of the reflected point to the UT for single-reflection paths), the l th path is characterized by the vector x l , defined as: where ϕ D l , θ D l , ϕ A l , θ A l are AAoD, EAoD, azimuth angle of arrival (AAoA), elevation angle of arrival (EAoA) ranging in [−π, π] rad, respectively, and τ l is the delay in second.
For the existent LoS path, , where P t is the transmit power, G t is the transmit antenna gain, G r is the receive antenna gain, L is the path loss obtained by (5) for the LoS path and by (5) and (7) for the single-reflection path. For the non-existent path, P l = −∞.

C. UT ZONE FORMATION
We consider that a total of N paths are present for a total of K feasible UT locations in the considered service area, which are obtained via the path estimation. N r paths X = [x 1 , · · · , x Nr ] arrive at UTs (i.e., existent paths) and N −N r paths are blocked by objects (i.e., non-existent paths). In this module, N r existent paths are grouped to N c clusters and K UT locations are divided into G non-overlapping UT zones in the considered service area.

1) FCM clustering
Using FCM clustering, each existent path can belong to multiple clusters with the membership from 0 to 1, where higher membership indicates stronger association with the cluster. The proposed clustering algorithm uses the following iterative process to group N r existent paths into N c clusters. At first, it initializes the membership matrix U = [u lj ] ∈ R Nc×Nr where u lj is the membership of the existent path x l in the j th cluster such that Nc j=1 u lj = 1. In the n th iteration (where n = 1, 2, · · · ), the j th cluster centroid is calculated based on the RSS, given by: where m > 1 is a fuzzy control constant affecting the cluster fuzziness. Subsequently, elements of the membership matrix U are updated as [24]: where E lj is the similarity between the existent path x l and the j th cluster to address the angle periodicity, given by: where x li and v ji denote the i th components of vectors x l and v j , respectively. Finally, the objective function in the n th iteration is defined as [24]: The iterative process continues to re-compute (10)-(12) until convergence, i.e., ∆J = J n − J n−1 < , where J 0 = ∞.

2) Validity
The number of clusters should be selected to appropriately represent all existent paths without high complexity. Thus, we examine the validity of a number of clusters c to determine the best number of clusters based on the partition coefficient V PC (c), the partition entropy V PE (c), the partition index V SC (c), the separation index V S (c), the Xie and Beni's index V XB (c), defined as [37]: where L j is the number of existent paths in the j th cluster and Nc j=1 L j = N r . The FCM clustering runs repeatedly for c ∈ {2, 3, · · · , N Cmax } to get the cluster prototypes, where the upbound N Cmax depends on the scenario. A best partition has higher V PC , and lower V PE , V SC , V S and V XB [24].

3) Cluster parameterization
The existent path x l is assigned to the j th cluster if u lj > u lq , q = 1, · · · , j − 1, j + 1, · · · , N c . Then, the j th cluster contains a path set X j = x 1 , · · · , x Lj . Instead of using X j , the j th cluster can be represented by [38]: where µ P j and σ P j are the mean and the spread of the cluster power calculated by: , are the mean and the spread of the angles and the delays, given by: The EAoA and AAoA information in the cluster become redundant for formulating RF beamformer in the BS and can be omitted. We apply path pruning to the original clusters in order to eliminate existent paths (i.e., outliers) that have too large Euclidean distances from their cluster center centroid, as well as to preserve the cluster parameters at the same time [39]. For the j th cluster, consider that the original path index set is C j0 = [I 1 , · · · , I Lj ], where I l denotes the index of the l th path. The algorithm first calculates the cluster parameters C j0 based on the original paths C j0 and take them as benchmarks. Then, in the n th iteration, the algorithm computes the point-to-center distances for all paths in C j(n−1) , discards the existent path with excessive distance (i.e., the l th existent path if E lj > E mj for m = l ≤ L j ) and creates the new path index set C jn = [I 1 , · · · , I l−1 , I l+1 , · · · , I Lj ]. Afterwards, the new cluster parameters C jn are recalculated by (19) and (21), and compared with the benchmark C j0 element by element. If values of the cluster parameters C jn are larger than 95% of C j0 (e.g., µ Pn , the algorithm continues pruning the existent paths in the cluster. Otherwise, the j th cluster takes C j(n−1) as its parameters and the iteration stops. With path pruning, the cluster center centroid moves closer to dense paths in the cluster and the spreads of the clusters are more representative of the majority of paths.

4) Cluster region (CR) identification
The j th CR R j is the geospatial region that includes all feasible UT locations having existent paths in X j . Part of (a) Real-life service area.
(b) 3D geospatial model. R j containing feasible UT locations with existent paths having u lj < 0.6 is defined as fuzzy sub-CR R F,j . The other part containing feasible UT locations with existent paths having u lj ≥ 0.6 is named the deterministic sub-CR R D,j . Therefore, UTs can receive the signals transmitted via the j th cluster if located in R D,j while the UT is less accessible if located in R F,j . Note that one cluster region can partially overlap with other cluster regions.

5) UT grouping
Feasible UT locations having the same cluster group Z g = C 1 , · · · , C Ng are assigned to the g th UT group. Feasible UT locations in the g th group further form the geospatial g th UT zone in the service area, which include all CRs of the cluster group Z g . Note that UT zones do not overlap each other. The UT-group CSI of the g th UT zone is represented by the parameters of the cluster group Z g .

D. ILLUSTRATIVE RESULTS
As illustrative results, the proposed channel estimation method is implemented in an indoor scenario at 28 GHz, as shown in Fig. 3, with related parameters and results summarized in Table 1 and Table 2, respectively. Fig. 4 presents the ray tracing example of one UT location, where there are 6 existent paths and 200 nonexistent paths. Fig. 5 shows the validity results of the indoor 28 GHz scenario. In this case, the best number of clusters is N c = 58 because V PC and V PE reach the local maxima and minima, respectively, while V S , V SC and V XB virtually drop to 0. Fig. 6 demonstrates the clusters in the indoor 28 GHz  Fig. 6(a) seriously overlap each other in AoD domain, because existent paths with similar AAoD and EAoD, and diverse delays, are assigned to separate clusters, as shown in Fig. 6(b). Fig. 6(c) presents an example of grouping existent paths distributed near π in AAoD domain into one cluster (C 11 ), proving that the proposed FCM addresses the angle periodicity problem in the conventional FCM. Besides, the minimum RSS among all existent paths is −62.23 dBm while the minimum cluster power is −58.26 dBm. In this case, UTs having the existent path with low RSS may receive a signal via the corresponding cluster with power level below the receiver sensitivity if the antenna gain in the BS is designed tightly in terms of the cluster power. Thus, the difference between the cluster power and the RSS of existent paths should be taken into consideration to meet the link budget when designing the antenna gain based on the proposed channel model. After the path pruning, the number of clusters decreases to N c = 56 in order to reduce the channel complexity. Fig. 7 presents three pruned clusters: C 25 , C 27 , C 30 in the angle-delay domain, where paths that are independent from the dense paths in the cluster are successfully identified and removed by the path pruning algorithm. The cluster parameters of C 25 are almost consistent before and after the path pruning while that of C 27 has a large drop in AAoD spread, from 0.3266 rad to 0.1227 rad, as summarized in Table 3. It is because dropped outliers in C 27 are spatially distant from its cluster center centroid, which falsefully  shifts the cluster center centroid in the direction with sparse paths.
In Fig. 8, two clusters (C 1 and C 6 ) result in an overlap in their cluster regions (CR1 and CR6). UTs located in the overlapping cluster region can receive signals through both clusters. Fig. 9 shows that when the degree of the cluster fuzziness rises, existent paths in one cluster have more divergent AoD and decreasing average maximum membership (from 0.96 to 0.025), leading to less representative cluster parameters. Thus, the cluster fuzziness should be controlled at a low level. Fig. 10(a) shows all UT zones for the indoor 28 GHz scenarios. Fig. 10      (c) One cluster crossing π in AAoD domain.

IV. PATH ESTIMATION BASED ON DNN AND 3D GEOSPATIAL DATA
The path estimation in Fig. 2 spends long simulation/ measurement time to obtain UT-level CSI in terms of all paths required for the channel estimation completely by offline ray tracing. We design a DNN-based path estimation model to produce part of required paths for the channel estimation based on the geospatial data without ray tracing.
In this section, we study and develop DNN-based path estimation models, including a one-step model based on the

A. DATA PREPROCESSING
For the u th path, there are 15 input features defined by is the x, y, z coordinates of the BS location, V u U T is the x, y, z coordinates of the UT location, V u P,1 , V u P,2 , and V u P,3 are the x, y, z coordinates of three points V 1 , V 2 , V 3 defining the interacted plane P.
Consider a dataset S = {o 1 , · · · , o N } with a total of N paths for K feasible UT locations required to comprehensively describe the desired channel, where K feasible UT locations in the service area are known from the geospatial database, o u is the vector composed of input features, the class tag and path parameters for the u th path. We build two groups according to K feasible UT locations: one group with K T r UT locations and the other group with K T est UT locations, where K T r + K T est = K. Paths of K T r UT locations are obtained by the offline ray tracing and taken as the training set S T r = {o 1 , · · · , o N T r }. Paths of K T est UT locations in the test set S T est = {o 1 , · · · , o N T est } are unknown and required to be predicted by the proposed   DNN-based path estimation model. To investigate the effects of training time on the prediction performance of the DNNbased path estimation model, we consider three sets of ratios for training and testing: K T r : K T est = 0.7 : 0.3, K T r : K T est = 0.5 : 0.5 and K T r : K T est = 0.3 : 0.7. If K T r UT locations are densely distributed in the service area, the predicted paths for the remaining K T est UT locations would not well match their true paths. Thus, we assume that K T r UT locations are randomly and uniformly selected from K feasible UT locations without replacement.

B. ONE-STEP MODEL
The one-step model maps the input features to the path parameters by a FFNN as shown in Fig. 12, where the hidden layer is activated by the tanh function and the output layer is activated by the linear function. For the u th path, 15 input features F u form an input vector f R u ∈ R 1×15 , and six path parameters form an output vector t u ∈ R 1×6 . We assume that the topology of the one-step model that fits the mapping from the input vector to one of the six path parameters (i.e., AAoD) also fits the mapping from the input vector to all six path parameters. Hence, a number of FFNNs in different topologies are trained with the input vector and AAoD (i.e., (f R u , ϕ Du ) for the u th path) in the training set by Levenberg-Marquardt backpropagation algorithm, and evaluated by the root mean square error (RMSE) and R. We choose the topology of the one-step model, such as the number of hidden layers and the number of neurons in each hidden layer, from the FFNN with a lower RMSE and a higher R. Afterwards, the coefficients of the one-step model are obtained by training it with paths in S T r (i.e., (f R u , t u ) for the u th path). The 1D-CNN and the regression tree ensembles are used in the proposed two-step model as presented in Fig. 13, where the 1D-CNN classifies a large number of paths into two classes: the existent class and the non-existent class, and the regression tree ensembles only apply to a reduced small number of existent paths for path parameter prediction. The reasons why the 1D-CNN and the regression tree emsemble win against other machine learning techniques are that the 1D-CNN can preserve the relative relationships between x, y, z coordinates in the input features compared to the FFNN in the one-step model, and the regression tree ensemble can handle small datasets with a high degree of errors and missing values with balanced variance and bias.

1) One-dimensional convolutional neural network
For the u th path, 15 input features F u are organized as a three-dimensional input vector f c u ∈ R 1×15×1 , and the corresponding output is the categorical class tag v u with two values, where v u = 1 if the u th path is existent and v u = 0 if the u th path is non-existent.
The proposed 1D-CNN is composed of an input layer, L c convolutional (CONV) layers, a max pooling layer, a dropout layer and a fully connected (FC) layer, as shown in Fig. 14. In the l th CONV layer, the input vector I l ∈ R 15×1×C il , where C il is the depth of the input vector, is normalized by the batch normalization technique and padded with zeros in the width and height dimensions for all depths I p l ∈ R 17×3×C il , convolves with C ol different filters F j ∈ R 3×3×C il , j = 1, · · · , C ol (e.g., C ol =30) using stride s = 1 to produce I c l,j = I p l * F j ∈ R 15×1×1 . The output vector of the convolution operation I c l ∈ R 15×1×C ol is activated by the rectified linear unit (ReLU) φ ReLU (I c l ). The output vector of the l th CONV layer is given by O c l = φ ReLU (I c l ) ∈ R 15×1×C ol . After L c consecutive CONV layers, the input vector of the following max pooling layer I P ∈ R 15×1×CoL c is operated with a window of size 1 × 2 and stride s = 1, in order to extract the max value between the neighboring elements in each depth of the input vector. The output of the max pooling layer O P ∈ R 14×1×CoL c is the input of the dropout layer with a dropout ratio of 0.5 to reduce the number of trainable parameters in the subsequent layers. The output of the dropout layer O D ∈ R 7CoL c ×1 connects to N f i = 7C oLc (e.g., 210) neurons in the FC layer. Finally, the FC layer activated by the softmax function maps N f i = 7C oLc neurons to N o = 2 neurons to predict the class tagv for each path.
As the number of paths in the training set for each class is imbalanced (existent:non-existent=2%:98%) as shown in Table 2, the 1D-CNN is trained with the binary focal loss given by [40]: where p v is the probability of one path belonging to the VOLUME 4, 2016 class v such that 1 v=0 p v = 1, γ is a modulating factor to weigh up the hard-classified paths, and α v is the weighting factor of the class v to weigh up the existent paths such that 1 v=0 α v = 1. The performance of the 1D-CNN is gauged by prediction, recall, F-score and Cohen's kappa instead of accuracy, since the class imbalance between the two classes leads to high accuracy by simply labeling all paths as non-existent paths [41]. Cohen's kappa κ represents a minimum acceptable level of agreement of the predicted path class ranging from 0 to 1, where κ = 0 denotes a slight agreement and κ = 1 denotes an almost perfect agreement [42]. It is given by: where p 0 is the proportion of observed agreement and p e is the proportion of chance agreement. For a binary-class problem, p e is given by [42]: 2) Regression tree ensembles There are six parallel regression tree ensembles, where each predicts one of the path parameters, as shown in Fig. 15. Unlike the training set S T r used in the 1D-CNN, this module is trained on the reduced training set S T r = [o 1 , · · · , o Nv ], where only N v out of N T r paths in the existent class v = 1 are involved. For the u th existent path inS T r , the input vector is f R u ∈ R 1×15 and the outputs are t u = {x u , P u } ∈ R 1×6 for six regression tree ensembles, respectively. The performance of the regression tree ensemble is evaluated by RMSE. For AAoD prediction, the entire training set of the regression tree ensemble isS D T r (i.e., (f R u , ϕ Du ) for the u th existent path). There are N E regression trees in the regression tree ensemble. For the e th regression tree in the regression tree ensemble, its training subsetS De T r of size N e is generated fromS D T r of size N v by randomly drawing N e paths paths with replacement.S De T r is randomly divided into 5 folds for 5-fold cross validation. The output of the regression tree ensemble is produced by averaging the outputs of N E regression trees. In order to reduce the depth of regression trees, only 12 input features V u U T , V u P,1 , V u P,2 , V u P,3 are taken as candidate predictors, as V u BS is the same for all paths inS T r .

D. ILLUSTRATIVE RESULTS
To evaluate the performance of the proposed DNN-based path estimation framework, we present a one-step model trained with 70% of paths and three two-step models trained with 70%, 50% and 30% of paths required for the channel estimation for the indoor 28 GHz scenario. The search range of the hyperparameters for the DNN-based path estimation model is given in Table 4.

1) One-step model
The performance of the one-step model for AAoD prediction trained with 70% of paths is shown in Table 5. The best topology of the FFNN for AAoD prediction has 25 neurons in the first hidden layer and 15 neurons in the second hidden layer. It achieves RM SE = 0.0995 rad and R = 0.924 on the test set, indicating that this network learns well from the training samples. However, Fig. 16 shows that there are huge differences between the predicted AAoD and the true AAoD for specific paths. Particularly, for a non-existent path, the one-step model falsefully assigns a non-zero value to its AAoD.
2) Two-step model trained with 70% of required paths For the 1D-CNN, we first search for a best α 1 in the focal loss, followed by the number of filters in each CONV layer and the number of CONV layers by training them with 70%   of required paths. Then we find the best topologies for six regression tree ensembles. Table 6 shows that two α 1 , 0.25 and 0.32 with the top two Cohen's kappa, are selected for further hyperparameter tuning. Table 7 illustrates that the 1D-CNN trained using the focal loss with α 1 = 0.32 show a great increase in precision (3%), recall (0.7%), F-score (1.8%) and Cohen's kappa (2%), compared with the 1D-CNN trained using the conventional cross entropy loss function. However, the model trained with the focal loss having α 1 = 0.1 performs worse than models in other three settings. Thus, we set α 1 = 0.32 in the focal loss function for the desired 1D-CNN.  Table 8 shows that the 1D-CNN with 30 filters in all CONV layers performs in a superior way than 1D-CNNs with more or less filters in each CONV layer, with recall = 88.09%, precision= 95.44%, F-score= 91.61% and Cohen's kappa= 0.91. Thus, we set 30 filters in all CONV layers for the desired 1D-CNN. Table 9 illustrates that #8 1D-CNN with 9 CONV layers with recall= 92.00%, precision= 96.31%, F-score= 94.11% and Cohen's kappa= 0.9399, leads to a perfect fit to the observed input-output pairs, compared to #1 1D-CNN with 2 CONV layers which is too crude to perform path prediction due to its unacceptable Cohen's kappa (0.29) and compared to #9 1D-CNN with 10 CONV layers which has a worse performance. Thus, the two-step model trained with 70% of required paths consist of #8 1D-CNN and regression tree ensembles summarized in Table 10. Meanwhile, Fig. 17 presents predicted and true paths in AAoD, where the predicted values closely approach the ideal fitting curve.

3) Two-step model trained with 50% of required paths
We first tune the number of CONV layers in the #8 1D-CNN and train the newly-built 1D-CNN and six regression tree ensembles with 50% of required paths. Performances of 1D-CNNs with the number of CONV layers ranging from 5 to 12 are presented in Table 11. The performance of the #12 1D-CNN becomes inferior when it is trained with 50% of required paths as compared to the #8 1D-CNN trained with 70% of required paths. However, when more CONV layers are added to the 1D-CNN, such as #13-#15 1D-CNNs, their performances are improved and even surpass that of the #8 1D-CNN. Thus, the two-step model trained with 50% of required paths is composed of the #15 1D-CNN and six   regression tree ensembles summarized in Table 12.

4) Two-step model trained with 30% of required paths
Similarly, we tune the number of layers of the #8 1D-CNN and train this two-step model with 30% of required paths. Table 13 demonstrates that the performance of the 1D-CNN with 9 CONV layers (#16) is comparable to that with 12  CONV layers (#17). Thus, the two-step model trained with 30% of required paths consists of the #16 1D-CNN and six regression tree ensembles summarized in Table 14.

5) Overall performance of DNN-based path estimation models
Comparing the performance of the one-step model(0.7:0.3) and the two-step model(0.7:0.3) in Table 15 indicates only a slight improvement in RMSE, but Fig. 17 and Fig. 16 clearly show the noticeable fitting performance enhancement of the two-step model over the one-step model. Besides, though the two-step model trained with 30% of required paths learns less from the UT-level CSI given by ray tracing, it achieves comparable and even better performance in path parameter prediction comparing to the two-steo model trained with 50% and 70% of required paths, as shown in Table 15.

6) Comparison between ray tracing and DNN-based path estimation model
There are 73, 336 paths for 356 feasible UT locations required for the channel estimation within the desired space in Fig. 18. The two-step model (0.7:0.3) generates 1, 325 existent paths and 72, 011 non-existent paths. The two-step model (0.5:0.5) generates 1, 350 existent paths and 71, 986 non-existent paths. The two-step model (0.3:0.7) generates 1354 existent paths and 71, 982 non-existent paths. As a reference, the ray tracing algorithm produces 1, 715 existent paths and 71, 621 non-existent paths. The AoDs of existent paths obtained by three two-step models and by ray tracing are presented in Fig. 19. Obviously, most of existent paths for unknown feasible UT locations can be accurately predicted by the two-step model except some paths having positive AAoD.
The relative times on average to acquire paths for one UT location by ray tracing and three two-step models are shown in Table 16. Compared with the time taken by ray tracing, the time taken by three two-step models is reduced by 26.4%(0.7:0.3), 46% (0.5:0.5) and 65.8% (0.3:0.7), respectively. Therefore, by means of deep neural networks, the UT-level CSI required for the channel estimation is produced in a much shorter time with acceptable errors.

V. SUM-RATE PERFORMANCE
Throughout this section, the proposed channel estimation model is utilized for the MU-mMIMO system illustrated in Fig. 1. The sum-rate performance of two-stage AB-HP and single-stage FDP techniques is investigated con-    sidering the indoor scenario specified in Table 1. Also, Table 17 summarizes the simulation setup, where the BS is equipped with a uniform rectangular array (URA) having M = 16 × 16 = 256 antennas.
In Fig. 20, there are G a = 2 UT zones in the service area, where the one on the left has four clusters and the one on the right has five clusters. Here, the RF beamformer of the AB-HP technique generates the corresponding beams by using the UT-group CSI of each UT zone obtained via VOLUME 4, 2016  the ray tracing. It is shown that the generated beams are well-aligned with the multiple paths and UT zones. AB-HP first builds the RF beamformer F ∈ C 256×22 , which needs N RF = 22 RF chains. Then, the BB precoder is generated via the reduced-size effective channel matrix H = HF ∈ C Ka×22 seen from the BB-stage. On the other hand, the conventional single-stage FDP requires N RF = M = 256 RF chains and the full channel matrix H ∈ C Ka×256 . Thus, the AB-HP technique utilizing the proposed channel modeling decreases the CSI overhead size by 91.4% compared to FDP and the number of RF chains are remarkably reduced from 256 to 22. Fig. 21 demonstrates the sum-rate performance of the FDP and AB-HP techniques obtained via (2) and (3). Here, we assume that K a = 2, 4, 6, 8, 10 UTs are equally distributed to G a = 2 groups (i.e., K g = Ka Ga ). In order to compare the accuracy of ray tracing and two-step model (0.3:0.7) on the sum-rate performance, the RF beamformer of AB-HP is also fed by the two-step model (0.3:0.7) in addition to ray tracing. Results show that AB-HP with ray tracing can provide a comparable sum-rate performance with respect to the conventional FDP. For instance, at P T = 40 dBm, the performance degradation is only 0.4 bps/Hz (7.2 bps/Hz) for K a = 2 (K a = 10) users. Moreover, AB-HP with ray tracing achieves 99.1% (97.8%) of sum-rate obtained by FDP for K a = 2 (K a = 10) users, while, AB-HP greatly relaxes the CSI overhead and the hardware cost/complexity as mentioned above. Furthermore, AB-HP with the two-step model (0.3:0.7) also closely converge the sum-rate performance of AB-HP with ray tracing, whereas, the two-step model decreases the time for the path prediction by 65.8% (please see Table 16). For example, two-step model (0.3:0.7) provides 99.6% (99.2%) of the sum-rate performance achieved by ray tracing model for K a = 2 (K a = 10) users.

VI. CONCLUSION
A deep learning based channel estimation approach using 3D geospatial data, the 1D-CNN and the proposed FCM for MU-mMIMO hybrid precoding RF beamformer has been developed. It provides the RF beamformer with offline UTgroup CSI for different UT zones in the service area, which significantly reduces the online CSI overhead by 91.4%. Using a DNN-based technique and ray tracing for UT-level CSI acquisition further shortens the measurement time by 65.8%, compared to only ray tracing. Meanwhile, the UTgroup CSI of each UT zone is robust to the imperfect UT-level CSI produced by the DNN-based technique because of the fuzziness introduced into the clusters by the proposed FCM. Finally, considering the MU-mMIMO systems, the illustrative results verify that the AB-HP configured with offline UT-group CSI generated by the proposed DNNbased channel estimation technique can successfully achieve a comparable sum-rate performance with the same AB-HP technique with offline UT-group CSI produced by computationally expensive ray tracing and the FDP technique with full-size online CSI.