Hybrid Precoding Based on Monopulse Ratio for Millimeter Wave Systems With Limited Feedback

Multiuser hybrid precoding in 5G NR mmWave communication system faces significant challenges such as establishing accurate directional radio links and maintaining links for mobile stations (MSs) moving in outdoor environments. A conventional solution relies on finite codebook-based beam sweeping for initial first-stage beam acquisition and subsequent second-stage beam tracking by sweeping adjacent beam pairs. However, such a conventional solution has inevitable residual AoA/AoD errors even after the best beam pair is established in the first stage and incurs nonnegligible overheads to sound adjacent beam pairs for maintaining the best beam pair in the second stage. To overcome these problems, a novel codebook-based two-stage solution that combines a novel beam tracking protocol with a low sounding overhead, a unique receiver structure employing a beam scheduler and a beam tester, and a fine accuracy residual AoA/AoD error estimation algorithm based on the monopulse ratio concept is proposed. The solution is unique because of the receiver structure for the residual AoA/AoD error estimation that exploits the cyclic prefix in OFDM systems, inspired by the monopulse ratio in radar systems. Moreover, it can be applied to MSs with a single RF chain. This solution, using the proposed receiver structure and algorithm, can establish a more accurate directional beam pair right after the initial beam sweeping in the first stage. For beam tracking in the second stage, it estimates the residual AoA/AoD errors of the current best beam pair rather than sweeping adjacent beam pairs, thereby reducing beam tracking overheads. Numerical evaluation and computer simulations show that the proposed solution offers more accurate beam acquisition (i.e., average array gain improvement of several dB) and costs considerably reduced beam sounding overheads compared to the conventional solution. Lastly, a ray-tracing tool is used to demonstrate that our solution is effective in practical channel parameters for outdoor environments.


I. INTRODUCTION
Millimeter wave (mmWave) band is highly anticipated for use in a mobile broadband radio access system. MmWave communication can solve the large bandwidth requirement issues caused by the ever-increasing data rate demands of user devices [1]- [6]. However, path loss in the mmWave band is more severe than that in lower frequencies [6]. To com- The associate editor coordinating the review of this manuscript and approving it for publication was Liang Yang . pensate for path loss, it is necessary to use directional transmission and reception using the array gain of large antenna arrays [2], [7]. Fortunately, large antenna arrays have feasible form factors because of the small wavelength in the mmWave band [8], [9]. Full digital baseband (BB) precoding is impractical because of the high power consumption and hardware (HW) costs of radio frequency (RF) chain components in the mmWave band [10]- [12]. Furthermore, analog-only beamforming exhibits a significantly degraded performance in downlink multiuser environments [13]. As a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ tradeoff, researchers [6], [14]- [17] recently proposed hybrid precoding comprising analog RF precoding and digital BB precoding. In fact, this technology has low cost and good performance, and therefore, it is now considered a key enabling technology in mmWave communication systems [6], [7]. As a use case in the fifth generation (5G) new radio (NR) access specification, enhanced mobile broadband (eMBB) [6] aims at higher peak and average data rates than the fourth generation (4G) long-term evolution (LTE) technology. Therefore, researchers have studied how 5G NR supports large bandwidths in mmWave frequency bands for commercial deployment [18] and also (analog RF) beam-related operations, called beam management in 3GPP [19]; the objective of such research was to setup and maintain good-quality directional links. In addition, using multiple RF chains at the base station (BS), 5G NR supports digital BB precoding for spatial multiplexing in multiuser multiple-input multiple-output (MU-MIMO) with limited feedback, a feature already added to LTE to improve spectral efficiency. The number of simultaneous streams for multiusers (i.e., number of RF chains) is far less than the number of transmit antennas. Hence, a good analog RF precoding using a limited feedback is expected to be essential for hybrid precoding in mmWave MU-MIMO systems. In the initial access phase, finding the best RF beam pair between BS and each mobile station (MS), as shown in Fig. 1, is a crucial operation to achieve sufficient array gain and good link quality for subsequent data transmission. Further, fast movement or self-rotation of a MS may occur during the data transmission phase [20]. Thus, to maintain a stable link under mobility or time varying angles, the system needs sounding overheads in protocols for periodic candidate beam pair measurements (called beam tracking), as shown in Fig. 1. There are several challenges to be considered when designing a hybrid precoding to meet the above mentioned requirements. First, as a tradeoff between accuracy and delay, most mmWave systems employ codebook-based hybrid precoding where RF precoders/combiners are selected from a finite size RF beam search codebook. Thus, there can be a considerably large sounding overhead for the periodic measurement of the candidate RF beam pairs to maintain the concurrent best beam pair, as in Fig. 1. Second, among many types of codebooks, discrete Fourier transform (DFT) based beam-steering codebooks have been widely used because they can achieve the maximum array gain at the designated beam directions and are well parameterized with beam directions. However, the performance of an average array gain can be degraded by the residual angle of arrival (AoA)/angle of departure (AoD) error because of the finite grid of angles in the codebook. Finally, MSs may have a single RF chain to lower HW cost or activate only a single RF chain to save power. As a result, when sweeping the adjacent candidate RF beam pairs to track the best beam pair, there might be nonnegligible sounding overheads in protocols, thereby causing loss of overall system throughput. In summary, multiuser hybrid precoding in the mmWave systems must overcome challenges so that beam acquisition is as accurate as possible and the beam tracking protocol is employed with sounding overheads as low as possible even for the MS with the single RF chain.

A. RELATED WORK
We classified the literature on hybrid precoding in mmWave MIMO systems into three research trend categories. In the first category, researchers assume that downlink antennalevel channel state information is available at the transmitter (CSIT); they further assume that, to maximize spectral efficiency, joint/sequential optimization of analog/digital precoding matrix pair is performed toward the optimal, fully digital precoder F opt calculated from the provided perfect (or estimated) antenna-level CSIT [6], [21]- [29]. These research outputs are meaningful for revealing the performance upper bound of hybrid precoding with the perfect (or estimated) antenna level CSIT. Nevertheless, channel state information (CSI) is difficult to acquire in practice because of the considerable estimation overhead and low signal-to-noise (SNR) before analog beamforming [13], [30].
The second category is hybrid precoding that exploits statistical CSI [31] or spatial channel covariance [32]- [37]. In [31], by assuming small angular spreads, the transmit channel covariance is obtained from the statistics of the channel such as the mean of AoD, the angular spread of AoD, and the mean delay of channel clusters; the RF precoders are selected as the eigenvectors of the channel covariance matrix. In [32]- [37], the channel covariance matrix is utilized to design the analog precoding of a hybrid precoding without using instantaneous CSIT.
The third category is based on RF beam sweeping to find the best RF beam pair. Therefore, this category is better suited than the others for commercial systems such as 5G NR [13], [30], [38]- [46]. In the literature, RF beam sweeping has a long history of both research and practical deployment. In [38], codebook-based analog precoding was considered in wireless personal area networks for stationary device-todevice wireless connections. In [30], to attain fast beam alignment in mmWave backhaul links, the authors used adaptive subspace sampling and hierarchical codebook designed by subarray-wise beam broadening. To estimate the AoA error, the authors in [41] used phase difference across spatially separated receive subarrays along with multiple receive RF chains. Their aim was to reduce RF beam codebook search complexity in single user and single carrier systems. Their method, however, does not compensate for the residual AoA error and cannot be applied to the single RF chain condition. In [40], the authors designed a sequential search on RF and BB codebook for a single user; this approach provides an initial version of two-stage hybrid precoding for a single user system. In [45], a novel analog codebook using a nonuniform quantization and a low-complexity hybrid precoding was proposed to reduce the feedback overhead. In [13], [42], [43], [46], two-stage hybrid precoding methods were considered for single RF chain users. Reference [13] uses a full sweeping over DFT-based RF beam search codebooks and a multiuser zero-forcing (ZF) BB precoding based on the reported effective BB channels for a multiuser narrowband system. Reference [42] and [43] are enhancements of [13] using Kalman or minimum mean squared error (MMSE) precoding. These, however, do not further investigate compensating for the residual AoA/AoD error. Reference [46] presents a tone-based AoA estimation for a narrowband multipath channel. Although it can improve AoA estimation accuracy, the AoA error compensation capability is not fully exploited. In [44], an auxiliary beam pair-enabled AoA/AoD error estimation is suggested by measuring/compensating the residual AoA/AoD error using the measurements of the left and right adjacent beams near the current best beam pair. It can be applied to multiuser wideband OFDM system with a single RF chain receiver. Although it requires additional auxiliary beams, it shows that there is still room for performance improvement in RF beam steering with a finite size DFTbased codebook. Afterwards, the conventional solution is defined as the method that uses the DFT-based codebook for beam acquisition/tracking and does not estimate/compensate the residual AoA/AoD errors.

B. CONTRIBUTIONS
In this paper, we present a novel solution for two-stage multiuser hybrid precoding based on monopulse ratio for mmWave systems. The contributions of the paper are as follows.
• We propose an enhanced solution for two-stage multiuser hybrid precoding that can achieve more accurate beam acquisition/tracking and considerably lower sounding overheads in a beam tracking protocol than that using the conventional solution. The proposed solution consists of a novel beam tracking protocol, a unique receiver structure, and its algorithm to estimate the residual AoA/AoD error. It works well for a single RF chain MS in both wideband multipath channels and timevarying angle conditions.
• Even with inter-block interference (IBI) and a multipath interference (MPI), cyclic prefix (CP) combined with monopulse ratio can be successfully exploited for residual AoA error estimation with a considerably reduced root of mean squared error (RMSE); thus more accurate beam acquisition and beam tracking are possible both in the first and second stages.
• The proposed solution requires sounding only at the current best beam pair in the second stage. This is more efficient than the conventional solution that sweeps all adjacent beam pairs around the current best beam pair. We demonstrate that it can reduce overheads significantly under realistic channel and mobility scenarios. We use the following notation: A is a matrix, a is a vector, a is a scalar, and A is a set. A F is the Frobenius norm of A, whereas A T , A * , A −1 , and A + are its transpose, Hermitian, inverse, and pseudoinverse, respectively. I N is the N × N identity matrix. U(a, b) is a uniform distribution between a and b. A ⊗ B is the Kronecker product of A and B. A n is a n-ary Cartesian power of a set A.

II. SYSTEM MODEL
A single BS multiuser mmWave system is shown in Fig. 2. We assume that there are U selected MSs by an appropriate user selection algorithm such as [47]- [51], and each MS has N r antennas and a single RF chain. To perform multiuser hybrid precoding, the BS has N t antennas and N RF RF chains. It is assumed that U ≤ N RF . Subsequently, we explain the transmitter, RF codebook constraints, channel model, receiver operation, and optimization problem as a fully connected architecture, and then, we mention its extensibility to sub-connected architectures.
We describe the transmitter operation for downlink hybrid precoding in a multiuser MIMO The CP of length N g (≤ N ) is added to the IFFT outputs to make one OFDM block of length N b = N g + N .
Finally, an N t × U frequency flat RF precoder F RF = f RF 1 , f RF 2 , . . . , f RF U is applied so that an OFDM block at the output of BS transmitter antennas becomes . . , x(N − 1), x(0) . . . , x(N − 1)]. Therefore, the effective transmit signal at subcarrier k is F RF x[k] = F RF F BB [k]s[k], which is confirmed via IFFT/FFT relation.
Considering analog phase shifters as RF precoding implementation, there are several constraints on RF precoders. The entries of F RF satisfy | [F RF ] m,n | 2 = 1/N t . In addition, the phases of the analog phase shifters are discrete quantized and have predefined finite values [6]. For a B BS bits phase shifter, [F RF ] m,n = 1 Here, we assume N t ≤ 2 B BS for uniform linear array (ULA), and max N t,x , N t,y ≤ 2 B BS for uniform planar array (UPA) of for all k to satisfy the total average power P. In this paper, our RF precoder codebook with notation F HW is an oversampled DFT-based beam steering codebook with size |F HW | = N HW N t (ULA) or |F HW | = N 2 HW N t (UPA), where N HW > 1 denotes the oversampling factor per dimension (azimuth/elevation). In particular, for the BS phase shifters that have B BS bits, F HW can be defined for ULA or UPA as (for integers a, b) where the meaning of the array response vector a t (·) can be found in [13], [52], [53]. Here, each column vector [F RF ] :,u becomes a member of F HW . The RF combiner codebook W HW is defined similarly with B MS . A wideband geometric channel with multiple paths can be modeled as in [27], [54] for mmWave OFDM systems. The delay-d MIMO N r × N t channel matrix H u,d of the u-th MS for ULA or UPA is written as where N p = N cl N ray is the number of all rays (paths) in N cl channel clusters, N ray is the number of rays per cluster, is a ray index, α u, is a complex gain with E[|α u, | 2 ] = N t N r N p , and T s is a system sampling time, i.e., T s = 1 B = 1 N f for a system bandwidth B and a subcarrier spacing f . Here, p(dT s − τ u, ) denotes a pulse shaping filter, obtained from an evaluation at t = dT s with time delay τ u, . The BS array response vector for ULA or UPA is written as The relation from the -th ray's spatial frequency ψ t u, of ULA or (ψ t,x u, , ψ t,y u, ) of UPA to the physical azimuth (elevation) angle of departure φ t u, (θ t u, ) is given by [52] ψ t u, = π sin(φ t u, ), where the antenna element distance equals 0.5λ, λ = 3 × 10 8 /f c is the carrier wavelength, and φ t u, ∈ [−π, π), θ t u, ∈ [−π/2, π/2). The array response vector of the MS is defined similarly.
As a DFT of the sum of D-taps time domain channel matrix H u (t) = D−1 d=0 H u,d δ(t − dT s ) with D ≤ N g and (D − 1)T s ≥ τ u, for all paths , the corresponding frequency domain channel matrix H u [k] for subcarrier k is given as [53] and it is further decomposed into rays' gain, delay, and angle parameters for ULA or UPA as Here, β u, [k] is the frequency selective gain caused by the rays' delay τ u, and the pulse shaping filter p(t) defined as [34] Note that β u, [k] becomes frequency flat if the system bandwidth is sufficiently small, i.e., max τ u, T s = 1/B [55]. Then, the received signal at the antenna of the u-th MS is where n u (n) ∼ CN (0, σ 2 I N r ) is an AWGN vector, and the signal after the RF combiner of the u-th MS is expressed as The frequency domain received signal in subcarrier k after FFT (discarding the cyclic prefix) is given as where w u has the same HW constraints as the BS RF precoders (i.e., equal gain magnitude and uniformly quantized phase set S MS with B MS bits), n u [k] is a frequency domain representation of w * u n u (n) The optimization problem of RF precoder/combiner selection from finite sets and BB precoder design is to find F RF , Our chosen objective function is a system achievable sumrate R sum = U u=1 R u , where R u is the achievable data rate of the u-th user per subcarrier. The optimal solution for the problem needs an exhaustive search over the discrete set is available for all k and u. Currently, we identify W HW and F HW as RF beam HW codebooks, and distinguish W HW and F HW from the RF beam search codebooks, W and F, which are used for beam sweeping in the first stage of two-stage multiuser hybrid precoding. Even though we described the system model with a fully connected architecture as a representative case, it is straightforward to incorporate a sub-connected architecture into the system model [17]. The corresponding internal elements in the RF precoder F RF , the codebooks F, F HW and the vectors a t (·), a t (·, ·) are set to zeros based on fixed or even dynamic subarray configurations between RF chains and antenna elements in the sub-connected structure [34]- [37]. Similar reasoning can be applied to the MS side.

III. PROPOSED SOLUTION FOR TWO-STAGE MULTIUSER HYBRID PRECODING
We first explain the frame structure and concept of the proposed solution for two-stage multiuser hybrid precoding, and then, we describe its protocol, receiver structure, and algorithm. Fig. 3 illustrates the concept of the proposed solution in comparison with the conventional solution. The frame comprises the first and second stages. The first stage can be divided into DL Beam Sweep and UL Beam Sweep. The second stage consists of multiple M CSI repeats of a CSI period. The CSI period consists of CSI Acquisition and DL MU-MIMO Data Transmission. This frame structure is well suited for the two-stage multiuser hybrid precoding schemes in [13], [40], [42], [43].

A. FRAME STRUCTURE AND CONCEPT OF PROPOSED SOLUTION
The conventional solution in Fig. 3 can be viewed as a codebook-based method for beam acquisition and beam tracking procedures in 5G NR mmWave systems [56]. As shown in Fig. 3(a), the DL Beam Sweep in the first stage implies joint BS/MS beam sweeping using known reference signals (e.g., 5G NR synchronization signal (SS) blocks in an SS burst for initial directional cell search in the idle mode [57], [58]). Through beam pair tests (e.g., SNR measurement), each MS determines the best beam pair and best MS transmit beam for the following uplink beam sweeping. As shown in Fig. 3(b), the UL Beam Sweep is for the BS to perform beam sweeping and to find out the best BS beam for each MS. This is an abstraction of the 5G NR preamble based random access procedure in the standalone mode [58]. As shown in Fig. 3(c), the CSI Acquisition in the second stage has two functions. The first function is the measurement of candidate beam pairs adjacent to the current best beam pair for the MS under mobility. Based on the measurement, the BS and MS can switch to always-the-best beam pair. The N CSI in the figure denotes the number of CSI reference signal (CSI-RS) symbols per CSI period. For the ULA type, N CSI can be 9 (= 3 × 3) for the conventional solution. The second function is the BB CSI estimation using the CSI-RS symbols. Each MS estimates the BB CSI and reports the quantized BB CSI estimates to the BS. As shown in Fig. 3(d), during the DL MU-MIMO Data Transmission, the BS transmits simultaneous data streams using multiuser BB precoding. The conventional solution uses the RF precoder/combiner only in the RF beam search codebook and does not estimate/compensate the residual AoA/AoD errors.
The proposed solution shown in Fig. 3 has more accurate beam acquisition and reduced sounding overheads in beam tracking protocols owing to the use of monopulse ratio based AoA/AoD error estimators. As shown in Fig. 3(a), after DL beam sweeping, the residual AoA error of the best beam pair is identified using the monopulse ratio and then compensated. As indicated in Fig. 3(b), after UL beam sweeping, the residual AoD error of the best beam pair is estimated and compensated in the same manner. As illustrated in Fig. 3(c), the proposed solution needs N CSI = 2 CSI-RS symbols (one for AoD tracking and the other for AoA tracking) for the ULA type (similar sounding overhead reduction can be achieved for the UPA type). Therefore, in this exemplary case, our solution can significantly reduce the number of CSI-RS symbols, N CSI , for beam tracking, e.g., about 78% reduced CSI-RS overhead compared to the conventional solution in the ULA type. The difference is that the reporting of the candidate beam pairs measurement is replaced with AoD error reporting. In Fig. 3(d), similar to the conventional solution, multiuser MIMO data transmission using multiuser BB precoding, such as [48], [49], [59], follows. Note that the proposed solution selects the RF precoder/combiner from the RF beam HW codebook.
The advantage of the proposed solution over the conventional solution in Fig. 3 can be summarized as follows. First, the conventional solution has a limitation on array gain maximization. The BS and MSs cannot achieve maximum array gain because of the quantized grid of the RF beam search codebook even when the finer RF beam HW codebooks are available to the BS and MSs. However, the proposed solution estimates and compensates for the finer resolution AoA/AoD errors, and thus, it has better array gain and more accurate beam acquisition. Second, our proposed solution can detect and track changes in the AoA/AoD error of the current best beam pair even with a single RF chain. Therefore, it does not need to measure adjacent candidate beam pairs, and thus, it reduces the sounding overhead in the beam tracking protocol. In contrast, the conventional solution sweeps adjacent candidate beam pairs.

B. DESCRIPTION OF THE PROPOSED SOLUTION
Our proposed solution consists of the protocol, receiver structure, and algorithm to enhance the two-stage multiuser hybrid precoding. First, we present the motivation of the proposed solution as follows.
The conventional solution is limited in the array gain maximization and incurs non-negligible sounding overheads in protocols for measuring adjacent candidate beam pairs under mobility. Increasing the codebook size (e.g., using oversampled DFT-based RF beam search codebooks) is not a good remedy, as it worsens beam-sweeping overheads in the protocols. To remove the limitations, we devise a receiver structure that can perform finer resolution AoA/AoD error estimation even though the nonoversampled DFT-based RF beam search codebook is used for beam sweeping. We notice a monopulse ratio that is a well-established AoA error estimator in radar systems to track moving targets. However, simply adopting the monopulse ratio causes another obstacle that requires two separate RF chains for the respective sum and difference beam (nulling at the boresight) measurements [60]. To avoid the obtacle, one can measure the sum and difference beam outputs in a time division multiplexing manner with a single RF chain. However, if an entire OFDM block is allocated to obtain the difference beam output, then the entire OFDM block would be wasted in terms of data rate.
Currently, we are motivated by classic methods in which the CP in OFDM systems is exploited for time/frequency synchronization purposes [61]. In this paper, the CP is exploited for angle synchronization. Therefore, we design an algorithm for the unique receiver structure equipped with a monopulse ratio-based AoA/AoD error estimator; this is integrated within one OFDM block. Thus, our solution can be used to detect and track well, even with a single RF chain condition, the direction of change of the current best beam pair. Therefore, our solution does not need to measure adjacent candidate beam pairs, thereby considerably reducing the sounding overhead of beam tracking in protocols. The overhead reduction ratio is discussed in subsection V-B.

1) PROTOCOL
In Fig. 4, the protocol of the proposed solution is designed to have more accurate beam acquisition and lower sounding overhead beam tracking than the conventional solution. In the first stage, the proposed solution estimates and compensates the initial AoA/AoD error of the best beam pair found by DL/UL beam sweeping. This feature cannot be provided by the conventional solution. In the second stage, the sounding overhead N CSI for beam tracking has a significantly reduced value than that in the conventional solution because the proposed solution does not need to monitor adjacent candidate beam pairs.
As shown in the left half of Fig. 4, in the first stage of the conventional solution, the BS broadcasts directional reference signals and each MS finds the best beam pair by using codebook-based beam sweeping. Each MS sets the RF combiner corresponding to the identified best MS beam, and sends random access preambles. Then, the BS finds the best BS beam using codebook-based beam sweeping, sets the RF precoder as the determined best BS beam. Therefore, the BS and MSs establish directional radio links. In the second stage, at each CSI period, the MSs measure adjacent candidate beam pairs using N CSI CSI-RS symbols for beam tracking and BB CSI estimation. Based on the measurement results, each MS reports the quality of the candidate beam pairs and quantized BB CSI estimates. The BS sends the MU-MIMO data streams to MSs using the calculated BB precoders. At the end of the CSI period, the BS and MSs update the RF precoders/combiners based on the measurement results of the candidate beam pairs for the next CSI period.
As shown in the right half of Fig. 4, in the first stage of the proposed solution, the BS broadcasts directional reference signals and each MS finds the best beam pair by codebookbased beam sweeping. As a discriminating feature, each MS estimates/compensates the initial AoA error of the found best beam pair, sets the RF combiner corresponding to the estimated AoA, and sends random access preambles. Then, the BS finds the best BS beam by codebook-based beam sweeping. The BS estimates/compensates the initial AoD error of the found best BS beam, and sets the RF precoder considering the estimated AoD. Therefore, at the end of the first stage, the BS and MSs set up more accurate directional radio links than the conventional solution. In the second stage, at each CSI period, for both AoD/AoA error estimation of the current best beam pair and BB CSI estimation, the MSs requires much smaller numbers of the CSI-RS symbols for sounding (smaller N CSI ) than those for the conventional solution. Based on the measurement results, each MS reports the AoD error and quantized BB CSI estimates. The BS sends MU-MIMO data streams to MSs using calculated BB precoders. At the end of a CSI period, the BS and MSs update RF precoders/combiners based on the estimated AoD/AoA error for the next CSI period.

2) RECEIVER STRUCTURE a: MONOPULSE RATIO
We first explain the monopulse ratio, and then, the unique receiver structure to make the discriminating features feasible using the CP. Finally, we present the monopulse ratio based AoA/AoD error estimators built in the receiver structure. Some subscripts (subscripts u, 1) in the following equations are omitted for brevity.
We briefly explain the monopulse ratio [52], [60] for the ULA type. For a target AoAψ r , the sum beam, which always belongs to the DFT-based HW codebook, is denoted as w sum = a r (ψ r ) ∈ W HW , and the difference beam, which is used only for estimating the AoA error, is denoted as The monopulse ratio plotted in Fig. 6 together with the absolute values of g s and g d is defined by where the AoA error is e ψ r = ψ r −ψ r for a true AoA ψ r . Note that monopulse ratio can be considered as a near-linear approximation of e ψ r when |e ψ r | ≤ π/N r .
For the UPA, the sum beam for target AoA (ψ r,x ,ψ r,y ) is w sum = w x sum ⊗ w y sum = a r (ψ r,x ) ⊗ a r (ψ r,y ). The difference beam for the spatial frequency x or y is w The AoA errors are defined as (e ψ r,x , e ψ r,y ) = (ψ r,x − ψ r,x , ψ r,y −ψ r,y ) for true AoAs (ψ r,x , ψ r,y ).

b: RECEIVER OPERATION
In Fig. 7, the MS receiver structure is designed to make the protocol of the proposed solution feasible assuming the ULA type. The beam scheduling module schedules the target MS beam to test in an OFDM block, and then, the beam testing module configures the target AoAψ r to the RF beam setter and gathers an AoA error reportê ψ r from the monopulse ratio based AoA error estimator, and the reference signal received   power (RSRP) from the OFDM demodulator. The RF beam setter configures the difference beam w diff to the phase shifters in the CP duration, and subsequently, it switches to the sum beam w sum in the OFDM symbol duration. Using the difference beam and sum beam outputs, the monopulse ratio based AoA error estimator computes the estimateê ψ r . The OFDM demodulator demodulates the OFDM symbol and reports the RSRP to the beam testing module. The RSRP can be used for best beam pair selection. When the OFDM block is not used for AoA error estimation, the RF beam setter holds the sum beam in the entire OFDM block.

c: MONOPULSE RATIO BASED AOA ERROR ESTIMATOR
In Fig. 7, the received signal at the antennas of the u-th MS is modeled with a sample index n for n = 0, . . . , N − 1 as r u (n) = where it is assumed that BS uses only one RF chain and its RF precoder v u ∈ F for broadcasting the reference signals x b (n) with the full bandwidth. After analog combining, the input signal y u (n) of the AoA error estimator in Fig. 8 is respectively denoted as y diff u (n) = w * diff r u (n) during the CP interval and y sum u (n) = w * sum r u (n + N ) during the last N g samples in the current OFDM symbol; it can be represented as where n N u (n) = n u (n + N ) and {x b (n)} −D+1 n=−1 in (22) are the last D − 1 samples of the previous OFDM symbol.
First, by taking a cross-correlation with the pseudo-random reference signal sequences {x b (n)}, the least square estimates on the d-th tap of the time domain BB channel (i.e., h diff where W denotes the channel estimator tap length and is a receiver design parameter) can be obtained as in [62] aŝ Here, the cross-correlation window length P can be appropriately selected to suppress the IBI and the MPI from the previous and its own OFDM symbols as well as the AWGN 1 . 1 The effect of the selected P on the MSE of MS AoA estimation is discussed in Section V.
Then, the monopulse ratio estimate R is computed as where the dominant tap index d is found as A detailed explanation on how (26) can estimate the monopulse ratio to estimate the residual AoA error w.r.t. the best receive beam steering direction is provided in Appendix A. For UPA, the monopulse ratio estimate R x or R y is computed with the difference beam w x diff or w y diff , respectively.
Finally, based on R or (R x , R y ), we can estimate the AoA error of the target MS beam for ULA or UPA bŷ Note that the proposed AoA error estimator at the MS can be easily applied to a sub-connected architecture with conventional fixed subarrays. The only difference is a reduced array gain compared to the fully connected architecture. However, for the dynamic subarray case, the situation is more complicated. The synthesis of the sum or difference beam under the dynamic subarray cannot utilize a simple closedform analytic solution and requires a complex numerical computation [63]- [65]. However, assuming such a numerical method is available for providing sum and difference beams satisfying both (20) and (21), the proposed AoA error estimation can be applied to the dynamic subarray case in a straightforward manner.
Remark 1: The monopulse ratio based AoD estimator at the MS can be designed using the same principle. For the AoD ψ t of the target BS beam assuming the ULA type, the BS applies f diff = DiffBeam(f sum ) during the CP interval and then switches to f sum = a t (ψ t ) during the OFDM symbol interval. The MS keeps the RF combiner as the current best MS beam (i.e., a r (ψ r )) in the entire OFDM block.
Remark 2: The monopulse ratio based AoD estimator at BS is designed using TDD channel reciprocity.
Remark 3: Although not explicitly shown in this paper because of space limitations, we can see from the simulation results that we can improve the performance of AoA error estimation in the low SNR region by adopting a moving average filter employing several OFDM symbols in the case where the AoA does not vary much during the averaging window.
Remark 4: In this paper, the phase settling time of the analog phase shifter is assumed to be negligible compared to the CP duration, as reported in [66].

3) ALGORITHM
We present the algorithm of the proposed solution for the two-stage multiuser hybrid precoding as in Algorithm 1 for VOLUME 8, 2020 , ∀k, u BS updates RF precoders using reported AoD errors, and MSs update RF combiners using AoA errors mmWave wideband OFDM systems. In the input, we state the relationship between the RF beam search codebook F, W and the finer steering RF beam HW codebook F HW , W HW . In the first stage, the BS and MSs estimate and compensate the initial AoA/AoD error of the best beam pair found by beam sweeping to set up accurate directional links. In the second stage, the BS and MSs require a considerably reduced number of N CSI CSI-RS symbols for beam tracking and BB CSI estimation because our solution does not need to monitor adjacent candidate beam pairs unlike the conventional solution. As for a multiuser BB precoding in Algorithm 1, our solution can accomodate any multiuser precoding such as ZF precoding [13], [46], [49], [59], MMSE precoding [43], and other enhanced precodings in [42]. For illustrative purposes, the ZF precoding is used in Section V.
Remark 5: If N CSI = 1 and UL beam sweeping plus AoA/AoD error compensation are removed in Algorithm 1, we get the conventional solution as in [13], [40].
Remark 6: The computational cost of the proposed solution for the downlink beam sweeping is D 4 × N × |W| × |F| real multiplications; this is common with the conventional solution. In addition, the proposed AoA error estimation in the first stage needs 8×P×W +2×W +6 real multiplications by treating one real division as 4 real multiplications as in [67]. Note that the operation in (28) is neglected because it can be implemented using a look-up table. The increased cost is bounded above by C 9 × P × W ≤ 9 × N 2 g and C is far less (e.g., 2%) than cost D when considering typical scenarios (e.g., N g = 144, N = 2048, |F| = 64, and |W| = 16 [13], [57]). Thus, we can see that the computational cost of the proposed solution is comparable to that of the conventional solution.
The AoA error compensation routine in Algorithm 1 can be explained as follows. First, we assume that for the target beam pair (g u , v u ) = (a r (ψ r ), a t (ψ t )) or (g u , v u ) = (a r (ψ r,x ,ψ r,y ), a t (ψ t,x ,ψ t,y )), the AoA error estimateê ψ r or (ê ψ r,x ,ê ψ r,y ) is available. Subsequently, the AoA error is compensated for by setting the RF combiner for ULA or UPA as w u = a r (Q MS (ψ r +ê ψ r )), or (30) w u = a r (Q MS (ψ r,x +ê ψ r,x ), Q MS (ψ r,y +ê ψ r,y )), (31) where the return value of the phase quantization function Q MS (·) belongs to the set S MS to meet w u ∈ W HW . The AoD error compensation routine in Algorithm 1 is performed using the AoD error (measured at the BS or reported from the MSs) for ULA or UPA as f RF u = a t (Q BS (ψ t +ê ψ t )), or (32) f RF u = a t (Q BS (ψ t,x +ê ψ t,x ), Q BS (ψ t,y +ê ψ t,y )).

IV. PERFORMANCE ANALYSIS
To verify that the proposed solution has more accurate beam acquisition, we evaluate the amount of gain achieved in the RF beam steering array gain and in the rate performance. The reduced sounding overhead in the beam tracking protocol under mobility is demonstrated in the next section.

A. IMPROVEMENT IN ARRAY GAIN
We present a numerical evaluation that shows that loss of the array gain in the conventional solution with the DFT-based RF beam search codebook can be compensated for using properly designed AoA estimators. For simplicity, a single path channel is assumed. For the evaluation, the RF beam search codebook size is set to |W| = N osf × N r and |F| = N osf × N t with an oversampling factor (osf) N osf ≥ 1.
For ULA, the average normalized array gain can be obtained by assuming e ψ r ∼ U(− π N r , π N r ) and equations in [52] as where sin x ≈ x for small x and a change of variable N r e ψ r 2 = t are used when deriving (a). By numerical integration, (36) amounts to 0.7737 ≈ −1.1 dB. Note that approximation (36) has no dependency on the number of receive antennas. For independent AoD and AoA errors, the total average normalized array gain is approximately −2.2 dB. For UPA, with independent azimuth and elevation AoA errors, the total average normalized array gain is approximately −4.4dB. 2) CONVENTIONAL SOLUTION WITH N osf > 1 In Table 1, we evaluate the average normalized array gain in MS for the ULA type for some values of N osf using a numerical integration assuming e ψ r ∼ U(− π N osf N r , π N osf N r ), Note that, for a negligible loss (e.g., less than 0.1dB), N osf ≥ 4 is required. However, this causes N 2 osf ≥ 16 times increase in the overhead at the first stage, which may not be a practical choice.

3) PROPOSED SOLUTION
If we assume that the residual AoA error estimateê ψ r , when e ψ r = 0, has a normal distribution with mean 0 and variance σ 2 e ψ r , i.e.,ê ψ r ∼ N (0, σ 2 e ψ r ), then the average normalized array gain with the residual AoA error is obtained as which is numerically evaluated and summarized in Table 2. For a negligible performance loss (e.g., less than 0.1dB), the required RMSE needs to be in the range of σˆe ψ r ≤ 2 −3 π N r . Note that similar results were already reported in [68].

B. RATE GAP ANALYSIS
The rate gap of the proposed solution for the two-stage hybrid precoding is defined as the difference between the rate R u achieved by the infinite sized RF/BB codebook and the rate R Q u with finite sized codebooks as in [13]. We follow the rate gap analysis for single-path (N p = 1) mmWave channels as in [13] to confirm the performance enhancement of the proposed solution.
The achievable rate of user u, assuming exact AoD/AoA matched RF beam steering with an infinitely large RF/BB codebook, infinite large number of bits in phase shifters, and perfect inter-user interference cancellation by ZF precoding, is given as where SNR = P σ 2 and the u-th MS effective channel h * u [k], assuming a single-path mmWave channel (N p = 1) and ULA structure, can be written for each subcarrier k as The achievable rate of user u assuming imperfect inter-user interference cancellation because of the finite size RF/BB codebooks is defined as Here, the u-th MS effective BB channel h Q * u [k] can be represented for the ULA type for subcarrier k as whereψ t u,1 andψ r u,1 are the estimated AoD and AoA, respectively.
Then, the average rate loss upper bound in terms of the normalized array gains after fixing the RF precoder/combiner for a single path channel can be obtained as (see [13]) where the first term of the rate loss upper bound has an inverse power of the normalized array gains at the BS and MS. This shows that the array gain loss reduction in the first stage is transferred to the rate enhancement of the two-stage multiuser hybrid precoding for wideband OFDM systems as proved for narrowband systems in [13].

V. SIMULATION RESULTS
In this section, we perform extensive numerical simulations to compare the proposed solution with the conventional solution that has nonoversampled (N osf = 1) or oversampled (N osf = 8) RF beam search codebook.

A. MORE ACCURATE BEAM ACQUISITION
We randomly generate the channel matrix for each user, and we perform DL/UL beam sweeping during the first stage. The channel is not changed at transition from the first stage to the second stage. The parameter N CSI for beam tracking is set to 1 as beam tracking is not required in the second stage. Therefore, only BB CSI estimation is performed in CSI Acquisition. After the RF precoders/combiners and BB precoders are determined, the average achievable rates per user E 1 U U u=1 R u is used as a metric for the performance comparison. Here, denotes the signal-to-interference-noise ratio (SINR) for subcarrier k. Note that the overhead for beam acquisition and beam tracking is not considered in the metric of performance comparison. The simulation parameters are • A BS with N t = 64 (8 × 8 UPA) and 2 RF chains • All MSs in the system have N r = 16 (4 × 4 UPA) and U MSs are assumed to be randomly selected out of all MSs in the first stage.
• The carrier frequency is 28 GHz and the system RF bandwidth is B = 491.52 MHz for the subcarrier spacing f = 120 kHz. The system sampling time is T s = 1/(N f ) ≈ 2.0345 ns. We adopt a raised cosine filter with a roll-off factor of 0.1 for the pulse shaping filter p(t) [69] and the OFDM parameters are set to N = 4096 (8.33 µs) and N g = 288 (0.59 µs) [57].
• The used channel models are single path, single cluster multirays, and 4 clusters multirays models [54]. Each cluster has the same delay (i.e., the rays have the same delay in a cluster) [25], [70], [71]. The maximum allowable excess channel delay D − 1 is set to N g − 1, and the cluster delays are set to be uniformly distributed in [0, (D − 1)T s ). The azimuth/elevation mean AoA/AoD of the cluster are assumed to be uniformly distributed in [−π, π) / [−π/2, π/2), and the rays per cluster are set to have Laplacian angular distribution around the mean angle of the cluster with an angular spread of 10 degrees. The channel parameters are generated such that the strongest cluster resides in the first half of the CP duration and the AoA/AoD estimates of the first-stage beam sweeping with N osf = 1 are in the neighborhood (i.e., half of beam width) of the finer (N osf = 8) first-stage beam sweeping results. Note that the IBI caused by the previous OFDM symbol is considered in the DL/UL angle error estimation; MUI is also considered in the UL angle error estimation. The number of random channel realizations is 10 4 .
• The channel estimation correlation length P can be 32, 64, and 128. The channel estimator length W is set to N g , because the MS receiver does not know the number of channel paths and associated delays • The SNR in figures is user-wise SNR SNR user = P/U σ 2 . • The number of bits for analog phase shifters is set as B BS = log 2 (max(N t,x , N t,y )) + 3, and B MS = log 2 (max(N r,x , N r,y )) + 3. Thus N HW becomes 2 3 .
• The ZF precoding [13] is used so that the multiuser BB precoding for each subcarrier k is set to where As a fundamental performance for multicluster channels with N ray = 1, Fig. 9 shows an improved normalized RMSE of the proposed AoA estimation compared to the conventional solution according to the number of channel clusters N cl , presence of IBI, and channel estimation correlation length P. Here, the normalized RMSE is obtained from the difference between the azimuth AoA of the strongest cluster and the AoA estimate, and the conventional solution with N osf = 1 shows relatively larger normalized RMSE, i.e., ( π Nr − π Nr (e ψ r ) 2 N r 2π de ψ r ) 1/2 /(π/N r ) = 1/ √ 3. The MPI caused by the increased number of the channel cluster and IBI resulting from the previous symbol worsen the RMSE of the proposed solution. However, we observe that as larger values of P are used, the RMSE can approach the performance where  the ideal (=no noise) sum beam and difference beam BB channel estimates are used for the monopulse ratio algorithm. In a noise limited region as shown in Fig. 10, the gain of using a large P can be seen more clearly; therefore, we set P = 128 as a default value in the following tests. From the above observations, it is confirmed that CP can be successfully utilized for AoA error estimation considering the effects caused from typical IBI and MPI scenarios.
As a next step, we evaluate the SE for the proposed solution. In Fig. 11, for a single user analog-only beamforming case in the single path channel, the performance gain of the proposed solution over the conventional solution using a nonoversampled RF beam search codebook is almost the same as that obtained from the analysis in subsection IV-A. As another example, we show the performance of the system employing ULA (N t = 8 and N r = 4) instead of UPA. As already explained in IV-A, the performance gain is about 2.2dB lower than the gain obtained by using UPA. In Fig. 12, even for a wideband multicluster channel (4 clusters, 20 rays/cluster), the proposed solution has near comparable performance to the conventional solution using  As an upper bound of all other solutions, the achievable rate of the full digital precoding obtained from a genie-aided antenna-level channel is plotted. Note that RF beamsteering to the strongest ray may not be better than RF beamsteering with the nonoversampled RF beam search codebook, especially when the channel has nonuniformly (e.g., Laplacian) distributed rays per cluster.
In Fig. 13, for multiuser hybrid precoding in the wideband multicluster channel (4 clusters, 20 rays/cluster), we compare the achievable rates of several solutions. It is observed that the proposed solution has comparable performance to the conventional solution using a large oversampled RF beam search codebook, and it shows about 4 dB performance gain over the conventional solution using a nonoversampled codebook. In the figure, we plot the performance of the analog-only beam steering to show that the multiuser interference in a high SNR regime causes considerable performance degradation if not properly cancelled. For comparison, the performance of the full digital block diagonalization performed per subcarrier in [13], [59] and the single user performance with perfect inter-user interference cancellation are plotted.  In Fig. 14, with emphasis on the number of limited feedback bits for the BB precoders, we evaluate the proposed solution for a wide range of sizes for the BB precoder random vector quantization codebook. The simulation results show that more than a certain number of BB CSI quantization bits are required to maintain the hybrid precoding gain over the analog-only beam steering.

B. LOWER SOUNDING OVERHEAD BEAM TRACKING
Although it is assumed that the channel does not vary during the time of the first stage beam sweeping or one CSI period, we may check the performance in a practical time-varying channel. Here, the channel parameters and matrix H u,d for a single cluster with 20 rays are assumed to be updated as a linear Gauss-Markov process in every (N g + N )T s (i.e., at every OFDM symbol boundary) as φ t u, = φ t u, + δπ/180, θ t u, = θ t u, + δπ/180,  where η ∼ N (0, 1) denotes a per-ray innovation random variable and δ = 0.1 degree. Here, the correlation coefficient ρ is set to where J 0 (·) is the zero-th order  Fig. 16 for every OFDM symbol. Note that these cases are actually impractical and used only for comparison. From Fig. 15, it is confirmed that our proposed solution tracks near the best beam pair quite well even in a practical fast time-varying channel with a far-less sounding overhead. As a demonstration, rather than relying on simulations under a geometric statistical channel model, we adopt a  ray-tracing tool to show that our proposed solution performs well under more realistic time-varying channel parameters (angles, powers, and relative delays) for a moving MS scenario in a specific site, such as trends in [11], [12], [72], [73]. In Fig. 17, we show a map of Rosslyn as a test site in the commercial ray-tracing tool called Wireless InSite [74]. We set each BS antenna element gain directional with 120 degree half-power beamwidth and 180 degree first null and MS antenna element gain isotropic. The BS transmit power is set to 45 dBm. The height of the BS is 5 m, and that of the MS is 2 m. The velocity of the MS is set as 20 m/s, and the distance of the MS trajectory points is set 0.2 m (0.2 m for every 10 ms). The carrier frequency is 28 GHz, and the used bandwidth is 491.52 MHz. The other parameters are all same as those used in the previous subsection. We construct the channel with Doppler effects using the channel parameters (power and azimuth/elevation AoD and AoA of each paths) from the ray tracing tool. The sounding overhead for the proposed solution is N CSI = 4 symbols per CSI period (=10 ms, 1121 symbols), whereas the conventional solution with the nonoversampled RF beam search codebook has an overhead of N CSI = 81 = 9×9 symbols, as shown in Fig. 16. We simulate the conventional solution with the oversampled RF beam search codebook, which has a sounding overhead of N CSI = 6561 = 81 × 81 symbols as shown in Fig. 16. This is an infeasible number in one CSI period; however, it is implemented as an offline search for performance comparison. The received signal power after the RF combiner is plotted for each solution in Fig. 18, which shows the beam tracking performance of the proposed solution is comparable to the conventional solution using an oversampled RF beam search codebook with a significantly lower sounding overhead in the beam tracking protocol.

VI. CONCLUSION
We proposed a novel solution for the two-stage multiuser hybrid precoding in mmWave OFDM systems such as 5G NR, which can achieve more accurate beam acquisition and lower sounding overheads in protocols than the conventional solution even for a single RF chain MS. To support the proposed two-stage protocol with lower overheads, a novel receiver design is proposed to employ AoA/AoD error estimators based on the monopulse ratio. We numerically evaluated the expected improvement in both average normalized array gain and rate gap, and we confirmed it via computer simulations under various channel conditions. In addition, as a practical demonstration, we used a ray-tracing tool in a realistic environment to confirm that the proposed solution works well in a realistic scenario. In the future work, the proposed solution will be extended to the case of using multiple RF chains at MSs in multicell environments.

APPENDIX A ESTIMATION OF THE AOA ERROR
We present our reasoning on how (26) can be used for the residual AoA error estimation in the following cases. For simplicity, we assume that the BS has a fully connected architecture with the ULA type and the delay of each cluster has sample-spaced values. Note that the extension of the proof to a sub-connected architecture case is straightforward.
Second, for a single cluster multirays channel (N p = N ray ) with τ u, = 0, ∀ , the time domain channel matrix becomes H u,0 = N p =1 α u, a r (ψ r u, )a * t (ψ t u, ) for d = 0. Again, the goal is to find the best AoA ψ r u,max such that ψ r u,max = arg max |ψ r u −ψ r u,ini |<π/N r a * r (ψ r u )H u,0 v u 2 with v u = a t (ψ t u,ini ). If we denote a * t (ψ t u, )v u = γ u, , H u,0 v u = N p =1 α u, γ u, a r (ψ r u, ). For S nbr = { |1 ≤ ≤ N p , |ψ r u, − ψ r u,max | < π/N r }, we may approximate H u,0 v u ≈ N p =1, ∈S nbr α u, γ u, a r (ψ r u, ) and for ∈ S nbr , we can approximate a r (ψ r u,l ) = a r (ψ r u,max + u, ) ≈ a r (ψ r u,max ) + ∂a r (ψ) ∂ψ | ψ=ψ r u,max · u, , where u, = ψ r u,l − ψ r u,max similarly as in [77]. Then, we obtain H u,0 v u ≈ ( Then, R = Im ĥ diff u (0) ĥ sum u (0) * |ĥ sum u (0)| 2 in (26) can be considered as an approximated ML estimate of the monopulse ratio r = w * diff H u,0 v u w * sum H u,0 v u , which is similar to that in the first case. However, in this case, r itself may have some bias that can cause an additonal MSE compared to the first case even in high SNR. When the channel has multiclusters with multirays, we may assume that there is a dominant channel cluster with its delay tap d so that similar reasoning can be applied. Although some bias may occur because of the approximation as in the second case and further degradation due to MPI is expected in a multicluster case, (26) can successfully estimate the monopulse ratio in typical channel environments as shown in the simulation results.