Rate-Maximizing Zero-Forcing Hybrid Precoder for MU-MISO-OFDM

Hybrid precoders, consisting of an analog hardware-constrained part operating at radio frequency (RF) and a digital part operating at baseband, reduce the RF implementation complexity and power consumption of multi-antenna transceivers, at the expense of some rate loss compared to an all-digital precoder. The analog and digital parts of the hybrid precoder are commonly designed by performing a constrained matrix decomposition (MD) of the all-digital precoder, which aims to minimize the Euclidean distance between the matrices corresponding to the hybrid and the all-digital precoder. In contrast, in this contribution we determine the zero-forcing (ZF) hybrid precoder that directly maximizes the weighted sumrate of a MU-MISO-OFDM communication system, taking into account various hardware constraints on the analog part. The resulting maximum rate serves as a useful benchmark for comparison with other ZF hybrid precoders. In a multi-carrier massive MIMO scenario, the rate-maximizing ZF precoders show a considerable performance advantage over MD-type hybrid precoders, indicating that the latter precoders are far from optimum. This contribution also investigates the trade-off between performance and computational complexity. Because of the iterative nature of the rate-maximizing ZF hybrid precoders, their superior performance comes with a large computational complexity. When this complexity cannot be afforded, one should revert to the MD-type precoders, at the expense of a considerable performance penalty; among the MD-type precoders, the non-iterative ones have only a slightly worse performance but a significantly smaller computational complexity, in comparison with the iterative ones.


I. INTRODUCTION
In order to exploit the large bandwidths that become available in mm-Wave communication, the large path loss associated with these high frequencies must be overcome by the beamforming gains provided by multi-antenna transceivers. Beamforming at the transmitter (TX) and receiver (RX) is referred to as precoding and combining, respectively. All-digital beamforming is performed at baseband, and offers the largest flexibility: signals can be adjusted in both amplitude and phase. However, this type of beamforming requires The associate editor coordinating the review of this manuscript and approving it for publication was Nagendra Prasad Pathak. one RF chain per antenna element, causing the resulting hardware cost and power consumption of the mm-Wave RF components to become prohibitive when deploying antenna arrays consisting of many elements. To reduce the number of RF chains for a given antenna array, hybrid beamforming has been proposed. The hybrid beamformer, consisting of an analog part (operating at RF) and a digital part (operating at baseband), allows to reduce the number of RF chains down to the number of data streams.
Denoting by N a , K , and N rf the number of antennas, data streams and RF chains at the TX or at the RX, the dimensions of F, A, and B in the case of single-carrier (SC) modulation are N a × K , N a × N rf , and N rf × K , respectively, with K ≤ N rf ≤ N a , so that F is tall. When the elements of A and B are unconstrained, it is clear that AB can be made equal to F, for any N rf satisfying K ≤ N rf , in which case the hybrid beamformer provides the same performance as the optimum all-digital beamformer. However, as variable-gain amplifiers operating at RF are expensive and consume much power, one often prefers using a single phase-shifter (SPS) implementation at RF, allowing only phase adjustments; this implies that a unit-magnitude (UM) constraint is imposed on the elements of A. Under this UM restriction, AB can still be made equal to F for any N rf satisfying 2K ≤ N rf [8]. When adopting a double phase-shifter (DPS) implementation, where the elements of A are restricted to be the sum of two complex exponentials [6], [8], the condition AB = F can also be met without variable-gain RF amplifiers, for K ≤ N rf . When K ≤ N rf < 2K , the minimization of F − AB 2 F under the UM constraint gives rise to a non-convex optimization problem, which has no straightforward solution. The authors of [1] make use of orthogonal matching pursuit (OMP), where the columns of A are restricted to the set of array response vectors. Higher rates (at the expense of an increased computational complexity) have been achieved by performing an alternating minimization (AltMin) of F − AB 2 F with respect to A and B, with A belonging to a complex circle manifold [2]. The optimization of A over the complex circle manifold, referred to as manifold optimisation (MO), has also been performed in the context of hybrid beamforming for scenarios involving intelligent reflective surfaces [9] and relay networks [10]. Other hybrid beamforming designs under the UM constraint avoid the complexity of manifold optimization by making use of a convexication of the optimization problem [4], [9], [11] or by heuristically applying phase extraction (PE) to determine A [2], [6], [7], [12], [13], [14], [15]. The rate resulting from AltMin manifold optimization in the case of SC modulation turns out to be close to the performance of the all-digital beamformer; therefore, the former rate has been considered as a performance benchmark for UM-constrained hybrid precoders [2].
To further reduce the number of RF components (at the expense of an additional loss in performance), the connectivity of the analog part of the hybrid beamformer can be constrained, i.e., each antenna is connected to only a subset of RF chains [2], [6], [7], [12]. In [7], three different connectivity constraints are defined: full connectivity (FC), group connectivity (GC), and partial connectivity (PC).
In the case of orthogonal frequency division multiplexing (OFDM) modulation, each of the N f subcarriers requires a separate N a × K beamformer, with K denoting the number of data streams per subcarrier. Horizontally stacking the optimum N f all-digital beamforming matrices yields the N a ×KN f matrix F; typically, one has KN f > N a , so that F is wide. With hybrid beamforming, the analog part A of dimension N a ×N rf is common to all subcarriers. Similarly stacking the N f hybrid beamforming matrices, one obtains AB, where the N rf × KN f matrix B contains the digital parts of the hybrid beamformers. The same MD-type approaches as for SC modulation can be used to design the matrices A and B, based on the minimization of F − AB 2 F : OMP, optimization over a complex circle manifold, . . . However, performance-wise there is a major difference with SC modulation: in the case of OFDM, the matrix AB is rank-deficient when N rf < N a ; therefore AB is in general not equal to F for K ≤ N rf < N a , even when A and B are unconstrained [7], [16]. Because of the rankdeficiency of AB, the performance gap between the optimum all-digital beamformer and the hybrid MD-type beamformers with given constraints (DPS or SPS implementation and connectivity constraint) turns out to be considerably larger than that with SC modulation. This observation raises the question of whether the minimization of F − AB 2 F over the complex circle manifold is still near-optimum in terms of the rate in the case of OFDM transmission.
In [15] a different type of AltMin optimization is performed: the all-digital precoder and combiner are optimized alternatingly between the downlink and uplink until they converge, after which a PE is applied to a low-rank decomposition of the all-digital precoder to find A. Table 1 provides a concise summary of the papers cited above. For each reference, we list the type of precoder used, the communication scenario, the relevant modulation, and the considered hardware constraints that are applied to the analog part. The labels FC and DPS are equivalent to applying no connectivity constraint and no UM constraint, respectively. A comprehensive literature survey on hybrid beamforming can be found in [17].
In this paper, we focus on a MU-MISO-OFDM communication system with hybrid ZF precoding. Our main contributions are the following; • Whereas in literature, the hybrid precoder for the considered scenario has been commonly designed using the above-mentioned MD approach, here we determine the ZF hybrid precoder that directly maximizes the WSR, rather than minimizing F−AB 2 F ; the WSR maximization is carried out under various hardware constraints on A (i.e., DPS or SPS implementation and connectivity constraint). This WSR-maximizing ZF hybrid precoder will be referred to as 'WSRmax'. To the authors' knowledge, the derivation of this type of precoder is a novel result. The resulting maximized WSR is a useful performance benchmark because it represents an achievable upper bound on the WSR of all ZF hybrid precoders.
• We investigate the SE of the WSRmax precoder, as a function of the signal-to-noise ratio (SNR) and of N rf . We make the comparision with the SE of the all-digital ZF precoder and of several MD-type hybrid precoders.
• We compare the computational complexities of the WSRmax precoder, the all-digital ZF precoder, and several MD-type hybrid precoders. An important conclusion is that, for N rf equal to or slightly in excess of K , the SE is significantly higher for the WSRmax precoder than for MD-type ZF precoders, whereas some of the latter precoders are much more avantageous in terms of computational complexity.
Related work on hybrid precoders that aim to directly maximize the WSR or SE can be found in [11], [12], and [9]. For a MU-MISO-OFDM system with ZF hybrid precoding and a power loading that imposes the same SNR on all subcarriers for all users, the numerical results obtained from the maximization of the SE (which simplifies to the maximization of the common SNR) over the analog and digital parts of the hybrid precoder are used as a performance benchmark in [12]. In a different context, the authors of [11] have maximized a convexication of the mutual information (between the RX antenna signals and the transmitted symbols) over the hybrid precoder under a total TX power constraint and a UM constraint on the analog part, for a SU-MIMO-SC system; as the hybrid precoder is not rank-deficient, the resulting maximized mutual information is very close to the rate associated with the optimum all-digital precoder. The authors of [9] have maximized a convex approximation of the WSR under a quality-of-service constraint, in a MU-MIMO-SC system with an all-digital precoder at the TX, and intelligent reflective surfaces acting as analog precoders subjected to the UM constraint.
In this paper, the following notations are used: (A) T , (A) H , and (A) −1 denote the matrix transpose, Hermitian transpose (or conjugate transpose), and the matrix inverse of matrix A, respectively. The identity matrix of size K is written as I K , and the operations Tr[A] and A F are the trace and Frobenius norm of the matrix A. The nth element of the vector a is denoted (a) n , while (A) m,n is the element on the mth row and nth column of the matrix A. The notation a ∼ CN (0, R) means that the random vector a is distributed according to a circular symmetric complex Gaussian distribution with zero mean and a covariance matrix R. Lastly, (z) is the angle of the complex scalar z, expressed in radians.

II. SYSTEM DESCRIPTION
We investigate the downlink transmission of a MU communication system consisting of a N a -antenna basestation (BS) and K single-antenna user terminals (UTs). The modulation is OFDM with N f subcarriers and a cyclic prefix of N cp samples; the sampling interval T equals the inverse of the useful signal bandwidth. To maintain orthogonality of the subcarriers and to avoid interference between successive OFDM symbols, the OFDM symbol duration (N f + N cp )T must be less than the channel coherence time, and the cyclic prefix duration N cp T must be not less than the delay spread of the channel. We assume K ≤ N a ≤ KN f .
The TX makes use of linear precoding to transmit to each UT N f data symbols per OFDM symbol interval. OFDM demodulation at the kth UT yields the demodulator outputs (y 1,k , . . . , y N f ,k ). Defining y i = (y i,1 , . . . , y i,K ) T , we obtain 1 , . . . , a i,K ) T and w i ∈ C K denote the channel matrix, the precoder matrix, the data symbol vector and the noise vector, all related to the ith subcarrier; we assume that a i ∼ CN (0, I K ) and w i ∼ CN (0, N 0 I K ). We impose a constraint on the total energy transmitted during an OFDM symbol interval, i.e., To eliminate inter-user interference (IUI), the precoders must also satisfy the ZF condition [18]: This way, (1) reduces to for k = 1, . . . , K , which illustrates that IUI is absent. As a performance measure we consider the WSR resulting from (4), which is given by [19] WSR = with ω i,k denoting the non-negative weight associated with the ith subcarrier and the kth UT.

III. ALL-DIGITAL VERSUS HYBRID PRECODING
In the case of all-digital precoding, the vectors x i = F i a i are computed at baseband. Defining x i,n = (x i ) n for n = 1, . . . , N a , the sequence (x 1,n , . . . , x N f ,n ) is applied to an OFDM modulator, and the resulting RF signal x RF,n (t) is applied to the nth TX antenna. Hence, with all-digital precoding, the number of required RF chains equals N a , the number of TX antennas. For large N a , the implementation cost and power consumption of the RF hardware become excessive [7]. In the case of hybrid precoding, the matrices F i are decomposed as F i = AB i where A ∈ C N a ×N rf and B i ∈ C N rf ×K denote the analog and digital precoding matrices, with K ≤ N rf < N a . The vectors u i = B i a i are computed at baseband. Defining u i,m = (u i ) m for m = 1, . . . , N rf , the sequence (u 1,m , . . . , u N f ,m ) is applied to an OFDM modulator, which results in the RF signal u RF,m (t). The nth TX antenna signal x RF,n (t) is obtained as the sum of scaled and phase-shifted versions of the RF signals u RF,1 (t), . . . , u RF,N rf (t); the scaling factor and phase shift applied to the mth RF signal are given by |A n,m | and A n,m , respectively, where A n,m = (A) n,m . With hybrid precoding, the number of RF chains equals N rf , which is less than the number of TX antennas. Hence, implementation complexity and power consumption are reduced compared with all-digital precoding [7]. The disadvantage of the hybrid precoder is a performance loss compared to the all-digital precoder. Let us define F = (F 1 , . . . , F N f ) and B = (B 1 , . . . , B N f ). For the all-digital precoder, F can represent any N a × KN f matrix that satisfies the TX energy constraint; the hybrid precoder is characterized by F = AB, so that the rank of F is limited to N rf .
For given N rf , the number of RF components can be further decreased by additionally imposing a connectivity constraint on the hybrid precoder A. The connectivity constraint involves restricting A to be block-diagonal: , yielding a total of N a N rf N b connections, with each connection providing a gain |A n,m | and a phase shift A n,m . One distinguishes between FC, GC, and PC, which correspond to the cases N b = 1 (i.e., there is no connectivity constraint), 1 < N b < N rf , and N b = N rf , respectively [7].
For a given connectivity constraint, we will investigate the SPS and DPS implementations. In the case of the SPS implementation, the UM constraint applies such that the nonzero elements of A satisfy |A n,m | = 1; hence, when A is blockdiagonal, the analog part of the hybrid precoder contains N a N rf N b phase shifters and no variable-gain amplifiers. When the UM constraint is not imposed, variable-gain amplifiers can still be disposed of by adopting a DPS implementation, where each nonzero element from A is the sum of two complex exponentials [6], [8]; in this case, 2N a N rf N b phase shifters are required. With the DPS implemenation, the hybrid precoding matrix F = AB consists of N b submatrices of dimension N a N b × KN f , which each have rank N rf N b but are otherwise unconstrained. We will use the labels 'SPS' or 'DPS' to indicate the implementation considered.
In sections IV-VI, several precoders will be discussed; the nomenclature that will be used in the sequel is summarized in Fig. 1. VOLUME 11, 2023

IV. RATE-MAXIMIZING ALL-DIGITAL ZF PRECODER
Considering the constraints (2) and (3), the optimal all-digital precoders F i that maximize the WSR (5) are obtained as and i must satisfy the TX energy constraint (2), which reduces to The coefficients γ i,k that maximize the WSR (5) result from waterfilling [20]: where (x) + = max(0, x), and the water level µ is determined from This WSR-maximizing all-digital ZF precoder will be referred to as 'all-digital'. Defining the set where µ and S are related by

V. RATE-MAXIMIZING HYBRID ZF PRECODER
For the hybrid precoder, the TX energy constraint (2) and the ZF condition (3) give rise to Because of (13), the resulting WSR is again given by (5). Let us first maximize the WSR (5) over B i for given A. Introducing the decomposition A = QP, where the columns of Q ∈ C N a ×N rf form an orthonormal basis of the column space of A and P ∈ C N rf ×N rf is non-singular, we define B i = PB i and H i = H i Q, so that (12) and (13) can be reformulated as These constraints are similar to the constraints of the alldigital precoder (see section IV) and result in Equivalently, The corresponding WSR (5) is maximized for given A by selecting γ i,k according to the waterfilling algorithm (8)-(9) [20], where now M i,k is the following function of A: This yields (10)-(11), again with M i,k given by (18). Taking the connectivity constraint into account, we introduce the decomposition where Next, we (numerically) maximize the WSR (10), with M i,k given by (19), over the analog part A of the hybrid precoder; this optimization is outlined in section VII. Having obtained the optimum A, the corresponding digital parts B i of the hybrid precoder result from (16). As already mentioned in section I, we refer to this rate-maximizing hybrid ZF precoder as 'WSRmax'.

VI. MATRIX-DECOMPOSITION HYBRID PRECODER
Commonly, in literature, the hybrid precoder is designed to approximate a given all-digital precoder by minimizing the squared Frobenius norm F dig − AB 2 F over A and B [1], [2], [3], [4], [5], [6], [7], where A is the analog part of the hybrid precoder, B = (B 1 , . . . , B N f ) collects the digital parts B i of the hybrid precoder associated with the N f subcarriers, and similarly F dig = (F dig,1 , . . . , F dig,N f ) collects the all-digital precoders. The resulting matrix AB represents a low-rank MD of F dig . Here, we take for F dig the WSRmaximizing all-digital precoder from section IV. Taking the reduced connectivity into account, we have ×KN f . Hence, the minimization of (20) reduces to N b separate minimizations.
In the following, several algorithms aiming at the minimization of (20) will be discussed; the resulting precoders will be referred to as 'MD'. Only the hardware constraints (reduced connectivity and/or UM) on A are taken into account during this minimization; the TX energy constraint will be enforced afterwards by keeping the analog part A resulting from the minimization of (20) but altering the digital precoder.
Two approaches for imposing the TX energy constraint will be considered. The first approach, which has often been used in literature (e.g., [1], [2], [3], [4], [6], [10]), consists of simply applying a proper scaling factor to the digital part B of the precoder matrix resulting from the minimization of (20). However, although the all-digital precoder F dig in (20) satisfies the ZF condition, this is in general not the case for the hybrid precoder resulting from the first approach; consequently, the ensuing IUI gives rise to a floor of the WSR at high E tr /N 0 . The occurrence of a WSR floor is avoided when taking the second approach, where the digital part B is computed according to (16) so that the ZF condition holds, and waterfilling is applied to obtain the coefficients γ i,k . This second approach differs somewhat from the ZF approach taken in [6], where IUI is eliminated by selecting (in the case of MU-MISO) the precoder related to the ith subcarrier as AB i C i , where A and B i result from the minimization of (20), i is added to remove IUI. When N rf = K , both H i A and B i are square matrices, in which case the ZF approach from [6] is mathematically equivalent to our second approach. However, when N rf > K , the digital part B i C i of the precoder from [6] has its column space restricted to the column space of B i , whereas such constraint does not apply to the digital part (16) of the precoder from our second approach; therefore, for N rf > K the ZF approach from [6] is outperformed by our second approach.
The algorithms discussed below differ in the way the analog part A of the precoder is obtained; for each of these algorithms, the label 'noZF' or 'ZF' will be used to indicate whether the first or the second approach is used to determine the digital part of the precoder.

A. NON-ITERATIVE ALGORITHMS
When reduced connectivity is the only hardware constraint applied to A, the minimization of (20) has a closed-form solution, which follows from the Eckart-Young-Mirsky theorem [21]. This yields [6] dig , and the columns of U are the corresponding left and right singular vectors. Precoders that use 1 for the analog part will be referred to as 'MD_DPS'.
When the UM constraint is imposed on the elements of the matrices A (b) , the minimization of (20) has no closed-form solution. However, a simple heuristic approach for obtaining the analog precoder A (b) consists in applying PE to U Given A (b) from (22), (20) is minimized by taking The precoder using (22) will be referred to as 'PE-MD'.

B. ITERATIVE ALGORITHMS
A more elaborate approach to minimize (20) under the UM constraint is AltMin [7]. The AltMin approach is iterative, and alternatingly minimizes (20)  (l) the analog and digital precoding matrices resulting from the lth iteration of the AltMin algorithm, we have, for l = 1, 2, . . ., , with their elements constrained to unit-magnitude. As the subproblem (25) has no closed-form solution, an iterative procedure can be adopted to find A (l) . In this case, (24)-(25) represents a nested optimization algorithm involving outer iterations (iteration index l) and inner iterations (iteration index l ). For a given outer iteration index l, the inner iterations can be represented generically as (25) denotes the analog precoder matrix obtained after convergence of the inner iterations. For given l, the inner iterations (26) 1 . It has been shown in [2] that the subproblem (25) can be solved using MO. This optimization is outlined in section VII and the corresponding precoder will be referred to as 'MO-AltMin'.
According to [3], the computation time of MO-AltMin can be reduced without affecting the WSR performance by updating the digital part of the hybrid precoder after each inner iteration; this results in an algorithm involving only one iteration index l, where B (b) (l) and A (b) (l) are obtained according to (24) and (27) respectively; hence, the iterations are intertwined rather than nested. The MO-AltMin with intertwined iterations can be interpreted as an MO-AltMin algorithm performing (24) and (26), where for each value of the outer iteration index l, only one inner iteration (26) (0) is the same as for the nested iterations. The nested and intertwined MO-AltMin precoders yield virtually the same WSR performance, but the former gives rise to a larger computational complexity; therefore, only the latter will be considered in the sequel.
To avoid the computational complexity of MO-AltMin, a simplified iterative precoder can be used for the minimization of (20) under a UM constraint. This precoder, denoted 'PE-AltMin', uses (24) and applies PE for obtaining an approximate solution A where . The PE-AltMin algorithm (24), (28) is similar to the PE-AltMin algorithm derived in [2], but here we do not restrict B (b) (l) to be semi-unitary.
When considering the PC constraint (i.e., the UM constraint has a closed-form solution, which is given by (28). Hence, in this case, the MO-AltMin and PE-AltMin precoders yield the same WSR.
When the PC constraint holds, [2] proposes the semidefinite relaxation (SDR) AltMin ('SDR-AltMin') precoder, as an alternative to the MO-AltMin precoder. In each iteration, the SDR-AltMin precoder performs two steps: step (i) determines the analog part of the precoder using PE according to (28)-(29); step (ii) consists of a numerical procedure aiming at finding the digital part of the hybrid precoder which minimizes (20) under the TX energy constraint for a given A. Note that the TX energy constraint is imposed in each iteration, instead of being dealt with afterwards. The non-convex quadratic optimization problem of step (ii) in [2] is not solved directly, but rather a SDR of the problem is solved, which consists of ignoring a rank-one constraint. However, it can be shown that step (ii) has an analytical solution, which corresponds to simply applying a proper scaling factor c (l) to each B (b) (l) from (24), so that the TX energy constraint is satisfied in each iteration. The resulting hybrid precoder (when using the analytical solution for step (ii)) is mathematically equivalent to the above PE-AltMin_noZF precoder, which applies the scaling only after convergence of the iterations.
Step (ii), as described in [2], uses the vectorization of matrices F dig and B. This vectorization gives rise to a numerical procedure that involves very large matrices, causing a much higher computational complexity than PE-AltMin_noZF; therefore, SDR-AltMin will not be considered in the sequel.

VII. GRADIENT-DESCENT OPTIMIZATION
When the UM constraint applies to the nonzero elements of A, the maximization of the WSR (10) (with M i,k given by (19)) or the minimization of the squared Frobenius norm (20), both with respect to the analog part of the precoder, can be considered as optimizations over a matrix manifold which incorporates the constraints on A [2]. Let us consider the complex circle manifold CC(p, n), defined as For given N b , the maximization of (10) is over the Cartesian product of N b complex circle manifolds CC N rf N b , N a N b , whereas the minimization of (20) reduces to N b separate minimizations over the complex circle manifold CC N rf N b , N a N b . The iterative optimization over a manifold consists of a loop containing the following steps: (i) the computation of the Euclidean gradient of the cost function at the point on the manifold resulting from the previous iteration; (ii) the projection of the Euclidean gradient onto the tangent space at the considered point, which yields the Riemannian gradient; (iii) taking a step in the tangent space, in the direction determined by the Riemannian gradient and tangent vectors from previous iterations; (iv) mapping of the resulting point in the tangent space onto the manifold yields the manifold point of the current iteration. Given the Euclidean gradient, a detailed mathematical description of these steps can be found in [22] and [23].
When the UM constraint is not imposed, the maximization of (10) with respect to A (1) , . . . , A (N b ) is over the Cartesian As in this case the elements of the submatrices A (b) are unconstrained, no manifold needs to be introduced, and conventional iterative optimization algorithms based on the Euclidean gradient can be used [24].
The initialization of the iterative MD-type precoders has been considered in section VI-B. For the WSRmax precoders, we obtain A  The minimization of (20) without the UM constraint yields the closed-form solution (21), in which case no iterative optimization is required.
Whereas the Euclidean gradient is commonly represented as a vector, here we use a matrix notation, which has the advantage of reducing computational complexity [3]. Denoting the cost function to be minimized as f (A), the Euclidean gradient G ∈ C N a ×N rf relates the differential ∂f of the cost function to the differential ∂A by ∂f = Re Tr(G · ∂A H ) . Taking the connectivity constraint on A into account, this relation can be expressed as The Euclidean gradient corresponding to the minimization of −WSR (with WSR given by (10)) can then be obtained by applying the computational rules from [25]; this yields where and For the minimization of (20), the Euclidean gradient is determined by As Z i from (32) depends on the matrices Q (1) , . . . , Q (N b ) , it follows that G (b) from (31) is a function of A (1) , . . . , A (N b ) rather than only A (b) ; therefore, the minimization of −WSR with respect to A does not reduce to N b separate minimizations. This contrasts with the minimization of (20) with respect to A, which obviously gives rise to N b separate minimizations. Table 2 gives an overview of the precoding algorithms described above. For the all-digital precoder, the precoding matrix F i is given by (6) where the diagonal power allocation matrix i is found by applying the waterfilling algorithm

VIII. OVERVIEW OF PRECODER ALGORITHMS
Note that for the all-digital precoder, the parameter M i,k only depends on the channel matrix H i .
The hybrid precoder WSRmax starts by finding the analog part A by maximizing the WSR (10) with M i,k now given by (19). Note that this M i,k depends both on the channel matrix H i and the analog part A. The maximization of the WSR (10) is found from a gradient-descent optimization on a matrix manifold, which ensures that A satisfies the hardware constraints. After the analog part A is found, the optimal ZF digital part B i is given by (16) where the diagonal power allocation matrix i is again found by applying the waterfilling algorithm (8)-(9) with M i,k given by (19).
The MD precoders try to minimize the squared Frobenius norm (20) using various techniques: rank-decomposition, AltMin, or a heuristic approach. The noZF MD precoders then use the analog part A and a scaled version of the digital part B that follows from this minimization. The rescaling of the digital part is necessary in order to satisfy the transmit constraint (2). The ZF MD precoders only keep the analog part A that follows from the minimization while choosing the digital part B i as the optimal ZF matrix given by (16) where i is again found by applying the waterfilling algorithm (8)-(9) with M i,k given by (19).

IX. COMPUTATIONAL COMPLEXITY
We express the computational complexity of the various precoders as the number of floating-point operations (flops) they require; the flop count has been derived from [26], where the computational complexity of the relevant matrix operations can be found. Here we will only indicate the behavior of the dominant terms in the flop count for large N a , assuming that N rf and K increase proportionally with N a . Table 3 displays the flop count associated with the computation of several variables required by the different precoding algorithms. Next, Table 4 shows the flops for the considered precoders, expressed in terms of the quantities F 1 , F 2 , . . ., which represent the flop counts from Table 3. The quantities N it , N out , and N in denote the average number of iterations (for WSRmax and PE-AltMin), the average number of outer iterations, and the average number of inner iterations per outer iteration (both for nested MO-AltMin); the results for intertwined MO-AltMin follow from setting N in = 1.
The flops F 1 in Table 3 corresponds to the computation of U  Table 4.
In each iteration of the PE-AltMin precoders, the digital part B  contribution N it (F 2 + F 4 + F 8 ) to these precoders' flop count; the complexity of the PE has been neglected compared to the computation of A (b) (l) . The WSRmax precoders use a gradient-descent algorithm for updating A; the corresponding computational complexity is dominated by the evaluations of the cost function and the gradient. In the lth iteration of WSRmax, A (l) is obtained from A (l−1) ; this involves computing the gradient at A (l−1) , and executing a backtracking line-search to find a stepsize satisfying the Armijo-Goldstein (AG) condition [22], [23]. To verify the AG condition for a given stepsize, one needs the values of the cost function (10) at A (l−1) , and at A (l) corresponding to the considered step size; this yields the contribution N it (N ls F 7 + F 9 ) for WSRmax in Table 4, where N ls denotes the number of cost function evaluations (associated with the line search) per update of A. Before the iterations start, the cost at A (0) must be obtained, because its value is needed for verifying the AG condition for l = 1; this gives rise to the contribution F 7 for WSRmax. Although in Table 4 the flop count expressions for WSRmax_DPS and WSRmax_SPS are the same, it should be noted that these algorithms yield different values for (N it , N ls ), because the optimization is over different manifolds (i.e. the standard Euclidean space for WSRmax_DPS and the complex circle manifold for WSRmax_SPS); consequently, these algorithms give rise to different flop count values.
For the nested MO-AltMin precoders, which use gradientdescent in the inner iterations, a similar reasoning as for WSRmax applies. The contribution N out N in (N ls F 8 + F 10 ) in Table 4 stems from the cost function and gradient evaluations during the inner iterations. In the lth outer iteration, the digital part B (b) (l) is computed, and an additional cost function evaluation at A (l−1) (which is taken as the starting point A (l,0) for the inner iterations) is required before starting the inner iterations; the corresponding contribution in Table 4 is N out (F 4 + F 8 ). The results for intertwined MO-AltMin can be obtained by making the substitution N in = 1.
From the above, it follows that for the iterative precoders with gradient-descent, the evaluation of the gradient at some A is always preceded by a cost function evaluation at the same A. Therefore, some matrices obtained during the cost function evaluation can be reused when computing the gradient; the corresponding entries in Table 3 take this into account. An efficient evaluation of the cost function (10) and gradient (31) is outlined in appendices A and B.
Once the analog precoder A has been obtained, the digital precoder can be calculated. The ZF digital part B i (16) can be efficiently computed as where the matrices H i and R eq,i are defined in appendix A.
, and are obtained by solving a system of equations by means of backward substitution (e.g., H H i R −1 eq,i is the solution of XR eq,i = H H i ). When the matrices H i and R eq,i are available, computing B i this way requires F 5 flops; when H i and R eq,i are not available and need to be calculated first, the flop count amounts to F 6 . Table 4 indicates that the flop count F 5 is present in the WSRmax precoders, which reuse H i and R eq,i from the computation of the cost function (10). The MD-type ZF precoders also use B i (16) as the digital part, and require F 6 flops. In contrast, the MD-type noZF precoders obtain the digital part from B (b) (23), which requires F 3 flops.

X. NUMERICAL RESULTS AND DISCUSSION
Numerical results will be presented assuming a mm-Wave frequency-selective channel model. In the time domain, the channel from the nth TX antenna to the kth UT is represented by a tapped delay line consisting of Q taps with spacing T . Each tap corresponds to a cluster of N ray rays, yielding the tap coefficients ȟ k,n (0), . . . ,ȟ k,n (Q − 1) , with [1], [2], [4], [5], [6], [10], and [16] for q = 0, . . . , Q − 1. When the kth UT is at distance d k from the BS, the complex-valued gains α k,r (q) in (35) are independently distributed according to CN (0, σ 2 k ), where σ 2 k = C/d 2 k and C is selected such that E σ 2 k = 1; assuming that the UTs are independently and uniformly distributed over an annulus with inner and outer radii d min and d max and with the BS at the center, one obtains Numerical results will assume d max /d min = 1000, with corresponds to a path loss difference of 60 dB between UTs at the minimum and maximum distances from the BS. The vector a tx (φ) in (35) denotes the normalized TX array response vector corresponding to an azimuth angle φ; assuming an N a -element uniform linear array with spacing λ/2, we take (a tx (φ)) n = 1 √ N a e jπ(n−1) sin(φ) . The angles of departure φ k,r (q) are independent and have a truncated (to the interval (−π, π)) Laplace distribution with meanφ k (q) and standard deviation of 10 degrees; the meansφ k (q) are independent and uniformly distributed over (0, 2π). Considering OFDM modulation with a cyclic prefix of Q − 1 samples and N f subcarriers, the channel matrix H i associated with the ith subcarrier is obtained as where (Ȟ(q)) k,n =ȟ k,n (q). According to this channel model, the rows of H i are statistically independent, but the elements in a same row are spatially correlated; for a given position of the kth UT, the normalization factor in (35) gives rise to marginal distributions CN (0, σ 2 k ) for the elements in the kth row of H i . Unless mentioned otherwise, in the sequel we take (N a , K , N rf , N f , Q, N ray ) = (84, 12,12,64,9,16), and N b = 4 when using the GC constraint. The values of N a and K correspond to a massive MIMO setting: N a 1 and K /N a < 1 [27].
The iterative algorithms make use of the following stopping criterion: the iterations are halted when in three consecutive iterations the relative change of the cost function is less than σ , with σ = 10 −3 (unless mentioned otherwise); consequently, a minimum of 3 iterations will always be executed before halting. The gradient-descent optimizations were performed using the Manopt toolbox [28] in Matlab using the 'rlbfgs' solver, which implements a limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm [24].
The nested and intertwined MO-AltMin precoders achieve virtually the same WSR performance, but the former gives rise to a computational complexity which is about 40 % to 80 % larger at σ = 10 −3 ; in the sequel, 'MO-AltMin' will refer to the intertwined version.
The instantaneous SE associated with the channel realization H = H 1 , . . . , H N f , is given by where SINR i,k is the signal-to-interference-plus-noise ratio (SINR) associated with the ith subcarrier and the kth UT (which reduces to γ i,k /N 0 for ZF precoders); the factor 1/(N f + Q − 1) takes into account the presence of a cyclic prefix with N cp = Q − 1; this is the minimum prefix length that avoids interference between successive OFDM symbols. The ergodic SE is defined as SE avg = E[SE(H)], where E[·] denotes the expectation over H. In section X-A, we will discuss the ergodic SE of the MISO communication system, resulting from precoders designed for ω k,l = 1. The numerical value of SE avg is obtained by averaging SE(H) from (36) over 1000 channel realizations.

A. SPECTRAL EFFICIENCY
We investigate the SE (for ω k,l = 1) resulting from various hybrid precoders, and make the comparison with the all-digital precoder.

Figs. 2 (DPS implementation) and 3 (SPS implementation)
, display SE avg versus E tr /N 0 for K = N rf = 12. For both the ZF and noZF precoders, the curves for PE-AltMin and PE-MD are not shown, because they are very close to the curve for MO-AltMin; when the PC constraint is imposed, MO-AltMin and PE-AltMin are even mathematically equivalent. In Fig. 3, SE avg related to the precoder from [15] is also shown, assuming single-antenna RXs and perfect channel estimation. This precoder performs worse compared to the MD_SPS_ZF precoders with the same connectivity constraint (i.e., FC). Moreover, this precoder requires an iterative optimization, which results in a larger computational complexity than with PE-MD_ZF. Therefore, the precoder from [15] will no longer be considered in the sequel. Table 5 shows the performance degradations of the ZF precoders with respect to the all-digital precoder; this degradation is expressed as a penalty in E tr /N 0 at SE avg = 50 bps/Hz. For a given connectivity constraint, Table 5 indicates that WSRmax_DPS outperforms WSRmax_SPS in terms of SE avg . As the WSRmax precoders are designed to maximize SE(H) under the ZF constraint for given H, it follows that WSRmax_DPS performs better than WSR-max_SPS also in terms of SE(H), because the latter precoder is subjected to the UM constraint.
The situation is different for the MD precoders, as these are designed to minimize F dig − AB 2 rather than maximize SE(H). This is illustrated by the scatter plot in Fig. 4, where the abscissa and ordinate represent the performance penalties (with respect to the MD_DPS precoder) of the MO-AltMin and PE-MD precoders, respectively, at SE(H) = 50 bps/Hz in the case of FC; each scatter point corresponds to a different   channel realization. It follows from Table 5 that the degradations (with respect to the MD_DPS precoder) of these precoders at SE avg = 50 bps/Hz are 0.57 dB and 0.89 dB; these are the coordinates of the asterisk in Fig. 4. Although, in terms of SE avg , MD_DPS outperforms MO-AltMin, which in turn outperforms PE-MD, we observe that this ranking of the precoders does not apply to all channel realizations: for scatter points with negative abscissa, MO-AltMin outperforms MD_DPS; for scatter points with negative ordinate, VOLUME 11, 2023 PE-AltMin outperforms MD_DPS; for scatter points located below the bisector of the first and third quadrant, PE-MD outperforms MO-AltMin. Similar observations can be made when replacing MO-AltMin or PE-MD by PE-AltMin, and/or considering the GC or PC constraints (results not displayed for conciseness).
For all types of precoders, SE avg gets smaller when the connectivity is reduced, because there is less freedom in choosing the analog precoder matrix A.
For the noZF precoders, SE avg flattens at high E tr /N 0 , because of the IUI becoming dominant over the noise. No flattening occurs for the ZF precoders because they eliminate IUI. For a given connectivity constraint, the noZF precoders outperform the ZF precoders at sufficiently small E tr /N 0 : the noZF precoders generate a larger useful component, at the expense of some IUI which at low E tr /N 0 is negligible compared to the noise.
For a given type of digital precoder (ZF or noZF) and a given connectivity constraint, the MD_SPS precoders (i.e., MO-AltMin, PE-AltMin and PE-MD) are very close in SE avg performance: Table 5 indicates that the penalties of the MD-SPS_ZF precoders differ by at most about 0.3 dB.
For a given connectivity constraint, the penalty of the SPS implementation with respect to the DPS implementation is rather small. Comparison of WSRmax_DPS with WSR-max_SPS and MD_DPS with PE-MD in Table 5 shows that these penalties are limited to about 1.2 dB. This penalty is in line with [13] and [14], which show that, for MU-MISO-SC with K = N rf and ZF precoding on i.i.d. Rayleigh fading channels, the degradation of the SPS implementation (which applies PE to the conjugate channel matrix) compared to the all-digital ZF precoder (which for SC modulation is equivalent to hybrid precoding with DPS) amounts to −10 log 10 π 4 ≈ 1 dB for large N a . Because OFDM transmission gives rise to a rank-deficient AB, the FC precoders with DPS implementation perform considerably worse than the all-digital precoder: Table 5 shows degradations of 5.86 dB for WSRmax_DPS and 13.25 dB for MD_DPS_ZF. This is in contrast to SC transmission, where the FC precoder with DPS implementation is able to achieve the same SE as the all-digital precoder.
For a given implementation (DPS or SPS) and connectivity constraint, the MD_ZF precoders perform considerably worse than the WSRmax precoders: according to Table 5, the penalty of the former compared to the latter ranges between 5.83 dB and 7.39 dB for DPS, and between 4.9 dB and 7.7 dB for SPS. Whereas for SC transmission the MD precoders are quasi-optimum in terms of SE (approaching (for SPS) or achieving (for DPS) the all-digital precoder performance, see for instance [1], [2], [3], [4]), this is clearly not true in the case of OFDM transmission. According to [1], for the MD precoders to be nearly optimal in terms of SE, F dig −AB 2 F must be small; however, the rank-deficiency of AB in the case of OFDM causes significantly larger F dig −AB 2 F than with SC transmission, so that the MD precoders can no longer be considered near-optimum. Hence, for OFDM transmission, the For all configurations, SE avg increases with N rf (at the expense of a growing hardware cost): as the rank of the hybrid precoder AB equals N rf , there is more freedom in selecting AB when N rf gets larger. For N rf = N a , A has dimension N a ×N a , so that selecting B = A −1 F dig makes AB equal to the all-digital precoder F dig ; this is confirmed by the numerical results, which show that all hybrid precoders achieve the same SE avg as the all-digital precoder when N rf = N a .
Since AB becomes less rank-deficient when N rf gets larger, AB can approximate F dig more closely, so that minimizing F dig − AB 2 F also tends to maximize SE(H). Indeed, the differences in SE avg between the WSRmax and MD precoders get smaller with increasing N rf : whereas the differences are considerable for N rf = K , WSRmax and MO-AltMin_ZF achieve a similar SE avg for N rf 2K ; much larger values of N rf are required for the MD_noZF precoders to operate close to the WSRmax precoders. From the hardware complexity point of view however, the interesting range for N rf is K ≤ N rf ≤ 2K ; in this region, the MD_noZF precoders are still much worse than the WSRmax and MD_ZF precoders.  For a given connectivity constraint and a given selection (ZF or noZF) of B, the hybrid precoders with N rf = 2K and SPS implementation significantly outperform the precoders with N rf = K and DPS implementation, but perform worse than the all-digital precoder. This is in contrast to SC transmission, where the former and latter hybrid precoders are both equivalent to the all-digital precoder [6], [8].
Performance differences among MD_ZF precoders and among MD_noZF precoders get smaller when the degree of connectivity decreases. In the case of PC, the curves for MD_DPS_ZF, MO-AltMin_ZF, PE-AltMin_ZF, and PE-MD_ZF virtually coincide; therefore only the curve for MO-AltMin_ZF is displayed; the same holds for the noZF versions of these precoders.

B. PERFORMANCE-COMPLEXITY TRADE-OFF
We consider the effect of the stopping criterion on SE avg and on the computational complexity of the iterative precoders, and make the comparison with the non-iterative precoders. Fig. 8 shows SE avg versus the flop count for different ZF precoders operating at E tr /N 0 = 40 dB, when the parameter σ of the stopping criterion varies. Each iterative precoder yields a curve with six markers; the corresponding values of σ are (from leftmost to rightmost marker) ∞, 10 −1 , 10 −2 , 10 −3 , 10 −4 and 10 −5 ; σ = ∞ corresponds to the execution of exactly 3 iterations. For the AltMin precoders, some of the markers for the larger values of σ nearly overlap, because for these values of σ the average number of iterations is close to 3; hence, only the markers for the smaller values of σ are clearly visible. Each non-iterative precoder is represented by a single marker; note that, for a given connectivity constraint and a given implementation (DPS or SPS), the values of SE avg and flop count of the non-iterative precoder also apply to the starting point of the iterative precoders.
For the WSRmax precoders, decreasing σ improves SE avg at the expense of a larger flop count, because it takes more iterations to converge. Compared to the starting point of the WSRmax precoders, a significant increase in SE avg is achieved when performing only 3 iterations. A suitable tradeoff value is σ = 10 −3 , because further reducing σ to 10 −5 improves SE avg by only 2 % at most, whereas the computational complexity further increases by a factor ranging from about 2 to 8. At σ = 10 −3 , SE avg for WSRmax_DPS is slightly larger (by up to 9 %) than for WSRmax_SPS (as explained in section X-A). Moreover, WSRmax_DPS also has a smaller flop count (maximum decrease of 32 % compared to WSRmax_SPS), but the DPS implementation has a higher hardware cost: it needs twice as many phase shifters to implement the analog part A [6], [8].
For a given flop count and connectivity constraint, MO-AltMin and PE-AltMin yield virtually the same SE avg . For these precoders, the dependence of SE avg on the flop count is much less than for WRSmax: for σ = 10 −5 , SE avg is at most 3 % larger than when only 3 iterations are performed, but the computational complexity at σ = 10 −5 is considerably larger, especially when FC or GC applies. Also, performing 3 iterations yields only a small gain (at most about 2 %) in SE avg , compared to the starting point, whereas the cost in flops roughly doubles. Therefore, PE-MD represents a suitable trade-off among MD precoders with SPS implementation. Compared to the PE-MD precoder, the MD_DPS precoder offers a slightly higher SE avg (increase of up to 7 %) for the same flop count; however, this benefit comes with an additional hardware cost (i.e., the number of phase shifters doubles). Table 6 provides a concise overview of the trade-off between SE avg , the computational complexity, and the hardware cost. As the all-digital precoder is not hampered by hardware constraints, it is able to achieve the highest SE avg and lowest computational complexity. The main downside of this precoder is the extremely large hardware cost and power consumption: N a RF chains are required (one per antenna element).
Hybrid precoding reduces the hardware cost and power consumption by only requiring N rf RF chains with N a > N rf ≥ K ; however, this results in a reduction of SE avg . The hardware cost can be lowered even more by reducing the connectivity of the analog part (by increasing N b ), at the cost of an even larger reduction of SE avg . On the positive side, reducing the connectivity also reduces the computational complexity.
For a ZF hybrid precoder with given hardware constraints (connectivity and/or UM), the WSRmax precoder provides the largest SE avg at the cost of a high computational complexity. On the other hand, the non-iterative MD precoders have low computational complexity but also the lowest SE avg . Lastly, the iterative MD precoders achieve slightly larger SE avg compared to the PE-MD precoder with a considerable increase in computational complexity.
The noZF MD precoders have similar computational complexity and hardware cost compared to their ZF counterparts while achieving higher SE avg at low SNR. At moderate to high SNRs, the noZF precoders perform much worse in terms of SE avg .

XI. CONCLUSION
In this paper, we presented a novel ZF hybrid precoder for a MU-MISO-OFDM communication system that maximizes the WSR under various hardware constraints. The WSR resulting from these WSRmax precoders is useful for benchmarking the performance of other ZF hybrid precoders. We investigated and compared the trade-off between the performance, computational complexity, and hardware complexity for several ZF hybrid precoders.
When N rf satisfies K N rf N a (as is typical of massive MIMO), the SE performance shows large degradations of the WSRmax precoder compared to the optimum all-digital precoder, and of the MD-type precoders compared to the WSRmax precoder. These large degradations are attributed to the inherent rank deficiency of AB in multi-carrier modulations for the considered N rf . Increasing N rf augments the rank of AB and, therefore, reduces these degradations; the degradations completely vanish when N rf = N a , i.e., when AB becomes a full-rank matrix. However, increasing N rf much beyond K is not a proper means to improve the SE, as the purpose of hybrid precoding is to limit hardware cost and power consumption by keeping N rf small.
The performance versus computational complexity tradeoff indicates that the WSRmax precoders, while achieving the best possible performance among ZF hybrid precoders, are viable only when their large computational complexity is affordable. When the allowed computational complexity excludes the WSRmax precoders, the non-iterative precoders MD-DPS and PE-MD are preferred; compared to MO-AltMin and PE-AltMin, PE-MD yields a slightly smaller SE avg , but has a much lower computational complexity.

APPENDIX A EFFICIENT EVALUATION OF THE WSR COST FUNCTION
In order to evaluate the cost function (10) with M i,k given by (19), we first need to calculate the positive scalar parameters M i,k for all i = 1, . . . , N f and k = 1, . . . , K , followed by a waterfilling algorithm (9) to find the water level µ and the set S = {(i,k) | µω i,k > M i,k }. Finally, the cost function can be evaluated according to (10).
Since evaluating the parameters M i,k is the most computationally intensive, it is important to use an efficient algorithm to calculate them. Algorithm 1 presents an efficient method. The algorithm starts by taking the QR decomposition of A (b) to obtain a semi-unitary matrix Q (b) and an upper-triangular matrix P (b) . Step 2 then computes the matrix product H Step 4 then performs the Cholesky decomposition of the matrices H i H H i , yielding the upper-triangular matrices R eq,i . In step 5 the inverse of the matrices R eq,i is computed by solving the linear system: R eq,i X = I K using backward substitution. In the final step, the parameter M i,k is obtained as the squared norm of the kth row of the matrix R inv,i . Because the cost function is evaluated before the gradient is calculated for a specific analog precoder A, we can reuse some matrices from algorithm 1 to calculate the gradient. These matrices are: R inv,i , H i , Q (b) , P (b) ∀i,k,b. Similarly, the parameters M i,k and the water level µ can also be reused.
The amount of work required to evaluate the cost function is now dominated by steps 2, 4, and 5, requiring about 8N a N rf 3 K 3 N f , and 4 3 K 3 N f flops, respectively. Assuming that the flops required by all other steps are negligible, the amount of work required to evaluate the cost function using algorithm 1 is about 8N a N rf

APPENDIX B EFFICIENT EVALUATION OF THE WSR GRADIENT
When calculating the gradient G (b) from (31) for a specific analog precoder A, we will assume that the matrices R inv,i , H i , Q (b) , P (b) ∀i,k,b, the parameters M i,k ∀i,k, and the water level µ have already been calculated during the evaluation of the cost function for the respective analog precoder A. An efficient algorithm for calculating these matrices and parameters can be found in appendix A.
An efficient method for calculating the gradient is presented in algorithm 2. In the first two steps we start by calculating the parameters d i,k = The amount of work required to calculate the Euclidean gradient is now dominated by steps 3 and 4, requiring about 4 3 K 3 N f and 8N a N rf N b KN f + 8N rf K 2 N f + 4K 3 N f flops, respectively. Again assuming that the flops required by all other steps are negligible, the amount of work required to calculate the gradient using algorithm 2 is about 8N a N rf 16 3 K 3 N f flops.