Energy-Efficient Hybrid Symbol-Level Precoding for Large-Scale mmWave Multiuser MIMO Systems

We address the symbol-level precoding design problem for the downlink of a multiuser millimeter wave (mmWave) multiple-input multiple-output (MIMO) wireless system where the transmitter is equipped with a large-scale antenna array. The high cost and power consumption associated with the massive use of radio frequency (RF) chains prohibit fully-digital implementation of the precoder, and therefore, we consider a hybrid analog-digital architecture where a small-sized baseband precoder is followed by two successive networks of analog on-off switches and variable phase shifters according to a fully-connected structure. We jointly optimize the digital baseband precoder and the states of the switching network on a symbol-level basis, i.e., by exploiting both the channel state information (CSI) and the instantaneous data symbols, whereas the phase-shifting network is designed only based on the CSI due to practical considerations. Our approach to this joint optimization is to minimize the Euclidean distance between the optimal fully-digital and the hybrid symbol-level precoders. Remarkably, the use of a switching network allows for power-savings in the analog precoder by switching some of the phase shifters off according to the instantaneously optimized states of the switches. Our numerical results indicate that, on average, up to 50 percent of the phase shifters can be switched off. We provide an analysis of energy efficiency by adopting appropriate power dissipation models for the analog precoder, where it is shown that the energy efficiency of precoding can substantially be improved thanks to the phase shifter selection approach, compared to the fully-digital and the state-of-the-art hybrid symbol-level schemes.

the emerging outdoor/indoor wireless communication deployments, enabling multi-gigabit-per-second data rates thanks to the enormously available unregulated spectrum resources within 30-300 GHz band [1]- [3]. Communication in the mmWave band, however, suffers from an order-of-magnitude increase in the free-space path loss, higher shadow fading, and more severe penetration losses compared to the legacy lower-frequency systems [4]. On the other hand, the shorter wavelength of mmWave signals makes it possible to pack a larger number of antenna elements in the same physical dimension, allowing for large-scale spatial multiplexing and highly directional beamforming. Employing large antenna arrays, commonly known as massive multiple-input multiple-output (MIMO), can further provide considerable beamforming gain to compensate for severe propagation losses at mmWave frequencies [5], which is indispensable to achieve high-quality communication links in mmWave systems.
In traditional MIMO systems, the convention is to perform baseband precoding fully in the digital domain, which enables modification of both the amplitudes and phases of complex signals [6], [7]. This fully-digital signal processing, however, requires one dedicated radio frequency (RF) chain per antenna element, which is challenging to implement in practical systems with large antenna arrays due to the prohibitive cost and high power consumption of mixed-signal components, especially when operating at mmWave frequencies [8]. Given mmWave massive MIMO practical constraints, the design of cost-effective low-complexity precoding implementations has become an active line of research. Various precoding schemes, mostly aimed at either simplification of or reducing the number of RF chains, have been proposed for both single-user and multiuser MIMO systems, among which we refer to analog-only beamforming using RF phase shifters [9]- [11], antenna (sub-set) selection [12], [13], quantized fully-digital precoding via low-resolution (especially one-bit) digital-to-analog converters (DAC) [14], [15], and hybrid analog-digital beamforming [5], [8], [16]- [18].
Hybrid analog-digital precoding is a cost-effective alternative to enable both multi-stream transmission and large beamforming gains via splitting the signal processing operation between the digital and analog domains. In hybrid architectures, a small-sized digital precoder is followed by a high-dimensional analog precoder which is usually implemented using RF phase shifters and/or switches [16]. Such a setup allows for employing fewer RF chains, scaling with the number of multiplexed data streams rather than the number of antennas. Specifically, in multi-user mmWave systems, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the digital precoder is so designed to mitigate the inter-user interference, whereas the analog RF precoder is used to improve the antenna array gain [19]. Nevertheless, while designing the digital precoder is straightforward, the design and implementation of the analog precoder are usually nontrivial.
For large-scale multiuser mmWave systems, the design of block-level hybrid schemes where the precoding solution solely relies on the channel state information (CSI), has been extensively addressed. However, symbol-level approaches to hybrid precoding are not yet well studied. In the latter design approach, the data information is further exploited in optimizing the precoder such that the inter-user interference constructively contributes to the received signal of each user. This has been led to the advent of constructive interference (CI) which is based on the philosophy that a noise-free received signal can be decoded correctly not necessarily when it is close enough to the target symbol, rather, as long as it lies within the correct decision region even farther away from the intended symbol [20], [21].
Symbol-level hybrid precoding design under mmWave hardware limitations has been addressed in some recent work [22]- [24]. In [22], the authors adopt a disjoint suboptimal approach to optimize the digital and analog precoders with a focus on the analog precoder design, where different techniques are studied and compared. Power-efficient transmitter architectures, including antenna selection and analog-only, are studied for symbol-level precoding in [25], where it has been shown that the analog-only design can outperform the other schemes especially when the transmit array size is much larger than the number of UEs. An even more cost-effective hybrid structure is considered in [23] where the baseband digitally precoded signal is subject to one-bit quantization due to the use of low-cost one-bit DACs for each RF chain. This excessive constraint, however, may limit the potential gain of symbol-level baseband signal processing. The joint optimization of digital and analog symbol-level precoders is addressed in [24], where the authors exploit the symbol-based design of the phase-shifting network to achieve the performance of the fully-digital precoder. In practice, the design needs to switch between the phases states of the variable phase shifters at the symbol rate. Keeping in mind target data rates of multi-Gbps in mmWave systems, such a high phase-switching speed requirement might be prohibitive in two aspects: first, it significantly increases the power consumption in the analog circuitry, and second yet, more importantly, it might be challenging from an implementation point of view considering the current RF semiconductor technologies [26]. Among the aforementioned symbol-level precoding techniques, those proposed in [23] and [24] are more related to the scope of this work, and therefore will be considered in the paper for comparison purposes.
Analog phase shifters and switches are two key components of the mmWave systems. A wide variety of hybrid precoding architectures are essentially based on employing either phase shifters or switches, or even a combination of both where the phase-shifting network is controlled by a preceding network of switches; see, e.g., [16] and [27] where several possible architectures are described. Employing the combination of phase-shifting and switching networks in the analog RF precoder has a two-fold advantage. On the one hand, it can provide additional degrees-of-freedom (DoF) brought by the switching network when designing the analog precoder, and on the other hand, it allows for potential power-savings through switching some of the phase shifters off. From a power consumption perspective, one further need to take into account the excessive power consumed by the switching network. For this purpose, power consumption models such as those introduced in [27] and [14] can be used. However, roughly speaking, the excessive power consumption due to the operation of switches is relatively small compared to the power reduction in the phase-shifting network. One reason is that, in general, switches consume less power than phase shifters. Furthermore, recent advances in RF circuit design have enabled the implementation of low-power high-performance switches working at mmWave frequencies, making the switching operation even more energy-efficient; see, e.g., [28]- [30]. Therefore, the use of analog switches in combination with the phase-shifting network is an attractive architecture for hybrid mmWave systems. In this line of research, hybrid implementations with the so-called phase shifter selection, where a two-state on-off switch precedes each phase shifter, have been studied for conventional block-level precoding; see [31]- [33]. For example, in [31], it has been shown that the combination of phase shifters and switches offers noticeably higher energy efficiency compared to phase shifter-only architectures, while the spectral efficiency is almost preserved. More specifically, significant power consumption reductions are possible without sacrificing the spectral efficiency even when up to 50% of the phase shifters are turned off [32]. To the best of authors' knowledge, such an approach has not been investigated so far for hybrid symbol-level precoding.
In this paper, we consider a hybrid analog-digital architecture for symbol-level precoder where the analog precoder is implemented using a network of variable phase shifters preceded by an on-off switching network of the same dimension according to a fully-connected structure. As for the analog precoder, the phase states of the phase-shifting network are designed solely based on the instantaneous CSI, i.e., they stay unchanged within the duration of one channel coherence block. On the other hand, the on-off states of the switches as well as the baseband digital precoder are jointly optimized in our design on a symbol-level basis. Our approach to this optimization is to minimize the 2 -norm distance between the outputs of hybrid and optimal fully-digital symbol-level precoders. For the latter precoder, we adopt a power-constrained max-min signal-to-interference-plus-noise ratio (SINR) criterion subject to user-specific CI constraints, where the CI constraints are assumed to be distance-preserving, as characterized in [34]. Accordingly, the main contributions of this paper are as follows: i. We exploit the notion of CI along with the phase shifter selection approach in designing the hybrid precoder. The CI-based design can improve the symbol detection performance at the receiver side, while the phase shifter selection approach brings additional DoF to the design problem and further enables the reduction of dissipated power in the phase-shifting network. The use of on-off switches, however, makes our design problem an NP-hard binary optimization. We deal with this difficulty by transforming the original problem into a biconvex form using an equivalent continuous-domain implication of the binary constraints. Efficient sub-optimal solutions can then be obtained via a standard block coordinate descent (BCD) algorithm. ii. We study the convergence of the proposed hybrid precoding algorithm, where it will be shown that convergence to a stationary point is guaranteed. We further analyze the required computational complexity in the large system limit. In our analysis, we consider both the Newton complexity, i.e., the number of iterations required till the BCD algorithm converges, and the per-iteration complexity. Moreover, we show via simulation results that the BCD algorithm usually converges within a few iterations for practical values of system parameters, i.e., array size, number of RF chains, and users. iii. We provide an analysis of energy efficiency, incorporating both performance and power consumption, to evaluate and compare different fully-digital/hybrid precoding architectures. For this purpose, we adopt appropriate power consumption models to take into account the power dissipated by the transmitter's RF circuitry. According to this analysis, the phase shifter selection mechanism offers significant improvements in the energy efficiency of precoding by switching off up to 50 percent of the phase shifters. iv. Our design approach is independent of the phase-shifting precision; however, to evaluate how this affects the ultimate precoding performance, in our simulations, we consider two implementations using infinite and finite resolution phase shifters. It will be shown that implementing the phase-shifter-selection-enabled analog precoder using low-resolution phase shifters can lead to gains of tens of Mbps/Joule per user in energy efficiency, compared to the case with infinite-resolution phase shifters. Organization: The rest of this paper is organized as follows. In Section II, we describe the adopted system, signal, and channel model. We begin Section III by designing the phase-shifting network. Then, we study the symbol-level precoding problem for fully-digital architecture. This is followed by the derivation of the proposed hybrid precoding algorithm and analyses of its convergence and computational complexity. In Section IV, we provide energy efficiency analysis and explain the power consumption model. Simulation results are presented and discussed in Section V. Finally, we conclude the paper in Section VI.
Notations: We use bold-faced uppercase and lowercase letters to represent matrices and vectors, respectively. For matrices and vectors, · respectively denotes the Frobenius norm and the 2 -norm. For vectors, and denote elementwise inequality. Operators diag(·) and blkdiag(·) represent diagonal and block-diagonal matrices, and vec(·) and vec −1 (·) denote the vectorization operation and its inverse, respectively. We use I to represent the identity matrix, and use 0 and 1 to represent, respectively, the all-zeros and the all-ones matrices (or vectors, depending on the context) of the appropriate dimensions. Operators ⊗ and • stand for the Kronecker product and the Hadamard product, respectively. Furthermore, P{·} and E{·} denote the probability function and the statistical expectation.

II. SYSTEM AND CHANNEL MODEL
In this section, we describe the system and channel model considered in the paper.

A. System and Signal Model
We consider a single-cell single-carrier mmWave multiuser MIMO system where the base station (BS) is equipped with a large-scale antenna array of N t elements and a (typically) much smaller number of transmit RF chains, denoted by N l . The BS simultaneously communicates independent data streams to N u single-antenna users, each supporting single-stream transmission. The maximum number of transmitted data streams (i.e., the maximum number of users scheduled within a transmission block) is limited by the number of available RF chains at the BS, which leads to the assumption N u ≤ N l < N t . Due to the limited number of transmit RF chains, the fully-digital implementation of multiuser precoder is not possible, and therefore, a hybrid digital-analog architecture is employed where the digital baseband precoder is followed by the RF chains and an analog RF precoder, as shown in Fig. 1. It is worth noting that the baseband precoder is capable of modifying both the amplitudes and phases of the input symbols while the RF precoder adjusts only the phases of the upconverted RF signals.

1) Digital Baseband Precoder:
We consider a (non-linear) symbol-level baseband precoder that calculates the digital outputs specifically for every set of input symbols. Accordingly, the discrete-time N u × 1 complex modulated symbol vector s = [s 1 , s 2 , . . . , s Nu ] T , where E ss H = I Nu , is preprocessed in the digital domain using the symbol-level precoder, resulting in the output baseband signal u BB ∈ C N l ×1 . In contrast to linear precoding schemes, the nonlinear-precoded signal u BB is directly designed and thus may not be explicitly decomposable as a linear combination of the users' precoding vectors. This symbol-by-symbol processing will be described in more detail in Section VI. The baseband precoded signal u BB is then passed through the RF chains for upconversion to the carrier frequency.
2) Analog RF Precoder: We assume the analog precoder to be implemented following a fully-connected structure with two successive switching and phase-shifting networks of dimension N l × N t that map N l digital outputs to N t precoded analog signals feeding the transmit antennas; see [16], [32], [33] where similar hybrid architectures have been considered. To be more specific, each RF chain upconverts a digitally precoded signal and feeds it to a phased-array with N t variable phase shifters, each of which preceded by a dedicated analog on-off switch determining whether the corresponding phase shifter is active or deactivated. The outputs of the phase-shifting network are then combined through N t analog combiners before being fed to the antenna elements. Let F ∈ C Nt×N l and T ∈ B represent the phase-shifting network and the on/off states of the switching network, respectively, where B X ∈{0, 1} Nt×N l : Then, the entire RF precoder can be represented by F • T. Note that the set B is defined such that the selection matrix T has no all-zero row and column, where the former case corresponds to an antenna selection scheme and the latter case excludes an RF chain from the transmitter's analog circuitry, but neither is the focus of this paper. We further note that since F is implemented using analog phase shifters, each element of F ∈ C Nt×N l is normalized such that |f k,j | = 1/ √ N t , with |f k,j | denoting the magnitude of the (k, j)-th element of F. Under the described system model, the vector collecting the baseband received signals for all N u users is given by where y ∈ C Nu×1 is the received signal vector, ρ is the instantaneous transmit power, H ∈ C Nu×Nt represents the mmWave multiuser channel, and z ∼ CN (0, Σ) is a circularly symmetric complex Gaussian noise vector with Σ diag(σ 2 1 , σ 2 2 , . . . , σ 2 Nu ) where σ 2 j denotes the noise variance at the receiver of the j-th user. The instantaneous total transmit power is constrained by ρ through enforcing (F•T)u BB 2 = 1. It is further assumed that the BS has perfect knowledge of the instantaneous channel H. In practical wireless systems, the CSI can be estimated at the receiver via, e.g., pilots or training sequences, and then fed back to the BS. Efficient mmWave channel estimation techniques that exploit the geometric nature of mmWave channels are presented in [35], [36]. At the receiver side, we assume that each user is capable of performing optimal single-user decoding of the received signal using the maximum likelihood (ML) detector.

B. Multiuser mmWave Channel Model
The mmWave propagation environment is known to feature limited multipath components. To capture this sparse scattering nature, the narrowband clustered channel modeling based on the Saleh-Valenzuela model is commonly used [37]- [39]. Under this model, the channel vector corresponding to a single user is a summation over the contributions of N c scattering clusters, with each cluster contributing N p propagation paths between the BS and the user. Assuming the same number of clusters and scatterers to be seen by each user, the narrowband mmWave channel vector for the j-th user can be expressed as For the l-th path in the i-th scattering cluster seen by the j-th user, α j,i,l ∼ CN (0, 1) denotes the circularly symmetric complex Gaussian gain of the path (i.e., the small-scale fading component), φ j,i,l and θ j,i,l are respectively the azimuth and elevation angles of departure (AoD), and a(φ j,i,l , θ j,i,l ) represents the normalized transmit array response vector evaluated at specific azimuth and elevation angles φ j,i,l and θ j,i,l . The array response vector further depends on the array geometry. For uniform linear array (ULA), where the antenna elements are linearly and equally spaced, the array response vector is independent of the elevation angles θ j,i,l and follows the Vandermonde structure given by where λ and d respectively denote the signal wavelength and the inter-element antenna spacing, and j = √ −1. Note that the elevation angles θ j,i,l appear in the array response vector in case a uniform planar array (UPA) structure is employed [9].
The variance of the path gains α j,i,l and the normalization

III. HYBRID SYMBOL-LEVEL PRECODER DESIGN
We start by designing the analog phase-shifting network. The matrix F representing the phase shifters' angles is usually considered to be solely dependent on the aggregate channel H. Here, we adopt an analog design based on the singular value decomposition (SVD) of H, which can be expressed as where Σ is an N u × N t rectangular diagonal matrix with the singular values on the diagonal in a descending order, and with the columns representing the left and the right singular vectors. We align the angles of the phase-shifting network to those of the first N l right singular vectors of H, i.e., {v 1 , . . . , v N l }, with an element-wise normalization due to the constant modulus constraint of the phase shifters. Accordingly, denoting by f k,j the k-th entry in the j-th column of F, we set where ϕ k,j is the phase of the k-th element in v j . Aligning the angles of the phase-shifting network according to the first N l right singular vectors of H enables the system to achieve larger array gains. Note that similar aligning schemes based on the SVD decomposition of the channel are used in, e.g., [23], [32]. Although infinite-resolution phase shifters are required for an accurate implementation of this approach, in practice, the use of finite-resolution phase shifters is preferred due to practical constraints of variable phase shifters, particularly in systems with large-scale antenna arrays as the number of phase shifters is proportional to the number of antenna elements. Therefore, in a more realistic implementation with discrete phase shifters, the phase states are quantized up to (typically) low bits of precision. We assume a quantization rule such that the phase of each entry of F is mapped to the nearest phase value in the discrete set {2mπ/2 bPS : m = 0, 1, . . . , 2 bPS −1}. Accordingly, the quantized phase of f k,j , denoted byφ k,j , can be obtained aŝ with b PS denoting the number of phase shifter's resolution bits. Although our design process is independent of the precision of the entries of F, we investigate in Section V the performance of the proposed hybrid precoding scheme for both finite-resolution and discrete phase shifters. Accordingly, for given symbol vector s, the channel matrix H and the phase-shifting network matrix F, our design objective is to jointly and instantaneously (i.e., on a symbol-level basis) optimize the digitally precoded signal u BB as well as the states of the switching network, represented by T.
To this end, we first study the symbol-level precoding solution corresponding to a fully-digital architecture which will be of essential use in our subsequent elaboration of the hybrid precoding approach.

A. Optimal Fully-Digital Precoder
Let consider a fully-digital architecture for the symbol-level precoder where each antenna element is driven by a dedicated RF chain, i.e., N u ≤ N l = N t . The optimization problem of the symbol-level precoder corresponding to a power-constrained max-min fair design criterion subject to constructive interference (CI) constraint for each user is given in [40] as where u FD ∈ C Nt×1 denotes the fully-digital precoded signal and min(·) denotes element-wise minimum. Furthermore, the following real-valued definitions are used: denoting an invertible 2 × 2 matrix containing the normal vectors of the decision region boundaries associated with s j ; if the intended symbol for the j-th user is an outer constellation symbol 1 and w j = 0 otherwise; and the 2N u × 1 vector 2 ] T , collects the orthogonal distances d i,1 and d i,2 of the noise-free received symbol √ ρ H j u from the boundaries of its corresponding CI region, for all j = 1, 2, . . . , N u .
The design formulation (7) uses the so-called distancepreserving CI regions, as introduced in [34], to exploit the CI at the receivers. By definition, any two points belonging to the distance-preserving CI regions of two different constellation points are distanced by at least the distance between the two constellation points. An illustrative example of distance-preserving CI regions is shown in Fig. 2 for a QPSK constellation. Accordingly, the CI constraint √ ρHū FD =Σs+A −1 Wd in (7) attempts to instantaneously align the multiuser interference so that the noise-free signal received by each user locates within the distance-preserving CI region that corresponds to its intended symbol.
By introducing a slack variable γ and applying the change of variable d → t + γ1 2Nu×1 , the optimization problem (7) can be equivalently written as Givenū FD and t, maximization of (8) over γ is amenable to a provably positive solution, given in closed-form by where q A −1 W1 2Nu×1 . As a result, we can eliminate the decision variable γ from our design formulation by replacing γ in (8) with the closed-form expression for γ * in (9). Denoting Q q 2 I Nu − qq T , after some straightforward algebraic steps, we can rewrite the optimization problem (8) in a convex form as max uFD,t 0 which can efficiently be solved via several off-the-shelf algorithms [41]. The optimal fully-digital precoded signal obtained from (10) is in fact a performance upper bound which can be achieved by the symbol-level precoder in case the number of BS's RF chains is equal to N t . Therefore, we use this optimal yet impractical solution in developing our hybrid symbol-level precoding algorithm and also as a performance benchmark for comparisons in Section V.

B. Hybrid Precoder With Phase Shifter Selection
We use the optimal fully-digital precoded signal in order to design the hybrid symbol-level precoder. More specifically, denoting by u FD the optimal solution to (10), we aim to find the digital-domain precoded signal u BB and the selection matrix T such that the output of the hybrid precoder, i.e., (F • T)u BB is as close as possible to u FD . The corresponding optimization problem is therefore given by To proceed, by defining g vec(G) where G 2T−1 Nt×N l , we recast (11) in an equivalent form which is more convenient for our later use, i.e., whereB z ∈ R NtN l ×1 : The new formulation in (12) is derived by using the well-known property vec(XYZ) = (Z T ⊗ X)vec(Y) for given matrices X, Y, Z, along with the fact that T = (G + 1 Nt×N l )/2. Problem (12) belongs to the class of minimization of quadratic forms over binary vectors (i.e., the binary constraints on the elements of g), which is known to be NP-hard in general [42]. To tackle this difficulty, we use an equivalent implication of the binary constraints given in a biconvex form. Lemma 1: Let g and e be two real-valued vectors of equal length N t N l . Provided that −1 g 1 and e 2 ≤ N t N l , the condition g T e = N t N l implies that g = e and further g ∈ {−1, +1} NtN l .
Proof: See [42]. Using Lemma 1, we can rewrite problem (12) in an equivalent form with all the optimization variables being taken from continuous domains, i.e., min uBB,g,e where g T e = N t N l is often called the equilibrium constraint. Reformulation (13) is still a non-convex problem due to the biconvex equilibrium constraint. We use a well-known approach, namely, the exact penalty method, to efficiently solve (13). The interested readers are referred to [42] and [43] where studies on the accuracy and convergence characteristics of the exact penalty method are provided. Based on the exact penalty method, the biconvex equilibrium constraint g T e = N t N l can be handled by adding the difference N t N l − g T e multiplied by μ > 0 as a penalty function to the objective function, where the difference N t N l − g T e acts as a measure of deviation from the equilibrium constraint. Accordingly, denoting the objective function of (13) by g(u BB , t), we can write min uBB,g,e which is our final formulation for the proposed hybrid symbol-level precoding design. It is important to note that, in general, problems (14) and (13) are not equivalent. However, by monotonically increasing the penalty parameter μ in each iteration up to a certain threshold, successive solutions of the penalized problem (13) eventually converge to the solution of the original biconvex problem. On the other hand, for a given u BB , it can be shown that if g(u BB , g) is an L-Lipschitz continuous convex function on −1 g 1, problem (14) has the same local and global minima as those of (13) for μ ≥ 2L, where L denotes the Lipschitz constant of g(u BB , g) with respect to g; see [42,Th. 1]. In the following lemma, we show that function g(u BB , g) is Lipschitz continuous on the domain −1 g 1.
Lemma 2: Let u BB be given, then g(u BB , g) is a Lipschitz continuous function on −1 g 1 with Lipschitz constant Proof: See Appendix A. Finally, we exploit the fact that the objective function of the minimization problem (14), i.e., g(u BB , g) + μ N t N l − g T e is a biconvex quadratic function in g and e, i.e., fixing either g or e gives a convex function in the other variable. From Lemma 2, it then follows that one can use a standard block coordinate descent (BCD) algorithm to find at least a locally optimal solution to problem (13), where a coordinate block refers to either of the vectors u BB , g or e. To be more specific, the objective function g(u BB , g) + μ N t N l − g T e can be minimized over one of these vectors while the other two are fixed, and then, repeating the same procedure for the other two blocks. The penalty multiplier μ should be increased monotonically every N cycles, where the Lipschitz constant L provided in Lemma 2 determines the limit for increasing μ as a function of the other variables. Accordingly, the BCD algorithm solving (14) performs the following steps in the n-th cycle: i. Updating g: Given u (n−1) BB and e (n−1) , the value of g in the n-th cycle is updated by solving which is a linearly-constrained quadratic programming. ii. Updating e: The value of e (n) can be obtained as the solution to the following problem: Algorithm 1 BCD Algorithm Solving (14) 1: input: F, u FD 2: output: u BB , T 3: initialize: BB = 0 N l ×1 ,μ (0) ,n = 0 4: set: ϑ > 1 5: while the terminating condition is met do 6: n ← n + 1 7: compute g (n) by solving (16) 8: BB using (20) 10: obtain L (n) from (15) 11: which is equivalent to a norm-constrained inner product maximization that admits a simple closed-form solution given by iii. Updating u BB : Given g (n) and e (n) , the minimization problem (14) is equivalent to Using the method of Lagrange multipliers, one can simply obtain the solution to (19) which is used as the n-th update of u BB and is given by where vec G (n) = g (n) , and (·) † stands for the Moore-Penrose inverse. iv. Updating λ: Once every N cycles, the penalty parameter μ is updated as where ϑ > 1 is a constant design parameter and L (n) is the n-th update of the Lipschitz constant L which is computed by substituting u (n) BB in (15). The pseudocode of the described BCD algorithm is presented in Algorithm 1. In what follows, we analyze the convergence behavior of Algorithm 1.

C. Convergence Analysis
The BCD algorithm is a successive optimization approach in which a certain approximate version of the objective function is optimized with respect to one block of variables at a time, while fixing the rest of block variables [44]. Let h(u BB , g, e) denote the objective function of problem (14). As mentioned earlier in this section, by fixing two variables among u BB , g and e, function h(u BB , g, e) becomes convex in the other variable. More precisely, the sub-problem (16) is a convex linearly-constrained quadratic programming (LCQP) which can be solved for the optimal solution. Furthermore, the sub-problems (17) and (19) are amenable to closed-form solutions, and therefore, can be solved for global optimality. This implies that, at the n-th iteration, we have where g (n) denotes the n-th update of g, and u As a result, the sequence of the objective function values after the update of each block is monotonically non-increasing, and therefore, convergence of Algorithm 1 to a stationary point (i.e., at least a local extremum) is guaranteed. We further note that the terminating condition for Algorithm 1 can be considered as where o denotes the threshold for the desired accuracy. In Fig. 3, we illustrate the convergence behavior of Algorithm 1 by plotting the value of the objective function h(u BB , g, e) versus the number of outer iterations (cycles) for phase shifters with different precision bits b PS , where it is shown that the proposed algorithm converges at a favorable rate. In particular, for a desired accuracy of o = 10 −2 , it can be seen that, in all cases, Algorithm 1 converges in no more than 10 iterations. It can further be seen that the algorithm shows a higher residual error for lower values of b PS . This is due to the fact that discretizing the states of the phase shifters with lower number of precision bits induces a greater discontinuity in the feasible region of the optimization problem, and therefore, it may not be possible to reduce the Euclidean distance between the fully-digital and the hybrid precoders beyond a certain limit.

D. Analysis of Computational Complexity
Using the four-step BCD approach summarized in Algorithm 1, the overall computation cost of solving (14) in terms of the required number of arithmetic operations is composed of two main parts. The first part involves inner iterations to solve the sub-problem (16) over g and updating u BB using (20), and the second part refers to the outer iterations (cycles) over coordinate blocks.
The computation cost of updating u BB via (20) is dominated by the arithmetic complexity of performing the matrix pseudo-inversion F•G (n) +F † , which is of order O(N t N 2 l ) given the dimensions of F and G. Furthermore, to efficiently solve (16), one may use the offthe-shelf algorithms such as (accelerated) projected/proximal gradient methods [45], or quasi-Newton approaches, e.g., L-BFGS-B [46]. In particular, for a Lipschitz smooth (not necessarily strongly) convex objective function as in (14), all the aforementioned algorithms converge superlinearly at a rate of O(1/ √ i ) to reach an i -optimal solution. For example, using the accelerated projected gradient descent algorithm, the per-iteration complexity associated with subproblem (16) is dominated by matrix multiplications of the limiting order N 2 l N 2 t , as N l , N t → ∞. Therefore, in the limiting case, the total number of operations needed to be performed in order to solve the inner subproblem (16) with an accuracy of i is of order O(N 2 l N 2 t )(1/ √ i ). Putting this together with the complexity of computing u BB , every cycle of the BCD algorithm has a dominating complexity of O(N t N 2 On the other hand, the reformulation (14), which is obtained based on the exact penalty method, is guaranteed to converge to a first-order Karush-Kuhn-Tucker (KKT) point with an accuracy of o in no more than ln(2L [42], where μ (0) is the initial value of the penalty parameter μ and · denotes the ceiling operation. To have a complete analysis of the complexity, we further need to evaluate the constant L. From (15), it is straightforward to show that where the equality can be justified considering the definition of matrix F in (5), which yields F F = √ N l , along with the fact that u FD = 1; see (10). It can further be verified that in the large system limit where N t → ∞ with N l N t , we have u BB → 1. Therefore, one can write As a result, in the limiting case with N l → ∞, we have Having the upper bound given by (28) on the dominating order of L, the worst-case computational complexity of Algorithm 1 solving the design problem (14) with accelerated inner gradient steps can be obtained as shown in Table I. In practice, however, the outer optimization usually converges in a few cycles, as we will see in Section V. For comparison purposes, the complexities of hybrid symbol-level precoding techniques proposed in [23] and [24] is further reported in Table I. For the hybrid scheme in [23], the reported complexity order refers to the worst-case complexity of reaching an o -optimal solution to a linear problem via the interior-point method; see [47]. Note that, in Table I, C FD denotes the complexity of solving the fully-digital symbol-level precoding problem, which generally depends on the adopted optimization algorithm. This solution is used as a baseline in designing the hybrid precoder in [24] and this work. In section V, we will compare the complexities of the hybrid precoding schemes in Table I by incorporating the solutions to both the fully-digital and hybrid design problems and evaluating the corresponding execution times.

IV. ENERGY EFFICIENCY ANALYSIS
Hybrid precoding strategies predominantly focus on reducing hardware cost/complexity and power consumption by delegating part of the signal processing burden to the analog domain. In return, this may sacrifice the precoding performance, e.g., spectral efficiency, with respect to fully-digital systems. On the other hand, various hybrid implementations may differ from one another in their complexity and power consumption. In order to be able to compare different hybrid architectures and also to assess their efficiency versus the fully-digital alternative, one needs to incorporate both performance and complexity/power consumption aspects into one single figure of merit. A common choice is energy efficiency which can simply be expressed as the ratio between spectral efficiency and power consumption. Due to the assumption of finite-alphabet signaling, we measure the spectral efficiency in bits per symbol. Thereby, the energy efficiency of the precoding scheme, in bits per Joule, is defined as the ratio between goodput and power consumption, i.e., where P e 1 − (1/N u ) Nu j=1 P e,j is the average symbol error probability across all N u users with P e,j denoting the symbol error probability for the j-th user. The average per-user spectral efficiency R and the power consumption P are defined as follows.

A. Spectral Efficiency
Using an uncoded transmission scheme with finite-alphabet signaling, the communication rate towards the j-th user can be evaluated, in terms of bits per symbol per unit bandwidth, through calculating the average mutual information between the target symbol s j and the received signal y j , i.e., Assuming transmission with Nyquist rate over a double-sided bandwidth of W Hz, the maximum allowable symbol rate is W symbols per second, which results in a bit rate of W×I(s j ; y j ) for user j. Putting this together for all N u users, the average per-user achievable rate of the downlink channel is given by It should be noted that deriving closed-form expressions for the conditional probability mass functions in (30) is a cumbersome task. As an alternative, one can obtain experimental probability distributions over sufficiently many independent realizations of the channel and the users' symbols to approximate the mutual information I(s j ; y j ) for each user j ∈ {1, 2, . . . , N u }.

B. Power Consumption
The power dissipated by the BS's RF front-end components accounts for the power consumption at the BS. In the sequel, we first adopt power consumption models for typical components of an RF front-end and then specifically tailor the overall power consumption model according to each precoding architecture, namely, fully-digital and hybrid (with and without phase shifter selection). The transmit RF front-end of a multi-antenna system is commonly composed of one baseband processor, several RF chains, each preceded by a pair of DACs (i.e., one DAC for each I/Q channel), and power amplifiers (PA). The use of analog components such as dividers, combiners, switches, and/or phase shifters are limited to hybrid architectures.
As a rule of thumb, the power consumption of DAC scales linearly in sampling rate and exponentially in the number of bits per sample (i.e., resolution bits). We assume the DACs are of binary-weighted current-steering type [48], where its power consumption is approximately given in [49] as with b DAC and F s respectively denoting the number of precision bits and the sampling frequency. A typical RF chain includes one mixer, one local oscillator, two low-pass filters and a baseband amplifier. We respectively denote by P M , P LO , P LPF and P BBA , the power dissipation of the RF chain components. Thereby, the power consumed by a single RF chain is equal to In case all the RF streams are transmitted at the same frequency, it might be possible to share a single local oscillator among all the chains and divide the power consumption P LO accordingly [27]. Further, let P BB , P PA , P PS and P SW respectively denote the power consumption of the baseband processor, a single PA, a single phase shifter and a single analog switch. Note also that, in general, the power dissipation of the RF combining network is very low [50], and thus is ignored in our modeling. The fully-digital BS architecture requires 2N t DACs, and N t RF chains and PAs, and therefore its power consumption can be modeled as On the other hand, the hybrid architecture with fully-connected phase-shifting network can be implemented using 2N l DACs, N l RF chains, N t PAs, and N t × N l phase shifters. The resulting power dissipation is thus given by To calculate the power consumption of the hybrid architecture with fully-connected networks of phase shifters and switches, i.e., with phase shifter selection, we assume the associated RF processes are turned off while a phase shifter is deactivated, and further, the phase shifter has negligible static power dissipation. Under this assumption, a deactivated phase shifter consumes no power. Denoting the average percentage of the active phase shifters at a symbol instant by β, the power consumed by the entire phase-shifting network is then βN t N l P PS . As illustrated in Fig. 1, the phase shifter selection mechanism is implemented through a network of N t × N l switches. Therefore, the power consumption of the hybrid precoder with phase shifter selection can be obtained as Recall from Section II that the selection matrix T is constrained to has no all-zero row (column), i.e., at least one phase shifter corresponding to a specific antenna element (RF chain) must be active at a symbol instant. As a consequence, the number of active phase shifters during any symbol period is never less than max{N l , N t } = N t , from which it follows that 1/N l < β ≤ 1. Our simulation results in Section V further indicate that β is usually smaller than 0.75 for the proposed hybrid symbol-level precoder in (14), regardless of the phase-shifting precision. This may lead to significant reductions in the power consumption of the analog phase-shifting network. It is also important to note that by employing low-power yet efficient mmWave switches, the excessive power consumption due to the switching operation can be made negligible compared with the power reduction of the phase shifters.
Using the above power consumption models with appropriate parameter selection, we will compare the power consumed by different fully-digital/hybrid architectures in Section V.

V. SIMULATION RESULTS
In this section, we present some simulation results to evaluate the performance of the proposed hybrid symbol-level precoding approach and to compare it with some other existing schemes. The simulation setup is as follows. We consider the hybrid analog-digital precoding architecture depicted in Fig. 1 for a downlink mmWave massive multiuser MIMO system, performing an uncoded transmission with QPSK signaling and a carrier frequency of 60 GHz over a bandwidth of 1 GHz. We assume unit noise variances at the receivers of all the users, i.e., σ 2 j = 1, ∀j = 1, 2, . . . , N u . As described in Section II, we adopt a geometric model for the mmWave propagation environment with N c = 1 clusters and N p = 12 scatterers between the BS and each user. For all the propagation paths, the azimuth angles of departure φ j,i,l are drawn independently from a uniform distribution over [0, 2π). To initialize Algorithm 1, we set N = 1 and ϑ = 1.1 to avoid overshooting, and consider μ (0) = 10 −4 to have a reasonable starting point.
We consider the fully-digital Wiener filter (WF) precoding [51], and the optimal fully-digital symbol-level precoding (SLP) as our performance benchmarks, and further, provide comparisons with the block-level hybrid precoding technique PZF and its quantized variant QPZF in [18], the block-level hybrid precoding with phase shifter selection in [32], and the hybrid symbol-level precoders in [23] and [24]. Note that the application of PZF and QPZF techniques is limited only to fully-loaded systems, i.e., when N u = N l , and therefore, their performances have been evaluated only in the relevant scenarios. We further note that the method in [23] performs a symbol-based optimization of the digital baseband precoder subject to one-bit DACs, and adopts a CSI-only design for the phase-shifting network. On the other hand, the hybrid scheme in [24] jointly optimizes both the digital baseband precoder and the phase-shifting network on a symbol-level basis. Accordingly, we refer to the methods in [23], [24], and the proposed scheme based on the adopted hybrid architecture and the precoder design approach. To summarize, throughout this section, the hybrid precoding techniques of interest are referred to as: - precoding with joint baseband precoder and switching network optimization (Algorithm 1) -Hybrid BB+SW SLP-NOPSS: the proposed hybrid symbol-level precoding with baseband precoder optimization and no phase shifter selection In our simulations, the power consumption is calculated according to the model introduced in Section IV, in which we consider reference values of P RF = 40 mW, P PA = 20 mW, P PS = 30 mW, and P BB = P DAC , as in [27]. As for the power consumption of switches, it is well known that nFET switches have zero static power dissipation. On the other hand, silicon-germanium (SiGe) based switches are shown to be capable of achieving high performance while consuming powers of less than 1 mW [29]. Therefore, based on the available technology for the implementation of RF switches, a fairly conservative choice would be P SW = 1 mW. Moreover, the power consumption of DACs is calculated via (32) assuming a sampling frequency of F s = 1 GHz which should be sufficient for mmWave systems. We further assume b DAC = 12 for those architectures employing high-resolution DACs.
In Fig. 4, we compare the power consumption of various hybrid precoding implementations with that of the fully-digital architecture as a function of the number of BS's RF chains N l , while fixing the number of transmit antennas and users to be N t = 64 and N u = 4, respectively. As might be expected, the power consumption values associated with the hybrid implementations increase with N l , which is a consequence of requiring more RF elements, phase shifters, and/or switches. This implies that increasing the number of RF chains beyond a certain limit makes the hybrid implementation a more power-consuming approach than the fully-digital architecture. Nevertheless, for N l ≤ 10, all the hybrid implementations consume less power than the fully-digital precoder. Remarkably, the proposed hybrid precoder in this paper, i.e., the hybrid BB+SW SLP, offers smaller power consumption amounts, with either infinite-precision or discrete phase shifters, among the other hybrid symbol-level precoding schemes in Fig. 4. This is brought by the adopted phase shifter selection mechanism in implementing the hybrid precoder. In particular, the proposed hybrid precoder has the smallest power consumption with b PS = 1 due to the large percentage of deactivated phase shifters, as we will see later in this section. Note also that the differences in power consumption of different hybrid precoding schemes in Fig. 4 increase with N l . In a scenario with N t = 64 and N u = 4, the percentages of deactivated phase shifters, i.e., (1 − β) × 100, for the proposed hybrid precoding approach are shown versus the number of RF chains in Fig. 5 for different values of b PS . It follows from the results that in the case of using low-resolution phase shifters, a higher percentage of the phase shifters can be switched off, and hence more power-savings are possible. In particular, in the case where N l = 4, up to 55% of the phase shifters with b PS = 1 can be turned off, which can be roughly translated to a power reduction of βN t N l P PS ≈ 4200 mW in the phase-shifting network. It can further be observed from Fig. 5 that the percentage of deactivated phase shifters decreases with increasing N l . One can justify these observations by considering that increasing the number of phase-shifting precision bits and the number of RF chains, respectively, reduces the discontinuity in the feasible region of the optimization problem (14) and increases the design degrees of freedom. In both cases, this enables the algorithm to achieve lower values for the objective function by activating a larger ratio of the phase shifters. The former case can also be verified from Fig. 3 where the residual error in the objective function is shown to be smaller for phase shifters with higher resolution bits.
We plot the average users' symbol error rate (SER) achieved by the precoding techniques of interest with either fully-digital or hybrid architecture versus the transmit SNR for a system with (N t , N l , N u ) = (64, 8,8) in Fig. 6. The proposed hybrid symbol-level precoder is evaluated for various implementations with infinite-precision and discrete phase shifters, where in the latter case we assume b PS = 1 and b PS = 2 bits of precision. It can be seen that, for the case with b PS = 2, both the hybrid BB+PS SLP and the hybrid BB+SW SLP schemes are capable of performing well close to the fully-digital SLP, though requiring far less RF chains to process the transmitted signal. The corresponding losses at SER = 10 −2 are respectively around 0.1 dB and 0.5 dB. Using phase shifters with b PS = 1, the hybrid BB+SW SLP scheme  still offers a reasonable performance with a loss smaller than 1 dB at SER = 10 −2 compared with the fully-digital SLP, as opposed to the hybrid BB+PS SLP scheme which shows a significantly deteriorated performance. It can further be seen from Fig. 6 that both hybrid BB+PS SLP and hybrid BB+SW SLP approaches outperform the PZF technique, which is a result of designing the precoded signal specifically for each instantaneous combination of the users' target symbols. Overall, from Fig. 6, it follows that the hybrid BB+PS SLP scheme offers the best SER performance compared to the other hybrid SLP schemes of interest. Nevertheless, as demonstrated in Fig. 4, this superior performance comes with increased power consumption.
In Fig. 7, the average per-user spectral efficiencies of the precoding schemes of interest are shown for the system parameter sets (N t , N l , N u ) = (64, 8, 8). As can be seen, the spectral efficiency plot follows the same relative trend as that of the SER plot. The hybrid BB+PS and BB+SW SLP schemes are more spectrally-efficient than the PZF and QPZF techniques, which is a result of the CI-based positioning of the received signals. Remarkably, the achievable spectral efficiencies by the hybrid BB+PS SLP scheme with b PS = 2 and by the hybrid BB+SW SLP scheme with either b PS = 1, b PS = 2 or b PS = ∞ are close to those of the fully-digital WF and SLP. The maximum loss with respect to the fully-digital SLP corresponds to the Hybrid BB+SW SLP scheme with b PS = 1, which is around 0.06 bps/symbol/Hz at SNR = 0 dB. On the other hand, hybrid BB+PS SLP is shown in Fig. 7 to be the most spectrally-efficient approach among the hybrid symbol-level precoders of interest.
Up until this point in the simulation results, we have seen that among the hybrid symbol-level precoders of interest, one approach outperforms the other in terms of either power consumption, symbol error rate, or spectral efficiency. To have an all-inclusive comparison, we use the energy efficiency measure, as defined in Section IV, that incorporates all the aforementioned figures of merit in evaluating the overall precoding performance. The results are shown in Fig. 8, where the energy efficiencies of different fully-digital and hybrid multiuser precoders are plotted as a function of the transmit SNR for a system with (N t , N l , N u ) = (64, 8,8). As can be seen, almost all of the hybrid symbol-level precoders are more energy-efficient than the fully-digital SLP, while the proposed hybrid BB+SW SLP approach with phase shifter selection outperforms the other schemes with either infinite or finite resolution phase shifters. The most energy-efficient scheme is shown to be hybrid BB+SW SLP with b PS = 1, using which energy efficiency gains of up to 75 Mbps/Joule per user can be achieved against the fully-digital SLP. In contrast to the Hybrid BB+PS SLP scheme, employing phase shifters with lower precision bits improves the energy-efficiency of Hybrid BB+SW SLP. This is because more phase shifters can be switched off using low-precision phase shifters, which leads to larger reductions in power consumption. It is important to note that in our power consumption model, we consider the same reference value for phase shifters with any number of precision bits. This is rather a simplistic approach as, in practice, higher-resolution phase shifters consume more power. In such a case, the results for power consumption and energy efficiency of the proposed hybrid precoder with low-resolution phase shifters would show an even higher gain compared to the other schemes of interest. It can further be seen from Fig. 8 that the proposed hybrid algorithm with b PS = ∞ outperforms the hybrid PSS BLP scheme, where both techniques employ a phase shifter selection mechanism via a switching network but on a symbol-level and block-level basis, respectively. In particular, the hybrid BB+SW technique shows higher energy efficiency gains against the hybrid PSS BLP scheme at low SNRs. We are further interested in the behavior of energy efficiency as a function of the number of available RF chains N l , which is plotted in Fig. 9 for fixed numbers of transmit antennas N t = 64 and users N u = 4 at an SNR of −5 dB. A common trend across all the hybrid symbol-level precoders is that their energy efficiency becomes lower as N l increases. This is in accordance with the power consumption results in Fig. 4, indicating that for a fixed number of antennas, a hybrid precoding implementation becomes less energy-efficient than its fully-digital counterpart whenever the number of RF chains exceeds an upper limit. This upper limit is shown in Fig. 9 to be larger for the proposed hybrid BB+SW SLP approach. On the other hand, comparing the proposed hybrid symbol-level precoder with the case where all the phase shifters are active, i.e., with no phase shifter selection, we can conclude that applying the phase shifter selection mechanism can substantially improve the energy efficiency of hybrid symbol-level precoding. The results in Fig. 9 shows that gains of up to 37 Mbps/Joule per user can be achieved using the hybrid BB+SW SLP method compared to its counterpart scheme without phase shifter selection.  Following the analytic complexity analysis provided in Section III, we numerically evaluate the computational complexity of the proposed hybrid SLP algorithm in both scenarios with infinite-resolution and discrete phase shifters. The complexity results, in terms of the required number of outer iterations (i.e., block cycles) for convergence, is shown in Fig. 10 as a function of the number of RF chains N l . It is, however, important to note that the complexity of solving the inner sub-problem (16) is not of our interest since, as mentioned earlier, this problem is a typical linearly-constrained quadratic program which can efficiently be solved using many existing algorithms. As might be expected, the number of outer iterations until convergence of the proposed hybrid SLP algorithm increases with N l in all the cases, which is due to the corresponding growth in the problem size. On the other hand, the computation cost increases with reducing the precision of the phase shifters. Such an observation, however, is not surprising since having discrete possible phase states causes a discontinuity in the feasible region of the optimization problem, and consequently, more cycles are needed for convergence to a stationary point. We further compare the complexities of different hybrid symbol-level precoding techniques in Fig. 11, where the execution time of each hybrid algorithm is plotted as a function of the number of users while the number of transmit antennas and RF chains are fixed as N t = 64 and N l = 8. It should be noted that for the hybrid algorithm in [24] and the proposed scheme, the results in Fig. 11 incorporate the solutions to both the fully-digital and hybrid symbol-level precoding problems. Note further that the execution times shown in Fig. 11 have been obtained using the MATLAB timing functions, and the fully-digital symbol-level precoding problems and the optimization problem corresponding to the hybrid scheme in [23] have been solved using the fmincon tool. It can be seen that the execution times of the hybrid BB+PS SLP algorithm and the proposed hybrid BB+SW SLP technique (with high-resolution phase shifters) slightly grows with increasing the number of users, whereas the complexity of the hybrid BB SLP method shows a larger growth rate with the number of users. The proposed hybrid BB+SW SLP technique requires higher execution times than the other two methods, particularly for a small number of users. Moreover, the proposed hybrid precoding scheme shows higher execution times with low-precision phase shifters, which is in accordance with the results shown in Fig. 10. Similarly, this observation is a consequence of having increased discontinuity in the solution space of the precoding problem due to discrete phase shifters.

VI. CONCLUDING REMARKS
In this paper, we proposed a hybrid analog-digital precoding scheme for large-scale multiuser mmWave downlink systems. The multiuser precoding operation is split between the digital and analog domains, where processing in the analog domain is carried out through fully-connected networks of switches and phase shifters. The use of on-off switches enables us to perform phase shifter selection in the analog precoder. We adopted a CSI-only design approach for the phase-shifting network, whereas the digital baseband precoder and the switching network are optimized in a symbol-level manner, i.e., by exploiting the instantaneous data symbols to enable constructive interference at the receiver side. We formulated our design problem to minimize the Euclidean distance between the hybrid symbol-level precoder and its optimal fully-digital counterpart, where a power-constrained max-min SINR design criterion subject to constructive interference constraints was adopted. Our design approach led us to an intractable binary optimization problem. We tackled this difficulty by transforming the original problem to an equivalent continuous-domain biconvex form, which can efficiently be solved for a sub-optimal solution via the standard block coordinate descent (BCD) algorithm. We evaluated the computational complexity of the proposed scheme, where it was shown by numerical results that the adopted BCD algorithm needs only a few (normally less than ten) cycles to converge. To assess and compare different fully-digital/hybrid precoding schemes from both performance and power consumption points of view, we provided an analysis of energy efficiency by considering appropriate models for the power dissipation of the RF elements. Our simulation results indicated that applying the phase shifter selection approach, up to half of the phase shifters can be switched off, allowing for reductions of multi-Watts in the power consumption of the analog circuitry. This power consumption reduction can significantly improve the energy efficiency of precoding, as compared to the fully-digital and the state-of-the-art hybrid symbol-level techniques. Moreover, we evaluated the proposed hybrid precoding scheme with both infinite and finite precision phase shifters. It was shown that using phase shifters with lower precision bits, on the one hand, degrades the spectral efficiency, but on the other hand, allows for more power-savings due to a larger number of deactivated phase shifters, and therefore, is more energy-efficient. APPENDIX A PROOF OF LEMMA 2 Given u BB , g(u BB , g) is an 2 -norm function of g and therefore is continuously differentiable everywhere. Let {g 1 , g 2 } ∈ R NtN l ×1 be any two distinct inputs to the function g(u BB , g) such that −1 {g 1 , g 2 } 1. Let us further denote Θ (u T BB ⊗ I Nt ) diag(vec(F)). Then, we can write According to the matrix/vector operator norm inequality, the following chain of inequalities holds true: where inequality (a) and inequality (b) are both followed from the (reverse) triangle inequality. Since −1 {g 1 , g 2 } 1, we always have g 1 − g 2 2 ≤ 2 √ N t N l , where equality is achieved in case either g 1 or g 2 is equal to 1 while the other equals −1. Using the latter inequality and the one in (38), we obtain the following upper bound: It then immediately follows that where L is a positive real constant in either case with Θ = 0 or Fu BB −2 u FD = 0, implying the Lipschitz continuity property for the function g(u BB , g) on −1 g 1.