PAPR Reduction in MIMO-OFDM via Power Efficient Transmit Waveform Shaping

In this paper we revisit the long standing problem of peak-to-average power ratio minimization in MIMO-OFDM systems, with a new angle of approach on a well-known scheme. Utilizing the principles of tone reservation, we place dummy symbols, i.e., complex coefficients, on unused space-frequency resources with the aim to jointly minimize the transmit signal PAPR and the self-power consumption of the dummy symbols. This joint minimization is solved using three different proposed algorithms exhibiting varying degrees of computational complexity and PAPR reduction performance. Our proposed framework utilizes the strict PAPR expression, i.e., we take into account the average transmit power of the antenna, to simultaneously reduce the PAPR on all antennas while keeping the self-power consumption of the scheme minimal. Our simulation results show that this optimization objective provides better worst-case PAPR reduction and dummy symbol power consumption performance compared to the peak power minimization objective widely utilized in the tone reservation literature. Finally, we propose a novel take on a well-known block-diagonalization algorithm by exploiting knowledge on dummy symbol allocations, resulting in high-gain data streams in downlink transmission.


I. INTRODUCTION
T Echnologies leveraging multiple transmit and receive antennas (MIMO, multiple-input multiple-output) and orthogonal frequency division multiplexing (OFDM) have become a staple in the state-of-the-art cellular technologies. MIMO offers high data rates through beamforming and spatial multiplexing, along with increased diversity by better exploiting multipath propagation. Similarly, the frequency diversity of OFDM provides robustness against frequency selectivity, as it leverages high bandwidths that are divided into independent, narrow-band subcarriers [1]. This multiplexing of frequency resources also makes OFDM a very efficient multiple access scheme, which is one reason it has been adopted in the 3rd Generation Partnership Project 5G New Radio (NR) standard as a basis for both downlink (DL) and uplink (UL) communications [2]. OFDM, however, has a major drawback. Due to the con-structive and destructive combining of complex-valued transmit data and beamformers with the inverse Fourier transform, the time-domain transmit waveform is known to exhibit a high dynamic range in signal power. Thus, the transmit signal suffers from high peak-to-average power ratio (PAPR), which causes problems especially in the design of radio frequency (RF) components [1].
High PAPR is detrimental to power amplifiers (PA) through high power consumption and low drain efficiency (ratio of RF output power to direct current input power [3]). To avoid distortions in the transmit signal, such as spectral broadening, the PA has to have highly linear (i.e., low efficiency) amplification characteristics, and a back-off on the operating point is required to prevent the high signal peaks from being distorted by the PA saturation. Furthermore, high PAPR increases the power consumption of digital-to-analog converters [4]. Due to the adoption of OFDM as the basis for communications in the 5G NR standard [2], PAPR reduction is still considered a highly relevant problem, especially in scenarios where power efficiency and consumption are important factors.
To combat the effects of PAPR, a rich body of research literature has accumulated over the years, and it is still an active research topic [1]. One of the best-known methods is amplitude clipping [5]- [7], a simple method which unfortunately causes clipping noise and signal distortions. The schemes for selected mapping [8], [9] and partial transmit sequences [10], [11] create low-PAPR permutations of the transmit signal but require transmission of side information or changes in the receiver structure. Active constellation extension (ACE), see for example [12], considers a distorted constellation for data symbols to obtain a low-PAPR signal.
Tone reservation (TR) [13]- [19] is a PAPR reduction scheme where subcarriers (tones) are reserved to transmit a peak-reducing signal along the data signal. The scheme does not induce distortions in the data signal, but reduces spectral efficiency due to the reserved subcarriers. Many of the referenced TR techniques are single-antenna peak power minimization schemes, and do not translate well to MIMO systems as they ignore the antenna-specific average transmit power. This causes some antennas with lower average transmit power to have higher signal PAPR, as the peak reduction signal only targets the highest absolute peaks.
Methods similar to tone reservation but which optimize the whole transmit signal are proposed in [20], [21]. In [20], the average power is also considered (i.e., the strict PAPR, not just the peak power) through an error vector magnitude (EVM) based linear lower bound. The authors also propose an interior-point method based algorithm, which requires computationally intensive Newton steps to be solved. The latter reference, [21], exploits the degrees of freedom of a massive MIMO system to jointly optimize precoding, modulation and peak reduction. Under a general MIMO setting, however, the scheme neglects antennas with lower average transmit power, which can still exhibit high PAPR. Furthermore, in [21], multiple system parameters (modulation, beamforming, PAPR) are optimized at once. For this reason, application of the scheme might require a significant overhaul of already existing systems.
For other PAPR reduction methods leveraging the spatial domain for signal shaping, see for example [22], [23], where the former considers a PAPR constrained power allocation strategy in single-carrier frequency division multiple access systems. The method provides good PAPR performance at high signal-to-noise ratio (SNR), but has prohibitively high computational complexity. The latter reference exploits leftover spatial modes from waterfilling for peak reduction, similar to our proposed case. However, their approach follows the peak power minimization principle, solved with both highly complex interior point methods and slow to converge descent methods.
In this article, we consider a PAPR reduction scheme that operates on the tone reservation principles. However, we utilize the term 'dummy symbols' instead of reserved tones, for two reasons. First, a reserved tone can be misunderstood to mean that a whole subcarrier is reserved, which is misleading, as some spatial streams on the subcarrier could still carry data. Second, we argue the term 'dummy symbol' better describes the scheme's operating principle: use of optimized complex coefficients (symbols) that carry no information (dummy), applied on unused resources to affect the timedomain waveform.
Contributions: We propose a PAPR reduction framework that uses the principles of tone reservation to minimize the strict PAPR of all transmit antennas jointly with the selfpower consumption of the scheme, which we call 'dummy symbol power reserve' (or 'power reserved for the dummy symbols'). Our approach has significant differences to the general peak power minimization used in the TR literature, as we also optimize the antenna-wise average transmit power. Thus, instead of using the l ∞ -norm as the objective, we minimize the ratio between l ∞ and l 2 norms for all antennas, which improves the overall PAPR performance of a general MIMO-OFDM system by not only focusing on the absolute highest power peaks. This approach also makes the problem more difficult to solve due to its non-convexity. The prior art also does not consider the self-power consumption of the schemes explicitly, with the exception of [20], where it is proposed as an additional constraint. Implicitly, however, minimizing power peaks can be done by lowering the average transmit power, which has an effect on the selfpower consumption. However, from the results it can be seen that considering the self-power consumption jointly with the PAPR is very beneficial in our approach.
Furthermore, unlike the prior art (see for example [21]), our approach is modular, i.e., it can operate on any given beamformers and frequency-domain symbol allocation, as long as there are some unused space-frequency resources (empty subcarriers on some streams) to exploit. The modularity of our scheme is beneficial in terms of implementation, as it can easily be applied over existing system configurations.
We derive three different algorithms with varying levels of PAPR reduction capability and computational complexity to minimize the proposed joint objective of strict PAPR and dummy symbol power reserve. We derive one baseline algorithm using successive convex approximation (SCA), solved via interior point methods, and two more computationally tractable low-complexity iterative schemes that are solved via the alternating directions method of multipliers (ADMM).
Our proposed PAPR reduction framework generalizes our previous work, see [24], [25]. As we previously applied the framework in uplink systems, in this article we consider a downlink scheme instead. However, it should be emphasized that the proposed PAPR reduction framework is specific to neither UL nor DL. We propose a novel approach on the well known iterative block-diagonalization scheme [26], [27], that leverages knowledge about the dummy symbol placements to jointly optimize the data beamformers within a larger interference-free space. This leaves low-gain spatial modes for dummy symbol allocations, but as the dummy symbols carry no information, this doesn't affect the rate performance.
Finally, we propose a simple rate maximization search algorithm to better observe the additional benefits obtained by our PAPR reduction scheme. The search algorithm finds, utilizing a MIMO-modification for the well-known Hughes-Hartogs bit and power loading algorithm (HH) [28], [29], the highest possible rate that can be achieved under a fixed peak power constraint over the antennas, symbol realization, and given system parameters. This search algorithm is novel but very straightforward, and we consider the main contribution of this paper to be in the low-complexity joint PAPR and dummy symbol power reserve minimization algorithm.
In summary, our contributions are as follows: • We propose a novel PAPR reduction framework based on tone reservation principles, which minimizes the strict PAPR of the transmit signal instead of only the highest peak power. The proposed scheme is formulated as an optimization problem to find a PAPR reducing dummy symbol allocation. Our proposed scheme also jointly minimizes the dummy symbol power reserve, i.e., the self-power consumption of the scheme. Through simulations, our proposed scheme is shown to outperform the classical approach of peak power minimization in terms of worst-case PAPR across all antennas. • We derive three algorithms to solve the optimization problem, with varying degrees of PAPR reduction capability and computational complexity. • We propose a novel take on the well-known iterative block-diagonalization algorithm to provide high-gain streams for data, and low-gain streams for the dummy symbol allocation. Organization: This article can be considered in two parts. First, in Sections II-IV we derive our proposed PAPR reduction framework in a general form. We begin with the time domain transmit signal for a generic MIMO-OFDM transmitter in Section II, including the dummy symbols, after which we define the joint PAPR and power reserve minimization problem in Section III. The algorithms to solve this problem are derived in Section IV. In the second part (Sections V-VII), we first propose a beamformer design in Section V that leverages a pre-determined dummy symbol resource allocation mask to jointly optimize the data beamformers and find the dummy symbol beamformers. Then, in Section VI we derive a simple rate maximization search algorithm that finds the highest transmit power possible (benefiting from reduced PAPR), subject to a given peak power constraint. Finally, the performance of the PAPR reduction framework is investigated through Monte Carlo simulations in Section VII.
Notation: We use general vector notation: a, a and A denote a scalar, a vector and a matrix, respectively. We use R and C to denote the real and complex sets. (·) T , (·) H and (·) * denote the transpose, Hermitian transpose and complex conjugate, respectively, while ∥ · ∥ p denotes the l p -norm and diag(a) denotes a diagonal matrix with the elements of a on the diagonal. Re[·] denotes the real part, and (·) + = max(·, 0). Finally, we denote iteration indices with bracketed superscripts.

II. TRANSMIT SIGNAL MODEL
We consider a generic MIMO-OFDM system where the transmitter is equipped with N T antennas and transmits data (possibly to multiple users) on L parallel spatial streams. We assume that a total of N C subcarriers are available. Let p c,l ≥ 0 and v c,l ∈ C NT×1 denote the power and direction of transmit precoder associated with l-th data stream on subcarrier c. Then, the transmitted frequency-domain signal vector in c-th subcarrier can be expressed as To define the transmit waveform PAPR of the OFDM system, we need to find the time domain representation of transmitted signal (1). To do this, first let us compactly denote frequency-domain signal vectors {x c } c=1,...,NC as X = [x 1 , . . . , x NC ] T ∈ C NC×NT , and equivalently express (1) as where we use the notationṼ l ∈ C NC×NT to denote a matrix obtained by stacking the stream-wise transmit precoders associated with l-th data stream over N C subcarriers, i.e., V l = [v 1,l , v 2,l , . . . , v NC,l ] T . Similarly, D l = diag(d 1,l , d 2,l , . . . , d NC,l ), P l = diag(p 1,l , p 2,l , . . . , p NC,l ).
Then, by defining the discrete Fourier transform (DFT) matrix, with α-times oversampling 1 , as F ∈ C NC×αNC [5], the time domain representation of the considered MIMO-OFDM system can be expressed as Note that the size of matrix S data is αN C ×N T , and hence each column of S data represents the samples of the time-domain transmit waveform from a single antenna. In order to minimize the PAPR of the transmitted waveform, we shape the transmit waveform (3) by exploiting the unused space-frequency resources (i.e, any stream l on subcarrier c with no scheduled data). That is, dummy symbols (complex coefficients) are placed on empty subcarriers or non-interfering spatial modes 2 . This transmit waveform shaping mechanism can be achieved by updating expression (3) as where m c,l ∈ C is a dummy symbol that is placed on c-th subcarrier of data stream l. Note that dummy symbols {m c,l } are applied in the frequency domain, which are precoded with matrix {Ṽ l } and then transformed to time domain representing using the DFT matrix F in expression (4) (i.e., by following the same steps as in the case of data symbols For a compact representation of expression (4), let m l = [m 1,l , . . . , m NC,l ] T for all l = 1, . . . , L, and also denote the n-th column ofṼ l byṽ l,n , then the n-th column (i.e., antenna n waveform) of S can be equivalently expressed as where s data n is the n-th column of S data . We assume that the data-carrying resource indices are known, and by corollary the indices (c, l) for which p c,l = 0 are also known. This lets us obtain a compact reformulation of (5) where we consider only the indices of the resources for which it is possible to place non-zero dummy symbols. To do this, let us usẽ m ∈ C ND×1 to denote a vector obtained by stacking {m c,l }, among all c = 1, . . . , N C and l = 1, . . . , L, for which p c,l = 0, i.e., a vector of N D possible dummy symbols. Then, expression (5) can be compactly expressed as where the matrix A n is obtained by concatenating the columns of matrix F H diag(ṽ 1,n ), . . . , diag(ṽ L,n ) that correspond to indices (c, l) for which p c,l = 0. Note that indices (c, l) select the c-th column of matrix F H diag(ṽ l,n ).

III. PROBLEM FORMULATION
Our objective is to minimize the PAPR of an existing timedomain transmit signal in a power-efficient manner. This is done by minimizing the ratio between the transmit signal power peak and the average transmit power, given by ∥s n ∥ 2 ∞ and (1/αN C )∥s n ∥ 2 2 , respectively, for each antenna n. The PAPR minimization is achieved by exploiting unused subcarriers to transmit dummy symbols, which affect the final transmit waveform s n , ∀n, but which also require additional power. Therefore, in addition to finding dummy symbols that reduce the PAPR of the transmitted waveforms, we also aim to jointly minimize the power reserved for these symbols.
With expression (6), the crest factor (cf, i.e., linear domain equivalent of PAPR) of the signal transmitted by the n-th antenna element can be expressed as The allocation of dummy symbols requires increased transmit power, similar to the data symbols. We call this additional power increase the dummy symbol power reserve, defined as a factor by which the already allocated data power is increased to enable the dummy symbols. To this end, let p data denote the total allocated data power (see expression (1)), i.e, Then, the factor of increase is given by where ∥m∥ 2 2 is the total extra transmission power due to the placement of dummy symbols.
Using (7) and (9) the joint PAPR and power reserve minimization can be cast as the following optimization problem: minimize log(max n (p cf (s n ))) + log(p res (m)) (10a) subject to s n = s data n + A nm , ∀n with variables {m, s n=1,...,NT }. We use the max-operator in the objective to consider the worst-case PAPR among all antennas. We also utilize the log-transformation in the objective due to the fact that power amplifier input-output characteristics are usually analyzed in logarithmic domain, where the PAPR can be straightforwardly summed to the input power. Similarly, the factor of increase on the power allocation, caused by the dummy symbols, can be summed on the input power, provided that the power reserve is translated into log-domain. This power summation objective is illustrated in Figure 1 for ideal amplifiers. In practice, the amplification curve is not linear (i.e., signals with low mean power but high PAPR get distorted even in the linear region), and the saturation happens gradually.

IV. ALGORITHM DERIVATION
By substituting the expressions (7) and (9) for p cf (s n ) and p res (m), problem (10) can be written equivalently as s n = s data n +A nm , ∀n, with variables {m, t, r, s n=1,...,NT }. Problem (11) is nonconvex due to the concave objective (11a) and the convexover-convex PAPR constraint (11b). In this section, we propose and derive three different algorithms, with varying Average antenna input power (data) Average antenna input power (data) 10log 10 (p res (m)) 10log 10 (p cf (s n )) 10log 10 Time-domain input signal power PAPR reduction performance and computational complexity, to find (sub-optimal) solutions for (11). The first algorithm utilizes first-order Taylor series approximations for the non-convex expressions of (11), and solves a linearized problem iteratively using successive convex approximation. The second algorithm ignores average antenna transmit powers, and replaces the logarithmic relationship in (10a) with a trade-off coefficient. This solution structure corresponds to many tone reservation algorithms in the literature [13]- [19], [21], but our approach includes also the dummy symbol power reserve function in the objective, and the solution is obtained via ADMM. The last proposed algorithm combines the two approaches to find an iterative heuristic solution.
To handle the non-convexities, let us first introduce slack variables w n , q n , ∀n, and extra (relaxed) constraints ∥s n ∥ ∞ ≤ w n , ∀n, and q n ≤ ∥s n ∥ 2 2 , ∀n. This allows us to reformulate constraint (11b) as quadratic over linear, resulting in the following equivalent problem: with variables {m, t, r} and {w n , q n , s n } n=1,...,NT . Problem (12) is still non-convex due to the concave objective and constraints (12b), which are non-convex due to the convex R.H.S. terms. To obtain a problem which has a tractable solution, we need to iteratively approximate these non-convex parts using first-order Taylor series approximation.
The first-order Taylor series approximations on the objective (12a) and constraints (12b) around a fixed local point {t,r,ŝ n=1,...,NT } are given by Replacing (12a) and the R.H.S. of (12b) with the above approximations, we can write a convex approximation of problem (11) as with variables {m, t} and {w n , q n , s n } n=1,...,NT . Problem (15) is convex and can be solved via interior point methods. The solution is then used to update the local point {t,r,ŝ n=1,...,NT }, after which a new solution that is a closer approximation to a solution of (11) can be found. This is repeated until the difference between objective values from one iteration to the next falls under some threshold ϵ SCA > 0, or the number of iterations reaches a given maximum j SCA . The objective value of (15) can be guaranteed to decrease between SCA iterations due to the linear upper bound approximation [31]. However, due to the fact that the original problem is non-convex, convergence to the globally optimal solution cannot be guaranteed. The joint PAPR and dummy symbol power reserve minimization algorithm is summarized in Algorithm 1.

B. LOW-COMPLEXITY PEAK POWER MINIMIZATION VIA ADMM
Finding a solution to problem (11) using SCA and interior point methods is computationally taxing, and thus, not VOLUME 4, 2016 Algorithm 1 Joint PAPR and dummy symbol power reserve minimization algorithm.
Solve problem (15). 5: practical. Therefore, finding a low-complexity algorithm with comparable PAPR reduction performance is necessary.
Another approach to tackling the non-convexity of problem (11) is to only consider the numerator of the non-convex PAPR constraint (11b), i.e., peak power minimization. This principle of tackling PAPR minimization has been considered in the literature as it is much simpler than utilizing the strict PAPR expression. Here, we derive the peak power minimization algorithm that also accounts for the dummy symbol power reserve, as a multi-objective optimization problem that can be solved using ADMM [32].
Peak power minimization ignores the relationship between the time-domain signal peak and the antenna average transmit power. It also ignores the logarithmic relationship between the dummy symbol power reserve p res (m) and the crest factor p cf (s n ), ∀n. Therefore, we can consider the objective (10a) without the logarithms, but we require a trade-off coefficient δ to find a good balance between the two functions. Furthermore, as the average transmit powers of the antennas are ignored, we can denotē With this notation, the peak power minimization problem can be formulated as We use ADMM to solve this problem. We start by writing the complex augmented Lagrangian as [33] where ρ > 0 is the ADMM penalty parameter and ω ∈ C αNCNT×1 is the vector of dual variables corresponding to constraint (16b). The ADMM steps are [32,Ch. 3]: Following the approach in [21, Sec. IV], step (18) can be written as an unconstrained scalar optimization problem and easily solved via one-dimensional search, for example, the well-known golden section search. To see this, we denote the constant vector z (k) =s data +Ām (k) − (1/ρ)ω (k) , and writē The second equality of (21) follows from the definition of the l 2 -norm. The third equality is obtained by matching the element-wise phases of vectorss and z (k) (the phases ofs do not affect the l ∞ -norm and can be freely chosen). After matching the phases, we only need to optimize and match the element-wise magnitudes ofs. These can be found via scalar optimization by solving and then set |s i | = min(|z (k) i |, γ), ∀i, i.e., we apply elementwise clipping. To see this, i.e., to get from (21) to (22), we first denote max i |s i | =γ. Then, for all i, we can set which characterizes the perfect match in magnitudes if |z (k) i | ≤γ, and the remaining residual otherwise. Another way to write this is max{0, |z Next, we move on to find a solution for step (19). The step has a closed-form solution, as it is a minimization of an unconstrained convex function, i.e., the zero-gradient condition gives the minimum. Taking the gradient of the augmented Lagrangian with respect tom: By iteratively solving the variabless,m and ω, we can find a solution for problem (16). The iterations are stopped once a convergence criteria is met, or a maximum number of iterations k peak is reached.
To measure convergence, we use the residuals of the primal and dual feasibility conditions of (16), given by [32,Ch. 3 As the ADMM iterations proceed, these residuals converge to zero. To set a practical stopping criterion, we can stop iterating once the residuals fall below a tolerance threshold ϵ peak . The joint peak power and dummy symbol power reserve minimization algorithm is summarized in Algorithm 2.

C. ITERATIVE RELATIVE PEAK POWER MINIMIZATION VIA ADMM
While the peak power minimization framework of Section IV-B provides a low-complexity solution to reduce the peaks of the transmit waveform signal, the PAPR reduction performance of the algorithm is less effective than the SCA approach. This is due to the fact that the peak power minimization algorithm ignores the average antenna-specific power. Thus, while the peaks of the transmit signal are reduced, the antennas with low mean power can still have substantially high PAPR. In the following, we derive a low-complexity heuristic approach that accounts for the average power, too. Similar to the peak power minimization, we manipulate the denominator of the non-convex PAPR constraint (11b). Instead of ignoring it completely, we fix the denominator to be a constant that is iteratively updated between each SCA iteration. Starting from problem (11), we denote the denominator of (11b) as (1/αN C )∥s n ∥ 2 2 = (1/αN C )∥s data n + A nm ∥ 2 2 =β n , ∀n, and, by applying first-order Taylor series approximation on the objective, see (13), we can write the relative peak power minimization problem as minimize max n subject to s n = s data n + A nm , ∀n, with variables {m, s n=1,...,NT }. Note that we also rolled back the epigraph formulation of (11) to reduce the amount of optimization variables. Next, we apply a change of variables s n =β The complex augmented Lagrangian of (27) is given by [33] L σ (m, {s n , ν n } n=1,...,NT ) = max n ∥s n ∥ 2 where σ > 0 is the ADMM penalty parameter, and ν n ∈ C αNC×1 , ∀n, are the dual variable vectors corresponding to constraints (27b). Similar to (17) ADMM steps (29) and (30) can be simplified in a similar manner to Section IV-B. Starting with (29), let us denote the constant vectorsz n , ∀n. We also note that max n ∥s n ∥ 2 ∞ = max n max i |s i,n | 2 = max i,n |s i,n | 2 .
Due to the max-operator over all n, the vectorss n , ∀n, have to be jointly optimized. To this end, ADMM step (29) is re-written as which can be solved via scalar optimization and clipping, following the same chain of arguments as in (21).
Step (30) is an unconstrained minimization of a convex function, and has a closed-form solution via setting the gradient to zero. To simplify the notation, let us denote VOLUME 4, 2016 n . Then, taking the gradient of the Lagrangian (28) with respect tom: By iteratively solving the variablesm, {s n , ν n } n=1,...,NT , we can find a solution for problem (26). The iterations are stopped once a convergence criteria is met, or a maximum number of iterations k rel-peak is reached.
Similar to Algorithm 2, we use the residuals of the primal and dual feasibility conditions, given by [32,Ch. 3 to measure the convergence. We assume the algorithm is sufficiently converged once both of the residuals fall below a set tolerance threshold ϵ rel-peak . However, unlike Algorithm 2, we also require an outer loop for the SCA updates, i.e., the ADMM solution has to be found for each SCA step. However, by using the solutions of the previous SCA step to initialize the subsequent ADMM call, the number of iterations is significantly reduced for quick convergence. The joint relative peak power and dummy symbol power reserve minimization algorithm is presented in Algorithm 3. As low-complexity alternatives, Algorithms 2 and 3 require solving the problems (16) and (26), respectively, which Algorithm 3 Joint relative peak power and dummy symbol power reserve minimization algorithm. 1: Input: {s data n , A n } n=1,...,NT , p data , ϵ SCA , ϵ rel-peak , j SCA , k rel-peak , σ. 2: Set: Updateβ n = (1/αN C )∥s data n + A nm ∥ 2 2 , ∀n.
are done with ADMM instead of interior point methods. This solution structure requires iteratively solving ADMM steps, multiple times for Algorithm 3 due to the SCA structure. We can consider the computational complexity of solving the ADMM steps used in Algorithm 2, i.e., (21), (23), and (20), and similar consideration will hold for Algorithm 3.
Solving step (21) is a simple one-dimensional search followed by a clipping operation, and step (20) is just a subgradient update. The most complexity seems to be in step (23), which requires the inversion of a symmetric matrix. For the inversion, we can distinguish two cases: when the dummy symbols are assigned on a single stream, or on multiple streams. For the single stream allocation, the nondiagonal termĀ HĀ can be easily simplified to a diagonal matrix, and thus, the inversion is trivial. Then, the most computationally intensive calculations are the evaluation of the matrix-vector productsĀm and the part of (23) following the matrix inverse. These products can easily be shown to be Fourier-transforms, with computational complexity in the order O(αN C log(αN C )), if fast Fourier transform is implemented. If the dummy symbols are allocated on multiple streams, the non-diagonal term in the inversion can easily be shown to be a (Hermitian) striped matrix. If we denote the number of allocated dummy symbol streams on subcarrier c with L dummy c , then the number of non-zero elements on each row of the striped matrix is less than or equal to L dummy-max = max c (L dummy c ). Thus, the inversion can be undertaken by solving a sparse system of equations. We can also reorder the rows and columns of the striped matrix, to obtain a block-diagonal matrix with N C blocks of size L dummy-max × L dummy-max at most, which can be inverted separately. The same arguments hold also for Alg. 3.
The steps are solved iteratively until a desired level of convergence or a maximum iteration count is reached. Therefore, the number of required ADMM iterations depends on the choice of ϵ peak , k peak for Algorithm 2 (or ϵ rel-peak , k rel-peak for Algorithm 3). The convergence rate of ADMM also depends on the penalty parameter ρ (or σ), which is a design parameter. High values for the penalty parameter set emphasis on the fast convergence of the primal residual, and low values put focus on the dual residual. In our simulations, we found that the dual residual tends to converge much slower, so low values for the penalty parameter were chosen. Figure 2 presents a numerical example simulated under an arbitrary channel realization with N D = 64 dummy symbols, providing insight into the required number of iterations (the full details of the system setup are given in Section VII). The top figure represents the decay of the primal and dual residuals, (34) and (35), for the first SCA step, up to a tolerance threshold of 10 −3 . The bottom figure represents the number of ADMM iterations required for each SCA step. It can be seen that in the beginning, as the SCA point can significantly change, ADMM requires more iterations to converge to the threshold. Then, as the SCA local point gets closer to a local optimum, the ADMM converges quicker. It should be noted that the threshold of 10 −3 is overly strict for this application, and in practical applications a lower tolerance level (10 −2 to 10 −1 ) would suffice. Furthermore, for the first SCA steps, a lower limit can be set on the maximum number of ADMM iterations to speed up the early (coarse) convergence to a local optimum. Finally, it is obvious that the number of ADMM iterations increases as the number of antennas and dummy symbols increase, as finding a consensus for (larger)m that works the best for all N T antennas gets more difficult.
Motivated by the numerical example, we should also highlight that setting a good initial point can reduce the number of SCA steps. For Algorithm 1, this requires finding a good initial dummy symbol vectorm, i.e., selecting over space C ND , which is used to calculate {t,r,ŝ n=1,...,NT }. In contrast, for Algorithm 3, the initial point {t,r,β n=1,...,NT } consists of only 2 + N T real variables. Furthermore, the initial {t,r} should be set dynamically based on the averages of previous solutions, especially in time-correlated channels where the transmission parameters can be assumed to change slowly. Judging by the bottom graph of Figure 2, this would also significantly reduce the required ADMM iterations, as only the last SCA steps would be required.

V. DUMMY SYMBOL AWARE BEAMFORMER DESIGN FOR DOWNLINK MULTI-USER MIMO
The joint PAPR and dummy symbol power reserve minimization framework operates on any arbitrary beamformers and data signals, provided there are free space-frequency resources for dummy symbol allocation. In our previous work [24], [25], we considered an uplink scenario where joint transmission between cooperating users was considered, assuming low angular separation between the users, resulting in low-gain streams that could be exploited through dummy symbol allocations. In this article, we consider a downlink scenario and the corresponding beamformer design. We consider a multi-user MIMO-OFDM system where a base station with N T antennas serves K users with N r antennas each, on N C subcarriers. The number of antennas can be generalized, but for simplicity, we consider a case where all the users have identical configurations. User k is scheduled with L data c,k data streams on subcarrier c. Furthermore, we assume that the number of spatial modes for dummy symbol allocation L dummy c , ∀c, is known (for example imposed by a standard or defined by the network operator), and the users are scheduled with equal amount of spacefrequency resources based on this knowledge. Thus, the total number of streams is K k=1 L data c,k + L dummy c = L, ∀c. Finally, as we utilize the Hughes-Hartogs bit and power loading algorithm [28], [29] to construct the transmit signal, a strict no-interference constraint is set in order to decouple beamformer design and power allocation. However, this constraint is not necessary in general.
Our approach utilizes a novel take on an iterative blockdiagonalization algorithm [34], [26,Sec. IV], [35] to obtain non-interfering transmit and receiver beamformers 3 . The first novel aspect comes from the exploitation of known dummy symbol allocations. As the resources allocated for dummy symbols can have arbitrarily low gain, this allows us to optimize the data precoders with increased dimension (i.e., project the data into a larger dimensional interference nullspace), simply by ignoring the dummy symbol precoders at this point. Thus, we propose scheduling the users with less than maximum spatial modes, and using the remaining modes to minimize the PAPR through dummy symbol allocations. This scheduling scheme can be easily justified, for example, in scenarios where the user-specific channel matrix is poorly conditioned, i.e., the majority of the achieved rate is obtained by using only the best eigenmode for every user 4 .
The remaining spatial modes on the transmit side can be exploited to reduce the PAPR via transmitting dummy symbols. Furthermore, the beamforming gains of these excess spatial modes can be made negligibly small (in expense of maximizing the gains of the data-carrying spatial modes), as dummy symbols are not meant to carry information. This is done by projecting the dummy symbol precoders into the nullspace of the already optimized data. This is the second novel aspect of our approach. Next, we'll briefly reproduce the iterative block-diagonalization process to find the transmit and receive beamformers, described in more detail in [26,Sec. IV].
The detected signal of user k on subcarrier c is given bŷ where U c,k ∈ C Nr×L data c,k is the user-specific receive beamformer, H c,k ∈ C Nr×NT is an arbitrary channel matrix between the base station and user k, and n c,k is the additive white Gaussian noise vector.
The iterative block-diagonalization algorithm is based on a series of consecutive nullspace projections that eliminate interference between streams. The transmit and receive beamformers of the data streams are optimized separately for all subcarriers c, exploiting the L dummy c spatial modes allocated for the dummy symbols as additional free dimensions of the interference-free subspace, which results in increased beamforming gain for the data streams (which can freely interfere the streams carrying dummy symbols).
First, user-specific effective channels are constructed using the left eigenvectors of the channel matrix asH c,k = U H c,k H c,k ∈ C L data c,k ×NT , where the columns of U c,k are chosen to correspond to the highest L data c,k eigenvalues of the channel matrix. Then, considering user k, the interfering users' effective channels are concatenated as that we can use to find the last L data c,k right eigenvectors of H c,k to form an orthogonal basis for its nullspaceṼ null c,k , i.e., an interference-free basis. Next, by projecting the channel matrix of user k into this interference-free domain, i.e., H c,k = H c,kṼ null c,k , we can obtain the left and right eigenvectors corresponding to the L data c,k strongest eigenvalues aŝ U data c,k andV data c,k , respectively, from the singular value decomposition ofĤ c,k . The interference-free transmit and receive beamformers for user k are given by V c,k =Ṽ null c,kV data c,k ∈ C NT×L data c,k and U c,k =Û data c,k , respectively. This process is applied for all users in parallel, and to further increase the beamforming gain, the receive beamformer can be used to initialize subsequent iterations following the same process. 4 In non-correlated channels, the allocation of data and dummy symbols is a trade-off between the benefits of PAPR minimization and throughput, as the dummy symbols can reserve resources better suited for data transmission.
Algorithm 4 Iterative block-diagonalization algorithm for data and dummy beamformer optimization. Find L data c,k left and right eigenvectors corresponding to the highest eigenvalues ofĤ c,k . Next, we utilize a strategy similar to successive blockdiagonalization [27] to obtain the dummy stream precoders, which are projected into the nullspace of the already optimized data. This way, the dummy streams will not interfere with the data streams, although the opposite can happen. However, as the dummy symbols are ignored in the receiver, this interference does not affect the system performance.
The dummy stream precoders can be found, on subcarrier c, from the last L dummy denoted asV dummy c ∈ C NT×L dummy c . Using this notation, the complete final transmit beamformer is given by the concatenation The iterative block-diagonalization algorithm is summarized in Algorithm 4.

VI. RATE MAXIMIZATION SEARCH ALGORITHM
There are multiple benefits to reduced PAPR, for example, less strict linearity requirements and reduced signal distortion of power amplifiers when using a high input power operating point, where the power efficiency is maximized. The reduced backoff requirement to combat the signal distortion, a benefit obtained with reduced PAPR, could be translated to increased transmit SNR, providing increased cell size.
In this paper, we also consider this additional SNR increase in terms of increasing the transmit data rate under a given peak power constraint P tx . This means allocating as much power to the data power budget p data as possible, while ensuring that the antenna-specific peak power remains below P tx (for example, the dashed black line in Figure 1).
We utilize the well-known Hughes-Hartogs bit and power loading algorithm [28], [29], to construct the transmit signal. The HH algorithm iteratively allocates bits on subcarriers in a greedy manner based on the channel gains. The algorithm provides an optimal bit allocation for a given bit error rate (BER) and total power budget, provided a strict no-interference constraint is imposed between the streams on any given subcarrier. Now, using the iterative blockdiagonalization algorithm of Section V, we can obtain beamformers that remove interference.
Our implementation of the HH algorithm follows [29], modified to operate in a multi-antenna system with parallel spatial streams. The algorithm iteratively allocates b inc bits on stream l, subcarrier c, that requires the least amount of incremental power, given by where b c,l is the current bit allocation on subcarrier c of stream l,ũ c,l is the l-th column of matrix [U c,1 . . . U c,K ] ∈ C Nr× k L data c,k , N 0 is the noise variance, B is the subcarrier bandwidth and Γ c,l denotes the subcarrier and stream-wise BER target. The bits are allocated using (40) until all the power available for data allocation is used. Once the bit and power loading is complete, we assign quadrature amplitude modulated (QAM) symbols for the subcarriers according to the bit rate, and obtain the frequency-domain data vectors d c , ∀c.
Our aim is to maximize the transmitted data rate by alternating between bit and power loading and PAPR minimization. After jointly minimizing the PAPR and the power reserve required by the dummy symbols, we compare the antenna-specific peak power to the peak power constraint P tx , and allocate the difference to the data power budget. The antenna-specific peak power is given by P peak n = 10 log(∥s n ∥ 2 ∞ ), ∀n.
Then, if P peak n < P tx , ∀n, we allocate the minimum difference to the data power budget as P data := P data + P tx − max n (P peak n ), where P data = 10 log((1/N c )p data ) is the average SNR. The bit and power loading and PAPR minimization are repeated using the new data power budget.
Once the antenna-wise peak power constraint is violated, i.e., P peak n > P tx , for some n, we backstep individual bit allocations and re-optimize the PAPR, until P peak n < P tx , ∀n. This provides the highest possible data rate for Hughes-Hartogs bit and power loading, under a given peak power constraint and QAM-symbol realization. The rate maximization search algorithm is summarized in Algorithm 5.
We acknowledge that the proposed rate maximization search algorithm is not necessarily appealing in practice, as it requires constructing the transmit signal multiple times, while continuously re-optimizing the PAPR. For the purposes of this paper, the algorithm is used to simulate the performance increase that is possible to be obtained using the PAPR reduction framework. In a more practical scenario, the data power budget would be immediately set according to statistics on past PAPR reduction capabilities, such as 99percentile results for P cf + P res , without utilizing a search algorithm.

5:
Assign symbols according to b c,l , ∀c, l to find d c , ∀c. 6: PAPR reduction using any of Alg. 1-3.

7:
Evaluate (41), ∀n, update data power budget with P data := P data + P tx − max n (P peak n ). 8: end while 9: while P peak n > P tx , for any n do 10: Remove p inc c,l from pc ,l , where indices {c,l} correspond to the latest allocation. 11: Remove b inc from bc ,l . 12: PAPR reduction using any of Alg. 1-3. 13: Evaluate (41), ∀n. 14: end while

VII. NUMERICAL RESULTS
We investigate the performance of the proposed joint PAPR and power reserve minimization framework using Monte Carlo simulations by first simulating the complementary cumulative distribution function (CCDF) for the different PAPR reduction algorithms and comparing them to the OFDM baseline. Afterward, we investigate the effect that the number of dummy symbols N D has on the 99-percentile PAPR reduction performance and the achieved rate when using the different algorithms. Finally, we compare the PAPRrate performance of the algorithms for various peak power constraints.
We compare our proposed algorithms to other TR techniques in the literature [13]- [19], [21], [23], in terms of worst-case PAPR performance. As many of the existing schemes are designed for single-antenna systems and might not readily translate to the multi-antenna case, we will base our comparison on the operating principles, i.e., between the optimization objectives of peak power and strict PAPR minimization. This is done through (16) by setting δ = 0, which corresponds to peak power minimization (i.e., the general literature approach) translated to a general MIMO framework. Furthermore, as [20] uses a linear approximation of the average transmit power, without updating the local point, we use a j SCA = 1 setting of Algorithm 1 with δ = 0 to provide the performance of a similar system. We also compare our algorithms to active constellation extension (solving [12,Eq. (6)]), as it has a similar operating principle as tone reservation.
Straight comparison to other, non-TR based techniques is difficult due to differing objectives and design constraints, such as allowed performance deficiencies. For example, amplitude clipping [5]- [7] can provide significant PAPR reduction if clipping noise and signal distortions can be tolerated. Our approach causes only negligible noise and distortions due to the orthogonal resources of the dummy symbols. Furthermore, methods that trade EVM or BER for PAPR reduction (such as [20], [21]) have poor performance if the BER target is strict. Our proposed approach has a negligible impact on EVM/BER. Thus, setting strict constraints for signal distortions and BER, i.e., mere selection of system parameters, can result in superiority of one method over another. Thus, we've limited our comparisons to methods with similar operating principles.

A. CHANNEL MODELS
For the rate maximization simulations, we generate the channel matrix in two ways. As the iterative block-diagonalization algorithm to find the beamformers is well-justified in scenarios where the channels are correlated, we generate the channel matrix by using a MIMO version of the multipath channel model for uniform linear arrays [36]. The MIMO modification is obtained by utilizing a rank-1 channel matrix for each path instead of a vector [37]. The multipath channel is given by where η k is the pathloss to user k, M denotes the number of independent (and identically distributed) paths, ϕ k,m is the uniformly distributed random phase noise of the path, and the vector a t (θ k,m ) is the array signature vector at the transmitter, with angle of departure θ k,m . For simplicity, we consider equidistant users with η k = 1, ∀k, and assume that the random phase rotation ϕ k,m provides enough averaging to use the same angle θ k,m for both array vectors, without loss of generality. The antenna spacing and carrier wavelength are normalized to 1. The number of multipaths is set to M = 50.
For comparison, we also consider uncorrelated channels, where all of the channel coefficients are random, independent and identically distributed as h c,k,i,j ∼ CN (0, 1), ∀c, k, i, j.

B. COMMON SYSTEM PARAMETERS
We consider a base station with N T antennas serving K = 2 users each equipped with N r = 2 antennas on N C = 64 subcarriers. The users are on a circle around the base station, with a 40 degree angle between the users. Both users have a uniformly distributed angular spread of 20 degrees. The noise power and subcarrier bandwidth are set to N 0 = 1 and B = 1, respectively. The BER-target for Hughes-Hartogs bit and power loading is set to Γ c,l = 10 −3 , ∀c, l, and we consider 4-, 16-, and 64-QAM alphabets, i.e., b inc = 2, for the bit loading.

C. SIMULATIONS
We investigate the PAPR reduction performance of the proposed joint PAPR and dummy symbol power reserve minimization algorithms. The base station has N T = 6 antennas, with a transmit SNR of P data = 20 dB, and the number of dummy symbols is fixed to N D = 64. We simulate the algorithm performance for minimizing P cf +P res over 10000 channel instances, plotted as a complementary cumulative distribution function in Figure 3. To provide sufficiently accurate results, the oversampling factor was set to α = 4. From Figure 3 we can see the powerful PAPR reduction capabilities of the proposed algorithms, especially Algorithm 3, i.e., the iterative relative peak power minimization. Even for a single SCA iteration, the algorithm performs better than the peak power minimization schemes, with comparable performance to single iteration Algorithm 1. With additional SCA iterations it even beats Algorithm 1, with significantly reduced computational complexity, as discussed in Section IV-D. As the only principal difference between these algorithms is in the handling of the denominator (11b), we conclude that the better performance is caused by a better choice for the fixed SCA local point, i.e., fixing the denominator at each SCA step instead of using a linear approximation. For Algorithm 2, note that we have used δ = 0, which was found to provide the best results in terms of P cf +P res . Finally, when compared to active constellation extension [12], our proposed scheme performs significantly better in terms of worst-case PAPR. This is due to the peak power minimization objective and our use of high order constellations. Also, due to the power requirement of ACE, the worst-case PAPR can be worse than the OFDM baseline, where no additional power is added to the signal. We investigate the effect of δ next.
It is of interest to see how the performance behaves for δ = {0, 0.1, 0.5, 1, 5, 10}, and for comparison, we also provide results for Algorithm 1, where we have modified the objective (15a) to t/t + δ∥m∥ 2 2 /(p datar ). As Algorithm 3 operates on the same principle as Algorithm 1, the results are assumed to behave similarly, and are thus omitted to avoid cluttering. The results are presented in Figure 4.
It can be seen that when using the strict PAPR expression, i.e., Algorithms 1 and 3, it is best to have equal priority for  both the PAPR and dummy symbol power reserve objectives. However, for the peak power minimization problem, the smaller values of δ provide better results. This is intuitive, since minimizing the signal peaks can be done also by minimizing the average power of the signal. Thus, the objective function of minimizing of the dummy symbol power reserve, which adds to the average signal power, is already present in the peak power objective. The same effect does not hold when utilizing the strict PAPR expression, where minimizing PAPR can be done by minimizing the peaks, or increasing the average power. It is evident that setting δ = 0, i.e., the approach taken in the literature, provides the best results for Algorithm 2. However, our proposal for Algorithm 2 still provides the possibility of trading between the power peaks and the allocated dummy symbol power (i.e., transmit SNR), which is not considered in the literature, with low computational complexity. This trade off would provide benefits in, for example, low-SNR applications where the signal power peaks are already low enough to not be distorted by PA saturation due to the low average transmit power. Then, a bigger part of the power budget could be allocated for data instead of the dummy symbols and the subsequent loss in peak reduction performance can be tolerated at the PA. Table  1 illustrates the average additional transmit power required for the dummy symbols, and the 99-percentile PAPR that is achieved, for different values of δ. We can observe that the dummy symbol power requirement can be quite significant with lower values of δ, i.e., the region of best performance for Algorithm 2. Figures 3 and 4 are simulated using N D = 64 dummy symbols. Next, we investigate how the amount of dummy symbols affects the algorithm performance. We consider the 16  99-percentile P cf + P res , calculated from results simulated over 1000 channel iterations. We set δ = 0 for Algorithm 2.
The oversampling factor was set to α = 1 for quicker simulations, as it affects the behavior of all algorithms similarly. Figure 5 presents the results. It can be seen that increasing the number of dummy symbols has diminishing returns. Algorithm 3 has the best performance overall, offering significant PAPR reduction even at low dummy symbol counts. Algorithm 2 has the worst performance, possibly even lower than the OFDM baseline, at low dummy symbol counts. This is caused by the increase in P res , combined with the low PAPR reduction capability when only few resources are allocated for the algorithm to operate with. As the number of dummy symbols is increased, the algorithms based on the strict PAPR expression begin to saturate, whereas the performance of Algorithm 2 could still benefit from more dummy symbols. It should be noted that N D = 128 corresponds to two full streams allocated for dummy symbols only, which in this setup could be done without affecting the data allocation.
Finally, we investigate the rate performance of the different proposed PAPR and dummy symbol power reserve minimization algorithms through the use of Algorithm 5. Our the simulation setup has strict constraints due to our use of Hughes-Hartogs bit and power allocation in the signal construction. Thus, we limit our investigation into differences between the algorithms under the same setup instead of simulating a realistic scenario. We consider a system where the base station has N T = 4 antennas and serves K = 2 users with N r = 2 antennas each. Thus, there is a trade-off in space-frequency resources between dummy and data symbols. The results are simulated over 1000 channel realizations, and since we are only interested in the performance differences between the algorithms, the oversampling factor was set to α = 1, and we set a peak power constraint P tx = 20 dB for Algorithm 5. Here, we also consider the effect of correlated channels to show the benefits of having low-gain streams to exploit. Figure 6 presents the achieved average rate as a function of VOLUME 4, 2016 the number of dummy symbols, with separate plots for correlated (solid line) and uncorrelated channels (dashed line). We can immediately make two main observation: first, the tradeoff in resources between data and dummy symbols, which is evident in the case of uncorrelated channels around the point N D = 64. Allocating more dummy symbols reserves high gain subcarriers from data allocation, and the power increase provided by PAPR minimization is unable to compensate, thus reducing the rate. The second observation is that the rate reduction is not yet present in correlated channels, where the low-gain streams can be allocated for dummy symbols to provide additional power boost, thus increased rate, with minimal penalty.
Comparing the different algorithms, the peak power minimization algorithm (Algorithm 2) performs the best in terms of achieved rate, and the algorithms utilizing the strict PAPR expression (Algorithms 1,3) have a lower but comparable performance. The higher rate performance of Algorithm 2 is explained by the objective of (16), i.e., the focus to minimize the highest signal peak (see (41)), which is the limiting factor when attempting to allocate as much power for data under a given common peak power constraint P tx over all antennas.
Despite the highest performance in terms of rate, the PAPR performance of Algorithm 2 is still worse due to the high PAPR of the antennas with lower average transmit power. To illustrate this, Figure 7 plots the 99-percentile PAPR and achieved rate for different algorithms for N D = 64. In the figure, the points correspond to different peak power constraints, P tx = {5, 10, 15, 20, 25, 30} dB, set over all antennas.
It can be seen that Algorithm 2, operating under the peak power minimization principle, has the highest rates (rightward direction), but at the same time, the highest PAPR (upward direction). For a fixed rate, the lowest PAPR is obtained with Algorithm 3. The worse behavior for the peak power minimization at lower peak power constraints is explained by P res , which corresponds to the factor of increase on the allocated data power. Thus, as the data power is very low, this factor can be very high, even more significant than the reduction in P cf .

VIII. CONCLUSIONS
We proposed a joint PAPR and dummy symbol power reserve minimization framework to combat the high PAPR of MIMO-OFDM systems. Our proposed method operates on the strict PAPR expression, which also accounts for the PAPR of antennas with lower average transmit power. We were motivated by the fact that the peak power minimization principle, common in the existing literature, does not necessarily translate well to multi-antenna systems as it can ignore the antennas with lower than maximum average transmit power. We also account for the self-power requirement of the scheme, which has not been widely considered. We derived low-complexity iterative algorithms to provide solutions for the proposed problem, and also proposed an approach to beamforming, where we leveraged knowledge of dummy symbol allocations to provide higher gain data streams at the cost of lower gain streams for dummy symbol allocations. Our simulation results show the benefits of using the strict PAPR expression instead of peak power minimization, in terms of worst-case PAPR over all antennas. We also see that the dummy symbol power reserve should be accounted for if strict PAPR expression is used, however, for peak power minimization it is not necessary. Finally, we see from the results that the proposed PAPR reduction scheme is very beneficial in terms of rate, especially under correlated channels where low gain streams are common.