Inverse Covariance Matrix Estimation for Low-Complexity Closed-Loop DPD Systems: Methods and Performance

In this article, we study closed-loop digital predistortion (DPD) systems and associated learning algorithms. Specifically, we propose various low-complexity approaches to estimate and manipulate the inverse of the input data covariance matrix (CM) and combine them with the so-called self-orthogonalized (SO) learning rule. The inherent simplicity of the SO algorithm, combined with the proposed solutions, allows for remarkably reduced complexity in the DPD system while maintaining similar linearization performance compared to other state-of-the-art methods. This is demonstrated with thorough over-the-air (OTA) mmW measurement results at 28 GHz, incorporating a state-of-the-art 64-element active antenna array, and very wide channel bandwidths up to 800 MHz. In addition, complexity analyses are carried out, which together with the measured linearization performance demonstrates favorable performance–complexity tradeoffs in linearizing mmW active array transmitters through the proposed solutions. The techniques can find application in systems where the power amplifier (PA) nonlinearities are time-varying and thus frequent or even constant updating of the DPD is required, good examples being mmW adaptive antenna arrays as well as terminal transmitters in 5G and beyond networks.


Inverse Covariance Matrix Estimation for I. INTRODUCTION
C ONTEMPORARY radio communication systems, such as the recently introduced 5G new radio (NR) mobile networks, build on multicarrier modulation-most notably orthogonal frequency-division multiplexing (OFDM). Multicarrier modulations are known to provide spectrally efficient waveforms but also possess high peak-to-average power ratio (PAPR) [1], [2], which complicates the operation of power Manuscript  amplifiers (PAs) close to saturation. In order to control the quality of the transmit waveform while ensuring high power efficiency in the transmit chain, digital predistortion (DPD) is a well-known technical approach-see, e.g., [1], [3]- [5], and the references therein. DPD aims to suppress the unwanted out-of-band (OoB) emissions and in-band nonlinear distortion by applying specific nonlinear preprocessing to the digital transmit waveform. Especially in combination with efficient PAPR mitigation methods [6], DPD can significantly improve the transmitter power efficiency while maintaining at the same time low levels of unwanted nonlinear distortions. A particularly timely DPD application is the linearization of active antenna arrays, being an essential technology for instance in the emerging 5G NR base stations and mobile devices, especially at the millimeter-wave (mmW) bands. These operating frequencies are also referred to as the socalled frequency range 2 (FR-2) [7] in the 3GPP terminology. Timely examples of articles focusing on this topic include, e.g., [8]- [13]. In such systems, the nonlinear distortion is known to be beam-dependent, stemming from the loadmodulation phenomenon [8]. Fast DPD adaptation is thus required such that the nonlinear distortions can be suppressed, while the beam is steered. This issue, along with the extremely high processing rates and channel bandwidths at mmW frequencies, calls for reduced complexity DPD approaches and associated parameter learning algorithms.
Another relatively new DPD use case is mobile device linearization. In general, mobile device power efficiency is very important, which is why the PA is typically operated close to saturation. Thus, despite the more relaxed linearity requirements compared to base-station transmitters, linearization can be pursued for maximum power efficiency. Furthermore, in mobile devices, the transmitted waveforms are very dynamic, as the resource block allocation and output power can change on a per-slot basis [14]. On the other hand, the computational resources in mobile devices are much more limited than in base stations. Thus overall, both the adaptability and the complexity of the DPD system are becoming increasingly important in mobile devices, forming one main motivation for this work and the methods described in this article.
A wide range of DPD models can be found in the existing literature, with some of the most common approaches being the memory polynomial (MP) [1], [5], [15] and the generalized MP (GMP) [3], [15], [16]. These approaches can be seen as subsets of the Volterra series [17], [18]. In such methods, the DPD coefficients are usually estimated by using block-based regression algorithms, such as least squares (LS) or Gauss-Newton (GN) [4], [19]. These learning solutions provide reliable coefficient estimation, but they also involve relatively high computational complexity-especially if onchip estimation were pursued. Alternatively, adaptive filtering algorithms can be used, such as least mean squares (LMS) or some of its variants [20]. The input data basis functions (BFs), however, need to be typically prewhitened or orthogonalized before the actual processing since the elements of the input data are highly correlated, leading to large eigenvalue spread of the input signal covariance matrix (CM) [21].
Literature targeting explicitly low-complexity DPD approaches and associated learning methods is, in turn, somewhat less common. Techniques aiming at this direction are, for instance, [21]- [27]. In [22], a closed-loop MP DPD model was presented, where the model coefficients were estimated with damped GN in combination with a signed regressor algorithm (SRA). However, the signed CM in the SRA suffered from rank-deficiencies, and additional Walsh-Hadamard matrix transformations were required, further increasing the computational complexity. In [21] and [23], a block-adaptive LMS-type algorithm was proposed. However, additional BF orthogonalization or prewhitening had to be applied to ensure fast and reliable convergence. In [27], an SO solution was applied and demonstrated to achieve a similar performance as other state-of-the-art techniques. However, the CM was assumed fixed, and the computational cost of its recalculation was high. In [24]- [26] and [28], cascaded Hammerstein and Wiener structures were proposed. Such approaches typically have less free parameters, leading to solutions with lower associated complexity. However, they are best applied in scenarios where physical knowledge of the system under study is available, which is not always the case. Finally, works in [29] and [30] presented lookup table (LUT)-based approaches. The LUTs aimed at substituting the high-order polynomials in the DPD model, relaxing the overall complexity. Nonetheless, their modeling capabilities are somewhat limited, depending on the size of the LUTs. To enhance the performance, the LUT sizes can be increased to better describe the nonlinear characteristics, but this leads to higher memory requirements and slower convergence speeds.
In this article, we adopt a closed-loop MP structure in combination with a modified version of the SO learning to update the DPD model coefficients. We adopt the MP structure, in contrast to other more complex solutions, due to the relaxed linearity requirements of the mmW NR FR-2 systems [7], [14] and also because its reduced complexity makes it appealing for various DPD applications, such as the terminal PA linearization. The overall DPD structure is shown in Fig. 1, in the context of mmW active arrays.
The main technical contributions of this article are the development of several low-complexity solutions for estimating and manipulating the inverse CM (ICM) needed in SO learning algorithms. The ICM gathers statistical information of the DPD input data and the corresponding BF samples, and it is an important ingredient in the learning path to ensure good linearization performance and convergence. This is the most computationally heavy element in the SO learning rule and needs to be recalculated whenever the transmit signal type or its spectral characteristics are modified, to ensure optimal DPD execution. Specifically, we propose the following four methods: 1) method to efficiently modify the ICM by removing rows and columns for a smaller DPD parameterization; 2) method to efficiently modify the ICM when the input signal is frequency-shifted, reflecting, e.g., a change in resource allocation inside the channel bandwidth; 3) method to estimate the ICM from the autocorrelation function of the input data, assuming that it is a complex proper Gaussian distributed random signal. 4) Method to efficiently approximate the technique in 3) by utilizing the Bussgang decomposition. Techniques 1) and 2) can be applied to modify a previously calculated ICM with low computational effort, while techniques 3) and 4) require only an estimate of the autocorrelation function of the input data to build the ICM. The SO learning algorithm in combination with the proposed methods allows for fast DPD coefficient estimation with minimal complexity in time-varying scenarios. Hence, on-chip implementations and continuous tracking of the DPD can become feasible.
Extensive over-the-air (OTA) RF measurement results at the 28 GHz mmW center frequency-the 5G NR band n257 [7]are carried out and reported in order to test and validate the proposed solutions. All measurements feature a state-ofthe-art 64-element active antenna array and 5G NR compliant OFDM waveforms, with signal bandwidths ranging from 200 to 400 MHz. In addition, further measurements with an aggregated signal bandwidth of 800 MHz are also conducted, with the aim of pushing the performance boundaries of the considered DPD system. Altogether, the obtained results, alongside with the complexity analyses, indicate very favorable performance-complexity tradeoffs of the proposed solutions when comparing against the state of the art. Finally, we note that although the experiments in this article consider mmW active array linearization, the developed solutions are applicable also in any more classical single-input-singleoutput DPD system and PA linearization context.
The rest of this article is organized as follows. Section II presents the adopted MP structure and also clarifies the assumptions taken for the mmW setup. Sections III and IV present the proposed low-complexity methods to estimate the ICM, needed for the SO learning rule. Complexity analyses and comparisons are then provided in Section V. Section VI presents the deployed 28 GHz measurement setup and the corresponding FR-2 RF measurement results and their analyses. Finally, conclusions are drawn in Section VII.

II. CLOSED-LOOP DPD SYSTEM
In this section, the MP DPD model utilized in this work is described. Both the processing and learning paths are detailed, where the actual digital predistortion and the DPD coefficient estimation are executed. The MP model is adopted because it is a widely used model and is known to strike a good balance between linearization performance and computational complexity [1], [3], [5], [21], [23]. In addition, the DPD system builds on a closed-loop structure, where the DPD coefficients are estimated from the input signal, x[n], and the observed learning signal, y[n], [31] according to the notation followed in Fig. 1.

A. DPD Main Path
The complete closed-loop MP DPD scheme is presented in Fig. 1. Here, x[n] is the original baseband signal, x DPD [n] is the signal after predistortion, and y 1 (t), y 2 (t), . . . , y R (t) are the individual PA output signals to be transmitted. Following this notation and adopting the classical MP model structure, the input-output relation of the digital predistorter can be written as where M denotes the number of memory taps considered in the model, P corresponds to the polynomial order, and α m, p is the corresponding PA model coefficient for a specific memory tap and polynomial order. Following this configuration, the model has C = (P/2) M coefficients. It is noted that while only odd orders are considered in the above model, there are works [32], [33], which have shown that the use of both odd and even orders can have certain benefits, such as the ability to reduce the overall polynomial order P for given linearization performance targets. The ICM estimation methods derived and described in this article focus on MP systems with odd orders, while the extension to cover also even-order polynomials is an interesting topic for future work in this area. Next, in order to facilitate the closed-loop processing, the input-output relation obtained in (1) can be alternatively expressed with matrix notation as the complex multiplication of the MP BF matrix and the DPD coefficient vector. This expression reads where α α α ∈ C C×1 contains the PA model coefficients and ∈ C N ×C is the original monomial BF matrix with N denoting the total number of the available input data samples.
can be further expressed as = x 1,0 x 1,1 · · · x 1,M−1 x 3,0 x 3,1 · · · x 3,M−1 · · · · · · x P,0 x P,1 · · · x P,M−1 with each vector x p,m ∈ C N ×1 being defined as The order of the BF terms in (3) becomes important for the formulation taken in the methods presented in Sections III and IV. We finally note that in the first DPD iteration, the vector α α α is usually initialized with a one in the first element and zeros in the rest (i.e., α α α = [1 0 · · · 0] T ) such that only the unmodified linear term is passed through the predistorter.

B. DPD Learning Path
In closed-loop DPD systems, the observation signal y[n] is nonlinear in the DPD parameters. Therefore, nonlinear LS techniques need to be applied, such as those based on the GN algorithm [29], [34]. At block iteration k, these iterative algorithms build on the error vector e k = y k − x k that is stacking the prevailing error signal samples of the current processing block of size K . In this article, we focus on the SO and the damped GN learning methods [21], [22], [29]. In Sections III and IV, the SO learning rule is studied in detail and combined with different methods to reduce the involved computational complexity, while the GN serves as a reference solution. To this end, the SO and the GN learning rules read respectively, where μ s and μ g are the learning steps for the updates, k ∈ C K ×C is the BF matrix corresponding to the DPD block iteration k, and R ∈ C C×C is the CM of the input BF vector, defined as In above, ψ[n] is the instantaneous input vector at time instant n (i.e., a row of ), and N is the total number of available samples over which R is estimated. If the input signal does not vary significantly within DPD iterations, the CM can be estimated off-line and then kept fixed, avoiding repetitive online calculations. For later purposes, we also already define the ordinary autocorrelation function of the DPD input signal x[n], as well as its sample estimate, as In general, it is important to note that the term ( H k k ) −1 in the GN learning rule essentially calculates the (conjugate of the) ICM at every DPD iteration. This provides an accurate description of the statistics of the input signal at every iteration but also involves heavy computational complexity. For this reason, the GN method is considered as the baseline highperformance solution against which the linearization performance is compared. At the same time, the SO approach aims at drastically reducing the computational complexity by obtaining the ICM with different alternatives, described in Sections III and IV, while still achieving a linearization performance close to GN. We further note that the learning solutions presented in (5) and (6) are generic learning approaches for linear-in-parameters models and can also be applied to other models, such as GMP [3], Volterra DDR [35], [36], or even LUT-based solutions [29], [37].
Finally, it is noted that one common alternative in the DPD developments is to deploy BF orthogonalization or whitening [21], [23]. Such an approach can also facilitate utilizing low-complexity learning algorithms, even block-LMS [22]. However, the BF whitening also implies increased DPD main path complexity-even more so in case of dynamic input signals, implying that also the whitening matrix should be recalculated. Hence, in this work, we do not consider BF whitening in the DPD main path.

C. Observation Receiver Configuration in mmW Active Array Systems
In the context of mmW phased-array systems, there exist various alternatives for arranging the ORX in order to obtain the combined learning signal-y[n] in Fig. 1-used for DPD parameter learning. One particular approach is the hardwarebased method, in which each PA output signal is coupled, phase-aligned, and combined in hardware [11], [12], [38]. Another plausible solution is to configure a separate ORX, which captures the combined OTA received signal and feeds it back to the DPD systems for DPD learning [13], [39]- [41]. These two alternatives are presented, as plausible solutions, in Fig. 1. Both of these alternatives mimic the far-field combined signal at a distant receiver, in case of line-of-sight propagation.
It is important to clarify that the DPD learning solutions and methods presented within this article do not depend on the actual way of obtaining the combined learning signal used for parameter estimation. It is noted, however, that the hardwarebased approach does not suffer from the common OTA-related challenges, such as ORX misalignment and positioning challenges, environmental dependencies, and beam dependence of radiated nonlinear distortion. We also note that there exist several works in the literature [39]- [41], where the OTA ORX beam misalignment challenges are studied, proposing different solutions to provide the far-field signal in the main beam direction. Furthermore, our adopted mmW measurement setup, featured in Section VI, aims at mimicking the hardware-based feedback system by carefully aligning a horn antenna in the main beam, acting as the OTA ORX.

III. LOW-COMPLEXITY METHODS FOR MODIFYING THE ICM
If the input signal to the SO algorithm remains static, the ICM can be precalculated off-line, saving complexity in the DPD learning path. However, whenever the input signal changes, the ICM needs to be estimated again in order to provide optimal linearization performance. The process of calculating the ICM-or, alternatively, the term ( H k k ) −1 in the GN learning rule (6)-is computationally heavy, as it involves calculating the inverse of the product between the BF matrix and its Hermitian transpose. The size of the BF matrix, k , is K × C, which is typically large in DPD implementations. To ease this calculation, we propose in Sections III-A and III-B two methods for modifying the ICM when the input signal or the parameterization of the DPD is changed. We finally note that these solutions are applicable to any DPD models that are linear in the parameters, not just the MP.

A. Reducing the Dimensionality of the ICM
The first considered method avoids the need of recalculating the ICM if a simpler parameterization is desired in the DPD system. A simpler parameterization might be adopted when lowering the transmit power or when changing from a high-order modulation to a lower order modulation with less stringent EVM requirements. Furthermore, in mmW active array linearization, where the severity of the nonlinear distortions varies as a function of the beam direction, such a technique may be particularly useful in optimizing the power consumption of the transmitter.
To this end, we first generate a generic ICM obtained with a large DPD parameterization. If a smaller polynomial order or memory depth is desired, the proposed method then removes the appropriate rows and columns from the generic ICM and obtains a new ICM submatrix, which corresponds to the new chosen parameterization. The new submatrix can then be used in the SO learning equation to estimate the new DPD coefficients.
The proposed method can be described as follows. Let us consider a generic ICM, obtained from the complete set of BFs,˜ . This matrix is known and it is used to find the reduced ICM subset The BF matrices,˜ and , are mutually related as where v ∈ C N ×1 represents a column in˜ that needs to be removed. The generic CM can be now written as and its block inverse can be directly obtained by using the Schur complement [42] of (11) as where Finally, the system of linear equations in (13) can be directly solved to obtain ( H ) −1 , which corresponds to the desired ICM subset, expressed as We note that removing one column from the BF matrix,˜ , corresponds to removing one row and one column from the ICM,Q −1 . The proposed method requires moving the row and column to be removed to the end of the matrix and can be executed iteratively to cut out more than one row/column. In Algorithm 1, we present an example pseudocode that shows how to remove l rows and columns fromQ −1 ∈ CC ×C , obtaining the new ICM submatrix

Algorithm 1
Pseudocode showing how l rows and columns can be removed fromQ −1 to obtain the subset Q −1 . The generic ICM, given as input data, is conceptually presented in (12) Permute row/column to the end ofQ −1 ;

B. Frequency Shifting the ICM
In the second considered scenario, we study how to avoid the ICM recalculation when the DPD input signal is digitally frequency shifted. In such a case, the spectral components of the signal are modified, and thus, its CM and ICM will no longer be the same. In order to provide accurate linearization performance, the new ICM needs to be estimated again by some means. To that end, we propose a novel method that shifts the original ICM, denoted asQ −1 = (R * ) −1 , in the frequency domain to obtain the resulting shifted ICM, denoted as Q −1 = (R * ) −1 . The computational complexity of this method is very low compared to recalculating the whole ICM again, as shown in Section V. A good example use case of such frequency shifting is when the uplink spectral allocation of a terminal/UE transmitter changes within a given channel bandwidth.
The proposed method is constructed as follows. Let us denote the original unshifted digital input signal asx [n] and the corresponding shifted signal as x [n]. A frequency shift of f , in Hz, can be mathematically expressed in the time domain as the multiplication with a sampled complex exponential, and thus, the shifted digital signal reads where ω = (2π f )/ f s denotes the normalized angular frequency shift, while f s refers to the sample rate. The next step is to obtain the shifted BF matrix, , exclusively as a function of the unshifted signalx[n] and the normalized frequency shift ω. This can be done by substituting (19) into (3) and (4). For simplicity, we just present the first row of the resulting , shown in (9) at the bottom of the next page, considering that e j ω n = 1. Due to the formulation taken in (9), Q can also be expressed as a function of Q and a multiplying factor that depends on the frequency shift ω. This factor appears in the BF terms with different delays, caused by the distinct memory terms appearing in the BF matrix. Specifically, we observe 0-sample-delay terms, 1-sample-delay terms, and, in general, m-sample-delay terms, appearing in the CM, expressed as 0 -sample-delay: The CM is generally a Hermitian-Toeplitz matrix, and the way how the different sample delay terms are scattered through it can also be seen from (28), where the value of τ within the autocorrelation function R X (τ ), defined formally in (8), indicates the corresponding delay. Hence, the zerodelay samples correspond to the main diagonal of the M × M submatrices within the CM, the one-delay samples correspond to the second diagonal term of these submatrices, and so on. From this structure, it is then relatively easy to see that the frequency-shifted CM can be written as where the diagonal matrix P ∈ C C×C applies the right shift weights for each column ofQ, and it is defined as Defining the CM as in (21) will allow for applying simple yet efficient matrix inverse properties. As a result, the shifted ICM can be directly obtained from (21) as Note that only the original ICM,R −1 , and the frequency shift, ω, are needed to obtain the final shifted ICM, R −1 .
IV. LOW-COMPLEXITY ICM ESTIMATION FOR GAUSSIAN SIGNALS This section presents two novel methods to estimate the ICM in a computationally efficient manner. These methods build on the autocorrelation function of the input signal and provide a set of precomputed coefficients that can be utilized to easily estimate the ICM. The first method is an application of Reed's theorem [43] for calculating the CM, while the second technique utilizes the Bussgang theorem and other stochastic analysis to provide an approximate CM. In the latter case, the resulting CM can be shown to consist of a specific subblock structure expressible as a Kronecker product, allowing for efficient inversion. We note that, since we consider complex Gaussian process theory, we assume the Gaussian distributed input signals in this section. OFDM signals are known to converge toward Gaussian, as the number of subcarriers is large [44]. This claim is further substantiated in Section VI.

A. ICM Estimation With the Autocorrelation Function
The first proposed method utilizes the moment theorem for complex-circular Gaussian signals presented in [43]. We specifically aim at expressing the CM as a function of the autocorrelation function of the input signal. To that end, the autocorrelation function is first estimated over the desired span with (8). Reed's theorem states that all odd-order central product moments are zero, while all even-order moments can be obtained through sums of products of the second-order moments, expressed in the general case as [43] Here, for notational simplicity, x m = x[n − m], and ζ constitutes a permutation of the set {0, 1, . . . , (k − 1)}. Noting that all terms in the CM have an even order, and following the specific BF order assumed in (3), each term in the CM can be obtained by applying (25).
For clarity, an example of the application of Reed's theorem to the fourth-and sixth-order moments is presented in (18), as shown at the bottom of the previous page, while other higher order terms can be obtained similarly. The definition of the autocorrelation function in (8) has been used to arrive at the last expressions in (18). It is noted that the expression presented in (25) allows to obtain the higher order terms recursively. It is also noted that the specific terms of the form E{|x[n]| 2 p } can be alternatively obtained as which may simplify their calculation.
Once Reed's theorem has been applied to each individual term in the CM, we can express the CM with a specific subblock matrix structure. Specifically, the CM can be expressed with a (P/2) × (P/2) submatrix structure, each of size M × M, written as presented in (24), as shown at the bottom of the page. Here, the submatrix R k,l is in turn defined as where the subscript in ψ[n] indicates the corresponding element within the instantaneous input vector. Due to the notation utilized in (24) and (27), such matrices can accommodate any polynomial order, P, and memory depth, M, chosen in the DPD system. Note that only the calculation of the upper triangular submatrices in R is required to build the complete CM. The CM is then inverted to generate the resulting ICM, thereon used in the SO learning equation.

B. ICM Approximation With Bussgang Coefficients
In the second proposed method, we aim at simplifying the previous method by using Bussgang's decomposition. This theorem states that the cross correlation of a Gaussian signal x[n] and a signal y[n] = f (x[n]) that has passed through an instantaneous nonlinear function f (·) can be expressed as the product of the autocorrelation function of x[n] and a scaling constant. Formally, this is expressed as where is the definition of the cross correlation function, while the autocorrelation function R X (τ ) reads as in (8). The Bussgang coefficient, ξ , is obtained through the complex proper Gaussian probability distribution function (PDF) as where f (·) represents the nonlinear amplitude distortion and σ 2 x is the variance of x[n].
In general, the second-order terms appearing in the CM, on its first row, can be expressed directly as a function of The higher order terms appearing in the CM (i.e., x[n]|x[n]| p , p = 2, 4, . . .) can be seen as nonlinear functions of x[n] and can thus be expressed as a function of the autocorrelation according to the Bussgang theorem [45].
Using (29), we can now express all the remaining high-order terms in the CM uniquely as a function of R X (τ ) as where the coefficients ξ 1 , ξ 2 , . . . can be obtained using (31). By having all the terms as a function of R X (τ ) and the Bussgang coefficients, the complete CM can be expressed through them. A concrete example is presented in (28), as shown at the bottom of the next page, which is showing the CM structure in the specific case of P = 3 and M = 1 while also assuming that ξ 0 = 1. Each matrix element, which depends on the input data signal, can be then obtained by using the explicit expressions shown in (7) and (31).
We next note that the order of the BFs in the BF matrix becomes important to replicate this particular structure. By further studying (28), we realize that it can be formulated with a specific subblock matrix notation that will allow for efficient Kronecker inversion. The CM can be alternatively simplified in M × M subblocks, which are weighted with a different Bussgang coefficient and are propagated through R as where the subscripts indicate the corresponding Bussgang coefficients. In addition, the submatrix R k ∈ C M×M reads Note that only the submatrices appearing on the first row and the last column of (39) need to be calculated to build the whole CM.
Finally, due to the well-structured formulation in (39), the CM can be efficiently inverted using different matrix inversion methods. One possible solution is to use blockrecursive matrix inverse algorithms, such as the one in [46], which poses drastically reduced complexity when compared to ordinary matrix inversion. An alternative and yet more efficient solution is to use the Kronecker inversion [47], [48]. To this end, we first note that the CM can be equivalently expressed with the Kronecker product as where ∈ R (P/2) × (P/2) contains the complete set of Bussgang coefficients. Then, the ICM is directly obtained as the inverse of its elements as The Kronecker inversion further reduces the computational complexity of inverting R, as demonstrated in Section V.

V. COMPLEXITY ANALYSIS AND COMPARISON
In this section, we present the computational complexity involved in the proposed DPD and ICM estimation solutions. The processing complexity is divided in three different stages-the DPD main path (Section II-A), the DPD learning path (Section II-B), and the ICM-related calculations (Sections III and IV). All the analyses are carried out in terms of real multiplications per K -sized DPD iteration (i.e., for one block iteration with data size of K samples). Multiplications constitute, in general, the most resource-intensive operations in digital signal processing (DSP) implementations, while additions can be considered to be essentially free [15], [49]. In this study, it is assumed that one complexcomplex multiplication is calculated with four real multiplications, and one real-complex multiplication is calculated with two real multiplications. In addition, the matrix inversion of an m × m matrix is assumed to cost 4 m 3 real multiplications [42], [46].
As noted, the first considered stage is the DPD main path, where the actual predistortion is applied. Here, the complete BF matrix [i.e., in (3)] is first calculated from the input signal x[n], assuming its recursive calculation through previously obtained values. Later, the actual predistortion processing is carried out as in (2). The second considered stage is the DPD learning path, where the closed-loop learning equations are applied to update the DPD coefficients. Here, we consider both the reference GN algorithm and the SO learning approach. In this analysis, in case of SO learning, we express the learning complexity first for a given ICM while noting the ICM calculation complexity then separately. The GN approach, in turn, needs to calculate the ICM estimate ( H k k ) −1 in every DPD iteration-by its very definition. Finally, a complexity assessment and comparison between the classical ICM calculation and the proposed ICM estimation methods are provided. The classical ICM estimation complexity directly results from the calculation of ( H ) −1 . The complexity of the proposed ICM estimation methods, in turn, is determined following the exact processing principles presented in Sections III and IV. Table I gathers the obtained complexity analysis results for the considered methods. The table presents the required real multiplications in each stage of the DPD solution, in a general form as a function of the modeling parameters. In addition, the final column presents a specific numerical example when P = 9, M = 4, C = 20, l = 5, and K = 20 000, in order to provide a concrete and representative example of the involved complexities. Such DPD parameterization is similar to the one chosen to perform the RF experiments and validation in Section VI. From the numerical example, it can be clearly observed how the computational complexity is reduced against the reference GN method when utilizing the SO learning solution in combination with the proposed ICM estimation methods. In all cases, the learning complexity is drastically reduced while still achieving a similar linearization performance, as demonstrated in Section VI. The quantitative results also demonstrate the very large reduction in the needed multiplications, achievable through the proposed ICM estimation methods, compared to the direct ICM calculation. For reference and comparison purposes, we also address the complexity of a classical MP model with BF prewhitening or orthogonalization while considering block-LMS as the DPD coefficient update rule. With block-LMS, BF prewhitening becomes a necessary ingredient stemming from the large eigenvalue spread of the input signal CM. This process ensures that reasonably fast and stable learning convergence can be obtained by the simple LMS algorithm, which essentially uses a diagonal matrix to approximate the ICM [22]. The BF prewhitening is done in the DPD main path and is assumed here to build on the well-known Cholesky decompositionbased method [23], [42], which provides an upper triangular orthogonalization matrix. The overall main path complexity of such an approach is then the one presented on the fourth row of Table I plus the extra cost of the prewhitening stage that is a matrix-matrix product. Due to the triangular orthogonalization matrix, the extra complexity is of the form 4K (C(C + 1)/2). In the numerical example case of Table I, the corresponding overall main path complexity is 18.6M real multiplications, reflecting a substantial increase in comparison to the methods that do not require prewhitening. On the other hand, the cost of updating the DPD coefficients in the block-LMS learning rule [50] is given by 4C K + 2C, which leads to 1 600 040 real multiplications-a number that is of the same complexity order as the SO solution when combined with any of the proposed ICM estimation methods. This analysis thus shows that the main path complexity increases substantially when BF orthogonalization is deployed, while the corresponding decrease in the learning complexity through using block-LMS is fairly minor when the ICM calculations in the reference SO solutions build on the proposed methods. In addition, we note that in these complexity calculations, the prewhitening matrix has been considered to be precalculated off-line. However, when considering DPD applications for example in terminal transmitters, where the transmitted waveforms are very dynamic (the resource block allocation and/or output power can change on a per-slot basis), such an assumption may not be feasible. This, in turn, means that in such dynamic scenarios, also the prewhitening matrix may need to be recalculated online during the learning procedures, further increasing the corresponding online processing burden.
Finally, based on the results in Table I, we separate three possible alternatives regarding the SO learning approach in combination with the ICM estimation.
1) The SO learning rule is applied in a dynamic waveform scenario using classical ICM calculation based on the sample estimate of the CM in (7). The ICM is thus calculated each time the waveform changes. If this happens frequently, the complexity approaches that of the GN. 2) The SO learning rule is applied in a dynamic waveform scenario using the proposed ICM estimation methods. Computational complexity may be drastically reduced compared to the previous when waveform changes are frequent.
3) The SO learning rule is applied in a static waveform scenario. In this case, the ICM is fixed and only needs to be computed once. Thus, the overall complexity is the smallest.

VI. RF MEASUREMENTS
In order to test, verify, and validate the proposed methods and algorithms, an extensive set of RF measurements is provided, in the context of an FR-2 mmW device. The deployed system features a state-of-the-art 28 GHz 64-element active phased array, acting as the nonlinear element which the proposed methods aim at linearizing. The measured 1-dB compression point of the array is +41 dBm, and the operating FR-2 band is 5G NR band n257. Once the measurement results are obtained, they are analyzed together with the quantitative complexity analysis results provided in Section V, to assess the performance-complexity tradeoffs of the proposed solutions.
In the context of mmW measurements with multipleelement phased arrays, several issues are to be noted. First, in an R-antenna array, there are also R parallel PAs, one per antenna unit. Each of these amplifiers will basically have unique nonlinear characteristics, and hence, the predistorter can only guarantee an efficient linearization performance in the main beam direction. In the rest of the angles, the intrinsic beampattern of the array suppresses the OoB nonlinear distortion [11]. Another concern is the load modulation of the PAs, occurring due to coupling between the antenna elements [8]. This makes the effective nonlinear characteristics of the antenna array beam-dependent, thus tying the DPD solution to the beam direction. Consequently, linearization methods must take considered. This issue can be tackled with, for instance, real-time DPD tracking, which is able to estimate the DPD coefficients as the beam is steered. Since the parameter estimation needs to be done in real time, it is crucial to explore ways of minimizing the involved computational complexity while at the same time being able to provide the needed linearization performance.
In this work, the DPD performance is evaluated through the well-known EVM [15], normalized mean square error (NMSE) [15], and the TRP-ACLR metrics since an OTA system is considered [7]. The TRP-ACLR measures the OoB performance by computing the ratio between the filtered mean power centered on the assigned channel frequency and the filtered mean power centered on an adjacent channel frequency, measured by integrating the powers over the whole beamspace while keeping the beamforming angle fixed [7], [8].

A. 28 GHz Active Array Experimental Setup
The complete 28 GHz measurement setup is shown in Fig. 2. The experimental setup is configured as follows. First, a Keysight M8195A arbitrary waveform generator (AWG) is deployed to output the modulated I/Q waveform at 2.5 GHz intermediate frequency (IF). Then, a Keysight N5183B signal generator provides the LO signal at 24.5 GHz that together with a Marki Microwave T310401741 mixer further upconverts the signal to the 28 GHz band. A Marki Microwave FB3300 bandpass filter (BPF) is applied immediately afterward to suppress the mixer-induced image frequencies. Two preamplifiers, AD HMC499LC4 and AD HMC1131, are deployed to guarantee a sufficiently high power at the input of the active antenna array such that it can be pushed close toward saturation. The 64-element active array-the Anokiwave AWMF-0129-is then configured to transmit the signal OTA toward the horn antenna of the ORX. Both antennas are mounted on different electrical tripods capable of providing the required rotation in azimuth and elevation. For DPD learning and verification, both antennas are perfectly aligned when transmitting/receiving, as noted in Section II-C. For simplicity, the beamforming angle of the TX antenna array is set at 0 • . The combined learning signal captured by the ORX is then attenuated and fed back to a Keysight UXR0402AP oscilloscope, which is acting as the digitizer. Finally, the digital signal is sent to a host PC for further processing and/or performance assessments. We note that the Anokiwave AWMF-0129 does not allow for actual hardware-based combiners for feedback, and hence, we adopt the carefully aligned OTA ORX to mimic such hardware-based processing.
The modulated signals adopted in the following experiments are 3GPP 5G NR Release-15 FR-2 compliant OFDM waveforms, with the subcarrier spacing (SCS) and RB allocation specified in each particular experiment. The sampling rate of the signals and DPD execution is 2 GHz. As an additional ingredient, which aims at pushing the performance boundaries of the DPD system, we also consider a wider, nonstandardcompliant, signal bandwidth of 800 MHz in some of the experiments. This is obtained by doubling the number of active subcarriers and also the OFDM waveform processing FFT size, compared to the standard-compliant 400 MHz signal. In addition, the initial PAPR of the generated signals is approximately 12 dB when measured at the 0.01% point of the instantaneous PAPR complementary cumulative distribution function (CCDF) [2]. This value is then limited through well-known iterative clipping and filtering-based processing to 8 dB, measured also at the 0.01% point of the instantaneous PAPR CCDF. Clipping the waveform, however, strictly speaking degaussianizes the input signal distribution, as it basically Fig. 3. PDF of the generated 5G NR FR-2 OFDM waveform with 400 MHz carrier bandwidth, including also PAPR mitigation to 8 dB through iterative clipping and filtering. The shown PDF reflects the real part of the complex I/Q waveform, while that of the imaginary part is essentially identical. For reference, also the theoretical normal PDF is shown. removes the high-and low-end values of the amplitude PDF. The difference to the Gaussian distribution is nevertheless very small, as can be observed through the illustration in Fig. 3, which shows the PDF of the generated NR-compliant OFDM signal with clipping and filtering versus an ideal Gaussian PDF. Finally, after the PAPR mitigation, additional timedomain windowing is also applied to suppress the inherent OFDM signal sidelobes.
In each DPD iteration, a block of K = 20 000 pseudorandomly generated OFDM signal samples is circularly transmitted, received, and utilized to estimate the DPD coefficients. The K -sized closed-loop error signal vector, e k , is generated as the subtraction between the transmitted and received data samples and essentially contains the prevailing PA distortion samples. This term is then used as an input of (5) or (6) to estimate the DPD coefficients. Such transmit/receive process is then repeated until the algorithms reach convergence. In all the experiments, the MP DPD model is configured with a ninth-order polynomial (i.e., P = 9) and with four taps of memory (i.e., M = 4). The DPD coefficients are initialized as α α α = [1 0 · · · 0] T such that just the linear term is obtained after predistortion in the first iteration. In the measurements where the ICM is precalculated off-line, the CM is estimated from a 500 ksamples sequence and inverted before the DPD processing. It will remain fixed for the rest of the iterations unless otherwise mentioned.

B. Baseline DPD Performance
In this section, we study the linearization performance of the SO and the reference GN learning methods. The measurement results are carried out with both signal bandwidths of 800 MHz (120 kHz SCS, 528 RBs) and 400 MHz (120 kHz SCS, 264 RBs), measured at a highly nonlinear operation point of the active phased array. Specifically, this experiment is performed with effective isotropic radiated power (EIRP) of approximately +40.5 dBm, corresponding to an input power of approximately −6 dBm. This leads to an initial TRP-ACLR of approximately +22 dBc and an EVM of some 11.6%, reflecting a highly nonlinear operation point.
First, Fig. 4(a) presents the obtained linearization results with a signal bandwidth of 800 MHz for both SO and reference GN learning methods. We assume for the SO method a precalculated ICM, which is then kept fixed through the DPD iterations. The GN learning rule, in contrast, calculates the ICM estimate ( H k k ) −1 in every DPD iteration. As can be seen from the figure, the linearization performance of the SO method is practically matching that of the GN, with TRP-ACLR numbers of about +32.5 dBc and EVM values below 6%. This performance is also achieved with the SO solution regardless of the rough ICM off-line estimation. Secondly, Fig. 4(b) presents the results with a signal bandwidth of 400 MHz. Similar conclusions can be drawn regarding the behavior of the SO and GN learning methods. In this case, the TRP-ACLR is around +35 dBc and the EVM is below 5.5% for both SO and GN methods. What is more important, we also include a third measurement that shows the linearization behavior of the SO learning method when the ICM is calculated from the previous 800 MHz signal (i.e., the same as the one utilized in Fig. 4(a)). As shown in the figure, the DPD performance is only slightly degraded in comparison to the normal SO learning rule, with the TRP-ACLR and EVM being around +34 dBc and 5.7%, respectively. This is because the ICM has been obtained from a wider signal than the one being transmitted. However, the SO with an ICM obtained from the 400 MHz signal would not work to linearize the wider 800 MHz signal. We finally note that all measured methods satisfy the TRP-ACLR limit of +28 dBc and EVM limit of 8% (for 64-QAM), as stated in 3GPP specifications [7].

C. Reducing the ICM Dimensionality
In this second experiment, we study and highlight the obtained numerical precision of applying the method, which removes rows/columns from the ICM to get a new covariance set for a DPD system with reduced parameterization. We recall that the rows/columns that are to be removed from the ICM are shifted to the last row/column, and then, this algorithm is repeated iteratively until all the targeted elements are successfully removed.
To this end, we perform the following experiment. We first generate an ICM of size 36 × 36 corresponding to P = 11 and M = 6. We then execute the proposed algorithm described in Section III-A iteratively until l = 16 rows and columns are removed, leading to a final ICM of size 20 × 20 elementscorresponding to P = 9 and M = 4. Parallel to that, we mathematically calculate, using (7), the corresponding ICM of size 20 × 20 (also corresponding to P = 9 and M = 4) and compare it to the estimated version. The obtained numerical error when comparing the two versions is in the order of 10 −13 . This order of magnitude can be considered negligible in any DPD implementation, thus demonstrating the effectiveness of the considered technique. A similar experiment is next repeated for reducing the DPD parameterization down to P = 7 and M = 3, corresponding to l = 24. The numerical error in this case between the ideal and   5. 400 MHz 5G-NR OTA linearization performance at EIRP ≈ +40.5 dBm for the SO method, with emphasis on the ICM shifting technique proposed in Section III-B, when a frequency shift of (a) −200 and (b) +400 MHz is applied. The linearization performance of the GN method is also included for reference. Note that the "SO DPD, original ICM" and "SO DPD, est. shifted ICM" curves practically overlap. estimated ICMs is in the order of 10 −11 , reflecting again very high accuracy. Finally, we note that no PSDs are visualized along this experiment, as with such very low numerical errors, the PSDs corresponding to the calculated and estimated ICMs are essentially completely overlapping.

D. Frequency Shifting the ICM
In this third experiment, we study the linearization performance of the SO learning rule in combination with the ICM shifting method proposed in Section III-B. The measurements are carried out with a signal bandwidth of 400 MHz (120-kHz SCS, 264 RBs), while the EIRP, the array's input power, and thus initial TRP-ACLR and initial EVM values are maintained as before.
The experiments to test the proposed method are carried out as follows. First, the input 5G NR 400 MHz baseband signal is generated, and then, two different example digital frequency shifts of −200 and +400 MHz are applied to it. This generates the resulting shifted RF signals, which are centered at 27.8 and 28.4 GHz. We report the linearization performance of the SO solution with: 1) the unshifted ICM; 2) the original ICM (i.e., estimated from the shifted signal); and 3) the ICM generated with the proposed method, presented in Section III-B. The performance of the GN solution is also presented as a reference. As can be seen from the obtained results in Fig. 5, the SO with the unshifted ICM does not converge to any solution since the ICM is not able to describe the statistics of the shifted signal. Second, the SO with the original ICM provides the optimal linearization performance, essentially equal to GN, since the ICM is directly calculated from the shifted signal. Third, the SO with the ICM estimated using the proposed method in Section III-B also provides optimal performance, despite the substantially lower associated complexity. The TRP-ACLR and the EVM Fig. 6. OTA linearization performance at EIRP ≈ +40.5 dBm for the SO method when the ICM is estimated through the two novel methods presented in Section IV, with the signal bandwidth of (a) 200, (b) 400, and (c) 800 MHz. The GN method is also measured and shown for reference. numbers for the proposed method are in both cases maintained above +31 dBc and below 5.6%, respectively, thus satisfying the 3GPP standards [7] at FR-2. This performance is equal to that of the GN method, which involves heavy complexity in each DPD iteration. These results verify the effectiveness of the proposed ICM shifting method in terms of linearization performance.

E. Estimating ICM From the Autocorrelation Function
This fourth experiment validates the two novel methods presented in Section IV, which estimates the ICM from the autocorrelation function of the input data. These techniques are tested with three different signal bandwidths of 200 MHz (120 kHz SCS, 132 RBs), 400 MHz (120 kHz SCS, 264 RBs), and 800 MHz (120 kHz, 528 RBs). In all cases, the ICM is estimated using the proposed approach described in Section IV and then injected in the SO learning rule. The GN method is also measured for reference, while all the remaining system parameters are maintained as explained before.
The obtained measurement results are presented in Fig. 6. With a bandwidth of 200 MHz, the SO DPD with autocorrelation ICM estimation achieves an excellent linearization performance, very close to that of the GN approach. The SO DPD with the Bussgang ICM estimation lies somewhat behind but still obtains a good amount of linearization. This difference is reduced when considering the wider bandwidths of 400 and 800 MHz, in which the performance of the proposed novel solutions is very close to each other and also to the reference GN model. It is noted that there is no direct theory-based reason why the performance gap between the autocorrelation-based ICM estimation and the Bussgang ICM estimation methods is largest in the case of 200 MHz bandwidth. The difference can be stemming, e.g., from a slight change in the hardware setup, or by a minor movement in the position of the person handling the measurements. In all cases, the SO solution in combination with the methods proposed in Section IV successfully satisfies the +28 dBc TRP-ACLR and the 8% EVM limits while showing a remarkable complexity reduction in estimating the ICM.

F. Convergence Analysis
We continue the experimental results with a convergence analysis of the proposed DPD methods in a static waveform scenario. For this analysis, we consider the signal bandwidth of 800 MHz (120-kHz SCS, 528 RBs, 625 ksamples) and present the convergence behavior of the SO method when: 1) ICM is estimated normally from the input signal; 2) ICM is estimated with the autocorrelation method presented in Section IV-A; and 3) ICM is estimated with the Bussgangbased method presented in Section IV-B. The convergence speeds of the remaining methods are the same as that of 1), and thus, they are not shown separately. The remaining DPD parameters are maintained as stated above.
The convergence behavior of the DPD models is presented in Fig. 7, in terms of the TRP-ACLR and the NMSE. As can be seen from the figure, the TRP-ACLR convergence of the GN, the SO with classical ICM calculation, and the SO with autocorrelation-based ICM estimation is very similar, reaching the steady-state performance in around 15 CL iterations, after which the DPD behavior stabilizes at around +32 dBc TRP-ACLR. The convergence speed of the SO with the Bussgang ICM estimation is slightly slower, reaching the steady state in around 20 block iterations. However, the final values of TRP-ACLR are similar compared to the previous methods. This behavior is expected since the Bussgang-based method provides only an approximate ICM. Similar conclusions can be drawn from the second figure, which shows that the NMSE obtained with the different methods. These results show the favorable performance of the proposed SO solutions, which together with the achieved complexity reductions demonstrates very appealing performance-complexity tradeoffs.

G. DPD Learning Complexity With Different ICM Methods
In this final subsection, we gather the quantitative DPD learning complexity numbers obtained with the proposed ICM estimation methods and corresponding to the DPD parameterization of P = 9, M = 4, C = 20, l = 5, and K = 20 000. Here, we measure the complexity in terms of average number of real multiplications per linearized sample, assuming that the DPD learning algorithms are executed over 15 closed-loop iterations-which is generally the number of iterations that the Fig. 7. OTA convergence performance with 800 MHz waveform bandwidth and EIRP ≈ +40.5 dBm for the SO method and its variants, in terms of (a) TRP-based ACLR and (b) NMSE. The GN method is also measured and shown for reference.  Table I, together with the corresponding relative complexity reduction with respect to the reference GN method, are presented in Table II. As can be seen from the table, excellent complexity savings can be obtained through the proposed solutions. In all cases, the numerical average complexity is drastically reduced when comparing with the reference GN method. This is also reflected in the achieved percentage reduction, which is in most cases larger than 90%. Keeping in mind that all the proposed methods satisfy the 3GPP transmit waveform quality specifications [7], as shown through the obtained results in Section VI-B-E, the combination of efficient linearization performance and low processing and learning complexity is paving the way toward fast and continuous DPD adaptation with on-chip real-time implementations in commercial systems.

VII. CONCLUSION
In this article, various methods to efficiently estimate the inverse of the input data CM were proposed and combined with SO closed-loop learning in a DPD-based linearization context. The inherent low complexity of the SO learning combined with the proposed methods, allowed for remarkably reduced complexities in the DPD system, while maintaining a similar linearization performance compared to state-of-theart solutions. To substantiate this, complexity analyses dealing with the proposed solutions were performed, and thorough RF measurement results at 28 GHz mmW band, featuring a 64-element active antenna array, were presented. The obtained results, both in terms of performance and complexity, indicated very favorable performance-complexity tradeoffs of the proposed methods when comparing against the state of the art. The proposed methods are thus promising candidates for linearizing mmW phased-array transmitters as well as cellular terminal transmitters, the processing complexity being a key concern in both applications. Since 2016, he has been a University Researcher with the Department of Electrical Engineering, Tampere University. From 2016 to 2017, he was a Visiting Research Fellow with Aalto University, Finland. His research interests are in radio communications and signal processing, with a focus on the radio implementation challenges in systems such as 5G, full-duplex radio, and large-scale antenna systems.