Sparse Flexible Reduced-Volterra Model for Power Amplifier Digital Pre-Distortion

The Volterra framework plays an important role in digital pre-distortion (DPD) for power amplifier (PA) linearization. In practice, the full-Volterra (FV) model is avoided due to the so-called curse of dimensionality, which has motivated several Volterra pruning techniques in the literature. However, choosing the pruned-Volterra structure with a given complexity that most accurately describes a PA or DPD with unknown characteristics often requires trial and error. The motivation of this article is to propose a reduced-Volterra model that extends traditional memory polynomials, e.g., the generalized memory polynomial (GMP), by including higher-dimensional terms in a flexible and controlled way, avoiding the curse of dimensionality of the Volterra model and resulting in a model that can apply to a wide range of PA and DPD. Furthermore, the article investigates the parsimonious sizing of the proposed model using least absolute shrinkage and selection operator (LASSO) and sparse-group LASSO (SGL) convex optimization, significantly reducing the model’s running cost, while preserving its performance. In our experimental validation, the sparse versions of the proposed model achieved 3 dB better adjacent channel power ratio (ACPR) than the conventional GMP model for the same or lower DPD running cost, at the expense of a more costly estimation, which is performed off-line.


I. INTRODUCTION
The Volterra framework [1], [2] plays an important role in digital pre-distortion (DPD) and baseband modelling of high-power and wideband nonlinear (NL) power amplifiers (PA) [3], [4], [5], [6], although, in practice, the full-Volterra (FV) model is avoided due to the large number of coefficients to be estimated, the length of the corresponding filter and its on-line running cost. The so-called FV curse of dimensionality has motivated several Volterra pruning techniques in the literature, the most prominent being the memory polynomial (MP) [7], the generalized memory polynomial (GMP) [8], the dynamic deviation reduction (DDR) [9], [10] and block-oriented [11], [12] behavioural models. Moreover, several specialized models have also been proposed, each of which taking into account additional nonlinearities, e.g., [13], The associate editor coordinating the review of this manuscript and approving it for publication was Ludovico Minati . [14], [15], [16], [17], [18], [19], [20], [21], [22], and [23]. Nevertheless, choosing among several candidates with a given complexity the most accurate pruned structure to describe a PA or DPD with unknown characteristics often requires trial and error. It is not a priori guaranteed that one such memory polynomial or block-oriented model adequately fits a NL system, without any physical knowledge available. This is one major challenge with pruned-Volterra behavioural models, such as the MP, GMP and DDR, compared to models developed from physical knowledge of the systems, such as PA operating/environmental conditions. The latter, in contrast are difficult to derive, typically more complex and less general [12]. Another challenge with behavioural modelling concerns to sizing, or dimensioning, i.e. setting appropriate values for the model parameters. Typically, the search for the best features (or regressors) to be added to a model of a given complexity is a combinatorial task, performed by exhaustive search [24], becoming a non-deterministic polynomial time (NP-hard) problem [25]. Indeed, although heuristic methods, such as hillclimbing [26], particle swarm optimization (PSO) [27], [28] and evolutionary algorithms [29] have been proposed, they still suffer from scalability [24], [30].
Furthermore, PA/DPD models usually admit parsimonious/sparse approximations [31], [32], [33], [34], [35], i.e., only a few of the PA/DPD model coefficients are relevant, either because the physical system is sparse or the original model is over-parametrized. The search for the optimum sparse set of model coefficients based on the 0 pseudo-norm penalty is a known NP-hard problem, this way, among several techniques [36], most of the attention has been focused on the convex 1 norm relaxation, i.e., the least absolute shrinkage and selection operator (LASSO) [37], [38] and on the 0 approximations, such as the orthogonal matching pursuit (OMP) [39], [40]. In some PA/DPD papers, sparse techniques have been applied to the FV model, such as LASSO in [31] and [32] and greedy OMP in [34], however, using the FV model is more costly than if a pruned model was considered. In [33], LASSO is applied to the parallel Hammerstein (PH) model; in this case, the obtained performance may suffer from limitations of the initial model. Finally, in [35], the sparse compressed-sampling matching pursuit (CoSaMP) technique is considered with pruned Volterra models, being advantageous in comparison to principal component analysis (PCA). In our paper, we choose LASSO-based techniques, specially the sparse-group LASSO (SGL) [41], due to their groupsparse inducing properties, convexity, robustness and strong theoretical guarantees [39], [40]. Moreover, we also investigate LASSO and SGL as a convex optimization approach for the sizing of the proposed behavioural model. For the SGL technique, we first represent the model in terms of a pre-defined block-modular structure. Next, SGL bi-level optimization selects the most relevant model blocks and, simultaneously, the dominant regressors within these blocks, leading to a more interpretable complexity-reduced model. After the training phase is completed, only the sparse active set of blocks/regressors needs to be implemented for online PA linearization, thus saving DPD running cost. The performance of the proposed model is validated through simulations and experimental comparison with well-known DPD behavioural models. Finally, we should mention that in [23] we also use LASSO and SGL, but with the Wiener Hammerstein with feedback (WHFB) model proposed, based on physical knowledge, specifically for DPD of PAs subject to impedance mismatch, therefore with a more limited application than the model proposed here.
The contribution of this paper is twofold: (a) We propose a flexible reduced-Volterra polynomial structure for NL PA and DPD, in the sense that the memory depth and polynomial dimension (i.e., the maximum number of cross-samples) of the polynomials to be included in the model are defined independently for each NL order. This strategy leads to a more flexible and generalizable model, while controlling the running complexity, as presented in this paper. The resulting model is more flexible by allowing regressors from traditional memory polynomials (such as the MP and GMP), as well as higher dimension regressors to be included and is also economic, by allowing higher order model parameters to be controlled; (b) Furthermore, we propose a block-wise modular structure for the proposed model and apply SGL estimation to select the subset of the most relevant blocks/regressors, which leads to model sizing and reduction. This paper is organized as follows. Section II provides background on the existing pruned-Volterra models and sparse estimation techniques; in Section III, the proposed DPD model is derived. Section IV proposes the modular structure of the proposed model and its SGL estimation. Section V provides the simulation and experimental results, followed by Section VI, where final conclusions are drawn.

II. SYSTEM MODEL
The baseband full-Volterra (FV) series model [42], [43], truncated to memory depth M and NL order K , is given by: whereũ(n) andx FV (n) are, respectively, the DPD complexvalued, discrete-time, baseband input and modelled output signals, · * is the conjugate operator, andh k (m 1:k ) are the k th -order FV kernels, where m 1:k = m 1 , . . . , m k . In Equation (1), the baseband Volterra kernels for k ≥ 3 present a doubly-symmetry, i.e., all permutations of the indices within each of the products 2ũ * (n − m i ) result in the same components. Therefore, the FV model can be equivalently replaced by the triangular baseband FV model, more economical in the number of coefficients [43]. Moreover, in [44], a modified triangular Volterra (MV ) model is proposed, in which the memory depths of each NL order are independent. This way, shorter memory spans can be assigned to higher NL orders, thus reducing the model size, e.g., see [32]. The MV model [44] is given by: where M k is the memory depth for the k th -order polynomials. This way, its required number of coefficients is: Although reducing the FV model size, the doublycombinatorial in Equation (2) still grows very fast with the NL order and memory depths. This has motivated several other pruned-Volterra models in the literature; among the most prominent are: • The MP model [7] that is limited to one-dimensional polynomials, i.e., built using a single input sample.
• The GMP model [8] that includes some two-dimensional polynomials, i.e. allows cross-sample products. The GMP model is given by: where parameters M 1 and K 1 are, respectively, the memory depth and NL order of the terms between aligned signal and powered envelope, M 2 , J 2 and K 2 are, respectively, the memory depth, envelope time-shift and NL order of the terms between signal and lagging powered envelope and M 3 , J 3 and K 3 are, respectively, the memory depth, envelope time-shift and NL order of the terms between signal and leading powered envelope.
• The DDR model [9] that controls the power orders of the delayed samples in the polynomials, as described in [10], instead of limiting the polynomial dimension. For example, the polynomialsũ(n − m 1 )|ũ(n)| 2 andũ(n − m 1 )ũ(n − m 2 )ũ * (n)|ũ(n)| 2 belong, respectively, to the first-and second-order DDR models. Even more sophisticated models have also been derived in recent papers from the FV model, such as in [15], [16], [17], [18], and [19]; among them, the complexity reduced Volterra (CRV) model proposed in [18] is given by: where · * is the conjugate operator and K , Q and M are, respectively, the NL order of instantaneous samples, NL order of lagging samples and the memory depth.

A. ESTIMATION TECHNIQUES
During the DPD training phase and following the indirect learning architecture (ILA) [45], we measure, down-convert and process the length-N signalsx andỹ, respectively, the PA undistorted input and NL output. In the ILA, the DPD training postdistorter, placed after the PA, estimates directly the inverse model, by interchanging the roles of the signals above, i.e., usingũ =ỹ/G PA as the post-inverse normalized input (G PA is the PA linear gain) and minimizing the least squares error betweenx and the modelled post-inverse output x, i.e., x −x 2 2 , where · 2 is the Euclidean norm [46]. After the training phase is completed, the postdistorter inverse model is copied to the DPD predistorter [1]. Using the MV model from Equation (2), which is linear in the coefficients, the post-inverse output signal can be modelled asx = β, where is the regression matrix and β are coefficients to be estimated by ordinary least squares (OLS), as follows: where N is the number of samples, R is the number of regressors, is the N × R model-specific regression matrix and β is the R × 1 coefficient vector to be estimated. Finally, the least squares solution is given byβ OLS = ( H ) −1 Hx , assuming H is non-singular. However, as the model dimensions increase, frequently becomes ill-conditioned (due to high correlation among regressors) and/or the OLS overfits the training data; in both cases, Equation (6) solution is compromised in terms of accuracy and stability [47] and ridge regression regularization [48] leads to better model prediction. Moreover, usually PA/DPD models admit parsimonious/sparse approximations [31], i.e., only a few of the PA/DPD model coefficients are relevant, either because the physical system is sparse or the original model is over-parametrized. In this work, we use LASSObased techniques to select the subset of the most relevant regressors, while avoiding ill-conditioning and over-fitting. LASSO-based techniques are chosen due to group-sparse inducing properties, convexity, robustness and strong theoretical guarantees [39], [40], compared to greedy techniques, such as OMP. The LASSO batch estimator solves the following convex optimization problem: where β 1 is the 1 norm, given by in terms of the real and imaginary parts of the coefficients, and λ is the sparsity parameter. One typically finds the LASSO solution by solving Equation (7) over a pre-defined range of λ values and obtaining the best λ by cross-validation [38] or more economical methods, such as the Akaike's information criterion (AIC) [49]. Without loss of generality, we assume that the matrix is standardized, such that each column φ r is centered, i.e., 1 Examples in the literature that have applied LASSO to reduce the dimensionality of the FV model are [31], [32], and [33]. Furthermore, when it is possible to organize the model into blocks, the regression matrix becomes block-modular, i.e., is divided into G disjoint blocks and group-sparse LASSO extensions, such as group-LASSO [50], [51], [52] and SGL [41], can be applied. In this paper, we apply the bilevel SGL selection to explore the sparsity of the model both at the individual and block levels. The SGL cost function is given by: where g is the g th -group of regressors from , β g is the corresponding coefficient vector, w g is the group weight, often equal to d g (where d g is the group length), λ is the sparsity parameter and α ∈ [0, 1] is the additional SGL parameter. Equivalently, one can use λ 1 = αλ and λ 2 = (1 − α)λ.

III. PROPOSED FLEXIBLE VOLTERRA-BASED MODEL
This section presents the proposed reduced-Volterra model, whose parameters are defined independently for each NL order. This allows to increase the model parameters, such as the NL order, polynomial dimensions and memory depths in a controlled way, generalizing existing memory polynomials, such as the MP and GMP, and avoiding the curse of dimensionality. Moreover, by exploring higher polynomial dimensions, not assessed by the GMP model, the proposed model is expected to achieve better normalized mean square error (NMSE) and adjacent channel power ratio (ACPR). Our pruning strategy to be described explores the fading memory behaviour of physical systems [44] that allows to reduce the memory depths of higher NL orders of the model. Starting from the MV model in Equation (2), let, for each NL order k, the collection of input sample multi-indices, drawn from the finite set of integers {0, . . . , M k }, be represented as follows: in which m 1 is the first non-conjugated sample delay index. M (k) and M (k) * are multisets, i.e., unordered collections of elements in which repetitions are allowed [53], [54], each with cardinality k−1 2 , and containing, respectively, the delay multiindices of the remaining non-conjugated and conjugated input samples. Additionally, the triangular model imposes that m k+1 The pruning strategy is described in two steps, as follows. In step 1, the doubly-combinatorial structure in Equation (2) is replaced by a single-combinatorial one, inspired by memory polynomials, such as the MP and GMP. This way, following the MP/GMP structure, we only retain the polynomials from Equation (2) in which non-conjugated and conjugated indices appear in pairs of the same delays. This corresponds to imposing in Equation (9) that, for each k: i.e., the two multisets are equal. This way, the vector of multi-indices reduces to 2 ) now contains the sample delay multiindices of squared envelope terms. Thus, the reduced-Volterra model RV 1 , after step 1, is given by: This way, the k th -order polynomials have up to k−1 2 squared envelope terms with distinct delays (cross-samples) and their maximum polynomial dimension is k+1 2 . The required number of coefficients in Equation (11) is: where is the multiset binomial coefficient and is equal to Next, in step 2, for each NL order k greater than or equal to three, the k th -order polynomials from Equation (11) are limited to have polynomial dimension up to L k + 1, or, equivalently, L k distinct powered envelope terms, where the parameter L k is user-defined in the range from 1 to k−1 2 . Thus, once L k is chosen, the model only keeps polynomials with up to L k distinct powered envelopes terms. For example, let us consider the 7 th -order polynomials u(n) |u(n)| 6 , u(n) |u(n − 1)| 6 , u(n) |u(n − 1)| 4 |u(n − 2)| 2 and u(n) |u(n − 1)| 2 |u(n − 2)| 2 |u(n − 3)| 2 that have, respectively, polynomial dimensions one, two, three and four. Now, if L k (with k=7) is set to two, the last polynomial is discarded from the model.
Let the multiset M (k,l) = (m 2 , m 3 , . . . , m l+1 ) contain the sample delay indices of the polynomials (k, l), i.e., the k th -order polynomials with exactly l distinct powered envelope terms. In this case, the l sample delays in M (k,l) are not repeated, that is, m l+1 > m l > · · · > m 3 > m 2 . Furthermore, let the vector v (k,l) contain the exponents of the powered envelope terms in the polynomials (k, l). Table 1 shows the exponents v (k,l) for k up to 13 and l = 1, . . . , L k (L k maximum equals k−1 2 ), as well as the corresponding polynomial terms. In practice, to avoid redundancy in the model, we do not consider all exponent vectors from Table 1, nor their permutations. In fact, we only consider the first exponent vector of each (k, l) tuple from the table in the model, that is, we apply only the highest possible exponents to the least delayed samples, inspired by the DDR [9] that also controls the power orders of the delayed samples. Finally, the proposed reduced-Volterra model RV 2 is given by: where v (k,l) are exponent vectors given in Table 1 (as discussed above) and the required total number of coefficients is: where L 1 = 0 and M k +1

IV. PROPOSED SPARSE-GROUP LASSO SOLUTION
This section proposes to generate the RV 2 model with a block-wise modular structure, so that group-sparse estimation techniques can be applied for model sizing and reduction. This section is motivated by the reduction in running cost that is expected by the use of a block-modular structure and group-sparse estimation, compared to LASSO, due to the reuse of previously computed blocks, as verified numerically in Section V. Let us consider the RV 2 model in Equation (13), whose parameters are K and the sets {M k } and {L k } whose values are for k = 1, . . . , K (k odd). Given N samples of the measured training signals, the model's regression matrix can be assembled modularly by concatenating sub-matrices (k) that contain the k th -order model regressors. In turn, each sub-matrix (k) is assembled by concatenating the disjoint blocks (k,l) , for l = 1, . . . , L k , that is: . . . . . .
where each k th -order sub-matrix is given by: as illustrated in Figure 1. Each block (k, l) is composed of row vectors φ (k,l) (n) of the regressors at instant n, i.e., Note that the size of (k) is , whereas the size of (k,l) is N × A truncated Kronecker operator is defined in [55] and [56] that allows an efficient implementation of triangular Volterra models, such as in Equation (2). Likewise, we next propose the modified truncated Kronecker operator that allows to separately implement each regression block (k,l) of our model efficiently and avoid redundancy due to sample delay permutations. Let us define the modified truncated Kronecker operator of an m × 1 vector a = a 1 a 2 . . . a m T by itself as: where a i = a i a i+1 . . . a m T and − is the modified truncated Kronecker operator. Recursively, the q th -order modified truncated Kronecker operator q − a ≡ a − · · · − a q is given by: where 0 − a = 1 and 1 − a = a. For the RV 2 model, we first define the input squared envelope vector, adjusted to the length M k + 1 for each NL order k, as follows: then, the modified truncated Kronecker product ofũ (2) (n) by itself is given by: . . .
where: and, recursively, the corresponding q th -order modified truncated Kronecker product results: Therefore, the row vectors φ (k,l) (n) of blocks (k,l) (k ≥ 3) of the proposed model can be efficiently computed as: where ⊗ is the conventional Kronecker product. In Equation (23), since the exponent (k + 1 − 2l) is even,ũ (k+1−2l) (n) is an element-wise powered version ofũ (2) (n), thus: Finally, before applying group-sparse techniques, each RV 2 model block (k,l) is further sub-divided into M k + 1 non-overlapping sub-blocks, through the sample delay index m = m 1 in Equation (13), i.e., the elements ofũ(n) in Equation (23). The length of each sub-block of (k,l) is given by: and corresponds to d g values in Equation (7).  This way, using the bi-level SGL selection we can perform the sizing of the model in a systematic way, by convex optimization, obtaining the active set of sub-blocks {(k, l, m)} and the indexes of the most relevant regressors within the selected sub-blocks. In the next section, we discuss the numerical/experimental results.

V. NUMERICAL VALIDATION
The numerical validation of the proposed RV 2 DPD model in Equation (13) is performed by comparison to well-known models, such as the triangular Volterra, MP, GMP and others. This task consists of an initial proof of concept performed through computer simulations, followed by an experimental test using hardware that validates the performance of the model.
In both steps, the PA input and output signals are generated, simulated/measured and stored in separate datasets for training and validation. The ability of the proposed DPD technique to compensate for in-band and out-of-band PA distortions and its performance are evaluated using the validation dataset. The DPD training algorithms are implemented offline using Matlab and, additionally, LASSO and extensions use the package Sparse Learning with Efficient Projections (SLEP) [57], also in Matlab, although there are other implementations available, e.g., [58] and [59]. Moreover, during sparse estimation, the LASSO algorithm is run for a range of pre-defined values of the sparsity parameter λ, using the SLEP package with up to 5,000 iterations per run and tolerance 1e−7. The best λ value is obtained using the Akaike's information criterion (AIC) [49]: where β (λ j ) 0 is the 0 pseudo-norm of the coefficient vector from LASSO with candidate λ j . The AIC is a widely used selection criterion for choosing the best among several competing models, trading-off their fitting capabilities and required running complexity [60]. Thus, we run the optimization in Equation (7) for a predefined range of λ values and choose the λ that minimizes the AIC.
As is discussed in the next subsections, the main advantages of the proposed model, comparing to conventional models from the literature, are: flexibility, i.e., parameter values can be independently defined for each NL order; generality, i.e., higher polynomial dimensions can be added to the model as required; modularity, i.e., its block-modular structure allows group-wise sparse estimation (further reducing the running cost, as in Subsection V-C) and parsimony, i.e., the curse of dimensionality of Volterra-based models is avoided.

A. SIMULATION RESULTS
In this subsection, we carry out simple computer simulations to evaluate the performance of the proposed DPD model. For this, we describe the PA in the complex baseband domain by the well-known Wiener-Hammerstein (WH) model [11], [12]. As in Figure 2, the WH is a block-oriented model consisting of the series interconnections of a linear timeinvariant (LTI) filter, a memoryless nonlinear (NL) block and another LTI filter. In the literature, the NL block is described by memoryless models, such as complex polynomials, nonlinear functions, the Saleh [61] and Rapp [62] models, or directly by amplitude-to-amplitude modulation (AM/AM) and amplitude-to-phase modulation (AM/PM) look-up tables. In our simulations, we account for the NL block through three different options, as follows: • The 5 th order memoryless polynomial given in [47], with coefficients: β 1 = 14.9740+j0.0519, β 3 = −23.0954+ j4.9680 and β 5 = 21.3936 + j0.4305, extracted from a Class AB PA.
• The Saleh model [61], with parameter values from [63]: α a = 2, β a = 2.2, α φ = 2 and β φ = 1. The LTI blocks in the WH model can be either finite impulse response (FIR) or infinite impulse response (IIR) filters. We adopt the transfer functions from [3], i.e., H i (z) = (1 + 0.25z −2 )/(1 + 0.4z −1 ) for the input filter and H o (z) = (1 − 0.1z −1 )/(1 − 0.2z −1 ) for the output filter, to be used in combination with any of the NL block options above. In the simulations, the input signal is a single-carrier 64-quadrature amplitude modulation (64-QAM) waveform. In each case, the input power level is adjusted to drive the PA model into the nonlinear range. Tables 2, 3 and 4 present the simulation results for the distinct NL options of the WH PA model, respectively, the polynomial, arctan and Saleh, using the DPD models: MP, GMP, dense RV 2 and sparse RV 2 LASSO. The DPD performance evaluation is given in terms of the required numbers of coefficients, achieved NMSE and ACPR. The DPD model sizing is performed by exhaustive search with the GMP model, choosing the model parameters based on the NMSE and ACPR achieved at each search grid point defined by K , M and J in the ranges K = 5, 7, 9; M = 4, 5, 6 and J = 2, 3, 4, 5 and assuming, for simplicity, in Equation (4), The same model sizing procedure is performed in Subsection V-B, where it is presented in more details. The RV 2 parameters follow the ones for the GMP, i.e. same K value and maximum M k equal to M + 2J (which is the GMP memory span), with lower M k values for higher NL orders. The RV 2 model is run for two sets of L k parameters : {0, 1, . . . , 1} and {0, 1, 2, . . . , 2}, as indicated in each case in Tables 2, 3 and 4. The RV 2 LASSO estimation is run for L k = {0, 1, 2, . . . , 2} and a pre-defined range of the sparsity parameter λ, so the best λ value is obtained in each case using the AIC score in Equation (26). This procedure is also discussed in more details in Subsection V-B. From our simulations, we can verify that the proposed model is able to outperform the conventional models under study and, in addition, the RV 2 LASSO outperforms the GMP model with approximately the same numbers of coefficients. Figures 3, 4 and 5 show the normalized power spectral densities (PSD) of the compensated output signals obtained using the MP, GMP, RV 2 (L k = {0, 1, 2, . . . , 2}) and RV 2 LASSO DPD models for, respectively, the polynomial, arctan and Saleh NL models with P avg = −14 dBm (polynomial and Saleh) and P avg = −9 dBm (arctan). As we can see in the figures, the proposed model can effectively linearise the PA, further reducing the spectral regrowth compared to the conventional models.

B. EXPERIMENTAL RESULTS
Next, we validate the performance of the proposed DPD model through experimental tests, using a 10 W Gallium Arsenide (GaAs) metal-semiconductor field-effect transistor (MESFET) RF PA with 40 dB of nominal (linear) gain, designed at Laboratory of Microelectronics (LME) of the University of São Paulo (USP), measured in Class AB with the test set-up shown in Figure 6. As detailed in the block diagram, an N5182B MXG vector signal generator from Keysight Technologies is used to synthesize, at the carrier frequency of 5.9 GHz, the PA input signal for this   experiment, an orthogonal frequency division multiplexing (OFDM) waveform with 1,024 sub-carriers modulated with 64-QAM and 20 MHz of bandwidth. This range of bandwidth complies with the orthogonal frequency division multiple access (OFDMA) downlink of the long term evolution (LTE) 3GPP standard [64] and is also considered in recent papers, e.g., [2], [15], [16], [20], [65], and [66]. Furthermore, note that memory polynomials in general, of which the proposed model is an expression, are also applicable for DPD of configurations with wider bandwidths, as for example [35]. An N9020B MXA vector signal analyzer from Keysight Technologies is used to digitize and capture the complex  I/Q data streams of the PA input and output waveforms. Both the MXG and MXA are connected to a PC via LAN. The PA inputx and outputỹ complex envelope datasets are time aligned by cross-correlation in Matlab and split into the disjoint training and validation partitions, each of which with N = 40,000 samples. In the next items, the RV 2 model and its sparse versions are compared, in terms of numbers of required coefficients and DPD performance, to well-known behavioural models, such as the GMP and MV , all estimated using the same training dataset. Additionally, we compare the proposed model to the MP, CRV (Equation (5)) and RV 1 (Equation (11)). Finally, the RV 2 model with block-wise structure and SGL estimation is validated.

1) INPUT POWER LEVEL
Initially, in order to decide an appropriate PA input power level, the PA is driven by the modulated input signal from the MXG with three average power levels: P avg = −17, −12 and −8 dBm. As the PA input power increases, the amplifier reaches its NL region and shows gain compression, as in Figure 7, where the PA gain is plotted versus the instantaneous power P in of the input signal. As expected, NL behaviour and memory effects become more pronounced as the PA average power level increases, reaching our region of interest, where spectral regrowth, modulation distortion and nonlinear memory effects are visible. Based on this analysis, P avg = −12 and −8 dBm are chosen for the DPD model comparison.

2) GMP MODEL SIZING
In this item, the GMP model sizing is performed by exhaustive search, as follows. A search grid is defined by varying the model parameters K , M and J in the ranges K = 1, . . . , 11 (K odd); M = 1, . . . , 6 and J = 1, . . . , 3 and also assuming in Equation (4) that M = M 1 = M 2 = M 3 , K = K 1 = K 2 = K 3 and J = J 2 = J 3 . Using the measured data for P avg = −12 and −8 dBm, the model is estimated at each grid point by least-squares and the corresponding NMSE and ACPR performances are captured. As shown in Figure 8

3) DENSE RV 2 RESULTS
Next, we discuss the RV 2 model results. In this item, the dense RV 2 model is considered, i.e., not exploring sparsity yet, so the RV 2 LASSO and RV 2 SGL models are discussed in the next items. Figures 9 and 10 show, respectively, for P avg = −12 and −8 dBm, the residual ACPR after compensation versus the numbers of coefficients obtained with the RV 2 model, having the GMP model as a benchmark. In this analysis, the GMP parameters are run in the ranges K = 1, . . . , 11 (K odd); M = 1, . . . , 6 and J = 0, . . . , 3. The RV 2 model parameters follow the ones for the GMP, i.e. same K value, maximum M k equal to M + 2J (which is the GMP memory span) and lower M k values for higher NL orders. The RV 2 model is run with parameters L k = {0, 1, 2, . . . , 2} and the first exponent vectors from Table 1. For each combination of parameters, the GMP and RV 2 models are estimated by least-squares. In addition, ridge regression is also applied when it is necessary to avoid overfitting the training data. The figures include, for each model size (number of coefficients), the combination of parameters that gives the best performance, i.e., the Pareto front. Comparing the residual performance provided by the models, the RV 2 (dense) model is capable of achieving lower ACPR levels, which is very relevant, since wireless transmitters must comply with spectral emission masks. This is also verified for in-band performance (NMSE) (not shown). However, the performance improvements required much larger model sizes, which impacts the on-line DPD running cost. Tables 5 and 6 summarize, respectively, for P avg = −12 and −8 dBm, the DPD performance of the proposed RV 2 model compared to the MV , RV 1 , MP, GMP and CRV models. Model parameters adopted in Table 5 are K = 7, M = 5 and J = 2 (GMP), M k = {9, 9, 6, 5} and L k = {0, 1, 2, 2} (RV 2 ), whereas in Table 6  Please note that we can flexibly include higher dimension and more delayed terms in each NL order of the RV 2 model by adjusting M k and L k . In Table 6, the size of the MV model grows very fast, so we were only able to estimate the model up to K = 9 and M k = {10, 10, 6, 4, 3} or 7,499 coefficients. The RV 2 model replaces the MV doublycombinatorial in Equation (2) by Equation (13), reducing substantially the model size, estimation and running costs. In addition, the RV 2 model allows the maximum polynomial dimension to be flexibly defined for each NL order. As a conclusion, the proposed RV 2 model approximates the DPD performance of the MV , while is more affordable for practical implementation. The MP model is a subset of the GMP, whereas the CRV is an extended version, in which a larger amount of two-dimensional polynomials is allowed. Comparing to the MP, GMP and CRV models, the dense RV 2 model offers 2 to 3 dB superior performance at the cost of higher complexity, which is addressed by the sparse techniques considered next.

4) RV 2 LASSO RESULTS
In this item, the LASSO sparse estimation from Equation (7) is applied to the dense RV 2 model, in order to reduce its complexity and DPD running cost. For P avg = −12 and −8 dBm, the previously assembled RV 2 regression matrix is standardized. In Figures 9 and 10, we run RV 2 LASSO with a predefined range of the sparsity parameter λ and visualize the ACPR values achieved versus the number of coefficients. In the figures, the ACPR performance with the RV 2 LASSO is very close to that of the dense RV 2 , with the advantage of requiring a reduced complexity. Moreover, Figure 11 shows, for P avg = −12 dBm, the NMSE and AIC values achieved with the RV 2 LASSO best fit solutions for a predefined range of λ values, indicated in the figure. The best λ value, where a good trade-off between accuracy and complexity is achieved, corresponds to the minimum AIC score in the curve. Note that by making λ too large, LASSO is not able to capture the underlying behaviour of the system. As λ is swept, the elbow in the NMSE curve indicates a range of economic solutions for which very small degradations of the model performance are observed. Tables 5 and 6 also present sparse estimation results, as follows. We compare the RV 2 LASSO and RV 2 SGL run for two sets of sparsity parameters, the one that corresponds to the best AIC scores and another that implies a more reduced model complexity. The RV 2 LASSO and RV 2 SGL models outperform the GMP and CRV models, as well as the sparse selection of regressors based on the greedy OMP [34], [35], [67], as in the tables. The LASSO sparse estimation technique preserves the model's performance, while significantly reducing its running cost. In the next item, we discuss the RV 2 SGL model results.

5) RV 2 SGL RESULTS
Next, the bi-level SGL sparse estimation from Equation (8) is applied to the original RV 2 model. The SGL estimation is run for a pre-defined search grid of the sparsity parameters (λ 1 , λ 2 ) using the SLEP package [57] with tolerance 1e−6 and the best values of (λ 1 , λ 2 ) are found through the AIC score. Note that this task becomes more complex compared to LASSO, due to the additional SGL parameter. As already mentioned, the SGL criteria performs the sizing of the original model, selecting the sparse set of the most relevant subblocks, as well as the dominant regressors. Tables 5 and 6 show that both the RV 2 LASSO and RV 2 SGL models outperform the ACPR of the GMP model by 3 dB for similar DPD running costs (at the expense of a more costly estimation). For P avg = −12 and −8 dBm, the active sets selected by SGL resulted in, respectively, 11 out of 37 and 18 out of 53 sub-blocks. Figure 12 illustrates for P avg = −12 dBm the normalized PSD of the compensated output signals using the GMP, dense RV 2 , RV 2 LASSO and RV 2 SGL models. From Table 5 and Figure 12, the sparse versions of the proposed model achieved better ACPR and NMSE performance than the GMP model for the same or lower numbers of coefficients. Moreover, Figure 13 (a) and (b) presents, respectively, the AM-AM gain (in dB) and AM-PM plots and we can see that the techniques are able to compensate for the nonlinearities. This way, the large size of the dense RV 2 model is addressed using sparse techniques, leading to a more affordable model for practical implementation. Note that the model extraction using LASSO and SGL is more costly than with ordinary least squares, but is performed offline. In the next subsection we compare the running complexity of the GMP, RV 2 LASSO and RV 2 SGL models.

C. COMPLEXITY ANALYSIS
In this subsection, we consider the DPD computational complexity of the models, particularly, the running cost. As in [68], the total DPD complexity can be classified into estimation, adaptation and running costs. The estimation and adaptation costs refer, respectively, to DPD training and updating tasks, both performed off-line in a digital signal processor (DSP). The running cost refers to the real-time tasks of regression matrix generation and DPD filtering, performed in hardware. In [68], expressions for the running cost in floating point operations (FLOPs) per sample are developed for several DPD behavioural models, based on FLOPs required per operation, such as amplitude squared, complexcomplex and complex-real multiplications, complex addition, and so on. This way, the GMP model running cost expressions (in FLOPs per sample) are: (a) regressors generation: 3 + (2J + 1)(K − 1); (b) DPD filtering: 8N gmp − 2, where N gmp is the number of GMP coefficients. Next, we derive running cost expressions in FLOPs per sample for the RV 2 LASSO and RV 2 SGL models, as follows: • In both cases, DPD filtering cost: 8N model − 2.
• RV 2 LASSO regressors generation cost: where N (k) lasso is the number of k th -order polynomials. Note that each k th -order regressor involves k−3 2 real multiplications plus one complex-real multiplication.
• RV 2 SGL regressors generation cost: where N (k,l) sb is the number of selected sub-blocks of a block (k, l) and d (k,l) sb is the length of each of these subblocks. In the proposed modular structure, all sub-blocks of a given block can reuse the same computations stored in memory and the computation of higher order subblocks partially reuse previously calculated ones. Table 7 compares the running complexities in FLOPs per sample of the GMP, RV 2 LASSO and RV 2 SGL models. We can conclude that even the models RV 2 LASSO and RV 2 SGL having approximately the same number of coefficients, the proposed block-modular structure is more advantageous for on-line DPD implementation, further reducing the required running cost.

VI. CONCLUSION
This paper proposed a flexible reduced-Volterra polynomial model that extends existing memory polynomial models, such as the MP and GMP, by including higher dimension terms in a controlled way, being applicable to a wide range of PA and DPD. In our experimental validation, the sparse versions of the proposed model delivered better in-band and out-of-band performance than conventional models for the same or lower DPD running cost. The main advantages of the proposed model, comparing to conventional models from the literature, are: flexibility, i.e., parameter values can be independently defined for each NL order; generality, i.e., higher polynomial dimensions can be added to the model as required; modularity, i.e., its block-modular structure allows group-wise sparse estimation (further reducing the running cost, as in Subsection V-C) and parsimony, i.e., the curse of dimensionality of Volterra-based models is avoided. Future works consider a generalization of the proposed model for the joint compensation of NL PA and in-phase and quadrature (I/Q) modulator imbalance and for beamforming and phased-array.