Decoupling Channel Estimation for FDD Massive MIMO Systems Utilizing Joint Sparsity

Effective channel estimation for massive multiple-input-multiple-output (MIMO) systems based on frequency division duplex (FDD) protocol is a crucial problem. To reduce pilot overhead, we consider the joint sparsity of the multi-path channel in delay-angle domain. Based on the joint sparsity, we propose a decoupling pilot design scheme. In the proposed scheme, the pilot design is decoupled into two parts. In the first part, a global optimal selection greedy iterative algorithm (GOS-GIA) is proposed to obtain the near-optimal pilot subcarrier pattern. In the second part, a random Rademacher distribution pilot matrix is used as the angle-domain pilot matrix. Accordingly, a two-stage channel estimation strategy is also provided and analyzed. The first stage of the strategy is concerned with retrieving the positions of non-zero dominant taps in the delay domain. The second stage focuses on estimating the channel coefficients at these taps. Algorithm complexity analysis shows that the GOS-GIA can reduce the computational complexity at least one order of magnitude, and the proposed channel estimation strategy reduces the computational complexity by more than two orders of magnitude. Simulation results verify that the proposed pilot design scheme achieves better performance with lower pilot overhead.


I. INTRODUCTION
Massive multiple-input-multiple-output (MIMO) systems have attracted much attention in recent years, which can enhance the spectrum efficiency and energy efficiency by orders of magnitude [1], [2]. Due to these benefits depend on the accuracy of estimating the channel state information (CSI) at the base station, the estimation of CSI has become a great challenge [3].
TDD protocol based on channel reciprocity is usually applied to estimate the CSI of massive MIMO systems. By utilizing the channel reciprocity, downlink channel estimation can be realized through uplink training. The number of orthogonal pilot sequences required is proportional to the number of users instead of the number of BS antennas [1], [2], [4]. While, TDD massive MIMO systems also bring a lot of challenges, such as pilot contamination and reciprocity calibration of hardware chains.
The associate editor coordinating the review of this manuscript and approving it for publication was Zhen Gao . Compared to the TDD protocol, FDD protocol can provide better CSI estimation performance in terms of transmission delay, communication range, and mobility support [5]- [10]. The challenge of CSI estimation for FDD protocol is serious pilot overhead [11]. That is because the number of training pilots required for channel estimation increases rapidly with the number of BS antennas. How to effectively design pilots and estimate the CSI is essential to FDD massive MIMO systems.

A. RELATED WORK
The traditional pilot designs are based on orthogonal schemes, such as time-domain orthogonal pilot design, frequency-domain orthogonal pilot design, code-domain orthogonal pilot design, and time-frequency orthogonal pilot design [12]. All these orthogonal pilot designs require the number of pilots to be proportional to the number of BS antennas and the quantized channel delay spread. The pilot overhead is prohibitively high in massive MIMO systems. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ To reduce the pilot overhead, some non-orthogonal pilot designs and channel estimation techniques based on channel statistics have been proposed for massive MIMO systems [9], [10], [13]- [16]. Based on the long-term channel statistics at the user side, [17] proposed an open-loop and closed-loop channel estimation strategy for massive MIMO by exploiting the practical MIMO channels correlation in time and space. An optimal pilot beam pattern design for channel estimation was proposed in [18] by considering the antenna correlation and temporal correlation. By utilizing the structure of the second-order statistics of the channel vectors and the correlation of users, the author [19] proposed a joint spatial division and multiplexing (JSDM) based channel estimation. In [20], [21], blind channel estimation schemes were proposed based on the covariance matrix of the received signals and subspace partitioning. Nevertheless, in these statistics based channel estimation schemes, the acquisition of the perfect long-term channel statistics is challenging in practice. And the pilot overhead is still very high.
To further reduce the pilot overhead, compressive sensing (CS) techniques have also been used to guide the pilot design and channel estimation by exploiting channel sparsity in recent works of literature. In [21]- [26], it was pointed out that the channel impulse response (CIR) in the delay domain tends to be sparse due to the limited local scatters at the BS. By exploiting the channel sparsity in the delay domain, one superimposed pilot design was employed to reduce pilot overhead in [22]. In [24], the authors presented a novel channel estimation approach that utilizes the sparsity and common support properties in the delay domain. The channel has also been explored to be sparse in the angle domain [12]. Paper [5], [27] considered the hidden joint common and individual sparsity in angle domain among the user channel matrices due to the shared local scatterers in the physical propagation environment. A nonorthogonal downlink pilot design was proposed in [8], [12], [14], [28] by exploiting the spatially common sparsity in angle domain and time correlation of massive MIMO channels. The channel sparsity in angle domain with partial support information was utilized in [29], [30] to acquire compressed CSI. However, all the works including the above-mentioned ones have rarely considered the joint sparsity in the delay-angle domain for massive MIMO systems. To the best of our knowledge, there are no related works of literature about the pilot design and channel estimation for massive MIMO systems with joint sparsity in the delay-angle domain.

B. CONTRIBUTIONS AND ORGANIZATION
In this paper, we explore the joint sparsity of multi-path massive MIMO channel in the delay-angle domain. By exploiting the joint sparsity, a decoupling pilot design scheme with low overhead and low complexity is proposed for FDD massive MIMO systems. In the proposed pilot design scheme, the pilot design is decoupled into two parts. In the first part, a global optimal selection greedy iterative algorithm (GOS-GIA) is developed to obtain the near-optimal pilot subcarrier pattern. The proposed GOS-GIA algorithm is based on the mutual incoherence property (MIP) of the pilot matrix. In the second part, a random Rademacher distribution pilot matrix is used as the angle-domain pilot matrix. Based on the proposed pilot design scheme, one corresponding two-stage channel estimation strategy is provided. The strategy is based on CS with exploiting the joint sparsity in the delay-angle domain. The first stage of the strategy is concerned with retrieving the positions of non-zero dominant taps in the delay domain. To retrieve the non-zero taps, one block subspace pursuit (B-SP) algorithm is developed by exploiting the sparse common support (SCS). The second stage focuses on estimating the channel coefficients at these taps. One parallel subspace pursuit (P-SP) algorithm is developed to estimate the coefficients. Additionally, we analyze the algorithm complexity and system performance of the proposed pilot design scheme and the proposed channel estimation. Algorithm complexity analysis shows that the GOS-GIA can reduce the computational complexity at least one order of magnitude, and the proposed channel estimation strategy reduces the computational complexity by more than two orders of magnitude. Simulation results verify that the proposed scheme achieves better performance with lower pilot overhead.
The contributions of this work are summarized as follows: • The decoupling pilot design scheme: By exploiting joint sparsity in the delay-angle domain of massive MIMO systems, we propose a decoupling pilot design scheme with low overhead and low complexity.
• The novel pilot subcarrier pattern design algorithm: One novel GOS-GIA algorithm is developed based on the MIP of the pilot matrix. The algorithm can obtain the near-optimal pilot subcarrier pattern with low complexity.
• The corresponding channel estimation strategy: Based on the proposed pilot design scheme, one corresponding two-stage channel estimation strategy is provided. The proposed strategy is based on CS with exploiting the joint sparsity in the delay-angle domain.
• Better performance with algorithm complexity analysis and simulation: Algorithm complexity analysis shows that the proposed channel estimation strategy reduces the computational complexity by more than two orders of magnitude. Simulation results verify that the proposed scheme achieves better performance with lower pilot overhead. The rest of this paper is organized as follows. In Section II, we explore the system model and joint sparsity of multi-path massive MIMO channel in the delay domain and angle domain. Section III explore the pilot decoupling scheme. The pilot design and the two-stage channel estimation strategy are proposed in Section IV and V, respectively. Section VI and VII provide algorithm complexity analysis and numerical simulation of the proposed pilot design scheme and the proposed channel estimation strategy. Conclusions are drawn in Section VIII. Notations: Capital and small bold letters represent matrices and vectors respectively. The transpose, conjugate and Hermitian transpose of the matrix are denoted by (·) T , (·) * and (·) H , respectively. · F and · 2 denote the Frobenius norm and the second-order norm respectively. C M ×N is the set of complex matrices with M rows and N columns. CN (µ, σ 2 ) denotes the circular symmetric complex Gaussian distribution with mean µ and standard deviation σ . vec(X ) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector. A\B denotes the difference of sets A and B.

II. SYSTEM AND CHANNEL MODEL
In this paper, we consider a cell from a massive MIMO cellular system. In the cell, a BS equipped with M antennas simultaneously serves K single-antenna users on N subcarriers. We consider FDD downlink systems, where each frame includes the pilot training phase, feedback phase, and data transmission phase. In the pilot training phase, the BS transmits pilot sequences to users to estimate channel vectors. Then the estimated channel vectors are fed back to the BS in the feedback phase. In the data transmission phase, the BS uses the estimated channel matrix for precoding. In this paper, we focus on the pilot training phase.
To simplify the notations, in the sequel we take an arbitrary user in the cell as an example to introduce the channel model and different representations. In this paper, the block-fading channel is assumed, where the channel gain remains static in each frame but may vary from one to another. Each frame contains R consecutive OFDM symbols.
The multi-path channel is considered. There are L channel paths between the transmit antennas and the user. The CIR of the l-th path between the m-th transmit antenna and the user is denoted as h l,m . Then, the channel matrix between BS antennas and the user is Due to the limited scattering in the propagation environment of typical cellular massive MIMO systems with tower-mounted BS, the channels are highly correlated in both frequency and space domains. To explore the correlation of H, we discuss the sparsity of the channel matrix in the delay domain and angle domain respectively.

1) CHANNEL SPARSITY IN DELAY DOMAIN
The CIRs of L paths between the m-th transmit antenna and the user are given by h m is the delay channel vector, and h m is the m-th column of the channel matrix H. For multi-path channels, numerous experimental results and theoretical analyses have confirmed that h m is sparse in delay domain [22]- [26], [31]- [33].
Meanwhile, the CIRs measured at the user from different antennas share a sparse common support (SCS) [34].
The support of h m is written as Since different antennas share the SCS channel, the positions of the non-zero elements of h m are the same for m = 1, 2, · · · , M , i.e., and where T is the number of no-zero entries of all L quantized channel paths, which stands for the number of resolvable paths in the analog domain. As the channel sparsity in the delay domain, L T .

2) CHANNEL SPARSITY IN ANGLE DOMAIN
The l-th path of the channel between BS antennas and the certain user can be denoted as And g l is the l-th row of the channel matrix H.
According to the spatial channel model (SCM) in [33], the channel vector g l is expressed as where β l,q denotes the complex gain; θ l is the mean angle of arrival (AoA) of the l-th path; Q is the number of subpaths within each path; θ l,q is the random angular deviation; and a(θ) is the array response vector. The channel vector g l can be modeled using the virtual angular representation [35] where A H l ∈ C M ×M is the unitary matrix representing the transformation matrix of the virtual angular domain at the BS side; and g a l ∈ C 1×M is the angle domain channel vector.

VOLUME 8, 2020
The p-th entry of g a l is non-zero indicates that there is a spatial path from the p-th transmit direction of the BS to the user. As the BS is usually elevated high with few scatterers around, the angle spread at the BS side is small. As indicated by the experimental study [36], since the angle spread is limited at the BS, only a few of the entries in g a l show a significant magnitude. That's to say, g a l is in general sparse [5], [12], [36]- [38].
Denote S l as the support of vector g a l , i.e., S l = {m : |g a l | > 0}.
As the angle spread for different paths is similar, we assume that The unitary matrix A l is determined by the geometrical structure of the BS antenna array. When the BS antenna array is uniform linear array (ULA) or uniform planar array (UPA), the unitary matrix A l becomes a discrete Fourier transform (DFT) matrix [35], [37]. To illustrate the basic principle of the proposed pilot design scheme, in the sequel we consider ULA at the BS. Then, the array response vector can be expressed as where λ is the wavelength and d is the antenna spacing. And where F M is a standard M -point DFT matrix. Combining with Section II-.1 and Section II-.2, the channel matrix H is joint sparse in the delay-angle domain. And it can be expressed as where H a = [g aT 0 · · · g aT l · · · g aT L−1 ] T ∈ C L×M is the delay-angle domain channel matrix. The matrix H a is joint sparse in row and column, and rows are block-sparse.

III. PILOT DECOUPLING
In downlink massive MIMO systems, the amount of pilot overhead, which is proportional to the number of BS antennas, makes channel estimation difficult to implement. However, the sparse structure of H a makes low overhead channel training feasible by exploiting CS techniques. By exploiting the joint sparsity of H a in the delay-angle domain, we propose a decoupling pilot design scheme to reduce the pilot overhead.

A. CHANNEL TRAINING
As we mentioned before, the block-fading channel is assumed [39]- [42]. Each frame contains R consecutive OFDM symbols, in each frame, G (G ≤ R) successive OFDM symbols are used to transmit pilots. And pilots are inserted into subcarriers, in each OFDM symbol, P (P ≤ N ) subcarriers are assigned to transmit pilots and the remaining N − P subcarriers are still assigned to transmit data.
As all BS antennas use identical subcarriers to transmit pilot at the same time, the received pilot signal y(i) ∈ C P×1 at the user in i-th OFDM symbol is expressed as where P = I N (P ζ , :) is a P × N binary subcarrier selection matrix with only one ''1'' in each row and PP T = I P ; and the pilot subcarrier pattern is the index set of pilot subcarriers in i-th OFDM symbol.
is the independent and identically distributed (i.i.d.) additive Gaussian white noise.

B. CHANNEL ESTIMATION PROBLEM FORMULATION
We collect all the G OFDM symbols together to estimate the channel matrix. The collected received signal matrix can be expressed as where Y = [y(1), · · · , y(g), · · · , y(G)] ∈ C P×G is the collected received signal matrix of all the G OFDM symbols.
By exploiting the channel sparsity, CS-based channel estimation with non-orthogonal pilot design can provide the efficient compression and reliable recovery of the sparse channel matrix. The CS-based channel estimation problem can be formulated as follows: The pilot matrix Y should be designed for channel estimation. The pilot overhead is characterized by P and G. To reliably estimate the channel, some traditional orthogonal pilot designs are usually adopted, such as time-domain orthogonal pilot design, frequency-domain orthogonal pilot design, and time-frequency orthogonal pilot design [12]. Time-domain orthogonal pilot design requires G ≥ M and P ≥ L, frequency-domain orthogonal pilot design requires G = 1 and P ≥ ML and time-frequency orthogonal pilot 81554 VOLUME 8, 2020 design requires PG ≥ ML. In general, all orthogonal pilot designs require PG ≥ ML. Massive MIMO systems suffer from prohibitively high pilot overhead with orthogonal pilot design. CS-based channel estimation with non-orthogonal pilot design can reduce the pilot overhead. In order to ensure channel estimation accuracy with the lowest pilot overhead, the pilot matrix Y should be optimized.
Remark 1 (Reduce Algorithm Complexity): As the size of the pilot matrix Y is very large, the joint optimization of Y is a challenge. It's difficult to implement with enormous algorithm complexities. To reduce the pilot overhead and algorithm complexity, we propose a decoupling pilot design scheme to design Y.

C. PROPOSED DECOUPLING DOWNLINK PILOT DESIGN SCHEME
In the proposed pilot design scheme, different BS antennas use P completely identical frequency-domain subcarriers to transmit pilots over G successive OFDM symbols. By exploiting the joint sparsity of the channel matrix H in the delay-angle domain, non-orthogonal pilots are considered for CS-based channel estimation.
Proposition 1 (Decoupling Downlink Pilot Design Scheme): If we assume that transmitted power vectors of all BS antennas are the same, we can decouple the large scale size pilot matrix Y ∈ C P×MLG into two small size matrices and , where And the channel estimation problem can be rewritten as Proof: If we assume that transmitted power vectors of all BS antennas are the same, the pilot signal x m (i) can be divided into two parts: where s m (i) ∈ C 1×1 is normalized angular pilot signal transmitted by the m-th antenna in i-th OFDM symbol. And φ ∈ C P×1 is transmitted power vector over the pilot subcarrier pattern P L ζ . According the decoupling expression (21), s m (i) is independent of the pilot subcarrier index ζ p but depends on the antenna index m and the OFDM symbol index i. While, φ and P are independent of m and i.
Then X i can be simplified as where s(i) = [s 1 (i), · · · , s m (i), · · · , s M (i)] T . The equation (16) can be rewritten as By using the matrix equations property of Kronecker product ( [43], Lemma 4.3.1), vec(AXB) = (B T ⊗ A)vec(X), the above formula can be changed as The binary subcarrier selection matrix P, the vector φ and the matrix S should be designed. In formula (24), diag{φ}P is on the left side of the channel matrix H a , and F M S is on the right side of H a .
In summary, we have decoupled the large scale size pilot matrix Y ∈ C P×MLG into two small size matrices and , where and By decoupling the large scale size pilot matrix Y into two small size matrices and , instead of design the pilot matrix Y, we can explore the design schemes of pilot matrices and respectively. As the sizes of and are much less than Y, the computational complexity of pilot design can be greatly reduced.

IV. PILOT DESIGN
In this section, we explore the design schemes of pilot matrices and respectively.

A. DESIGN OF PILOT MATRIX
First of all, we explore the design scheme of the pilot matrix . The matrix Y is rewritten as where G a = H a H ∈ C L×G is block-sparse with only T rows are non-zero.
The measurement matrix plays an important role in the sparse signal recovery. Various results have been proposed about what conditions the measurement matrix should satisfy to guarantee a robust signal recovery. Restricted isometry property (RIP) [44] and mutual incoherence property (MIP) [45] are widely used. The MIP condition is stronger than the RIP in that the MIP implies the RIP but the converse is not true [45]. In this paper, we consider the measurement matrix based on MIP.
The MIP of the measurement matrix is defined as the largest normalized inner product between any two columns of , and it is shown as where l 1 and l 2 are the column indexes of . VOLUME 8, 2020 According to the MIP condition in [45], the recovery performance of sparse signals is better with smaller MIP. is determined by P ζ and φ. So, we should design P ζ and φ to minimize the MIP, and the problem becomes arg min We assume that the transmit power of the pilot symbols among all subcarriers is equal, i.e., Specifically, we design where ψ p has the i.i.d. uniform distribution in (−π, π], namely, the i.i.d U(−π, π]. Let r = l 2 − l 1 . Then, µ( ) is simplified as where is the mutual correlation between two columns of when the difference between the two column index is r. From (33), µ( ) only depends on P ζ and it can be denoted as the function P ζ µ( ) η(P ζ ).
As we assume that the transmit power of the pilot symbols among all subcarriers is equal, the problem of minimizing MIP can be translated into the pilot subcarrier pattern optimization problem. The pilot subcarrier pattern P ζ can be designed to minimize the MIP, i.e., There are some pilot subcarrier patterns have been proposed for the problem (36). Intuitively, by exhaustively searching over all possible pilot patterns, the optimal pilot pattern can be obtained. But it is computationally prohibitive to search from all N P candidates when N and P are large. For example, if N = 1024 and P = 12, there are 1024 12 = 2.6 × 10 27 different pilot patterns. Random pilot subcarrier pattern and uniform pilot subcarrier pattern are widely adopted [22], [30]. Unfortunately, random pilots and uniform pilots cannot guarantee the performance in practical systems [46]. In [23], it has been shown that the pilot subcarrier pattern generated from the cyclic different set (CDS) is optimal. Nevertheless, the CDS only exists for some specific pair of (N , P). Some iterative algorithms were given in [23], [46]- [49] to obtain near-optimal pilot subcarrier patterns. However, these iterative algorithms still need high algorithm complexities.
In this paper, based on the MIP of , we propose a greedy-iterative algorithm to obtain near-optimal pilot patterns for any given pair of (N , P) and for any value of L. The algorithm consists of two layers of loops. And it is based on a greedy search, which searches for the near-optimal pilot pattern with two loops of iterations. The main idea of the proposed algorithm is listed as follows: Step 1: In the outer loop, we randomly generate pilot patterns as the initializations of the inner loop.
Step 2: In the inner loop, we iteratively update the resulting pilot pattern in a greedy manner. Different from the two stochastic search schemes in [23], in each iteration of the inner loop, we select the worst subcarrier and best subcarrier respectively, and the worst subcarrier is replaced by the best subcarrier.
Step 3: The pilot subcarrier pattern with the minimum MIP is selected from the results of inner loops. Given the maximum numbers of iterations for the outer loop, i.e. N 1 , the global optimal selection greedy iterative algorithm (GOS-GIA) is described as follows.
• In each iteration of the outer loop, we randomly generate a pilot subcarrier pattern P t ζ as the initialization of the inner loop. And with N = {1, 2, · · · , N }.
• In each iteration of the inner loop, we select the worst subcarrier and best subcarrier respectively, and the worst subcarrier is replaced by the best subcarrier.
-First of all, we aim to delete one pilot subcarrier from P t ζ . According to the greedy idea, we should delete the worst pilot subcarrier index ζ d p from P t ζ , which results in the minimum MIP with pilot pattern P t ζ \ζ p , i.e., η(P t ζ \ζ p ).
-Then, we aim to add one pilot subcarrier from set N \P t ζ . The best pilot subcarrier index ζ * in set N \P t ζ , which results in the minmial MIP with pilot pattern {P t ζ \ζ d p } ∪ {ζ }, is given as • With N 1 outer loop iterations, we obtain N 1 optimized inner loop results, from which we select the one with the minimum MIP as the final pilot pattern. The detailed algorithm is presented in Algorithm 1. As the inner loop is a greedy iteration, when η(P t ζ ) > η(P * ζ ), it means the inner loop has converged. For the outer loop, initialization has a great impact on the result of the inner loop. So, we repeat the inner loop N 1 times. For each initial pilot pattern in the outer loop, we obtain a corresponding optimized pilot pattern. With N 1 outer loop iterations, we then obtain N 1 optimized results, from which we select the one with the minimum MIP as the final result. Initialization: Outer iteration: Repeat the next steps N 1 times. 1) Outer iteration initialization: • Select the worst pilot subcarrier: Obtain ζ d p according to (38).
• Update the pilot subcarrier pattern: • Repeat the above steps until η(P t ζ ) > η(P * ζ ) 3) Update the best pilot subcarrier pattern : if η(P * ζ ) < η(P ζ ) P ζ = P * ζ B. DESIGN OF PILOT MATRIX Then, we explore the design scheme of pilot matrix in this section. The received signal matrix Y is rewritten as where = S T F M ∈ C G×M is the angle-domain measurement matrix; and H a = H aT (diag{φ}PF L N ) T ∈ C M ×P is sparse with S non-zero entries in each column.
The measurement matrix is also designed based on MIP. And, we have It implies that the MIP only depends on the S. Then, the optimal S is given by There are also many results for the problem (42). Reference [50] proves the measurement matrix with its elements following the i.i.d. complex Gaussian distribution enjoys a satisfying performance in compressing and recovering sparse signals. The measurement matrix with its elements following the Rademacher distribution is also commonly adopted due to its simplicity, and it also enjoys a satisfying performance. In this paper, the Rademacher distribution is adopted and S is i.i.d. drawn from {− √ 1/M , √ 1/M } with equal probability.

V. CORRESPONDING CHANNEL ESTIMATION
With the received compressive measurement Y at the user, one corresponding channel estimation strategy is provided in this section. The channel estimation can be processed either at the user side or at the BS side. If the channel estimation is processed at the user side, the user needs to have prior knowledge about P ζ , φ and S. To reduce the computational complexity and the storage capacity at the user side, we process the channel estimation at the BS side. The user only needs to feed the compressive measurements Y directly back to the BS. Then the BS should recover the channel matrix H a from the compressive measurements Y with the prior knowledge of P ζ , φ and S.
The problem of channel estimation is formulated as The channel matrix H a is joint sparse in row and column, and rows are block-sparse. According to the multidimensional compressive sensing based algorithm [51], the twodimensional problem can degenerate into an one-dimensional problem, which is expressed as in which y = vec(Y) ∈ C P×G , = ⊗ ∈ C PG×ML , h a = vec(H a ) ∈ C ML×1 and z = vec(Z).
As the size of PG × ML is very large, the complexity of the CS-based algorithm will be very enormous. To solve the channel estimation problem (43), we propose one two-stage channel estimation strategy with low algorithm complexity. In the proposed strategy, the large-size problem is decomposed into two small-size problems.
The main idea of the two-stage channel estimation strategy is listed as follows: Stage 1: The first stage is concerned with retrieving the positions of non-zero dominant taps in the delay domain through exploiting the SCS channel. Stage 2: The second stage focuses on estimating the channel coefficients at these taps through exploiting the sparsity in the angle domain.

A. FIRST STAGE
In the first stage, we estimate the positions of non-zero dominant taps in delay domain. The estimated problem can be formulated as The problem in (45) is a CS problem with the signal matrix G a is block-sparse. Only T rows of G a are non-zero. Many CS-based algorithms can be adopted to estimate G a , such as orthogonal matching pursuit (OMP), subspace pursuit (BP). In this paper, by exploiting the SCS of the channel, one Block-BP algorithm is applied. The detail of Block-BP algorithm is presented in Algorithm 2. Note that there are three stopping criterions for iteration, R t+1 F > R t F , R t+1 F < γ th and t > N 3 . The first criterion means the iteration has converged. The second criterion means the estimation error is small enough, and the last criterion guarantees the maximum computation time. Initialization: • Initialize the chunk support T 0 = ∅. • Initialize the residue matrix R 0 = Y. repeat • Update the support as follows: • Estimate the signal with LS, i.e., G a p ( T t+1 , :) = ( (:, T t+1 )) † Y, G a p ({1, · · · , L}\ T t+1 , :) = 0.
• Update the residue, i.e., In the second stage, we would like to recover H a from G a . We assume G a = G a + E, where E is the estimation error. The set of the non-zero rows of G a is denoted as T , i.e., T = {l : | G a (l, :)| 2 > 0, l = 1, 2, · · · , L}. We define a T × L binary selection matrix T = I L ( T , :) with only one ''1'' in each row and TT H = I T . Then, we need to estimate H a based on T G a . According to G a = H a T , we have where Y = (T G a ) T ∈ C G×T , X = (TH a ) T ∈ C M ×T and Z = −(TE) T . Then the estimation problem can be expressed as The problem in (47) is also a CS problem with X is sparse. Each column of X has S non-zero entries. In this paper, one Parallel-BP algorithm is applied to exploit the sparse structure. The detail of the Parallel-BP algorithm is presented in Algorithm 3. The stopping criterions in this algorithm is similar to Algorithm 2.
Then, we can acquire the estimation of the channel H

VI. COMPUTATIONAL COMPLEXITY ANALYSIS
In this section, we analyze the computational complexities of the proposed decoupling pilot design scheme and the corresponding channel estimation strategy.

A. ALGORITHM COMPLEXITY OF THE PROPOSED DECOUPLING PILOT DESIGN SCHEME
The computational complexity of the proposed decoupling pilot design scheme mainly depends on Algorithm 1. In each inner iteration of Algorithm 1, the computational complexity is mainly determined by the searching for the worst index ζ d p in (38) and the best index ζ * in (39). The searching for the worst index ζ d p has the complexity on the order of O(L + P). The searching for the best index ζ * has the complexity on the order of O((N − P)LP) by directly computing (39).
The best pilot subcarrier index ζ * in set N \P t ζ , which results in the minimal MIP with pilot pattern is the mutual correlation between two columns of with pilot pattern {P t ζ \ζ d p } ∪ {ζ } when the difference between the two column index is r. It means min Then, the ζ * can be computed by one-step iteration, which is shown in the following where • repeat -Update the support as follows: -Estimate the signal with LS, i.e., -Refine the support as follows: -Update the signal with LS, i.e.,

end if end for
The (50) is a one-step iteration with the complexity on the order of O(1). Hence, the computational complexity of searching for the best index ζ * has been reduced to the order of O((N − P)L). The inner iteration number is denoted as L 2 , and it's determined by the speed of convergence of the inner iteration. The outer iteration number is N 1 .
Then the computational complexity of Algorithm 1 is the order of O(N 1 L 2 (N − P)L). We discuss the selection of the outer iteration number N 1 and the maximum number of inner iteration N 2 in Section VII. From the result, it can be seen that the convergence speed of the inner iteration and outer iteration is very fast, and the values of N 1 and N 2 can be less than 10. Algorithm complexities of different pilot subcarrier pattern design algorithms are indicated in Table 1. The two-layer greedy iteration algorithm (TLGIA) is the result of our previous work in [52], and the SSS algorithm is the result of [23]. Overall, the proposed pilot design scheme has lower computational complexities. It can reduce the computational complexity at least one order of magnitude compared to the SSS algorithm. For example, when N = 1024, P = 12, and L = 73, our proposed GOS-GIA algorithm has almost the same algorithm complexity as the TLGIA algorithm, but at least 10 times less than the SSS algorithm. Overall, our proposed two-stage channel estimation strategy has the computational complexity in the order of O(PGT 3 +GTS 3 ). If the channel estimation is done by degenerating the two-dimensional problem as a one-dimensional problem, as shown in (44), the computational complexity is the order of O(PG(TS) 3 ).
The proposed two-stage channel estimation strategy reduces the computational complexity by the order of magnitude O(min(S 3 , PT 2 )). Obviously, the proposed two-stage channel estimation strategy reduces the computational complexity by more than two orders of magnitude. For example, when S = 12, P = 12, and T = 6, our proposed two-stage channel estimation strategy reduces the computational complexity by more than 400 times.

VII. SIMULATION RESULTS
In this section, a simulation study is carried out to investigate the performance of the proposed decoupling pilot design scheme and the corresponding two-stage channel estimation strategy. We consider typical massive MIMO OFDM systems [12], [32], [33], [53] with the basic simulation parameters listed in Table 2. The number of simulation for obtaining VOLUME 8, 2020   the figures is 1000. We keep these parameters in the following simulation unless otherwise stated.

A. PERFORMANCE OF THE PROPOSED PILOT PATTERN DESIGN
In this section, we analyze the effect on the MIP of the inner iteration and the outer iteration in Algorithm 1. And we discuss the selection of the inner iteration number and the outer iteration number.
In Fig. 2, we compare the average MIP of different pilot subcarrier patterns as the function of the outer iteration number N 1 in Algorithm 1. The curve of random generation represents the random generation pilot subcarrier pattern. The curve of one-layer iteration represents the pilot subcarrier pattern result of the inner iteration in Algorithm 1. The curve of GOS-GIA represents our proposed pilot subcarrier pattern result of Algorithm 1. The TLGIA is the result of our previous work in [52], and the SSS algorithm is the result of [23]. In this simulation, the maximum number of inner iterations N 2 = 10, the subcarrier overhead P = 12 and  SNR = 20 dB is adopted. From this figure, it shows that there is a great gap between the MIP of the random generation and the one-layer iteration result. This phenomenon means that the inner iteration can reduce the MIP significantly compared with the random pilot subcarrier pattern. Compared with the SSS algorithm, our proposed GOS-GIA algorithm has a little performance loss. This is because our algorithm greatly reduces the algorithm complexity. The curve of the proposed algorithm is almost converged when N 1 ≥ 6, and the convergence speed of the outer iteration is very fast. Fig. 3 describes the average number of inner iteration under various subcarrier overhead P. It approximately increases linearly with the subcarrier overhead P. Compared with the SSS algorithm in [23], our proposed algorithm greatly reduces the number of inner iteration. Our proposed algorithm has also a fewer number of inner iteration compared with our previous work in [52]. As the average number of iteration is a very small value, the convergence speed of the inner iteration is very fast and the algorithm complexity of the pilot design is very low. Fig. 4 compares the average MIP of different pilot subcarrier patterns under various subcarrier overhead P. It is intuitive that the MIP decreases with the overhead P. Compared with the random pilot pattern, the one-layer iteration reduces the subcarrier overhead by 40%, and our proposed algorithm achieves similar MIP performance with half of the subcarrier overhead P. This means the inner iteration and the outer iteration in Algorithm 1 are effective to reduce the pilot overhead. Compared with the SSS algorithm, our proposed algorithm has a slight loss of performance, but greatly reduces the number of inner iteration.

B. PERFORMANCE OF THE PROPOSED CHANNEL ESTIMATION STRATEGY
In this section, we investigate and analyze the performance of the proposed channel estimation strategy.
We compare the mean square error (MSE) of the proposed channel estimation strategy with different pilot subcarrier patterns. Random pilot subcarrier patterns, the proposed GOS-GIA pilot patterns, and pilot patterns with the SSS algorithm are considered. We also take LS channel estimation as a reference. In the LS channel estimation, the sparse support in the delay-angle domain is assumed fully known in advance at the receiver side.
In Fig. 5, we show the channel estimation MSE performance under various subcarrier overhead P. In this simulation, G = 48 and SNR = 20dB are adopted. Compared with the random pilot subcarrier pattern, our proposed GOS-GIA pilot subcarrier pattern design achieves significant MSE performance gain when P < 16. This is because the proposed pilot subcarrier pattern design has better MIP performance. When P ≥ 16, the MSE curves tend to unchanged. The reason is the detection probability in the delay domain approximate to zero, and the MSE performance is mainly determined by the estimation error in the angle domain which is independent with P. Compared with the pilot patterns with the SSS algorithm and the referenced LS channel estimation, our proposed channel estimation with the proposed pilot pattern has almost the same performance. This means the proposed channel estimation and the proposed pilot pattern are near-optimal for various P. Fig. 6 shows the channel estimation MSE performance under various signal to noise ratio (SNR). In this simulation, P = 12 and G = 48 are adopted. Compared with the  random pilot subcarrier pattern, our proposed GOS-GIA pilot subcarrier pattern design achieves significant MSE performance gain. The gain increases with the SNR. This is because when SNR is high, the influence of noise weakens and the pilot pattern becomes the dominant factor. Compared with the pilot patterns with the SSS algorithm and the referenced LS channel estimation, our proposed channel estimation with the proposed pilot pattern has almost the same performance. This means the proposed channel estimation and the proposed pilot pattern are near-optimal for various SNR.
In Fig. 7, we show the channel estimation performance under various OFDM symbols overhead G. In this simulation, P = 12 and SNR = 20dB are adopted. Compared with the random pilot subcarrier pattern, our proposed GOS-GIA pilot subcarrier pattern design can achieve significant MSE performance gain. The performance gain increases with G. This is because when G is small, the MSE is mainly influenced by the estimation error in the angle domain, and when G is large, the MSE is mainly influenced by the estimation error in the delay domain. And the proposed GOS-GIA pilot pattern is in the delay domain. Compared with the pilot patterns with the SSS algorithm, our proposed channel estimation with the proposed pilot pattern has almost the same performance. This means the proposed channel estimation and the proposed pilot pattern are near-optimal for various G. Compared with the referenced LS channel estimation, our proposed channel estimation with the proposed GOS-GIA pilot pattern has almost the same performance when G ≥ 40, while there exists a large performance gap when G is small. The reason is the sparse support detection error in the angle domain has a great effect on performance.

VIII. CONCLUSION
This paper has addressed the issue of pilot design for FDD massive MIMO systems by utilizing the joint sparsity in the delay-angle domain. Based on the joint sparsity, a decoupling pilot design scheme has been proposed, in which the pilot design has been decoupled into two domains. In the delay domain, a GOS-GIA algorithm has been proposed to obtain the near-optimal pilot subcarrier pattern. In the angle domain, a random Rademacher distribution pilot matrix has been adopted. At the receiver, the corresponding channel estimation strategy with two-stage has been provided. Algorithm complexity analysis has shown that the GOS-GIA can reduce the computational complexity at least one order of magnitude, and the proposed two-stage channel estimation strategy reduces the computational complexity by more than two orders of magnitude. Simulation results have verified the proposed decoupling FDD pilot design scheme with the corresponding channel estimation strategy achieves better performance with lower pilot overhead compared with the random pilot pattern. He is currently a Full Professor with the Department of Electronic Engineering and Information Systems, USTC. His research interests include wireless communication, microwave and millimeter, and radar technology. He is a member of the Committee of Optoelectronic Technology, Chinese Society of Astronautics. VOLUME 8, 2020