Basis Pursuit Denoising via Recurrent Neural Network Applied to Super-Resolving SAR Tomography

Finding sparse solutions of underdetermined linear systems commonly requires the solving of $L_{1}$ regularized least-squares minimization problem, which is also known as the basis pursuit denoising (BPDN). They are computationally expensive since they cannot be solved analytically. An emerging technique known as deep unrolling provided a good combination of the descriptive ability of neural networks, explainable, and computational efficiency for BPDN. Many unrolled neural networks for BPDN, e.g., learned iterative shrinkage thresholding algorithm and its variants, employ shrinkage functions to prune elements with small magnitude. Through experiments on synthetic aperture radar tomography (TomoSAR), we discover the shrinkage step leads to unavoidable information loss in the dynamics of networks and degrades the performance of the model. We propose a recurrent neural network (RNN) with novel sparse minimal gated units (SMGUs) to solve the information loss issue. The proposed RNN architecture with SMGUs benefits from incorporating historical information into optimization and, thus, effectively preserves full information in the final output. Taking TomoSAR inversion as an example, extensive simulations demonstrated that the proposed RNN outperforms the state-of-the-art deep learning-based algorithm in terms of super-resolution power and generalization ability. It achieved 10%–20% higher double-scatterer detection rate and is less sensitive to phase and amplitude ratio difference between scatterers. Test on real TerraSAR-X spotlight images also shows the high-quality 3-D reconstruction of the test site.


A. Motivation
Sparse solutions are ordinarily desired in a multitude of fields, such as radar imaging, medical imaging and acoustics signal processing.Compressive sensing theory tells that the exact solution in the absence of noise is the signal with the minimum L 0 -norm while still fulfilling the forward model.As the L 0 -norm minimization is NP-hard, this is often solved by L 1 -norm minimization.The unconstrained form of a linear system can be formulated as follows: where A, x, and b are the sensing matrix, the signal to be retrieved, and the measurements.Solving Eq. ( 1) is an unconstrained convex optimization problem, whose objective function is non-differentiable.It is also known as basis pursuit denoising (BPDN) [1].In the field of remote sensing, sparse signals are widely expected.Therefore, BPDN is broadly employed to exploit sparsity prior in various remote sensing application, including but not limited to pan-sharpening [2], spectral unmixing [3], microwave imaging [4] and synthetic aperture radar tomography (TomoSAR) [5].In this work, we focuses on addressing BPDN in TomoSAR inversion, but our findings are applicable for general sparse reconstruction problems in other fields as well.Generic solvers for BPDN are either first-or second-order compressive sensing (CS) [6] [7] [8] based methods.Firstorder methods are typically based on linear approximation of gradient, e.g.iterative shrinkage thresholding algorithm (ISTA) [9], coordinate descent (CD) [10] and alternating direction method of multipliers (ADMM) [11].Second-order methods usually have much better performance than firstorder methods.An example of the second-order methods is the prime dual inferior point method (PDIPM) [12].It was demonstrated in [13] [5] that CS-based methods are able to achieve unprecedented super-resolution ability and location accuracy comparing to conventional linear algorithm [14] [15].In spite of the good performance of CS-based methods, they often suffer from heavy computational burden due to their iterative properties and are hard to extend to practical use.
In the past years, the advent of deep neural networks has attracted the interest of many researchers and triggered extensive studies due to their excellent learning and expression power.Deep neural networks have demonstrated their availability and advanced the state-of-the-art for many problems.More recently, an emerging deep learning algorithm coined deep unfolding [16] was proposed to provide a concrete and systematic connection between iterative physical model based algorithms and deep neural networks.Inspired by this concept, various neural networks were proposed to solve BPDN in CS arXiv:2305.14209v1[eess.SP] 23 May 2023 problems by unrolling iterative CS solvers.The first work of deep unfolding dates back to learned iterative shrinkage thresholding algorithm (LISTA) [17], which was designed for solving sparse recovery.LISTA unrolls ISTA, one of the most popular algorithms, and substantially improves the computational efficiency and parameter tuning.[18] proposed ADMM-CSnet by unrolling ADMM algorithm to deep hierarchical network architecture and applied ADMM-CSnet to magnetic resonance imaging (MRI) and natural image CS.Results in [18] indicates favorable performance of ADMM-CSnet in high computational speed.For remote sensing application, CSR-net [19] was proposed by combining deep unfolding structures and convolutional neural network modules and achieved fast and accurate 3-D microwave imaging.In addition, [20] proposed AF-AMPNet by unrolling approximate message passing with phase error estimation (AF-AMP) to a deep neural network.AF-AMPNet was employed in sparse aperture (SA) inverse SAR (ISAR) imaging and accelerated the imaging process.Inspired by the encouraging achievements made by deep unfolding, the TomoSAR community started to design deep neural networks by unrolling iterative optimization solvers for solving BPDN in TomoSAR inversion.[21] unrolled and mapped vector AMP (VAMP) [22] into a neural network for line spectral estimation and applied it to tackle TomoSAR inversion.Results in [21] show that L-VAMP is able to separate overlaid scatterers.γ-Net was proposed in [23] by tailoring the complex-valued LISTA network.γ-Net introduced weight coupling structure [24] and support selection scheme [24] to each iteration block in LISTA and improved the conventional soft-thresholding function by piecewise linear function.It was demonstrated in [23] that γ-Net improves the computational efficiency by 2-3 order of magnitude comparing to the state-ofthe-art second-order TomoSAR solver SL1MMER [13] while showing no degradation in super-resolution ability and location accuracy.
However, unrolled neural networks do not consider historical information in the updating rules.To be exact, the output is generated exclusively based on the output of its previous layer This kind of learning architecture leads to an error propagation phenomenon, where error in the first few layers will be propagated and even amplified in the upcoming layers.Moreover, when the unrolled neural networks are designed for sparse reconstruction, shrinkage steps are usually required to promote sparsity.The shrinkage step utilizes thresholding functions to prune element with small magnitude to zero and such pruning causes information loss in the dynamics of the neural network.Once useful information is discarded in the previous layers, the upcoming layers have no longer chance to utilize the discarded information, thus degrading the performance of the neural network and sometimes leading to large error in the final output.

B. Contribution of this paper
In this paper, we aim to address the problem of information loss caused by shrinkage steps in unrolled neural networks designed for sparse reconstruction.To this end, we propose a novel architecture, termed as sparse minimal gated unit (SMGU), to incorporate historical information into optimization so that we can promote sparsity using thresholding functions and preserve full information simultaneously.Additionally, we extend SMGU to complex-valued (CV) domain as CV-SMGU and use it to build a gated recurrent neural network (RNN) for solving TomoSAR inversion.The main contribution of this paper is listed below: 1) We addressed the problem of information loss in unrolled neural networks for sparse reconstruction by a novel gated RNN.The gated RNN is built using SMGUs, which incorporate historical information into optimization.The proposed gated RNN is able to promote sparsity by employing shrinkage thresholding functions.Simultaneously, the pruned information will be reserved in the cell state of SMGUs, thus full information can be preserved in the dynamics of the network.2) We extend the SMGU to the complex-valued domain, called as CV-SMGU, and apply the gated RNN built with CV-SMGUs to solve TomoSAR inversion.To the best of our knowledge, it is the first attempt to bridge the gated RNN and TomoSAR inversion.We may provide novel insights and open a new prospect for future deep learning based TomoSAR inversion.3) We carry out systematic evaluation to demonstrate that the proposed gated RNN outperforms the state-of-the-art deep learning-based TomoSAR algorithm γ-Net in terms of super-resolution power as well as generalization ability for TomoSAR inversion.The remainder of the paper is outlined as follows.The TomoSAR imaging model and γ-Net is briefly reviewed in Section II.Section III provides an overview of the formulation of SMGUs as well as CV-SMGUs with application to TomoSAR inversion.Results of systematic evaluation, using simulated and real data, are presented in Section IV.Section V discussed the generalization ability w.r.t baseline discrepancy and analyzed the model convergence.Finally, the conclusion of this paper is drawn in Section VI.

II. BACKGROUND A. TomoSAR imaging model
In this section, we briefly introduce the TomoSAR imaging model.Fig. 1 demonstrates the SAR imaging model at a fixed azimuth position.A stack of complex-valued SAR acquisitions over the illuminated area is obtained at slightly different orbit position (the elevation aperture).The complex-valued measurement g n of the nth acquisition is the integral of the reflectivity profiles γ(s) along the elevation direction s.The discrete TomoSAR imaging model can be written as: where g ∈ C N ×1 is the complex-valued SAR measurement vector and γ ∈ C L×1 denotes the discrete reflectivity profile uniformly sampled at elevation position s l (l = 1, 2, . . ., L) along the elevation direction.N is the number of measurements and L is the number of discrete elevation indices.R ∈ C N ×L is the irregularly sampled discrete Fourier transformation mapping matrix with R nl = exp (−j2πξ n s l ) where ξ n is the frequency proportional to the perpendicular baseline of the nth acquisition.The readers can refers to [14] for more details of the SAR imaging model.The elevation synthetic aperture is built up by acquisition from slightly different incidence angles.Flight direction is orthogonal into the plane.
Since the reflectivity profile γ is sufficiently sparse in urban areas [5], retrieving γ is a sparse reconstruction problem.Accordingly, γ in presence of measurement noise ε can be estimated by BPDN optimization, which is formulated as follows: where λ is a regularization parameter balancing the sparsity and data-fitting terms.It should be adjusted according to the noise level as well as the desired sparsity level.The choice of a proper λ is described in great detail in [1].

B. Review of γ-Net
As shortly mentioned previously, conventional CS-based BPDN solvers for Eq. ( 3) is extremely computational expensive.To overcome the heavy computational burden and make super-resolving TomoSAR inversion for large-scale processing feasible, the author proposed γ-Net in [23], which tailors the first unrolling ISTA network, to mimic a CS-based BPDN solver.To be specific, γ-Net introduces the weight coupling structure and support selection scheme and improves the conventional soft-thresholding function by the piecewise linear function.Fig. 2 illustrates us the architecture of the i th layer in γ-Net.SS in γ-Net indicates a special thresholding scheme called support selection, which will select ρ i percentage of entries with the largest magnitude and trust them as "true support".The "true support" will be directly fed to the next layer, bypassing the shrinkage step.η pwl is a novel thresholding function, called piecewise linear function, to execute shrinkage in the γ-Net.It contributes to improving convergence rate and reducing reconstruction error.More details about γ-Net formulation and the full model structure can be found in the Appendix A. However, as one can see in Fig. 2, γ-Net inherits the learning architecture of LISTA despite modifications made by the authors to improve the performance.Therefore, it can be imagined that γ-Net will suffer from the same problem as LISTA.Specifically speaking, the learning architecture of γ-Net, where the output is generated only directly from the previous output.As a natural consequence, the final output can only utilize the information from the second last layer.When useful or important information is pruned by shrinkage steps in the intermediate layers, the discarded information is no longer possible to contribute to the final output.Consequently, large reconstruction error in the final output can be expected.were assumed to have identical phase and amplitude and were spaced by 0.6 Rayleigh resolution, i.e. in super-resolution regime and the SNR level was set as 6dB.In general, if we cannot resolve the overlaid double scatterers, the reflectivity profile should have a dominant amplitude peak between the true elevation position of the double scatterers, as it is shown by the estimate of a non-superresolving algorithm SVD-Wiener [14] in Fig. 3.However, γ-Net was able to detect one of the double scatterers with very high localization accuracy but failed to find the other one.From our perspective, it was abnormal and we supposed that this unsuccessful double scatterers separation should attribute to the information loss caused by shrinkage steps in γ-Net.Inspecting the intermediate layers in γ-Net, we discovered that the information of the second scatterer gradually diminished after each shrinkage step in the intermediate layers.Until the second last layer, the information of the second scatterer fell out completely.As a result, the final output of γ-Net, i.e. the estimate of γ, did not contain the information of the second scatterer.Hence we cannot detect the second scatterer.

A. Adaptive ISTA and sc2net
In the optimization community, it has been extensively studied and proved [25], [26], [27] that incorporating historical information contributes to improving the algorithm performance.Inspired by the high-level ideas from the previous researches, researchers proposed adaptive ISTA in [28] to integrate and make use of historical information by introducing two adaptive momentum vectors f and i into ISTA in each iteration, which is formulated as follows: where η st indicates the conventional soft-thresholding function and its complex-valued version reads: Comparing to ISTA, whose update rule can be equivalently expressed as γ(t) = η st c(t) using the same notation, the adaptive ISTA takes not only the current information but also the previous information into consideration.To be exact, at the t th iteration of the adaptive ISTA, the estimate is generated by linear combining the historical information c (t−1) at the previous iteration and the current information c(t) at the current iteration.The historical information c (t−1) and the current information c(t) are weighted by the adaptive momentum vectors f (t) and i (t) , respectively.By this means, the final estimate of the adaptive ISTA will accumulate historical information weighted by different f (t) and i (t) for different iteration.
However, one problem of the adaptive ISTA is that the two momentum vectors in each adaptive ISTA iteration are difficult to determine.So far, there has been no analytical way to determine the values of the adaptive momentum vectors f (t) and i (t) .Usually, they are selected in by tediously hand-craft tuning, which takes a lot of time and cannot guarantee optimal performance.To address this issue, the author proposed sc2net in [28] by recasting the adaptive ISTA as a recurrent neural network to parameterize the two momentum vectors and learn them from data.The sc2net is built by sparse long short-term memory (SLSTM) [28] units, as it is demonstrated in Fig. 4.Each SLSTM unit represents an individual layer of sc2net.At the t th layer of sc2net, the input gate and forget gate correspond to the momentum vectors i (t) and f (t) in each adaptive ISTA iteration, respectively.Hence, we use the same notation in SLSTM units to describe the input and forget gate.The two gates in each SLSTM unit are parameterized with the input data g and the output γ(t−1) at the previous layer as follows: f 1 g To clarify, SLSTM unit does not have the output gate like conventional LSTM units.By substituting Eq. ( 6) into Eq.( 4), we have the formal definition of the SLSTM unit, as it is listed in Table I.W i1 , W i2 , W f 1 , W f 2 denote four trainable weight matrices to determine the input and forget gates in each SLSTM unit.It is worth mentioning that the weight matrices W 1 and W 2 are also learned from data while they are shared for all SLSTM units in an individual sc2net.σ(•) indicates the conventional sigmoid function, which is express as: The sparse activation function employed in the SLSTM to promote sparse codes is the double hyperbolic tangent function, which is abbreviated as η dt (•) and defined as follows: where s and θ denote two trainable parameter.It is worth noting that the double hyperbolic tangent function can be viewed as a smooth and continuously differentiable alternative of the conventional soft-thresholding function.Its advantages are mainly two-fold.On the one hand, its second derivative sustains for a long span, thus contributing to addressing the gradient vanishing problem caused by the cell recurrent connection [29].On the other hand, it is able to effectively imitate the soft-thresholding function within the interval of TABLE I: Formal definition of the t th layer in different models and comparison of their difference.γ-Net has no gated expression.SLSTM unit introduces forget and input gates to incorporate historical information.SMGU has the minimal number of gates while maintains the performance comparing to SLSTM unit.CV-SMGU extends SMGU to the complexvalued domain.The forget gate is activated on the magnitude using tanh function instead of the sigmoid function to guarantee the activation value ranging from 0 to 1. tangent function and compares it to the soft-thresholding function.
To sum up, sc2net inherits the advantage of the adaptive ISTA, which incorporates historical information into optimization.The cell state c (t) in each SLSTM unit of sc2net acts as an "eye" to supervise the optimization from two aspects.First, the long-term dependence from the previous outputs can be captured and maintained.Second, important information will be automatically accumulated, whereas useless or redundant information will be forgot, in the dynamics of sc2net.
However, when we tried to apply sc2net in TomoSAR inversion, we discovered that a drawback of sc2net impedes its application.As it is known, a complicated RNN model, on the one hand, hinders theoretical analysis and empirical understanding.On the other hand, it also implies that we have to learn more parameters and tune more components.As a natural result, more training sequences, which mean more training time, and (perhaps) larger training datasets are required.When sc2net is applied to solve TomoSAR inversion, we need to learn four weight matrices i2 , which have the dimension L × L, L × N , L × L and L×N , respectively, to determine the forget gate f (t) and input gate i (t) in each individual SLSTM unit.Moreover, SAR data is complex-valued.Hence, there weight matrices to be learned should be complex-valued as well, thus duplicating the number of trainable components and parameters since two weight matrices need to be learned simultaneously as the real and imaginary part of a complex-valued weight matrix.Through our research and experiments, we found that such large amount of high dimensional weight matrices to be learned makes the training procedure time-consuming.More seriously, it is difficult for the model to converge in the training process.

B. Complex-valued Sparse Minimal Gated Unit
To address the aforementioned issue and better leverage the power of incorporating the historical information for solving TomoSAR inversion, it is necessary to reduce the components and simplify the model architecture.Recently, studies and evaluations in [30], [31], [32] demonstrated that the gated unit contributes to significantly improving the performance of a RNN comparing to that without any gated unit.However, it does not signify that the more the gates the better the performance of an RNN.Based on this fact, the author proposed a RNN model with only one gate, termed as minimal gated unit (MGU) and revealed that fewer gated units reduces the complexity but not necessarily the performance.
Inspired by the valuable works in [33], [34], we proposed sparse minimal gated unit (SMGU), as illustrated in Fig. 6, by coupling the input gate to the forget gate, thus further the simplifying SLSTM unit.The detailed equations for defining the SMGU are listed in Table I.
In the t th layer of a RNN with SMGUs, we will firstly compute the forget gate f (t) .In addition, the short-term response c(t) is generated by combining the input data g and the "forgotten" portion (f (t) ⊙ γ(t−1) ) of the output from the previous layer.Hereafter, the new hidden state c t of the current layer can be formulated by combining part of γ(t−1) and the short-term response c(t) , which are determined by (1 − f (t) ) and f (t) , respectively.Eventually, the sparse activation function, i.e. the double hyperbolic function, will be applied to the current hidden state c t for shrinkage and thresholding to promote sparsity of the output.In this formulation, we can see that the SMGU is able to simultaneously execute a two-fold task with only one forget gate.On the one hand, SMGU allows a compact representation by enabling the hidden state c (t) to discard irrelevant or redundant information.On the other hand, SMGU is capable of controlling how much information from the previous layer to take over.Additionally, comparing the formulation of SMGU to SLSTM in Table I, we can see that the parameter size of SMGU is only about half of that of SLSTM since the weight matrices W 1 and W 2 are shared for different layers in a network.The main advantage brought by the significant elimination of trainable parameters is that we can reduce the requirement for training data, training time as well as architecture tuning.
In addition to the improvements using SMGU, an extension of SMGU to complex domain is required.Complex-valued SMGU (CV-SMGU) has essentially the same structure as SMGU despite two differences.First, each neuron in CV-SMGU has two channels indicating the real and imaginary part of a complex number, respectively.Often the real and imaginary parts are not directly activated.Instead, the activation is performed on the magnitude of the complex number.Hence, it is no longer appropriate to use the sigmoid function for activation to generate the forget gate since the magnitude is always greater than zero leading to the undesired result being always greater than 0.5 after activation.To tackle this problem, we employed the "tanh" function instead of sigmoid to guarantee that the value of the forget gate vector varies from 0 to 1 after activation, as it is originally designed.By applying the aforementioned adaptions, we have the formulations of CV-SMGU, as listed in Table I as well.The symbols W * , g and γ * represent where Re(•) and Im(•) denote the real and imaginary operators, respectively.η cv−dt (•) is the complex-valued version of the double hyperbolic function applied component wise and expressed as follows: Table II summarizes and compares the features of different unrolled RNNs.Through experiments, we found that gated unrolled RNNs require significant less layers to achieve comparable or even better performance.Moreover, the SMGU simplifies the model structure by coupling the two gates, thus significantly eliminating the number of free trainable parameters.Even if the CV-SMGU duplicates the number of parameters for determining the gate, it will not induce serious memory burden or computational expense.

A. Simulation setup and model training
In the simulation, we applied the same settings as [23], i.e. 25 regularly distributed spatial baselines in the range of -135m to 135m were simulated.The corresponding inherent elevation resolution, i.e.Rayleigh resolution, amounts to about 42m.
In the experiment, about 4 million training samples, half of which are single scatterer and the others are two-scatterers mixtures, were simulated to generate the training dataset.To make the training dataset adequate and the simulation more realistic, we randomized many parameters, i.e.SNR level, amplitude, phase and elevation position of scatterers, when we simulated the training samples.Below list the simulation details of single scatterer and double scatterers.
In addition, the amplitude A of the scatterer is simulated to be uniformly distributed in the range of (1, 4) Hereafter, the complex-valued scattering coefficient γ can be generated by γ = A•exp{(jϕ)}.The elevations of the simulated scatterers are regularly distributed on 1m grid between -20, and 300m.
Once the elevation is determined, the echo signal g ∈ C   [35].The employed optimizer was Adam [36].The learning rate was set to be adaptive according to the number of training epochs with initial value being 0.0001.The loss function over the training data {(g i , γ i )} T i=1 is mean square error (MSE) loss, which is defined as follows: where Ψ denotes the set of all parameter to be learned from data.To determine the optimal structure of the network, we validated the performance of the network with different number of CV-SMGUs in term of normalized mean square error (NMSE) on a validation dataset.The validation dataset was composed of 50000 noise-free samples simulated using the same settings introduced in the previous section and the NMSE is defined as follows: As we can see from Table III, the NMSE gradually converges with increasing the number of SMGUs.Moreover, after 6 CV-SMGUs, further increase of the number of CV-SMGUs leads to marginal performance improvement.Instead, heavier computational burden will be brought about.Therefore, the network we designed is composed of 6 CV-SMGUs.

B. Performance assessment and comparison to γ-Net
In this section, we carry out experiments to systematically evaluate the performance of the proposed algorithm in terms of super-resolution power, estimation accuracy and generalization ability against different amplitude ratio and phase difference of scatterers.

Super-resolution power and estimation accuracy
The first experiment set out to study the super-resolution power and estimation accuracy of the proposed algorithm via a TomoSAR benchmark test [5] [14].In the experiment, we mimicked a facade-ground interaction by simulating twoscatterers mixtures with increasing elevation distance between them.The double scatterers were simulated to have identical phase and amplitude, i.e. the worst case for TomoSAR processing [13].The proposed algorithm and γ-Net were employed to resolve overlaid double scatterers at two SNR levels, i.e.SNR∈ {0, 6}dB, which represent typical SNR levels of a high-resolution spaceborne SAR image.We use the effective detection rate defined in [23] to fairly evaluate the superresolution power.An effective detection should satisfy the following three criteria: 1) the hypothesis test correctly decides two scatterers for a double-scatterers signal; 2) the estimated elevation of both detected double scatterers are within ±3 times CRLB w.r.t their true elevation; 3) both elevation estimates are also within ±0.5 d s w.r.t their true elevation.where d s indicates the distance between the double scatterers.Fig. 7 compares the effective detection rate P d of the proposed algorithm and γ-Net.It is presented as a function of the normalized distance α, which is the ratio of the scatterers distance and the Rayleigh resolution α = d s /ρ s .For each combination of SNR and α, we simulated 0.2 million Monte Carlo trials.From Fig. 7, one can see that the proposed algorithm and sc2net with CV-SLSTMs (CV-sc2net) have quite similar performance in terms of effective detection rate.This is the same as we expected since the CV-SMGU is constructed by simplifying the CV-SLSTM.The purpose of CV-SMGU is to reduce the network components while maintaining the performance.The advantages of the proposed algorithm comparing to CV-sc2net are analyzed and discussed in the following "DISCUSSION" section.When we compare the proposed algorithm and CV-sc2net to γ-Net, we can see that both of the proposed algorithm and CV-sc2net outperform γ-Net by a fair margin at both SNR levels.Specifically, they are able to deliver 10%-20% higher effective detection rate in moderate super-resolving cases at 6 dB SNR.In the noisy case at 0 dB SNR, the proposed algorithm and CV-sc2net gradually approach about 90% effective detection rate with the increase of the normalized distance, whereas γ-Net reaches only about 70% effective detection rate.The superior performance of the proposed algorithm and CV-sc2net attributes to that they overcome the information loss in the dynamics of the network by incorporating historic data and preserving full information.As we have mentioned in the previous section, the detection of double scatterers is affected by the information loss.We cannot detect the scatterers whose information is discarded.
To better manifest how the incorporation of historic information improves the performance, we simulated 2000 samples containing double scatterers with increasing scatterers distance at 6 dB SNR.We made a scatter plot of their elevation   estimates and color coded the points by the detector decision in Fig. 8.The x-axis refers to the true normalized elevation distance of the scatterers.The y-axis shows their normalized elevation estimates.The ideal reconstruction would be a horizontal and a diagonal straight line, which represent the ground truth of the simulated ground and facade.The green lines refer to ground truth ±3 times CRLB of single scatterer elevation estimate.The blue dots indicate the detected double scatterers, whereas the red dots represent the samples were detected as single scatterers, meaning the second scatterer were lost in the network output.Fig. 8 clear shows that (1) γ-Net experiences much more red dots locate within ±3 times CRLB w.r.t the ground truth, meaning it occasionally can only detect one of the double scatterers but is able to estimate its elevation with high precision.We ascribe this problem to the information loss caused by the learning structure of γ-Net.In the contrary, the proposed algorithm utilizes CV-SMGUs to preserve full information, thus avoiding discarding any significant information; (2) the proposed algorithm is able to resolve double scatterer at much smaller scatterers distance.Specifically, the proposed algorithm starts to separate double scatterer from about 0.15 Rayleigh resolution, whereas γ-Net can only detect double scatterer only after about 0.3 Rayleigh resolution.
The elevation estimates of the simulated facade and ground are plotted in Fig. 9 w.r.t the normalized true elevation distance.The red horizontal and slant lines indicate the ground truth of ground and facade respectively.The black dashed curves represent the ground truth ±1× CRLB.The error bars indicate the standard deviation of the elevation estimates with the mid-point depicting the mean value of the elevation estimates at given normalized true elevation distance.We discarded the points below an effective detection rate of 5% in the figures.Due to the strict criteria of the effective detection, both the proposed algorithm and γ-Net provide high elevation estimation accuracy, especially at 6 dB SNR, where the bias of the elevation estimates derived by the both methods approaches 0. However, in the extremely noisy case, we can see that the proposed algorithm is able to estimate the elevation with slightly lower bias comparing to γ-Net.

Performance w.r.t. amplitude ratio
In this experiment, we propose to study how the proposed algorithm perform at different amplitude ratio of double scatterers.The double scatterers were set to have identical phase.The SNR level was set as 6 dB.Fig. 10 compares the effective detection rate of the proposed algorithm and γ-Net at different amplitude ratio.As can be seen, the effective detection rate of both algorithm degrades with the increase of the amplitude ratio.The reason for the degradation of the effective detection rate is two-fold.First, dark scatterers suffer from larger and larger bias with the increase of the amplitude ratio since their elevation estimates tend to approach the other more prominent scatterer.Second, at high amplitude ratio, the energy of the second scatterer is closer to the noise level.In real-world application, we usually see dark scatterers at high amplitude ratio (≥ 4) as noise.However, by comparing the two algorithms, we can see from Fig. 10 that the proposed algorithm performs much better with the increase of the amplitude ratio than γ-Net despite the fact that the effective detection rate is seriously affected.From our perspective, the better performance of the proposed algorithm attributes to that the estimates derived by the proposed algorithm preserve the full information, thus we have higher chance to retrieve weak signal of dark scatterers.

Performance w.r.t. phase difference
As it was investigated in [13], the super-resolution power depends strongly on the phase difference when double scatterers were spaced within the Rayleigh resolution.To evaluate how the proposed algorithm perform w.r.t phase difference of double scatterers in super-resolving cases, we vary the phase difference of simulated double scatterers in this experiment and test the effective detection rate.The double scatterers are simulated with identical amplitude.Fig. 11 illustrates the effective detection rate of the proposed algorithm and γ-Net for the case when N=25, SNR=6dB with α=0.6.As can be seen, the both algorithms have the worst performance at △ϕ=0 and performs better when △ϕ approaches 180 • .Comparing to γ-Net, the proposed algorithm is less sensitive to the phase  difference.When △ϕ=0, the proposed algorithm delivers about 20% higher effective detection rate than γ-Net.

C. Practical demonstration
For the real data experiment, we used the test data stack over the city of Las Vegas covering Paris Hotel.Fig. 13 demonstrates us an optical image from Google Earth and the SAR mean intensity image of the test site.The stack is composed of 50 TerraSAR-X high-resolution spotlight images with a slant-range resolution of 0.6m and an azimuth resolution of 1.1m, whose spatial baseline distribution is demonstrated in Fig. 12.The images were acquired between 2008 and 2010.More details of the data stack we use are listed in the Table IV.
We employed the DLR's integrated wide area processor (IWAP) [37] to carry out preprocessing like multiple SAR   images co-registration and phase calibration.In addition, a coherence point on the ground was chosen as reference.We used the baselines of the test data stack to simulate training data.The simulation was conducted in the same way as introduced in the previous section and 4 million training samples were generated.When the network was well-trained, the proposed algorithm was directly applied to reconstruct the elevation of the test site.
The reconstruction results of the test site are demonstrated in Fig. 14 and compared to the results derived by γ-Net.In Fig. 14, (a   To provide a more intuitive comparison of the superresolution power of both algorithms, we summarized the scatterers detection of both algorithms in Table V.As it is shown in Table V, most pixels are detected as 0 scatterers by the two algorithms because the fountain and many low infrastructures in the test site exhibit no strong scattering, which can be seen in Fig. 13(b).Comparing to γ-Net, the proposed algorithm detected less single scatterers (33.30%), but more double scatterer.Comparison between the double scatterers detected by both algorithms shows that the proposed algorithm is able to detect 95.2% of the double scatterers detected by γ-Net.Moreover, it detects 50% more double scatterers than γ-Net.
Further investigation was conducted to inspect the improvement of double scatterer detection.The histogram of detected double scatterers' elevation difference from the proposed algorithm and γ-Net is shown in Fig. 15.In the non-superresolution region, especially when the distance between double scatterers is larger than twice Raleigh resolution, the two  V. DISCUSSION

A. Generalization ability against baselines discrepancy
The effective baseline in a SAR image varies according to the range and azimuth location.A deep learning model trained with a fixed set of baselines may have undesired performance when being applied to the whole image stack, as baselines discrepancies between training and testing data may cause data domain shift.In this experiment, we verify the generalization ability against baselines discrepancies.The network with 6 CV-SMGUs is trained using 25 regularly distributed baselines as introduced in the simulation setup.Then we add random perturbation uniformly distributed in the range [5m, 10m], i.e. about [7%, 14%] of standard deviation of the 25 regularly distributed baselines, to the 25 regularly distributed baselines.100 different baselines distributions were generated.For each of baseline distribution, we carry out a Monte Carlo simulation at 6 dB SNR for each baselines distribution with 0.2 million Monte Carlo trials at each discrete normalized distance.Fig. 16 demonstrates the effective detection rate of the proposed algorithm when we apply the pre-trained network to the data generated with baselines perturbations.The red line represents the reference, i.e. the pre-trained network is applied to data simulated with the same baselines distribution.The green line indicates the average effective detection rate of the 100 Monte Carlo simulations with the blue error bars depicting the standard deviation.As one can see, the proposed algorithm shows a good generalization ability against baselines discrepancy with the effective detection rate decreasing only 5% to 8% comparing to the reference.Therefore, we see the proposed algorithm as a promising tool for large-scale TomoSAR processing since the biggest baselines difference of a typical spaceborne SAR image will not exceed the perturbation we simulated.
However, for baselines with large perturbation or even completely different distribution, the proposed algorithm is not an estimation efficient method.We carried out an additional experiment to test the boundary of the generalization ability by further increasing baseline perturbation.As we can see in Fig. 17, with the increase of the baseline discrepancy, the effective detection rate deceases slowly at first.While when the perturbation is larger than 15m, the performance of the proposed algorithm degrades dramatically.According to the test result, it indicates that 15m might be the boundary for the proposed algorithm to have reasonable performance for the baseline setting in this simulation.
When we set out sights on global urban mapping using To-moSAR, the huge discrepancy between baselines of different data stacks will be a severe challenge.We still need to explore a more general and also computationally efficient algorithm.

B. Convergence analysis
In this section, we propose to investigate the influence of CV-SGMUs on the convergence performance in comparison with CV-SLSTMs.We use a RNN with 6 CV-SLSTMs as a baseline.Fig. 18 compares the objective loss (equation 10) with increasing training epochs.From Fig. 18, we can observe that CV-SMGUs contribute to faster convergence.To be specific, the RNN with CV-SMGUs needs only about 500 epochs to achieve convergence, while the RNN with CV-SLSTMs requires more than 1000 epochs to converge.Furthermore, CV-SMGUs lead to slightly lower overall cost than CV-SLSTMs.

C. Requirement of training data
As we have clarified in previous sections, the CV Fig. 17: Effective detection rate as a function of α at baselines with increasing perturbation.Firstly, the effective detection rate decreases slowly with the increase of the baseline perturbation.While when the perturbation is larger than 15m, the performance of the proposed algorithm degrades dramatically.between double scatterers was fixed at 0.6 Rayleigh resolution and the double scatterers were set to have identical phase and amplitude.The result is demonstrated in Fig. 19.As can be seen, the RNN with CV-SMGUs has better performance when the two RNNs are trained with the same amount of training samples.In addition, the RNN with CV-SMGUs requires obviously less training samples to achieve optimal performance.

VI. CONCLUSION
In this paper, we proposed a novel gated RNN based BPDN solver for sparse reconstruction.The proposed gated RNN adopted a novel architecture, termed as sparse minimal gated unit (SMGU), to avoid information loss caused by shrinkage by incorporating historical information into optimization.With the assistance of SMGUs, we are able to capture and maintain long-term dependence from information in previous layers.To be specific, important information will be automatically accumulated while useless or redundant information will be forgotten in the dynamic of the nerwork.Moreover, we extended the SMGU to the complex-valued domain CV-SMGU and applied it to solve TomoSAR inversion.
denotes the set of parameters to be learned for the piecewise linear function in the i th layer.W i indicates the trainable weight matrix in the i th layer and it is initialized using the system steering matrix R with W i = βR H . β is the stepsize.Usually, a proper step size can be taken as 1  Ls , with L s being the largest eigenvalue of R H R. γi is the output of the i th layer.Re(•) and Im(•) denote the real and imaginary operators, respectively.
SS in γ-Net indicates a special thresholding scheme called support selection, which is formally defined as follows: In the i th layer, the support selection will select ρ i percentage of entries with the largest magnitude and trust them as "true support", which will be directly fed to the next layer, bypassing the shrinkage step.The remaining part will go through the shrinkage step as usual.The shrinkage is executed using the piecewise linear function η pwl , which is a novel shrinkage thresholding function to promote sparsity while improving convergence rate and reducing reconstruction error in the meanwhile and expressed as: Fig. 20: Illustration the learning architecture of a K-layer γ-Net.

Fig. 1 :
Fig.1:The SAR imaging geometry at a fixed azimuth position.The elevation synthetic aperture is built up by acquisition from slightly different incidence angles.Flight direction is orthogonal into the plane.

Fig. 2 :
Fig. 2: Illustration of the i th layer in γ-Net.

Fig. 3 :
Fig. 3: An example of unsuccessful detection of double scatterers caused by information loss.γ-Net detects one of the double scatterers with very high localization accuracy but fails to find the other one.

Fig. 4 :
Fig. 4: Sc2net and detailed learning architecture of SLSTM unit.Each SLSTM unit builds an individual layer of sc2net.

Fig. 5 Fig. 5 :
Fig. 5 demonstrates an example of the double hyperbolic

Fig. 6 :
Fig. 6: Structure of the proposed SMGU.f indicated the only gate in each SMGU.

Fig. 7 :
Fig. 7: Effective detection rate of the proposed algorithm, CV-sc2net and γ-Net as a function of the normalized elevation distance between the simulated facade and ground with SNR = 0dB and 6dB under 0.2 million Monte Carlo trials.

Fig. 8 :
Fig. 8: Normalized estimated elevation of facade and ground of increasing elevation distance, with SNR=6dB and N=25.The double scatterers were simulated to have identical phase and amplitude.The true positions are a horizontal line referring to the ground and a diagonal line referring to the scatterers at variable elevation.The green lines depict true positions ± 3 times CRLB of elevation estimates for single scatterers.Red dots represent samples detected as single scatterers.Blue dots indicate detected overlaid double scatterers.

Fig. 9 :
Fig. 9: Estimated elevation of simulated facade and ground, (a) SN R = 0dB with the proposed algorithm, (b) R = 0dB with γ-Net, (c) SN R = 6dB with the proposed algorithm, (d) SN R = 6dB with γ-Net.Each dot has the sample mean of all estimates as its y value and the correspond standard deviation as error bar.The red line segments represent the true elevation of the simulated facade and ground.The dashed curves denote the true elevation ±1×CRLB normalized w.r.t the Rayleigh resolution.

Fig. 10 :
Fig. 10: Effective detection rate of the two algorithms w.r.t. the normalized elevation distance at different amplitude ratios.

Fig. 11 :
Fig. 11: Effective detection rate ρ d of the two algorithms as a function of phase difference △ϕ under the case: N = 25, SN R = 6dB and α = 0.6.
) and (b) illustrate color-coded elevation of single scatterers detected by both algorithms.(c)-(f) depict the reconstruction of detected double scatterers of the both

Fig. 14 :
Fig. 14: Reconstructed and color-coded elevation of detected scatterers.From left to right: Elevation estimates derived by the proposed algorithm and γ-Net, respectively.From top to bottom: Color-coded elevation of detected single scatterers, top layer of detected double scatterers and bottom layer of detected double scatterers, respectively.

Fig. 15 :Fig. 16 :
Fig. 15: Histogram of the elevation distance between the detected double scatterers from the proposed algorithm and γ-Net.The proposed algorithm shows significantly more detection in the super-resolution region.

Fig. 19 :
Fig. 19: Effective detection rate vs. number of training samples.The RNN with CV-SMGUs requires less number of training samples to achieve optimal performance.

TABLE II :
Comparison of different unrolled RNNs for sparse reconstruction.

TABLE III :
The performance of the network with different number of SMGUs.After 6 SMGUs, the performance improvement is marginal with increasing the number of SMGUs.Instead, the increase of SMGUs leads to heavier computational burden.

TABLE IV :
System parameters of the TerraSAR-X highresolution spotlight image stack.

TABLE V :
Percentage of scatterers detection for the two algorithms.
-SMGU has only one gate, i.e. the minimum number of gates, thus it has less trainable parameters and simpler structure.In this experiment, we study how this simpler model contributes to reducing the requirement of training data.We compare two RNNs with 6 CV-SMGUs and 6 CV-SLSTMs, respectively, in term of effective detection rate at 6dB SNR.The distance