Hessian Free Convolutional Dictionary Learning for Hyperspectral Imagery With Application to Compressive Chromo-Tomography

Convolutional dictionary learning (CDL) is an unsupervised learning method to seek a translation-invariant sparse representation for signals, and has gained a lot of interest in various image processing and computer vision applications. However, 3D hyperspectral images pose unique challenges due to their high dimensionality and complex structures, making optimization of the dictionary and its application to inverse problems difficult. This paper proposes an efficient CDL algorithm that neither explicit evaluation of the Hessians nor their inversion is required in the optimization process, which leads to substantial acceleration and memory savings. Furthermore, we exploit the learned kernels as the convolutional sparse coding (CSC) image prior for the compressive chromo-tomographic (CCT) reconstruction problem, and examine the usability and performances of the proposed method for CCT reconstruction. Numerical experiments show that, for CCT, 1) the proposed CSC can provide an efficient representation for HSI by using several tens of 3D filters; 2) the learned convolutional dictionary has reliable generalization capability; and 3) the proposed CSC-based method outperforms the classical reconstruction method using an analytic sparsifying basis.


I. INTRODUCTION
Hyperspectral imagery (HSI) measures the radiation intensity of every pixel in the scene over many wavelength bands. These spatially registered spectral images exhibit three important attributes: local spatial smoothness, high spectral correlation and nonlocal spatial similarity, which make dictionary learning (DL) a well-suited approach to seeking for a sparse and redundant representation model for various HSI analysis and processing tasks, including hyperspectral classification [1]- [3], hyperspectral image denoising [4], anomaly detection [5], hyperspectral image super-resolution [6]- [8], and hyperspectral compressive sensing [9], [10], among others.
The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou .
The DL problem probes for a set of prototype signals or features, constituting a usually overcomplete dictionary, from a training database. This data-dependent methodology distinguishes from classical analytic transforms, such as Wavelet [11] and Contourlet [12], in that the resultant dictionary can better adapt to any family of signals via learning whereas analytic transforms are usually designed for some particular class of signals, for example, for piece-wise constant signals and for image edges. The flexibility and adaptability of DL has made it a rather appealing sparse representation approach and its evolution has displayed three important stages that are worthy of mention: i) Patched-based dictionary learning, extracts the so-called dictionary atoms from overlapped image patches, forming the columns of the dictionary D, and approximates every image patch with a linear combination x ≈ Dz, where z is the sparse vector that plays an important role in the unique representation and in the interpretability. This data-driven image modeling has demonstrated much improved performances over generic image models in many practical applications [13]. However, their patch-wise manipulation gives rise to two restrictions on D, unstructuredness and shiftvariance. The former determines that explicit matrix-vector operations dominate the numerical expense while the latter means that dictionaries may contain many shifted versions of same atoms. These two issues impose severe computational and memory burdens on the dictionary learning algorithms, such as the method of optimal directions (MOD) [14], the K-SVD [15], and the generalized PCA (GPCA) [16]. A novel PCA-based dictionary learning technique is proposed for multi-modality image fusion, where multiple sub-dictionaries are trained, each for a different cluster of pixels sharing common high-frequency features [17].
ii) Double-sparsity dictionary learning, trains a sparse atom representation matrix A over a preselected dictionary , e.g., Discrete Cosine Transform (DCT) or Wavelet transform, and therefore the dictionary takes the form D = A [18]. Some follow-up methods extend this idea to multiscale [19], orthogonal [20], or online [21] dictionary learning. When has a fast computation algorithm, such a dictionary structure allows for more efficient operations with D and its adjoint D * . Due to their genes inherited from the patch-based learning techniques, these methods still suffer from explicit matrix manipulations, restricting their deployment to high dimensional signals.
iii) Convolutional dictionary learning (CDL), uses convolution as the basic image synthetic operation, X = k D k * Z k , where {D k } K k=1 are the set of convolutional kernels learned from training data, and {Z k } K k=1 are the corresponding code maps. The most prominent property of this image model is shift-invariance, contributing two merits over its patch-based counterparts: compact kernel set capable of global image representation [22], and structured dictionary with fast operations using Fast Fourier Transform (FFT) [23]. The convolution operation, however, causes the image boundary problem that can undermine the learning performance. The seminal work [24] by Heide et al. introduced a mask matrix to account for the boundary effect and employed the alternating direction method of multipliers (ADMM) to flexibly and efficiently solve the sparse coding and filter learning sub-problems, in an alternative manner.
As the aforementioned DL algorithms were proposed under the configuration for 2D images, they need some type of adaptation to the 3D data structure of HSI. Some work [4], [8] directly learn the spatial dictionary on each spectral band, neglecting the spectral correlation in HSI. A more widely adopted approach is to reshape the 3D data ∈ R H ×W ×B into a 2D matrix X ∈ R B×N , where N = H ×W and H , W and B represent the image height, width and channel number, respectively. Then some work [1]- [3], [5]- [7], [9] conduct DL along the spectral dimension, only capturing the spectral sparsity by expressing a pixel (a column in X) as x n = Dz n . The attempt to partially consider the joint spatial-spectral sparsity is to employ the patch-based DL on small overlapped 2D patches in X [4] or vectorized 3D cubes in [10], [25], which still cannot guarantee the 3D signature integrity.
Computational complexity and memory consumption entailed by the curse of dimensionality are two critical issues that every DL algorithm has to cope with. As to the practice of patch-based DL in hyperspectral applications, suppose that the dictionary D consists of N atoms, each having a dimension M . As discussed previously, the basic operation is matrix-vector multiplication, resulting in a complexity O (MN ), which prohibits large atom number as well as large patch size. On the other hand, however, redundant representation requires sufficient quantity of atoms to ensure diversity, which usually needs a tradeoff between the dictionary scale and the computational and memory complexities. In contrast, CDL employs convolution as the basic operation, which can capture local image features through the global sliding process of filters, yielding a small set of convolutional kernels that need to be stored in memory. And the convolution operation also has fast computation by using FFT. Although in methodology concise representation and fast operation are two advantages of CDL over its patch-based counterparts, in practice, how can we design efficient optimization algorithms that can fulfill the CDL training and testing tasks within reasonable memory and running time expenditures? And furthermore, how can we exploit this learned sparse representation in inverse HSI problems and how much benefit can it bring?
To address these problems, this paper introduces the Preconditioned ADMM (PADMM) framework [26], [27] into CDL for HSI and compressive chromo-tomography (CCT) [28], and further proposes two efficient diagonal preconditioner design methods that allow for Hessian free solutions, resulting in substantial algorithm acceleration and memory savings. In particular, we make the following contributions: • We prove that the subproblems of CDL, sparse coding and filter updating, both can be cast within the PADMM framework, leading to efficient numerical solutions with convergence guarantee. Furthermore, this precondi-tioning technique is also introduced to the CCT problem, which makes the reconstruction of high dimensional HSI numerically more tractable.
• We present two efficient diagonal preconditioner selection methods, contributing to the proposed Hessian free CDL algorithm, which reduces the computational complexity of involved linear systems to O n 2 , compared with the O n 3 complexity of Hessian-based matrix inversion approaches, and at the meanwhile removes the storage requirement for the Hessians.
• We experimentally demonstrate the effectiveness of our learned dictionary to CCT, which exhibits robust noise properties, reliable generalization capability, and VOLUME 8, 2020 outperforming reconstruction performances over analytical transform bases. The remainder of this paper is organized as follows. A brief survey on CDL is given in Section II. Section III discusses the biconvex optimization framework for hyperspectral CDL. Section IV proposes the algorithm acceleration and memory reduction techniques using PADMM. Application of the leaned dictionary to compressive chromo-tomography reconstruction is detailed in Section V. Experimental results are provided in Section VI, and Section VII concludes this paper.

II. RELATED WORK
In this section, we first introduce the classical CDL formulation for single channel images, then extend to the multichannel scenario.

A. SINGLE CHANNEL CDL
In the literature, a widely used mathematical model for CDL can be formulated as the following optimization problem: where D k ∈ R h×w is the k th member in a set of K convolutional kernels, {X j } and {Z j,k } ∈ R H ×W denote the J training images and their corresponding J ×K code maps with respect to {D k }, respectively. The symbol * is the 2D convolution operator, and [K ] represents the set of 1, . . . , K . The sparsity of Z j,k is controlled by the regularization weight λ, and the normalization constraints on D k help reduce the ambiguity of energy distribution between Z j,k and D k . Most CDL algorithms employ the alternative minimization strategy to solve problem (1) since it is biconvex w.r.t. the variable set {D k }, {Z j,k } . The whole optimization process alternates between the following subproblems, each of which can be solved using ADMM [23], [24], [29], [30] or Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [31], [32].
where we follow the notations and definitions on convolution matrix construction in [33] that D k ∈ R (H −h+1)(W −w+1)×HW is the ''cropped'' Toeplitz-block-Toeplitz matrix constructed from D k and z j,k vec Z j,k ∈ R HW is the vectorization of Z j,k such that D k z j,k = D k * Z j,k keep the ''valid'' part of their convolution. Likewise, x j vec X j . One should bear in mind that the matrix-vector representation of convolution is just to facilitate algebraic derivations whereas the true computation usually resorts to FFT in the frequency domain.
2) FILTER UPDATE SUBPROBLEM ( FIXING Z j,k ) arg min where Z j,k ∈ R (H −h+1)(W −w+1)×hw is constructed from Z j,k and d k vec (D k ). Here, the indicator function C (y) takes the place of the filter constraints in (1), The sparse coding step in (2) can be directly decomposed into J independent subprobelms w.r.t. each training image x j . Nevertheless, the filter update step involves all x j , which is much more computationally challenging. A recent work [34] employed the multiconvex optimization strategy [35] to sequentially update each filter and its corresponding sparse codes, in which a block proximal gradient method using a majorizer is used instead of ADMM for lower memory consumption.
Another line of research proposes to use non-convex regularization (L 0 norm) and non-convex constraints ( D k 2 F = 1) [36], [37], which perform the joint optimization of {D k }, {Z j,k } , relying on local Lipschitz constant estimation and proper initializations to achieve the global convergence.

B. MULTI-CHANNEL CDL
While plenty of research on single channel CDL have been conducted as discussed above, multi-channel CDL has received relatively few investigations [38], [39]. Basically, when extending CDL from the single channel to the multi-channel scenario, two kinds of approaches can be used: a single channel dictionary combined with independent multi-channel code maps, or a multi-channel dictionary combined with a shared set of code maps. Since the latter case can provide a much more memory efficient representation, we adopt this convolutional representation for HSI in our work.
The mathematical model of multi-channel CDL can be formulated as: where D k,b is the channel b of the filter k, Z j,k is the k th code map shared by all B channels X j,b , belonging to the training image X j . When training a convolutional dictionary for HSI, a practical manipulation is to crop HSI in the spatial dimensions into small cubes to alleviate computation and storage problems. Therefore, the image boundary problem should be properly treated to avoid its undermining influence on dictionary learning. However, existing work [38], [39] on multi-channel CDL employ the model (5) that do not consider the boundary effect. Moreover, their ADMM solutions to the filter updating subproblem employ the Iterated Sherman-Morrison (ISM) formula to solve the linear system, leading to a quadratic complexity on the number of image channels, therefor not suitable for hyperspectral CDL.
Our work introduces a masking matrix into (5) to account for the boundary effect, and takes advantage of the flexible variable splitting capability of ADMM to solve the sparse coding and filter updating subproblems in an alternative way. In order to acquire further acceleration, we employ PADMM to address the most computationally challenging problem, the linear system solution, by circumventing the explicit Hessian matrix calculation and inversion in the optimization process.

III. BI-CONVEX OPTIMIZATION FRAMEWORK
Now, we are ready to extend the single channel CDL problem to the 2D-representation and 3D-filter mode for HSI to account for the 3D data-cube structure. Such a scheme allows for a memory efficient representation. We also employ the bi-convex optimization framework and take advantage of the algebraic structures of the involved subproblems to effectively reduce the computational complexity.

A. SPARSE CODING SUBPROBLEM
Since convolution boundary handling has been recognized as an important issue for CDL [24], [34], hereafter all images are zero-padded in the spatial dimension to size (H + h − 1) × (W + w − 1), denoted as N . We then introduce a mask matrix M ∈ R N ×N into (2) and rewrite it as follows, arg min where Problem (6) is convex but nonsmooth, therefore we resort to ADMM for variable splitting. Introducing the auxiliary variables s j = Dz j and t j = z j , we write the augmented Lagrangian function for (6) as where α z and β z are the dual variables and ρ (z) α and ρ (z) β are the respective penalty factors. Then ADMM using the scaled dual variables [40] performs a sequential variable update, where is the soft thresholding operator [40].
Since M is a diagonal matrix, solving for s j is cheap. However, the true challenge lies in the matrix inversion [D T D + r (z) I] −1 due to its huge size KN × KN . An existing wisdom on this problem is to resort to the frequency domain solution, wherex represents the DFT of x. Unfortunately, unlike the single channel CDL whose 2D filters lead to an efficient solution using the matrix inversion lemma [41], the 3D filter structure demands a higher computational cost.

1) VARIABLE REODERING FOR EQ. (9)
ObservingD HD has the following sparse structure, which is composed of K × K blocks of diagonal matrices of size N × N , (9) can be decomposed into N independent smaller problemŝ whereD n ∈ C B×K samples all the K filters' frequency spectrum on each of the B channels, at position n. H n =D H nD n ∈ C K ×K contains all the n-th diagonal elements of every block in (10).ξ i j,n ∈ C B andζ i j,n ∈ C K are extracted in the same way. Note that the intractable inversion of D T D + r (z) I has been first transformed into the frequency domain then further decomposed into N matrix inversions of much smaller size K × K .
2) COMPLEXITY OF EQ. (11) The computational overhead of (11) includes three items: i) FFT of D, O (KBN log N ); ii) computation ofD HD in (10), O BNK 2 ; and iii) the inversions of H n + r (z) I, O NK 3 . The main complexity of (11) itself lies in the two matrixvector multiplications, resulting in O (JNK (K + B)). Note VOLUME 8, 2020 that the overhead can be split among the inner iterations of ADMM, with the extra storage of the inversed matrices.

B. FILTER UPDATE SUBPROBLEM
The extension of the single channel mode in (3) to the 2D-representation and 3D-filter mode can be formulated as arg min where the whole batch is considered together by defining combines two constraints on the learned filters: their original spatial support and the convex unit-norm ball.
Problem (12) can also be solved using ADMM, resulting in the following iterations where Similarly, the most computationally difficult part in (13) is the inversion of matrix Z T Z + r (d) I, which involves all training samples and thus becomes more challenging than the z-subproblem in (8). The frequency domain solution of the d-subprolem readŝ whose efficient computation choice depends on bigger one between the image number J and the filter number K .

1) VARIABLE REODERING FOR EQ. (15)
Z HẐ has the following block sparse structurê which is composed of K × K blocks of diagonal matrices of size BN × BN . Again by variable reordering, we can decompose (15) into BN independent smaller problemŝ whereẐ n ∈ C J ×K samples all the K filters' frequency spectrum of each of the J coefficient maps, at position n. (17) all spectral bands with the same n use a common G n .

2) COMPLEXITY OF EQ. (17)
The main computational overhead of (17)  The overhead can also be split among the ADMM iterations with the extra storage of the inversed matrices.

C. COMMENTS ON COMPUTATIONAL COMPLEXITY
We present three comments on the computational complexity of our CDL algorithm: (1) The computational overheads on the matrix reversions in (11) and (17) can be spilt among the multiple ADMM inner iterations in each subproblem, with the storage cost for the extracted smaller inversed matrices. (2) The decomposed smaller linear system of equations, (11) and (17), are completely independent at all frequency indices. Therefore our algorithm is highly suitable for parallel implementation. , and the corresponding matrix inversions impose quadratic or even cubic dependence on K , rendering the CDL problem still rather computationally expensive.
In the following section, we will introduce the preconditioned ADMM (PADMM) into CDL, which we call PCDL, having a computational complexity of O (2NBKJ ) (See Section IV.D for complexity analysis). The proposed PCDL framework: J HSI samples are used to train a 3D convolutional dictionary consisting of K filters, while every sample has a set of K 2D feature maps, each shared by all spectral channels of the corresponding 3D filter. PADMM is employed in both the convolutional sparse coding and the filter updating sub-problems for algorithm acceleration and memory saving.

IV. HESSIAN FREE CDL USING PADMM
Although the previously described variable reordering technique in the frequency domain has made CDL computationally tractable, the explicit matrix inversions in (11) and (17) still demand a great amount of computational efforts as well as the corresponding memory storage. In order to acquire a lightweight and efficient CDL optimization method for HSI, we introduce the preconditioned ADMM (PADMM) [26], [27] into CDL, which we call PCDL. The proposed PCLD does not need explicit calculation or storage of the Hessians D T D and Z T Z, eliminating the quadratic or higher order computational complexity on the problem scale (N , B, K , J ), as well as the quadratic memory requirement on filter numbers K . The framework of PCDL is shown in Fig.1.
where γ ∈ W is the dual variable, and ρ is the penalty parameter. Let * and ∂ denote the adjoint and the subgradient operators, respectively. The optimality condition of (19) involves the evaluations of (ρA * A + ∂F) −1 and (ρB * B + ∂G) −1 , which are computationally challenging and can be circumvented by employing the preconditioned ADMM [26].
Lemma V.1. (PADMM, [26, Theorem 3.1] ) Assume problem (19) has a saddle point, then the following sequence p i , q i , γ i converges to an optimal solution of problem (19) if the following conditions hold: (a) ρA * A + ∂F and ρB * B + ∂G are strongly monotone.
where Q ∈ P and L ∈ P are linear, bounded, self-adjoint, and positive definite operators. (c) A or B is compact.
When Q and L are chosen properly, (20) can be solved efficiently due to the added proximal regularization items with the form 1 2 This preconditioning strategy has also been proposed in the primal-dual framework [42], [43]. Below, we will demonstrate how to employ this method to reduce the numerical burden on the CDL algorithm.

B. PRECONDITIONED SPARSE CODING
Now we present the first contribution of this paper.
Proposition V.2. If we take a bounded and positive definite matrix Q ρ (z) α D T D and L = I, then the CSC subproblem (6) can be cast within the PADMM framework, resulting in the preconditioned solution, Proof: We first transform the augmented Lagrangian in (7) so as to comply with the standard form in (19), by setting: (c) Our construction of A and B leads to their compactness. Finally, the right hand side of EQ. (21) is the closed-form solution of the least squares problem thereof. VOLUME 8, 2020 By the Parseval's Theorem, we can also precondition the z-subproblem in the frequency domain, Using the same variable reordering manipulation as in (11), the reordered version of (22) at each frequency index n readŝ If we choose Q n to be diagonal matrices, then the main complexity of (23) lies in the two matrix-vector multiplications of size J × K , resulting in O (2JNBK ). Next we present two approaches to selecting feasible diagonal matrices Q n , which does not require the computation of the HessianD HD .

C. DIAGONAL PRECONDITIONER SELECTION
Next, we present the second contribution of this paper.

1) RICHARDSON PRECONDITIONER
For n = 1, · · · , N , let θ (z) n =σ 2 D n be the spectral radius ofD H nD n , and ε a small constant. When taking we have Q n − ρ (z) αD H nD n 0, which corresponds to the simplest Richardson preconditioner.
To avoid the computation ofD H nD n , we use singular value decomposition (SVD) to get the biggest singular value ofD n , denoted as σ 1 D n . SinceD n is of size B × K , and the SVD of an B × K matrix has complexity O (BK min (K , B)) when only singular values are desired [45], [46], computing all r n has the complexity O (NBK min (K , B)). A byproduct of the block-wise SVD is the spectral radius of D T D, θ (z) = max n σ 2 D n , which is true because neither the FFT nor the permutation matrix changes the eigenvalues of D T D.

2) DIAGONAL MAJORIZING PRECONDITIONER
The following proposition provides an efficient way to constructing a diagonal majorizing preconditioner.
Proposition V.3. Let D ∈ C B×K , and construct the diagonal matrix Q = diag (q 1 , · · · , q K ), where Then Q D H D.
Proof: One can find that {q k } are nothing but the row sums of |D H ||D|. Therefore the Hermitian matrix Q − D H D is diagonally dominant with nonnegative diagonal entries. Then by the Gershgorin's circle theorem, every eigenvalue λ Q − D H D ≥ 0, which completes the poof.
Note that constructing Q using (25) has the complexity O (BK ), and thus is more efficient than using SVD.

D. PRECONDITIONED FILTER UPDATE
As the ADMM solution for the filter update subproblem has similar structure as in the scenario of CSC, the arguments in Sec. IV.C also apply to the filter update subproblem. Here we directly give the preconditioned solution for the filter updatê The main computational complexity of (26) lies in the two matrix-vector multiplications of size B × K , resulting in complexity O (2JNBK ) for the total JN blocks.

E. SUMMARY OF PCDL
Our technical avenue for acceleration can be divided into four stages: • Firstly, the Fourier Trick is exploited to decompose the original large linear system involving D T D (KN × KN ) and Z T Z (KBN × KBN ) into respectively N and NB independent smaller linear systems of scale K × K , which makes direct matrix inversion feasible.
• Secondly, we further employ PADMM to precondition each frequency block rather than using a global preconditioner, which results in tighter bounds for the preconditioners and thus faster convergence rate.
• Thirdly, we design diagonal preconditioning matrices whose complexity are only O (NBK ) for sparse coding and O (NJK ) for filter update, respectively. Comparatively, the direct inversions are O NK 3 .
• Fourthly, by using diagonal preconditioners, neither computation nor storage of the Hessians D T D and Z T Z is required, which further save O BNK 2 and O NJK 2 , respectively. The whole procedure of PCDL is shown in Algorithm 1.

V. COMPRESSIVE CHROMOTOMOGRAPHY RECONSTRUCTION USING CSC
As the third main contribution of this paper, we demonstrate the usefulness of our learned convolutional dictionary to CCT reconstruction.

A. IMAGING MODEL
Chromo-tomographic hyperspectral imaging spectrometer (CTHIS) was first proposed by Mooney, etc. in [47] to achieve a high optical throughput, which is of significant interest for weak target hyperspectral measurement and also for hyperspectral imager development with small form factor. = t V . 20: l = l + 1. 25: Until stopping conditions are met.  Fig.2, where a rotating direct view prism (DVP) is implanted into the collimated space to effectively displace the spectral images on the focal plane array (FPA) with the wavelength dependent shifts in w and h directions,

A schematic of CTHIS is shown in
where k is the dispersion coefficient of the DVP, ω is the wavelength, ω c and ϕ denote the undeviated wavelength and the rotation angle of the DVP, respectively. By calibration, the optical parameters k and ω c are transformed into pixel number and spectral channel number, respectively.
The spectral images shifted according to (27) will be integrated along the chromatic dimension by the FPA [47], [28], which mathematically follows the Radon transform and is comparable to the parallel beam X-ray tomography, hence the name of chromo-tomography. The original reconstruction algorithm of CTHIS amounts to solving an overdetermined linear system of equations in the Fourier domain, while this work exploits the compressed sensing theory [48], [49] to pursue the compressive chromo-tomography reconstruction from several chromo-tomographic shots fewer than the spectral band number.
Denote the lexicographically stacked representation of the original datacube by x ∈ R H ×W ×B , and the measurement vector captured at DVP angle ϕ l by g l ∈ R H ×W . Then the discrete imaging model of CCT can be formulated as where the forward system matrix A l A l 1 , · · · , A l B stands for the summation of all the B shifted spectral channels, and A l b , b ∈ [B] implements the spatial shift of channel b [28], where F is the Fourier transform matrix, δ are the image shifts caused by dispersion in (27).
Finally, suppose we have L measurements defined in (28), then they can be concatenated to form where g g T 1 , · · · , g T L T and A A T 1 , · · · , A T L T .

B. COMPRESSIVE RECONSTRUCTION ALGORITHM
It's well known that CSC is a rather effective representation method for the high-frequency (HF) components in images, which has been demonstrated in different image processing tasks [50]- [56]. When dealing with the low-frequency (LF) components in single channel CSC applications, a common method is to add a low-pass Gaussian filter [24], [29] [57] or a Dirichlet function [51], [53] into the learned filter set. However, a direct extension of this manipulation to hyperspectral images cannot effectively cope with the variations among spectral channels.
In this section, we first acquire an HSI estimate via a least square-total variation (LS-TV) algorithm, then retrieve its LF component which is further fed into a LS-TV-CSC framework for the HF component enhancement.

1) LS-TV RECONSTRUCTION
We first solve the following LS-TV problem for an intermediate image reconstruction result, arg min VOLUME 8, 2020 where A and g are defined in (30), and x TV x 2 , with I B ⊗˜ . Thereof, ⊗ is the Kronecker product and˜ ∈ R 2HW ×HW is the total finite difference operator [58].
Introducing the auxiliary variable s = x, the ADMM solution to (31) is obtained through the following alternations: where ρ α is the penalty factor and α is the dual variable.
x-subproblem: The x-subproblem in (32) involves the big inverse matrix [A T A + T ] −1 , so we employ the preconditioning trick introduced in Section IV and get the preconditioned x-subproblem as follows, arg min where Note that by the definition of A l b in (29),Â HÂ has the same structure asD HD in (10). Thus a diagonal preconditioner N A can be obtained in the same way as the filter update problem does using (24) or (25). As to T I B ⊗˜ T˜ , we use an established bound on the operator norm ˜ ≤ √ 8 [59], leading to a diagonal N = 8ρ α I. Therefore, the solution of (33) reads, where θ =˜ n x i+1 b + α i ρ α , and˜ n ∈ R 2×HW extracts the horizontal and vertical differences at pixel n in the spectral band image x b .
Once the stopping criterion of (32) is reached, we take the estimate x as the intermediate result and extract its LF part v by a low-pass filtering, where G (·) stands for a low-pass Gaussian filtering

2) LS-TV-CSC ENHANCEMENT
The HF parts of compressive measurements are extracted by Since inaccurate filters and structure loss may exist in the learned dictionary D [50], we augment the CSC prior with a TV regularization to surpass the possible artifacts, and reconstruct the HF part by solving the following problem, arg min where z is the CSC map corresponding to the HF components of the unknown hyperspectral image x.
Introducing auxiliary variables u = Dz and t = z, the ADMM solution to (38) is obtained through the following alternations: z-subproblem: One can find the z-update in (39) is nothing but the single image version of the z-solution in (8) for CSC. Therefore, here we directly give its preconditioned solution similar to (21), where N D ρ β D T D can be constructed using (24) or (25) in the same way, and the frequency solution of (23) is also applicable to (40).
t-subproblem: The Z-update is given by the soft thresholding, where b = z i+1 + γ i ρ γ . u-subproblem: The u-update needs to solve a proximal regularized LS-TV problem similar to (31), and we leave its solution to the next subsection.

3) SOLUTION OF U-SUBPROBLEM
Introducing the auxiliary variable e = u, we conduct an inner ADMM alteration for the u-subproblem in (39): where ρ φ is the penalty factor and φ is the dual variable. u-update: The u-update in (43) has similar structure to the x-subproblem in (32). Here, we also add two preconditioning items 1 2 u − u j 2 Q A and 1 2 u − u j 2 Q to the u-update objective function so that A T A and T can be canceled out for efficient computation. The resultant solution is given by where N A and N are same with those in (34). e-update: The solution of e in (43) is parallel to the ssolution in (35), where θ =˜ n u j+1 b + α j ρ α . The ADMM alteration for the u-subproblem is detailed in Algorithm 3.

VI. EXPERIMENTS AND ANALYSIS
In this section, we first verify the convergence performance of the proposed PCDL algorithm with different dictionary parameter settings. Then numerical simulation experiments using the learned convolutional dictionaries are carried out to demonstrate the effectiveness of CSC to compressive chromo-tomography reconstruction.

A. CDL EXPERIMENTAL SETUP
All CDL algorithms in this section are with Matlab implementations and ran on a workstation with an Intel Xeon CPU E5-2620 and 64GB RAM. All comparison algorithms were fed with identical randomly initialized feature maps and filters for fair comparison.

1) ALGORITHM NAMING CONVENTIONS
We compare four ADMM-based implementations of CDL for HSI with convex constraints, including: • HCDL: This is the implementation of the Hessian-based algorithm described in Section III, which requires the calculation and inversion of Hessian matrices and can be regarded as a 3D extension of the fast and flexible CDL in [24]. Since accurate matrix inversion is directly employed, this version can serve as a benchmark for convergence rate.
• HCDL-greedy: This is a greedy version of HCDL that only performs one ADMM iteration in each step, which can be regarded a 3D extension of the fast CDL [23] and the efficient CSC using ISM [32]. VOLUME 8, 2020 • PCDL1: This is the implementation of the proposed Algorithm 1, using the Richardson type preconditioner in EQ. (24).
• PCDL2: This is the implementation of the proposed Algorithm 1, using the diagonal majorizing preconditioner in EQ. (25).

2) HSI DATASET SPECIFICATION
We use the Columbia CAVE dataset in [60] as our training set. The original HSI is of size 512×512×31, covering a spectral range from 400nm to 700nm. We downsampled the original images by a factor of 2 and cropped them into small samples of size 128 × 128 × 31. All samples were normalized to [0,1] and then low-pass filtered by a 13 × 13 Gaussian filter with the standard deviation parameter σ empirically set to 4.773. Finally their LF components were removed and the resultant HF parts were used for CDL.

3) CDL ALGORITHM PARAMETER SETTING
The parameters in Algorithm1 were set as follows: where T is the maximum intensity of the input images [61]. We used S = V = 5 inner iterations for each filter and feature map update, respectively.

5) STOPPING CRITERION
Iterations are terminated if either of the following condition is satisfied: where F (d, z) is the objective function defined in (5), and l is the iteration index in the outer loop in Algorithm1. The parameter tol was set to 1e −3 .

B. CDL CONVERGENCE COMPARISON
In this experiment, we conduct two CDL tasks with different scales for every algorithm described above. The first one uses a training set of moderate size with J = 32 samples to train K = 25 filters. And the second one learns K = 50 filters from J = 100 samples. The cost function value vs. iteration number curves are drawn in Fig.3 (a) and (c), which shows that the two PCDL algorithms converge with slightly slower rates compared with the HCDL versions that use exact matrix inversions. However, the corresponding running time curves drawn in Fig.3 (b) and (d) reveal that PCDL can achieve same optimization performance with HCDL but with much faster speed. This experiment verifies the reliable convergence property of PCDL compared with those resorting to exact Hessian inversions. Both the runtime and the computational complexity analysis are presented in Table 1 for a comprehensive comparison.

C. LEARNED FILTERS
The 3D filters learned by PCDL2 in the two tasks are displayed in Fig.4. We assemble the filters in each channel for observing their intra-channel diversity, while also provide a 2D colorful version by dimensionality reduction to reveal the chromatic variation among the 3D filters. This experiments shows that the proposed CDL method successfully captures the spatial-spectral features hidden in the HSI samples. Note that in the colorful display of the 25-filter dictionary, there exist some grey ones, which correspond to common features among all spectral channels and thus have constant appearance. However, when it comes to the 50-filter dictionary, we cannot observe such phenomenon maybe because the designated bigger filter number allows for the dispersion of similar spectral variations into more filters. Therefore, an interesting question arises: Is it true that more filters lead to better results for inverse problems? We  investigate this problem through the following experiments on CCT reconstruction.

D. CCT RECONSTRUCTION PERFORMANCE 1) EFFECT OF FILTER NUMBERS
This experiment was designed to compare the CCT reconstruction performances using the aforementioned two dictionaries. In order to take into account the generalization capability of the dictionaries, two test sets were compared: Set1, consisting of 4 images ((1)-(4) in Fig.5) randomly picked up from the training set; Set2, consisting of 4 images ((10)-(13) in Fig.5) randomly picked up from the ICVL dataset [62]. All test hyperspectral images are of size 256 × 265 × 31. Using each of the test images, we simulated multiple 2D chromo-tomographic measurements via the imaging model (30) as the inputs of Algorithm 2.   The reconstruction PSNRs averaged over the two test sets are listed in Table 2. First, the reconstruction results of Set2 under each measurement setting do not show obvious inferior performances compared with Set1, and even better with 4 shots. This validates that CDL can retrieve an image prior capable of generalization. Second, on all trials conducted, using 50 filters almost yields a slightly worse performance than using 25 filters. This maybe discloses that on one hand, 25 filters have been able to successfully learn the HSI intrinsic features, and on the other, using more filters means more unknown coefficient maps to estimate and thus affect the final reconstruction results. We will use the 25-filter dictionary in the sequel experiments for further exploration.

2) ROBUSTNESS TO MEASUREMENT NOISE
This experiment examines the proposed algorithm's robustness to measurement noise. We add Gaussian noise to the previous test sets for four SNR levels, ranging from 10dB to 40dB. Fig.6 shows the average relative MSE in dB versus the SNR. First, reconstruction from single shot is obviously inaccessible due to the sampling rate as low as 1/31. And in the 10dB SNR case that is extremely noisy, the proposed algorithm failed to recover signals from equally strong noise, with reconstruction quality deserving the shot numbers. Starting from 20dB SNR, however, the error curves show clearly the desirable behavior between SNR and MSE. The error reductions of the two groups of curves, four shots vs. eight shots and eight shots vs. twelve shots, keep an about constant ratio, 2:1, which corresponds to the improvement by accordingly increasing the number of measurements.

3) RECONSTRUCTION RESULTS OF LARGE HSI
In this experiment, we test the reconstruction performance with the 16 test images shown in Fig.5, but with higher spatial resolution 512 × 512. SSince Fig.6 suggests a desirable result when using 12 shots with 30dB SNR, in Table 3 we collect the PSNR results under this experimental setting and compare them with the performance of CACCT (Coded Aperture Compressive Chromo-tomography) using an analytic sparsifying basis [28]. It is shown that the proposed method outperforms CACCT consistently. Fig.7 shows the reconstruction results of four hyperspectral images from the Harvard dataset [63]. Compared with CACCT, our method exhibits higher performance in terms of HF component recovery, owing to the fusion of the learned CSC hyperspectral prior with the proposed optimization strategy.

E. FEATURE REDISCOVERING PERFORMANCE
The last experiment is designed to examine two aspects of CDL: the information preserving capability and the feature rediscovering capability. We first conduct a CDL task using 32 HSI samples of size 128 × 128 × 31, yielding a dictionary D consisting of 25 kernels of size 11 × 11 × 31. Then the resultant feature maps as well as the learned dictionary D are used to generate a new training set, which should be similar with the original one if the CDL algorithm only incurs minute information loss. And finally, a new round of CDL is started to train a new dictionary A, whose kernels should resemble the ones in D if the CDL algorithm has reliable performance of global convergence.
We use the following atom coherence to measure the similarity of kernels in A and D, defined as:  A pseudo-color plot of C is shown in Fig. 8. The average coherence value of the first 25 biggest C (i, j) is 0.6814, with VOLUME 8, 2020 the mutual coherence between A and D is 0.9680. These statistics reflect that the atoms in A and D have a relatively high ensemble similarity, while the difference between them is probably due to the non-convex nature of the CDL problem, even though identical initialization has been fed into the two trials of the CDL algorithm.

VII. CONCLUSION
In this paper, we cast the convolutional sparse coding and filter updating problems within the PADMM framework, leading to efficient and convergent numerical solutions to CDL for HSI. Two efficient construction methods of diagonal preconditioners have been proposed, which contributes to a Hessian free CDL algorithm for HSI with substantial algorithm acceleration and memory savings. Furthermore, we also employ PADMM to solve the CCT problem, which allows for the reconstruction of high dimensional HSI. Thereafter, we experimentally examined the usability of the learned convolutional dictionary in the CCT reconstruction problem. Numerical experiments validate that the proposed 3D CSC model is effective and generalizable in representing the HF components of HSI. The CCT reconstruction comparison results demonstrate that the CSC prior outperforms the analytic spasifying transforms, which is maybe of interest for other compressive hyperspectral imaging schemes. Future work will examine the proposed method in real world applications and exploit learning-analytic hybrid dictionary for CCT reconstruction. Additionally, with appropriate algebraic structure adaptation, the proposed preconditioning technique can be extended to train convolutional dictionaries for other multi-dimensional signals, such as video feature learning and light field feature learning [29]. Another interesting direction is to explore the possibility that uses CSC as a sparse channel representation in wireless communication [65].