Hyperspectral Compressive Image Reconstruction With Deep Tucker Decomposition and Spatial–Spectral Learning Network

Hyperspectral compressive imaging has taken advantage of compressive sensing theory to capture spectral information of the dynamic world in recent decades of years, where an optical encoder is employed to compress high dimensional signals into a single 2-D measurement. The core issue is how to reconstruct the underlying hyperspectral image (HSI), although deep neural network methods have achieved much success in compressed sensing image reconstruction in recent years, they still have some unsolved issues, such as tradeoffs between performance and efficiency, and accurate exploitation of cubic structure information. In this article, we propose a deep Tucker decomposition and spatial–spectral learning network (DS-net) to learn the tensor low-lank structure features and spatial–spectral correlation of HSI for reconstruction quality promotion. Inspired by tensor decomposition, we first construct a deep Tucker decomposition module to learn the principal components from different modes of the image features. Then, we cascade a series of decomposition modules to learn multihierarchical features. Furthermore, to jointly capture the spatial–spectral correlation of HSI, we propose a spatial–spectral correlation learning module in a U-net structure for more robust reconstruction performance. Finally, experimental results on both synthetic and real datasets demonstrate the superiority of the proposed method compared to several state-of-the-art methods in quantitative assessment and visual effects.


I. INTRODUCTION
H YPERSPECTRAL imaging aims at sampling the spectral reflectance of a scene to collect a 3-D dataset with two spatial dimensions (h, w) and one spectral dimension λ, called a data-cube (h, w, λ) [1], [2]. Generally, it has tens to hundreds of discrete bands with high spectral resolution. The rich spectral details are beneficial to various computer vision tasks, such as classification [3], [4], [5], super-resolution [6], medical diagnosis [7], [8], and anomaly detection [9], [10]. To obtain the 3-D hyperspectral image (HSI), many different techniques have emerged [11], [12], [13]. However, these conventional imaging systems scan the scene with multiple exposures, which usually only scan one dimension or two dimensions of the data cube at a time and then use another scanning to supplement the remaining dimensions. During the scanning process, there may be motion artifacts and insufficient lighting efficiency, which ultimately lead to the decline of the imaging quality and cannot be used in dynamic scenes. In the last few years, numerous snapshot hyperspectral imaging systems have been developed [8], [14], [15]. Based on the compressive sensing (CS) theory [16], [17], snapshot spectral imaging spectrometers collect both spectral and spatial information simultaneously and have attracted increasing attention due to their promising ability in capturing the dynamic target.
Coded aperture snapshot spectral imaging (CASSI) [18], is a well-known hyperspectral snapshot imaging system that takes advantage of the CS theory and forms 3-D HSI into 2-D snapshot measurement by random linear projection. It is worth noting that the number of samples required for CASSI is far few than for scan-based spectrometers. Correspondingly, it needs an optimization algorithm to reconstruct the spectral scenes. However, the bottleneck of CASSI is the quality of the optimization algorithm for reconstructing the 3-D HSI from the 2-D compressive measurement. Since this problem is underdetermined, HSI reconstruction from the snapshot measurement is an ill-posed inverse problem. Luckily, two kinds of methods, i.e., model-driven-based methods and data-driven-based methods, have been developed for HSI reconstruction to address this problem.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The model-driven-based reconstruction methods are designed by utilizing the prior knowledge of the HSI, such as total variation [19], [20], sparsity prior [18], low-rank prior [21], [22], and nonlocal self-similarity [23]. The reconstruction results can be obtained by solving these prior regularized optimization problems with iterative optimization algorithms. Among these priors, the tensor low-rank prior has been widely utilized for HSI reconstruction to model high contextual correlation in high-dimensional structures. However, these conventional model-driven-based optimization algorithms suffer from low computational efficiency. In addition, these priors are designed empirically, which relies on empirical knowledge and, thus, also limits the quality of reconstruction. With the success of deep convolution neural networks for natural image restoration and their powerful learning capabilities [24], [25], [26], deep data-driven-based HSI reconstruction methods have been proposed to directly learn a mapping function from 2-D compressive measurement to 3-D HSI and have been proven to provide much better performance. However, they learn a brute-force mapping to reconstruct spectral images, and thereby ignore the image information. Recently, researchers in the field of deep neural networks have applied image priors [27] (e.g., sparse prior) for HSI reconstruction and achieved promising performance. But they seldom consider the tensor 3-D structure information to fully excavate deep spatial-spectral cubic features and there is still a large room for improvement in the construction. Zhang et al. [28] proposed integrating the deep canonical-polyadic (CP) decomposition into an optimization-inspired network and achieved a good performance, which validated that exploiting tensor structure to learn low-rank prior is a correct direction. However, the mentioned method has some limitations, i.e., information loss in extraction, neglect of texture details, and insufficient utilization of spectral information.
Therefore, in this article, we propose a hyperspectral compressive image reconstruction method by designing a deep Tucker decomposition module to learn deep low rank prior and preserve more cubic structure information, as well as designing a spatial-spectral correlation learning (SSCL) module to exploit the spatial-spectral correlation for HSI reconstruction. Specifically, our method is inspired by deep CP decomposition in [28]. We propose a degenerated Tucker decomposition (DTD) module where a 3-D tensor can be decomposed into three-factor matrices and a core tensor. The factor matrix on each mode can be expressed as the principal component of the tensor for each mode in the corresponding subnetwork. We learn the tensor low-rank prior of the cubic patches by capturing the primary context information from each mode. The mode product of the factor matrix and core tensor can obtain a 3-D feature map, which can be regarded as global attention mapping from three modes. Meanwhile, the global hierarchical features are learned adaptively by cascading features of all Tucker decomposition blocks. Furthermore, to exploit the spatial-spectral correlation in the HSI, a U-net structure network named SSCL module is constructed by a spatial and spectral attention submodule, feature mapping, and spatial and spectral self-attention [29] at the minimum scale layer, which can effectively boost the reconstruction performance. We refer to our deep Tucker decomposition and spatial-spectral learning network as DS-Net for short. The main contributions can be summarized as follows.
1) We propose an end-to-end DS-Net for hyperspectral compressive image reconstruction, where a residual deep DTD module is designed to learn the tensor low-rank prior and an SSCL module for learning the deep spatial-spectral correlations of HSI.
2) The residual deep DTD module is designed to characterize the tensor low-rank prior by learning the main context information of the tensor for each mode. Then, through mode product with core tensor to obtain a 3-D feature map, which can be regarded as global attention mapping to express low-rank property. Finally, the residual connection is used to protect high-frequency and spectral information. 3) SSCL module is used to characterize the correlation between both spatial and spectral dimensions of HSI. Using the U-net structure as the backbone, SSCL learns and extracts more diverse multiscale features. Specifically, SSCL learns more accurate spatial-spectral correlation through the element-wise product of spectral and spatial attention features at different scales, thus improving the reconstruction quality. The rest of this article is organized as follows. In Section II, we review some related works. In Section III, we introduce the CASSI system, and Section IV describes the proposed DS-Net in detail, including network architecture and network learning. Extensive simulated and real-data experiments and analysis are presented in Section V. Finally, Section VI concludes this article.

II. RELATED WORKS
In this section, we briefly review the popular HSI compressive snapshot reconstruction methods, which can be mainly divided into the following two ways: model-driven-based and data-driven-based reconstruction methods.

A. Conventional Model-Driven-Based Reconstruction Methods
Reconstructing the 3-D HSI from 2-D compressive measurement is the core of the CASSI system, but the inverse problem is ill-posed. According to the CS theory, the desired image can be reconstructed by optimizing a convex optimization problem consisting of the data fidelity term and prior regularization terms. Researchers attempted to represent the intrinsic property of HSI with various prior regularization terms, such as the total variation prior, sparsity prior, and the low-rank prior. Bioucas-Dias et al. [19] and Yuan et al. [20] employed the prior of total variation to propose a two-step iterative shrinkage/thresholding (TwIST) method and a generalized alternation projection (GAP) for model optimization, respectively, but the reconstructed results may lead to being oversmooth, thus losing the detail structures. The sparsity prior maintains the sparse property of HSI in a fixed transform domain or learns an overcompleted dictionary [30] to represent the underlying HSI data cube. Then, low-rank matrix approximation was proposed [31] to exploit the nonlocal correlation, and Liu et al. [21] proposed a method named DeSCI, which integrated the nonlocal self-similarity of HSI and the rank minimization approach with the SCI sensing process. Zhang et al. [32] developed a dimensional-discriminative low-rank tensor recovery model-based a weighted nuclear norm. Generally, those methods either expressed the 3-D HSI as a 1-D vector or a 2-D matrix, which inevitably broke the high-dimensional nature and were not sufficient to fit the data diversity.

B. Deep Data-Driven-Based Reconstruction Methods
In recent years, due to the powerful learning ability of deep neural networks, which have been applied and achieved the startof-the-art results for a variety of image vision tasks including CS reconstruction [33], [34], [35], [36]. Unlike model-driven-based algorithms, data-driven-based methods directly learn a nonlinear mapping by training a network on datasets and capturing the inherent statistical characteristics of images. Xiong et al. [37] proposed the HSCNN to upsample the undersampled measurement into the same dimension as the original HSI and then treated the HSI reconstruction as an image enhancement task. The self-attention mechanism [38] is widely used to capture long-range interactions, and many attention modules [39], [40], [41] have shown great potential. For instance, Miao et al. [42] proposed a network named λ-net to reconstruct the HSI through a two-stage procedure, where the HSI was first initially reconstructed by a generative adversarial network with self-attention, then followed by a refinement stage for further improvements. Meng et al. [43] designed a spatial-spectral-attention (TSA) module in the backbone U-net to use spatial-spectral correlation to complete the reconstruction. Most recently, Zhang et al. [44] proposed a plug-and-play (PnP) method that incorporated the deep denoisers as regularization priors into the optimization process. Meanwhile, Meng et al. [45] developed a framework by integrating deep image prior (DIP) into a PnP regime, resulting in a self-supervised network and achieving state-of-the-art results. Wang et al. [46] learned a data-driven regularization optimization prior that exploited spatial and spectral correlations and combined the regularizer by unfolding the half-quadratic splitting method. Furthermore, they also developed a deep nonlocal unrolling (DNU) method [47] to further improve the reconstruction quality by combining the spatial attention block with the local sparsity block. Meng et al. [48] developed a deep unfolding method, named GAP-net, which unfolded the GAP algorithm and used a deep network to estimate the desired signal at each stage. Later, Zhang et al. [28] learned the tensor low-rank prior through the deep CP decomposition to extract contextual correlation and integrated the learned prior into an iterative optimization algorithm to complete reconstruction. Huang et al. [49] proposed a maximum a posteriori estimation framework that employed a learned Gaussian scale mixture (GSM) model to learn scale prior and estimated the local means of GSM via filter generator. Based on the powerful learning capability of the deep neural network, in this article, we design an SSCL module to learn hierarchical structure features and spatial-spectral correlation of HSI.
Due to the sparsity and low-rank nature of HSI data, low-rank tensor recovery attempts to estimate the desired tensor with low-rank constraints through various tensor decomposition models. The most commonly used tensor decomposition models are Tucker decomposition, t-SVD decomposition, and CP decomposition [50]. Tucker decomposition can represent the input tensor as the product of several factor matrices and a core tensor, where the factor matrix contains the principal component of each mode. In this article, we first leverage the deep DTD module to learn the low-rank prior and 3-D structure features and then through mode product with core tenor to obtain 3-D feature results, which can be regarded as attention mapping. Then, a series of decomposition modules are cascaded to obtain multihierarchical features. Finally, an SSCL module is designed to learn multiscale features and feature mappings to characterize the spatial-spectral correlation of HSI, thus, further improving the reconstruction performance.

III. CODED APERTURE SNAPSHOT SPECTRAL IMAGING
In CASSI system, the 3-D hyperspectral data is encoded into the 2-D compressive measurement. As shown in Fig. 1, the incident light for a spectral scene X ∈ R h×w×λ , where h and w are the spatial index (1 ≤ h ≤ H, 1 ≤ w ≤ W ), H × W as the spatial size, and λ is the spectral index (1 ≤ λ ≤ Λ), Λ as the number of spectral bands, is first spatially modulated by a transmission function T (h, w) created in coded aperture and then spectrally disperses each wavelength with a wavelengthdependent dispersion function Ψ(λ) by a dispersive prism. Finally, the dispersed data is captured by a 2-D imaging sensor and forms a 2-D compressive measurement with a mixture of the information of all wavelengths. The final snapshot measurement can be represented as In summary, the matrix-vector form of CASSI imaging process can be formulated as where x ∈ R N and y ∈ R M denote the vectorized representation of the underlying 3-D HSI X and 2-D compressive measurement Y , respectively, N = HW Λ and M = H(W + Λ − 1), and Φ ∈ R M ×N denotes the measurement matrix of the CASSI system, implemented by the coded aperture and disperser, and n ∈ R M represents the measurement noise.

IV. PROPOSED METHOD
In this section, we describe our proposed DS-Net that reconstructs the underlying HSI X from a measurement Y. Deep DTD module via learning both spatial and spectral dimensions contextual information to characterize the tensor low-rank prior. Meanwhile, an SSCL module is developed to further extract spatial-spectral correlation information by learning multiscale features and feature mappings. The overview of the proposed DS-Net is shown in Fig. 2, which contains the following four parts: a feature encoding part, a residual degenerated Tucker decomposition (RDTD) module, an SSCL module, and a feature decoding part.
First, we employ several convolution (Conv) layers to code the image feature of the input HSI X 0 . During the feature encoding part, we first use a 3 × 3 Conv to adjust the number of the input channels as 64 to increase the spectral redundancy of intermediate embedding, followed by a Conv-rectified linear unit (ReLU)-Conv to generate the features of the input images where X 0 is the input feature and X 1 is the encoded feature, and f 3×3 (·) denotes the 3 × 3 convolution layer. Since Rich redundancy and high contextual correlation exist in the image features, we employ a cascade of four RDTD modules to extract different hierarchical features and contextual information, where each module contains a residual block (RB) and a DTD module.
The RBs aim to extract the image features before delivering the coded image features into the proposed DTD module to learn the 3-D low-rank features. After extracting hierarchical features with those RDTD modules, we conduct feature fusion, including a multihierarchical feature fusion and a residual connection. Feature fusion makes full use of features from the previous layers and residual connection can preserve the network's stable and high hierarchical features. After extracting and fusing frequency structure features, we refer to the high-spatial-spectral correlation. Here, we employ a U-net structure with a spectral attention module in encoding and a spatial attention module in decoding to boost learning discriminative features effectively in both spatial and spectral dimensions for the better reconstruction performance. The last ending is with a Conv-ReLU to adjust the number of channels and decode the refined features into the reconstructed HSI.

A. RDTD Module
RBs have performed well in image feature representation, so we first employ residual blocks to further enhance the effectiveness of feature learning in the DTD module where f RB (·) denotes the RB. Specifically, we choose a DTD to learn the tensor low-rankness, and the Tucker decomposition is given as follows.
An N -dimension tensor X ∈ R I 1 ×I 2 ×···×I N can be decomposed into the product of a core tensor and a series of factor matrices, written as where A (n) ∈ R I n ×r n is a factor matrix that can be characterized as principal components of different modes, and r n denotes the number of components in each factor matrix. G ∈ R r 1 ×r 2 ×···×r N is the core tensor, and × n denotes mode product. A 3-D tensor Tucker decomposition model is illustrated in Fig. 3. As mentioned above, Tucker decomposition can be regarded as a high-dimensional extension of image principal component  analysis, where the factor matrix contains the distinctive structures of different modes. Therefore, Tucker decomposition can effectively characterize discriminative information in the image features effectively. Meanwhile, the key to Tucker decomposition is the learning factor matrix and core tensor, for briefly, we fix the core tensor as an identity tensor, so each mode factor matrix has the same rank. Inspired by the convolution block attention module, we design a block to capture the important information of each mode for generating the factor matrix, as shown in Fig. 4. In detail, given the input cubic features, we first employ a Conv-ReLU-Conv block to learn highly representative features from the input features of the previous block, then apply average pooling and max pooling to characterize the principal information and global context representation. Finally, Conv-Sigmoid is employed for generating a nonlinear projection of the previous pooling results. The formula is as follows: where X in RDTD and X lf RDTD denote the input feature and local feature of RDTD, respectively, and A (i) (i = 1, 2, 3) denotes the factor matrix. f 1×1 (·) is the 1 × 1 convolution function and p MP (·), p AP (·) denote max pooling and average pooling, [·] denotes concatenate operation. All matrices are generated using independent convolution kernels and each of them learns contextual information from the different modes and outputs as the contextual features. For the non-linearity of the factor matrix, we consider the element as the weight of a certain kind of contextual information, which also satisfies the definition of the structure. The context features should not be linearly related so that each of them can represent different information. Finally, normalization is applied to generate the final factor matrix for the different modes. That is, we normalize the generated matrices A (i) = {a (i,1) , . . . , a (i,J n ) }, a (i,j)∈R In for keeping the properties and semantic meanings of each basiŝ where is set to 10 −6 to avoid the denominator being zero. Unlike previous work, we simultaneously collect the contextual distribution for each mode. Then, we employ the mode product of the factor matrix and core tensor to generate the low-rank features X LR which can be regarded as the feature response in both spatial and spectral dimensions. Furthermore, a Conv layer is applied to learn the weights along different dimensions adaptively. The low-rank features contain rich contextual information, which can be regarded as a 3-D attention map of modeling the global correlation from different dimensions, and they convey the contextual information in spatial and spectral dimensions jointly.
Here, we employ the Hadamard (element-wise) product between the low-rank feature maps and input features to obtain deeper image features. Finally, we utilize a skip connection to obtain the final components where denotes Hadamard product. Skip connection can faithfully transfer valuable information and reconstruct finergrained structures, thus avoiding overfitting and facilitating back-propagation.
After extracting contextual features in different levels with a set of RDTD modules, we further employ feature fusion to fuse cascaded hierarchical features by concatenating the cascaded contextual features of all RDTD modules to extract global features. Features fusion consists of global feature fusion and residual connection, which takes full advantage of the features of all previous layers and also preserves the flow of information. Specifically, we concatenate feature maps produced by the cascaded RDTD modules to extract the global features. Then, there follows a 1 × 1 Conv to adaptively adjust the features with different levels where X GF is the global fusion feature and X (i) RDTD (i = 1, 2, . . . , n) denotes the ith RDTD feature. Then, a 3 × 3 Conv further extracts features to facilitate residual connection. We then employ residual connections between the shallow features and fusion features to reconstruct the more fine-grained spatial structure and spectral characteristics in the latent HSI. Finally, the output features X 2 are achieved via a 3 × 3 Conv where X FF is the fused feature. The output features X 2 will be further fed into the SSCL module to boost the reconstruction performance.

B. SSCL Module
Spatial-spectral correlation is an inherent characteristic of hyperspectral data and exhibits a multiscales structure. The Unet has great performances in reconstruction tasks with excellent learning and multiscale representation capability. So we propose an SSCL module using a U-net structure to extract multiscale characteristics and learn 3-D feature mappings at different scales to represent HSI more effectively.
As shown in Fig. 5, the initial input of the module is the previous fused feature, and the proposed SSCL module performs 3-D attention feature mappings and feature representation at three scales. The encoding unit is built based on an RB and a spectral attention module. We employ the attention module to extract the abstract features, pay more attention to the relations along a spectral dimension, and adaptively emphasize useful features. Meanwhile, we employ the RB to increase the flow of information from the previous fused feature and contribute to the prediction of the result where X 2+i denotes the input feature at the ith scale and X res RSE denotes the corresponding RB output feature. As shown in Fig. 6, in the spectral attention module, we first employ a Conv-ReLU-Conv to extract local spatial features where X lf SE is the local feature of spectral attention module. Then, a global max pooling (GMP) p GMP and a global average pooling (GAPing) p GAPing are utilized to aggregate the spatial information in the spatial dimension of features, which are then fed into two fully connected layers f FC that shared parameters and use ReLU as the intermediate activation function where X gm SE and X ga SE denote the corresponding 1-D features. Then, we add the two output 1-D features and use a Sigmoid function to obtain the weight of each input feature layer, which can be considered as the response values in the spectral dimension. After obtaining the weights, we multiply them with the input features to obtain the new features X RSE We can obtain the corresponding features X k RSE (k = 1, 2, 3) at different scales. From this, we can improve the ability of the network for discriminative learning and make it more aware of relevant information and crucial features. Then, we employ the residual learning strategy to achieve fast and stable training and pass on the low-frequency features to the end. Specifically, we use a 3 × 3 Conv with stride 2 (i.e., f 3×3 s=2 ) instead of a pooling function as a learnable down-sampling operation to compute the downsampled features X j (j = 3, 4) In the minimal scale layer of the module, the input feature is X 3 RSE , and we catch long-range dependencies by concatenating a spatial self-attention f SAA and a spectral self-attention f SEA as a global encoding method that can obtain the merging of long-range features X 5 Then, we employ this method to further excavate the deep spectral correlation information and spatial context for the subsequent decoding part.
In the decoding part, we first utilize bilinear interpolation f BLI rather than the transpose convolution to double-upsample the previous features, that is where X u up (u = 1, 2) are the upsampled features. The decoding unit is constructed based on a spatial attention part and an RB. The spatial attention module shown in Fig. 6 generates the attention map focused on spatial context extraction to provide global spatial information for the subsequent processing, thus helping to reconstruct the spatially coherent structures of the HSI. The spatial attention first employs both max-pooling p MP and average-pooling p AP along the spectral dimension to aggregate the spectral information of each feature. Different from previous ones, we further concatenate them together and employ a Conv layer with 3 × 3 kernel size and a Sigmoid activation function to finally obtain a 2-D feature weight coefficient X W k SA , (k = 1, 2) as follows: Then, we multiply it with the input features, which can direct the model's attention to regions of interest. Finally, we add the input features to obtain the output features X k SA (k = 1, 2) for passing the low-frequency information to the end Specifically, we generate the features through a Sigmoid function as 3-D attention feature maps. After extracting the spatial attention feature maps, we employ a Conv to compute features X k SE (k = 1, 2) from the same scale encoding part to adjust the features adaptively, and then carry out Hadamard product of it with the 3-D spatial attention feature maps to obtain the final output features X k Correspondingly, we obtain the enhanced attention features and the spatial-spectral correlation by the Hadamard product operation with the 3-D attention feature maps. Furthermore, we concatenate the enhanced features and the spatial features to better fuse the spatial and spectral information at different scales and then use an RB to further boost the generated results X j (j = 6, 7), that is Finally, we employ a Conv-ReLU layer to adjust the number of the final output channels to be the same as the input bands and Obtain residual features X res n from X 1 n . 6: Extract each mode principal information matrix from X res n and obtain the 3D attention maps X att n by mode product with each matrix. 7: Conduct Hadamard product of X att n with X res n to obtain X RDT D k n , and deliver it into the next RDTD module. 8: end for 9: Concatenate all X RDT D k n and employ Conv to fuse all features. 10: Use residual connection with X 1 n and fusing features to obtain X 2 n and input it into SSCL module. 11: In SSCL module, the forward features are processed by residual block and spectral attention module to obtain X SE k n , (k = 1, 2, 3) and then downsample the features. 12: Conduct spatial self-attention and spectral self-attention on X SE 3 n and fuse them to obtain X 5 n . 13: Upsample X 5 n and employ spatial attention module to obtain X SA k , (k = 2, 1), and after a sigmoid product with X SE j n and a residual block to obtain X j n , (j = 6, 7). 14: Output X out n by decoding part and obtain the refine image. 15: end for 16: Use test set with the trained model to get predicted data. decode the refined features to obtain the output HSI

C. Network Training
We learn the network parameters Θ of our HSI reconstruction model by end-to-end training strategy. All the parameters are optimized by minimizing the l 1 -based loss function, which can be represented aŝ where Θ is the set of network parameters, and C denotes the total number of the train samples, F(y c , Θ) represents the output of the proposed network, and x c is the ground truth HSI. In our work, we employ PyTorch as the running framework and use  [51] with setting β 1 = 0.9, β 2 = 0.999, and = 10 −8 to train the proposed network. The parameters of the convolution layers are initialized by Xavier initialization [52]. The learning rate is set as 1.28 × 10 −4 and it decays with a factor of 0.95 for every ten epochs. The proposed network is executed using an NVIDIA 3060TI and its main steps for HSI compressive reconstruction are summarized in Algorithm 1.

A. Experimental Setup
We conduct simulation experiments on two public hyperspectral datasets, CAVE [53] and KAIST [54], to demonstrate the effectiveness of the proposed DS-Net. The CAVE dataset consists of 32 HSIs with spatial size 512 × 512 and 31 spectral bands. The KAIST dataset has 30 HSIs with spatial size 2704 × 3367 and 31 spectral bands. Following the setting of TSA-Net [43] and DGSM [49], we employ the real mask of size 256 × 256 for simulation, and the CAVE dataset is used for network training. To match the wavelength of the real system [43], the modified training and test data have 28 spectral bands ranging from 450 nm to 650 nm by spectral interpolation. For the test set, ten different scenes with spatial size 256 × 256 from KAIST dataset are applied, which will be used for comparison with other reconstruction methods.
During training, we randomly extract 96 × 96 × 28 patches from the training dataset as training labels, and to make the dataset as robust as we can, random flipping and rotation are both used, and then randomly extract 96 × 96 patches from the real mask to generate the simulated data. Meanwhile, the simulated data is shifted in spatial at a step of two pixels. The spectral dimension of the shifted data is summed up to generate the 2-D measurements of size 96 × 150. The peak signal to noise (PSNR) and the structural similarity index (SSIM) [55] are employed to evaluate the HSI reconstruction quality. PSNR reflects the spectral reflectance accuracy and SSIM emphasizes reconstructed spatial structures.

B. Comparison With State-of-the-Art Methods
We evaluate the reconstruction performance of DS-Net and compare it with several state-of-the-art reconstruction methods, including model-driven-based methods, i.e., TwIST [19], GAP-TV [20], DeSCI [21], and data-driven-based methods, i.e., λ-net [42], HSSP [46], TSA-Net [43], PnP-DIP [45], DGSM [49]. We use the source codes released by their authors to reproduce the experimental results. Table I shows the reconstruction results of these testing methods on 10 scenes in KAIST, where we can see that the deep data-driven-based methods outperform the model-drivenbased methods. In addition, the proposed method ranks first on almost every testing sample in terms of PSNR and SSIM. DS-Net largely surpasses other deep data-driven-based methods. Specifically, our method outperforms the second-best method DGSM by 1.24 dB in average PSNR and 0.0175 in average SSIM. Compared with the self-attention method TSA-net, the improvements by the proposed method over TSA-net is 1.78 dB on average. Compared with the PnP-DIP and HSSP, the improvements by the proposed method over PnP-DIP and HSSP are 1.90 dB and 3.15 dB on average, respectively. The DGSM proposed learning the spatial-spectral prior by the spatially-adaptive GSM models but less learning the image structure. The HSSP and TSA-net methods also tried to learn the spatial-spectral correlations of HSI but without emphasizing image edges and textures. In contrast, our DS-Net proposes to extract low-rank prior of the HSI by the DTD module and learn the global spatial-spectral correlation of HSIs by the SSCL module. Fig. 7 plots the spectral curves of the reconstructed HSI at the specified position, moreover, the correlation coefficients of the reconstructed spectral signatures and the ground-truths are shown in the legends. We also visualize the reconstructed results for the six deep data-driven-based methods and modeldriven-based methods and display 4 out of 28 spectral channels of the reconstruction results. In order to clearly compare the details of the reconstructed images, we also provide zoomed-in views of some image areas in rectangles. It can be seen that the spatial details in reconstruction by deep data-driven-based methods are better than those of model-driven-based algorithms, and the results of model-driven-based algorithms suffer from blurry artifacts produced by the coded measurements, which are caused by the disperser in the hardware system. Moreover, from the four chosen spectral channels, we can observe that the HSI reconstructed by DS-Net maintains more details and fewer undesirable visual artifacts than the other methods. It demonstrates the capability of DS-Net to utilize the inherent characteristics of HSI. Similarly, as shown in another example in Fig. 8, the reconstructed results of our DS-Net outperform  all results of other algorithms in terms of high-frequency image detail recovery and spectral recovery consistency.
We further conduct experiments on the same training and testing datasets with the same mask setting as DLTR [28]. Then, we compare DS-Net with DLTR [28] and DNU [47] methods. Table II shows the average PSNR and SSIM results by these three methods on the 10 scenes. Compared with the results in Table I, although DS-Net has a decrease in PSNR, it still surpasses the other two competing methods by a large margin. For example, it surpasses the second place by 2.15 dB in PSNR and 0.042 in SSIM.

C. Ablation Study
We conduct several ablation studies to verify the impacts of DS-Net, including the influences of different modules, the number of RDTD module blocks, and the rank of the HSI tensor. We first conduct the module influence experiments, in which we choose the RDTD module, residual connection, and SSCL module. Table III reports the experimental results of four combinations of those modules. From the whole table, it can be seen that the lack of any module will cause the degradation of performance, and the combination of RDTD module and SSCL module can achieve the best reconstruction qualities. Specifically, from case 1 and case 4, it can be seen that the SSCL module has a great effect on the experimental results, where the PSNR is increased by 2.53 dB and the SSIM is increased by 0.0546. Comparing case 2 and case 3, it can be seen that the RDTD module has a positive effect on the experimental results, where the PSNR is increased by 2 dB, and the SSIM is increased by 0.0368. In summary, each module contributes to the improvement of the final results of the network. Fig. 9 shows the PSNR and SSIM results as a function of rank values, we can see that as the rank value increases the HSI reconstruction quality has improved, but then decreased slightly. The improvement flattens out after the rank value reaches 4, and thus, we set the rank as 4 in our implementation. The results for different numbers of blocks are also shown in Fig. 9, from which we observe that increasing the number of blocks leads to the better performance. Still, as the number continues to increase, it leads to a bit worse performance and also increases the computation times. Noteworthy, compared to Table I, when we set the number of blocks to 1, the performance has significant improvement in the performance, which further demonstrates the effectiveness of our module in improving the results. Finally, we set 4 for the number of blocks in our implementation for achieving a good tradeoff between reconstruction performance and computational complexity.

D. Real Data Results
Real HSI reconstruction is considered more challenging than simulation data. In addition to the achievement we have made on the synthetic dataset, we also validate the effectiveness of DS-Net on HSIs collected from real scenes [43], which captures the real scenes with 28 wavelengths ranging from 450 nm to 650 nm and has 54-pixel dispersion in the column dimension. Thus, the measurements captured by the real system have a spatial size of 660×714. Meanwhile, the captured measurements unavoidably contain noise as well as more details. So similar to DGSM [49], we expanded the previous training set by adding 30 HSI images from KAIST dataset and retrained DS-Net. In addition, to simulate the real measurements, we injected 11-bit shot noise during training. Fig. 10 shows an example of leave (top) and plant (bottom) on 3 spectral channels for comparing our method with three reconstruction methods, DeSCI [21], TSA-net [43], and DGSM [49]. As can be seen, the reconstructed results of our method contain much fine texture details, while  the output of DeSCI is relatively blurry, and some regions lack detailed texture, and TSA-net is totally distorted. In Fig. 11, we provide complete reconstruction results on a real-captured scene lego flower. As can be seen, different colored regions are emphasized by distinct wavelengths accordingly, which indicates that DS-Net has an significant effect on spectral response.

E. Model Size
We further analyze model size (number of trainable parameters) and floating-point operations (FLOPs) between DS-Net and three compared deep data-driven-based methods for hyperspectral compressive image reconstruction. As compared in Table I, the deep data-driven-based methods have achieved the promising performance in HSI compressive reconstruction. However, one of their problems is the model size, which limits the development of real application. As shown in Table IV, DS-Net takes less than 50% parameters and 12% FLOPs of GSM while boosting 1.24 dB increase for average PSNR. By comparing DS-net with TSA-net and λ-net, although we have no significant reduction in FLOPs, the parameters have only 4% and 2.8% of TSA-net and the proposed DS-net delivers more

F. Time Efficiency
We have also analyzed the time complexity between DS-Net and the other three data-driven-based methods, i.e., λ-Net, TSA-Net, and DGSM. Table V shows the average running time of each method to reconstruct ten 256×256 HSIs of the KAIST dataset. Those methods are all written by PYTHON and run on GPU. According to Table V, λ-Net is relatively fast, but the reconstruction performance is unsatisfactory. DGSM has a good performance but with the highest computational complexity. Compared with DGSM, the proposed DS-Net has a shorter running time and better performance of reconstruction. The running time of TSA-Net is optimal, but the reconstruction quality is worse than that of the proposed DS-Net.

VI. CONCLUSION
In this article, we proposed the DS-Net for compressive image reconstruction of HSIs. Inspired by tensor decomposition, we proposed a deep DTD module to learn the tensor low-rank prior of the cubic patches by capturing the main context information from each mode of the tensor. Then, we conducted the mode product of the factor matrix and core tensor as a 3-D attention map to extract the global information. A series of deep RDTD modules were cascaded to learn multihierarchical features and a residual connection was used to protect high-frequency and spectral information. Furthermore, we proposed an SSCL module to jointly learn the spatial and spectral correlation at multiscales, which was embedded into a U-net architecture to complete the reconstruction and improved the reconstruction quality. The experimental results on both synthetic and real datasets validated that the proposed method achieved superior reconstruction results and outperformed several existing stateof-the-art algorithms.