Meta-TR: Meta-Attention Spatial Compressive Imaging Network With Swin Transformer

As a flourishing research topic in the field of remote sensing, spatial compressive imaging (SCI) can utilize prior knowledge to recover high-dimensional signals from low-resolution measurements through joint sampling and compression, thus contributing to the bandwidth reduction of information transmission. However, most of the existing SCI methods based on deep learning cannot effectively utilize prior information, and difficult to perform deep extraction of image features, so the reconstruction is not ideal in the case of low sampling ratio. To address the above difficulty, we propose an SCI network based on meta-attention (MA) and swin transformer, named Meta-TR. We adopt the swin transformer as the network backbone, through the wide application of self-attention mechanisms, to achieve deeper extraction of image features, thereby improving the reconstruction quality under low sampling ratios. In addition, we design an MA module, which adopts Squeeze-Excitation architecture to convert the metadata of SCI image degradation process to attention vectors. Then, the attention vectors are used in the channel modulation of network feature maps to guide the network training. Extensive experiments are performed on different benchmark remote sensing datasets and different sampling ratios to confirm the superiority of the proposed Meta-TR method.


I. INTRODUCTION
C OMPRESSIVE sensing is an epoch-making technology in the field of signal transmission, which can recover the original signal at a lower sampling ratio than Nyquist sampling [1]. Spatial compressive imaging (SCI), as an application of compressed sensing (CS) theory in the field of image spatial compression, aims to reconstruct high-resolution (HR) images from low-resolution (LR) measurements by employing prior information [2]. With SCI algorithms, more signal information can be recovered using a low-cost hardware, which can reduce the requirement to a sensor and data transmission bandwidth. Therefore, the idea of SCI has been favored by IR imaging [3], MRI [4], radar imaging [5], and other application fields [6]- [8]. As the emergence of extensive remote sensing tasks, such as resource exploration, climate monitoring, and environmental protection in recent years, the availability of remote sensing data has also increased. However, the explosive growth of HR remote sensing data has also brought great pressure on data compression and reconstruction. Based on this, some super-resolution (SR) methods are applied in the field of remote sensing, benefit from the mapping from low-dimensional space to high-dimensional. Molini et al. [9] proposed that DeepSUM uses a self-registration method to achieve LR to HR reconstruction. Salvetti et al. [10] designed a lightweight SR method with 3-D convolution and attention mechanism. Hang et al. [11] designed an SR method using the internal correlation and projection properties of hyperspectral images. Compared with SR, SCI has some advantages in the field of image compression and reconstruction, mainly due to the application of sensing matrix in the reconstruction process, which can achieve compression and reconstruction of sparse signals at a sampling ratio far lower than the Nyquist frequency. Therefore, the SCI algorithm can effectively relieve the data transmission pressure of remote sensing systems and contribute to the development of HR earth observation applications [12], [13]. Mallat and Zhang [14] first proposed the usage of a redundant dictionary to represent sparse signals and perform reconstruction. The orthogonal matrix pursuit, by solving the sparse approximation problem on redundant dictionaries, can be used to reconstruct an object in a faster speed [15]. Besides, these scholars [16], [17] use nonconvex sparse regularization methods to calculate the global optimal solutions. In the work of [18], the rank residual minimization algorithm is used to get the original signal, by using the nonlocal self-similarity prior and the low-rank characteristics of an signal. Although high-quality reconstructions can be obtained, a main drawback of these methods is the long running time due its iterative calculations. In addition, the reconstruction quality degrades rapidly as the sampling ratio decreases, which also limits their application.
To address above issues, scholars have used deep learning methods for vision tasks [19], [20]. In [21], convolutional neural network (CNN) is used for SCI, and the reconstructions are applied for target tracking to prove that sufficient semantic information is maintained after the compression and reconstruction. Some networks [22]- [25] are specifically designed for hardware implementation friendly and low-storage requirements by jointly optimizing compression and reconstruction during training. In deep residual reconstruction network (DR2-Net) [26], the time complexity of network is greatly reduced by using multiple residual blocks, while the reconstruction quality is improved. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Although the neural networks discussed above have better reconstruction quality than traditional algorithms, they are hard to interpret and rely too much on dataset while ignoring imaging process. Thus, some works, such as iterative shrinkagethresholding algorithm network (ISTA-Net) [27], combine traditional methods with neural networks by replacing the linear or nonlinear steps in each iteration of a traditional method with designed CNN units. Cui and Sun et al. respectively use the nonlocal self-similarity prior in the measurement domain and the multi-scale feature domain to find similar vectors in the sizelimited vector space, fill each other with the missing information, and reconstruct the original image [28], [29]. In the article [30], the rank residual minimization algorithm is combined with deep network units to obtain highly competitive reconstruction results.
However, the above networks still have some unique problems. First, for SCI in remote sensing field, most networks for reconstruction use basic CNN units and residual connections, which limits the deep extraction of image global features. And at low sampling ratios, the traditional CNN network cannot achieve satisfactory results in some visual tasks due to its limited representation ability. According to this, exploring a network backbone with stronger and deeper extraction capabilities is the key to the progress of SCI networks. Second, these previous networks lack the effective utilization of prior information (such as sensing matrix in SCI), leading to training process being too dependent on the dataset, resulting in problems such as overfitting and poor transferability. Therefore, how to adopt the metadata of image degradation process is also vital to the reconstruction of SCI.
To deal with the above issues, we study an end-to-end SCI network based on meta-attention (MA) and swin transformer, named Meta-TR. Compared with previous SCI networks, the proposed Meta-TR can calculate the internal autocorrelation of the input measurement frames through the self-attention mechanism [31]. In this case, the network is able to mine deeper image information for a better reconstruction, which has shown clear superiority at low sampling ratios. In addition, we design a MA module, which uses the Squeeze-Excitation network (SeNet) [32] to convert the image degradation metadata (sensing matrix in SCI) into attention vector, which are used to modulate channels in each feature extraction module of the network. In this way, Meta-TR can make full use of image degradation metadata to guide network training, and the multilevel sharing way also makes the weights of each level maintain consistent convergence. The main contributions of this study are summarized as follows.
1) We adopt the swin transformer as the network backbone to extract higher-level information from LR measurement, by calculating the self-attention results of shift windows. 2) We design a novel MA module to guide the training of network, which employs dual-path pooling and SeNet to convert metadata into attention vectors.
3) The proposed Meta-TR performs better than the representative SCI methods on benchmark datasets with different bands and sampling ratios, which also shows an efficient balance among reconstruction performance, parameter size, and running time.

II. METHODOLOGY
In this section, the proposed SCI method is elucidated. For better understanding of the proposed Meta-TR, a brief review on the SCI problem formulation is given first. Then, we will introduce the structure and principle of swin transformer. Finally, we will introduce the proposed Meta-TR network architecture in detail.

A. SCI Problem Formulation
Conventionally, SCI aims to reconstruct the original highdimensional object x ∈ R n by inputting m(m << n) random measurements y ∈ R m [33]. Mathematically, the imaging process can be described as follows: where Φ represents the sensing matrix of size (m × n), which satisfies the restricted isometry property RIP criterion [38]. However, due to m << n, the number of unknowns in (1) is much more than the number of equations, so there are infinite solutions in (1). Therefore, the solution condition of the underdetermined problem requires the original object x to satisfy the property of being sparse in the transform domain. Specifically expressed as follows: where Ψ represents the transformation matrix, which also satisfies the RIP criterion. The parameter s represents the representation of original object x in the transform domain, which is sparse [39]. The parameter Θ represents the multiplication of Φ and Ψ. Through this transformation, the solution of the (1) can be converted into a constrained optimization problem of the l 0 norm [40], [41], as follows: where || s || 0 represents the zero norm of s. In this way, the complexity of the calculation is greatly reduced. Due to its strong learning ability and operational efficiency [34], [35], the neural network can use the fitting of the network parameters on the dataset to achieve the solution process of (3), i.e., to solve s from y. Compared to the traditional SCI algorithm, network-based algorithms have lower complexity and higher accuracy [36], [37]. In this article, we adopt Meta-TR to perform SCI, as shown in Fig. 1. After training on dataset, Meta-TR can reconstruct HR objects using LR measurements in an end-to-end way.

B. Swin Transformer Architecture
In this subsection, we will introduce the swin transformer architecture, which is the backbone of Meta-TR network.
Transformer was originally used in natural language processing [42], and it has also shown its superiority in remote sensing image processing in recent years [43]- [45]. However, the original transformer needs to pay attention to all pixels of image in the calculation, which leads to a sharp increase in calculation and increases the restrictions on deployment and application. Based on this, the swin transformer uses the window multihead  self-attention (W-MSA) instead of the global multihead selfattention (MSA), which greatly reduces the amount of computation. In addition, in order to ensure the correlation information between windows, swin tranformer extends W-MSA to the shifting window multihead self-attention SW-MSA calculation [46]. As shown in Fig. 2, the left side represents the W-MSA, while the right side represents the SW-MSA. In SW-MSA, additional cyclic shift operations and inverse operations are used to ensure that, the window during self-attention calculation is consistent with that in W-MSA. Now let us talk about the working mechanism of the swin transformer. For an input of size (H × W × C), we divide it into HW M 2 nonoverlapping windows X of size (M 2 × C). X is first processed using a layer normalization(LN), then the self-attention calculation within a window X is performed, specifically as follows: where Q, K, V ∈ R M 2 ×d are the query, key, and value matrices of the preprocessed LN (X), respectively. The parameter d represents the dimension of the key. The parameter B represents the relative position encoding. After that, the residual structure and multilayer perceptron (MLP) are applied to the self-attention result, specifically as follows: where F wMSA , F MLP , and F swMSA represent W-MSA, MLP, and SW-MSA, respectively. Note that, the self-attention operation on the input of W-MSA and SW-MSA is the same, but the selection and shifting of the window are different. Details of (5) can also be found in Fig. 3(a). A gaussian error linear units (GELU) activation function is used in front of MLP. As shown in Fig. 3(a), each W-MSA is followed by an SW-MSA, and the two appear in pairs. The shifting distance of the window is (M/2, M/2).

C. Meta-TR Network
In this subsection, we will first introduce the functional parts (shallow information extraction, deep information extraction, and MA) of Meta-TR, and then introduce the construction process of the loss function.
1) Shallow Information Extraction: In this part, we use convolutional layers to perform shallow extraction on the input image and retain most of the original information for subsequent deeper processing. After shallow information extraction, we use the layer normalization operation to prevent the gradient from disappearing and improve the network convergence speed. The formula of this part is expressed as follows: where I LR represents the input of network; F SIE represents the shallow information extraction module and I SIE represents the output of F SIE . 2) Deep Information Extraction: After shallow information extraction, Meta-TR employs N 1 residual swin transformer block (RSTB) for deep information extraction.
In RSTB, as shown in Fig. 3, short residual connections are used to aggregate features from different levels. Each RSTB contains N 2 swin transformer layer (STL), and the structure of STL is described in (5). Note that, W-MSA is used for odd-numbered STL, and SW-MSA is used for even-numbered STL. The two kinds of attention mechanism calculation methods appear alternately, in order to use the shift window to reduce the computational complexity of the network, which is also the core of the swin transformer. The formula of deep information extraction part is as follows: where F C , F STL , F RSTB , and F DIE represent convolutional layer, STL, RSTB, and deep information extraction, respectively. I DIE represent the output of F DIE .  3) Metaattention: In this subsection, we introduce the framework of MA. MA aims to utilize the metadata of image degradation to guide the overall training of the network.
In the design of MA, we mainly adopt maxpooling, avgpooling, and SeNet structure. The two pooling structures are to extract the maximum and mean information of the sensing matrix, and reduce the 3-D tensor to a 1-D vector. After the pooling, each element has a global receptive field, and global features are obtained. Then, the SeNet is adopted to use the global features to obtain the nonlinear relationship between channels, and finally obtains a series of modulation factors between (0, 1) to guide the training of each RSTB module. In SCI, the most critical factor of image degradation is the sensing matrix. The following is a detailed description of the transformation process of the sensing matrix in MA. As shown in the Fig. 4, for a sensing matrix of size (K × K × 1), pass through the dimension expansion module consisting of convolutional layers, and the output tensor size is (K × K × D), where D is equal to the number of channels in each RSTB. After that, MA utilizes average pooling and max pooling for core information compression, and outputs two vectors of size (1 × 1 × D). And then, the SeNet is used to modulate the vector to achieve the extraction of core features with a small amount of parameters. Finally, the modulated vectors are added and activated using the sigmoid function, resulting in a final attention vector of size (1 × 1 × D), named meta attention output (MA-OUT). At this point, MA-OUT represents the core information of the sensing matrix, and then we use it to channel-modulate the output of each RSTB. In this way, the network reconstruction quality can be improved by making the network pay more attention to feature maps with more important information. 4) Loss Function: Finally, after the upsample module, Meta-TR outputs a reconstruction with the same size of the original object. In this article, we utilize the maximum a posteriori (MAP) to construct the loss function, as follows:  where x represents the output of Meta-TR, σ is the noise level, R(x) is a regularization term. We can rewrite (10) as a function of parameters y, Φ, σ, and Θ, where Θ represents the parameters of MAP inference, specifically as follows: Based on this, we design the loss function for Meta-TR as follows: where i n represents the batch size of a sample in training.

III. EXPERIMENTS
In this section, we compare the proposed Meta-TR with the state-of-the-art SCI methods on remote sensing datasets with multiple bands and sampling ratios. First, the datasets and training details are introduced. Then, the comparison between our method and other SCI methods on visual effects and evaluation metrics is presented. After that, ablation experiments of the MA module and internal structure are performed to confirm its effectiveness. Finally, the parameter size and running time of the network is discussed.

A. Datasets and Training Details
Datasets: In this article, we train and test on two datasets detailed below: 1) Project for on-board autonomy vegetation (PROBA-V) [47]; 2) Satellite dataset I (global cities) [48].
PROBA-V is an earth observation satellite used to map global land and vegetation cover. This dataset has been released by the Advanced Concepts team of the European Space Agency. The PROBA-V dataset includes LR images of size (128 × 128) and HR images of (384 × 384). All images in the dataset are 14b depth and single-channel. Additionally, this dataset contains 1160 scenes, 566 from the near infrared (NIR) band and 594 from the visible red (RED) band.
Satellite dataset I (global cities) is a subset of the wuhan university (WHU) building dataset, which is collected from remote sensing resources around the world and is mainly constructed with urban building clusters. This dataset includes 204 red green blue (RGB) images of size (512 × 512). In addition to satellite sensor differences, factors such as atmospheric conditions and seasonal changes make this dataset more informative and suitable for neural network training.
Training Details: In the experiment, Meta-TR is trained on the PROBA-V and Satellite datasets. There are 1160 images in PROBA-V dataset, 1000 as training set, and 160 as test set. There are 204 images in Satellite dataset, 153 as training set, and 51 as test set. As described in Section II-A, this article proposes an SCI method, and the LR input in the SCI process is obtained by HR through sensing matrix modulation and downsampling. The core component (sensing matrix) in the compression process of SCI can be regarded as the metadata of image degradation. Therefore, in the PROBA-V dataset and Satellite dataset I (global cities), we only need the HR dataset, and the LR dataset to be manually generated, by using the sensing matrix. During dataset preparation, for sampling ratios of 1/4, 1/16, and 1/36, we use sensing matrices of size (2 × 2), (4 × 4), and (6 × 6), respectively, to generate LR datasets by sliding and dot producting on HR datasets.
The LR patch size is set to (24 × 24), and the batch size is set to 16. It can be deduced that when the sampling rates are 1/4, 1/16, and 1/36, the corresponding HR image sizes are (48 × 48), (96 × 96), and (144 × 144). Data augmentation is performed with rotation and cropping during training. The evaluation indicators of network reconstruction performance are peak signal to noise ratio (PSNR) and structural similarity (SSIM) [49]. The Adam optimizer with β 1 = 0.9, β 2 = 0.999 is adopted to train the Meta-TR [50]. The initial learning rate is set to 1e−4. We train Meta-TR on an Nvidia GTX 3090 GPU for approximately two days to achieve the optimal results.

B. Comparing SCI Methods
In this subsection, we compare Meta-TR with representative SCI methods in recent years, including total variation augmented lagrangian alternating Direction algorithm (TVAL-3), Recon-Net+res, modified super resolution residual network (MSRRes-Net), residual attention multi-image super-resolution (RAMS), Joinput-CiNet, Meta-CiNet, and residual channel attention network (RCAN). For a fair comparison, all methods use the same sensing matrix and dataset, and the networks are trained to convergence. These methods are described in detail below.
TVAL-3 [51]: This is a classic traditional algorithm, which adopts an augmented Lagrangian based total variational regularization model to achieve iterative SCI reconstruction.
ReconNet+res [21]: ReconNet is a block-to-block SCI network. For remote sensing datasets, a residual structure is added to enhance the reconstruction performance in this article.
MSRResNet [52]: The original MSRResNet is a modified version of the super-resolution reconstruction residual network. This article trains it to perform SCI reconstruction.
RAMS [10]: This is a representative lightweight network for remote sensing images reconstruction, which builds feature and temporal attention mechanism modules through 3-D convolution, and achieves excellent results on the PROBA-V dataset.
Joinput-CiNet [2]: It is a SCI network with joint input of degradation maps and LR measurements, which uses principal component analysis to extract sensing matrix information to guide reconstruction.
MetaCiNet [54], [55]: It is an improved version of Joinput-CiNet, which extracts more dimensional information of the sensing matrix than the former.
RCAN [53]:This is one of the most representative CNN SR networks, which uses residual-in-residual and channel attention mechanism to build a very deep network to achieve high-quality reconstruction.

C. Results on PROBA-V Dataset
In this subsection, we train and test all methods on the PROBA-V dataset. Experiments are carried out at sampling ratios 1/4, 1/16, and 1/36. In Table I, we summarize reconstruction PSNR and SSIM values using TVAL-3, ReconNet+res, MSRResNet, RAMS, Joinput-CiNet, Meta-CiNet, RCAN, and Meta-TR. In the table, NIR, RED, and ALL represent the reconstruction results of each method in the infrared band, visible light band, and all bands, respectively. It can be seen that, the images in the RED band show better results than the NIR images at each sampling ratio, because the RED images have lower average brightness compared to the NIR images. In conclusion, our Meta-TR achieves the best PSNR/SSIM values on all sampling ratios and datasets. At sampling ratios of 1/4, 1/16, and 1/36, Meta-TR can achieve average improvements of 0.68 dB/0.0057, 0.43 dB/0.0104, and 0.25 dB/0.0092 compared to the classic MSRResNet method. It is worth mentioning that, the parameter amount of Meta-TR is about 1/16 of that of RCAN, but the reconstruction quality still exceeds that of RCAN under different datasets and sampling ratios. Extensive quantitative data demonstrate the superiority of the proposed Meta-TR on SCI.
Figs. 5 and 6 show the reconstruction visual results of different SCI methods in the RED and NIR bands, respectively. In each band, we can find that, as the sampling ratio decreases, the reconstruction results of all methods also decrease. Compared with other methods, the reconstruction results of the proposed Meta-TR have more detailed information (rivers, mountains, etc.), which is beneficial to the subsequent identification and analysis of remote sensing images. In addition, our method shows superiority in both RED and NIR bands, confirming that it can work well in different wavelengths.

D. Results on Satellite Dataset I (Global Cities)
Similar to the above subsection, we train and test all methods on the Satellite dataset I (global cities). The dataset consists of RGB images, which are located in the visible light band, and mainly reflect the information of urban building groups. Table II shows the PSNR/SSIM values of the reconstruction results of different methods at sampling ratios 1/4 and 1/16. It can also be found that, Meta-TR achieves the best indicators under different sampling ratios. Compared to the second best method, Meta-TR can achieve 0.16 dB/0.0043, 0.16 dB/0.0128 improvements in PSNR/SSIM at 1/4, 1/16 sampling ratios, respectively. Figs. 7 and 8 show the reconstruction visual results of different methods at sampling ratios 1/4 and 1/16, respectively. From the figure, we can find that, compared with other methods, the reconstructions of Meta-TR have more detailed information of buildings and roads, which is beneficial to the follow-up target monitoring and terrain mapping of remote sensing data. Extensive

E. Ablation Experiments of MA Module
In this subsection, the validity of the MA module and its internal structure are verified. As shown in Fig. 3(b), the core part of MA consists of Maxpool, Avgpool, and SeNet. Therefore, Meta-TR trains the following four versions on the PROBA-V dataset with a sampling ratio of 1/16: 1) Baseline (without MA); 2) Avgpool (MA that only contains Avgpool); 3) Maxpool (MA that only contains Maxpool); 4) Avgpool+Maxpool (with complete MA).
The training results are tested in the NIR and RED bands. As shown in Table III, Meta-TR (Avgpool+Maxpool) has about 0.11 dB and 0.0034 improvement in PSNR and SSIM than Meta-TR (Baseline). This proves that MA module can boost the reconstruction indicator of Meta-TR. Furthermore, Meta-TR (Avgpool+Maxpool) surpasses the models using Avgpool or Maxpool alone, which confirms the rationality of the MA internal structure in this article. The bold entities represents the method proposed in this paper. Fig. 9 shows the visual reconstructions of the two network versions. It can be found that, Meta-TR (w/ MA) has more advantages in detail reconstruction, and has improvement in both bands. Through the ablation experiments in this section, it is confirmed that the MA module can make full use of the degradation information to guide the network training, and also prove its effectiveness for SCI.

F. Comparison of Parameters and Running Time
In this subsection, the parameter quantities and running time of different SCI networks are compared and discussed.  shows the parameters of ReconNet+res, MSRResNet, RAMS, Joinput-CiNet, Meta-CiNet, RCAN, and Meta-TR, under sampling ratios 1/4, 1/16, and 1/36. It can be seen that, as the sampling ratios decreases, the parameters of all networks will increase, which is mainly due to the increase of layer numbers in the upsampling module. Furthermore, we can find that, Meta-TR achieves the third-least number of parameters among all methods, but achieves the best reconstruction results (according to Sections III-C and III-D). This shows that Meta-TR can achieve better reconstruction results with a lower number of parameters, which is more conducive to model deployment and application. This subsection reflects that Meta-TR achieves an excellent balance between network performance and parameter quantity. Finally, the reconstruction running time comparison of different SCI methods is presented. As shown in Table V, except TVAL-3, the reconstruction time of other SCI methods for an image of size (384*384) is kept between 20 and 60 ms, which can basically meet the needs of real-time imaging. This subsection illustrates that Meta-TR still achieves a good balance between reconstruction time and quality. The bold entities represents the method proposed in this paper.

IV. CONCLUSION
In this article, we propose a SCI network employing MA and swin transformer. The proposed Meta-TR uses the swin transformer as the network backbone to extract global information inside the image block by using self-attention, which improves the depth of network information extraction while ensuring that the amount of parameters is not overloaded. Furthermore, we design a MA module to extract key information from the image degradation metadata in SCI through Squeeze-Excitation structure, and perform channel modulation in the feature maps of Meta-TR. By using this module, additional prior information can be used to guide the network training process, which improves the network reconstruction quality and interpretability. Extensive experiments on remote sensing benchmark datasets with different bands and different sampling ratios confirm the superiority of the proposed Meta-TR in both reconstruction metrics and visual effects.