Crossed Dual-Branch U-Net for Hyperspectral Image Super-Resolution

Hyperspectral images have gained great achievements in many fields, but their low spatial resolution limits the effectiveness in applications. Hyperspectral image super-resolution has emerged as a popular research trend, where high-resolution hyperspectral images are obtained via combining low-resolution hyperspectral images with high-resolution multispectral images. In this process of multimodality data fusion, it is crucial to ensure effective cross-modality information interaction. To generate higher quality fusion results, a crossed dual-branch U-Net is proposed in this article. In specific, we adopt U-Net architecture and introduce a spectral–spatial feature interaction module to capture cross-modality interaction information between two input images. To narrow the gap between downsampling and upsampling processes, a spectral–spatial parallel Transformer is designed as skip connection. This novel design simultaneously learns the long-range dependencies both on spatial and spectral information and provides detailed information for final fusion. In the fusion stage, we adopt a progressive upsampling strategy to refine the generated images. Extensive experiments on several public datasets are conducted to prove the performance of the proposed network.

Crossed Dual-Branch U-Net for Hyperspectral Image Super-Resolution Jingyi Zhang , Jianjun Liu , Member, IEEE, Jinlong Yang , and Zebin Wu , Senior Member, IEEE Abstract-Hyperspectral images have gained great achievements in many fields, but their low spatial resolution limits the effectiveness in applications.Hyperspectral image super-resolution has emerged as a popular research trend, where high-resolution hyperspectral images are obtained via combining low-resolution hyperspectral images with high-resolution multispectral images.In this process of multimodality data fusion, it is crucial to ensure effective cross-modality information interaction.To generate higher quality fusion results, a crossed dual-branch U-Net is proposed in this article.In specific, we adopt U-Net architecture and introduce a spectral-spatial feature interaction module to capture cross-modality interaction information between two input images.To narrow the gap between downsampling and upsampling processes, a spectral-spatial parallel Transformer is designed as skip connection.This novel design simultaneously learns the long-range dependencies both on spatial and spectral information and provides detailed information for final fusion.In the fusion stage, we adopt a progressive upsampling strategy to refine the generated images.Extensive experiments on several public datasets are conducted to prove the performance of the proposed network.

I. INTRODUCTION
H YPERSPECTRAL images (HSIs) have extensive spectral bands that carry a wealth of spectral information, which enables them to identify object materials.Accordingly, HSIs have been employed in the field of computer vision, including image classification [1], [2] and environmental monitoring [3] to anomaly detection [4], [5], etc.However, limited by sensor devices, obtaining HSIs with high spatial and spectral resolution concurrently is difficult.To alleviate this problem, some methods for HSI super-resolution have been suggested [6], [7], [8].There are usually two solutions, namely, single im-age and fusion-based HSI super-resolution.The fusion-based strategy obtains high-resolution hyperspectral (HRHS) images by merging low-resolution hyperspectral (LRHS) images with high-resolution multispectral (HRMS) images, which are preferred solutions.The existing fusion-based methods for HSI super-resolution generally fall into four classes: component substitution (CS), multiresolution analysis (MRA), model-based, and deep learning-based methods.
The CS methods attempt at generating HRHS images simply by replacing the spatial details in LRHS images with the corresponding HRMS images [9], [10].The MRA methods extract spatial detail information of HRMS images by multiresolution decomposition, and yield HRHS images by incorporating the obtained spatial details into LRHS images [11], [12].Both CS and MRA methods have advantages, such as low computational cost and fast implementation, but they often suffer from spectral or spatial distortion.
Model-based methods typically establish an optimization function to model the fusion problem, and the function is solved with iterative algorithms [13], [14], [15], [16], [17], [18].In HSI super-resolution tasks, the optimization function generally includes two parts: data fidelity terms and regularization terms.The data fidelity terms mainly serve to stabilize the model and reduce the differences between input and output images in spatial and spectral information.The regularization terms constrain the fusion result based on some prior knowledge.These prior knowledge are often based on the latent statistics of HSI, such as sparsity prior [19], [20], low-rank prior [21], and total variation prior [22].Model-based methods have the advantage of interpretability, but they usually rely too much on handcrafted priors, resulting in many parameters need to be tuned.
Deep learning-based methods have attracted extensive interest from researchers owing to their powerful feature extraction capabilities.These methods typically build end-to-end deep neural networks to effectively learn the underlying relationships of inputs and outputs.In recent years, many CNNs have been employed in HSI super-resolution tasks, such as ResNet [23], U-Net [24], [25], DenseNet [26], and GAN [27].Because of the limited size of receptive field in convolution operation, CNNs fail to effectively utilizing global information.To overcome this disadvantage, Transformer has been developed and become a promising solution [28].Transformer relies on the self-attention mechanism to handle the long-range dependencies in images, which has been applied successfully in HSI super-resolution tasks.
There are two shortcomings in current deep learning-based methods.One is that these methods cannot fully utilize local and global features.The other is that these methods often ignore the correlations of spectral information, leading to suboptimal fusion.Given the aforementioned issues, we propose a crossed dual-branch U-Net for HSI super-resolution based on CNN and Transformer.Specifically, we design two CNN-based branches to fully extract local and shallow features of images.To make full use of these features, a feature interaction module that consists of convolution operations and matrix multiplications is designed to merge these spectral and spatial features and generate interaction information.In particular, we propose a spectral-spatial parallel Transformer (SSPT) that includes a spatial self-attention and a spectral self-attention, which both considers the spatial correlations and spectral correlations.This study's major contributions are described in the following.
1) A novel HSI super-resolution method named crossed dualbranch U-Net is proposed, which combines CNN and Transformer to effectively utilize local details and global information.2) To facilitate the interaction of information between branches, we introduce a spectral-spatial feature interaction module (SFIM), hence improving quality of fusion.3) We introduce an SSPT as skip connection to supplement global relevance features, which models global spatial information and takes into account the dependencies between adjacent spectral bands.The rest of this article is organized as follows.In Section II, we review existing works in HSI super-resolution.Section III mainly describes the proposed network and its components.The presentation and analysis of the experimental results are discussed in Section IV.Finally, Section V concludes this article.

II. RELATED WORKS
We give brief review of the model-based and deep learningbased methods of HSI super-resolution.

A. Model-Based Methods
In general, model-based methods are summarized within two categories, nonfactorization-and factorization-based methods.Nonfactorization-based methods aim to obtaining target images via prior knowledge.For example, Wei et al. [29] utilized the probability information within the scene and proposed a Bayesian fusion method.A fast fusion method integrating Sylvester equation was presented, which dramatically reduced computational complexity [30].Factorization-based methods mainly decompose the target image and then build an optimization model to solve.The factorization-based methods include matrix factorization-based methods [31], [32], [33], [34] and tensor decomposition-based methods [16], [35], [36], [37], [38], [39], [40], [41], [42].Matrix factorization-based methods primarily transform the fusion task into an estimation of the spectral basis and corresponding coefficients.Dian et al. [32] formulated an optimization model in conjunction with sparse prior and estimated the spectral basis and coefficients simultaneously.Considering the subspace low-rank relationships between HRMS/LRHS images, Xue et al. [21] proposed a subspace clustering-based approach that formulated a variational optimization model.Since the original HSIs are considered as 3-D cubes, the tensor decomposition-based methods could better handle multidimensional information.Examples of popular tensor decomposition methods include Tucker decomposition, CP decomposition, and tensor-ring decomposition.For example, Jin et al. [38] presented a tensor network by fusing the high-order tensors that correspond to LRHS and HRMS images, designing a new regularization term named weighted graph regularization.In response to the noise and nonsmooth problems, Guo et al. [39] inserted two different operators to design a tensor decomposition network.Based on tensor-ring decomposition, He et al. [40] designed a model that iteratively obtain corresponding core tensors from LRHS and HRMS images.A regularization method was proposed by Xu et al. [42], which integrated two priors simultaneously to estimate tensor subspace and tensor coefficients and obtained excellent super-resolution results.

B. Deep Learning-Based Methods
Deep CNNs have powerful feature extraction capabilities and are extensively used in variety of deep learning tasks.In the last few years, many efficient HSI super-resolution methods that use CNNs have been proposed [43], [44], [45], [46], [47], [48], [49], [50].Yang et al. [43] introduced a network with two branches, where one branch was dedicated to extracting spatial features of HRMS image while the other branch was involved in extracting spectral features of LRHS image.To fully utilize multiscale features, Zhan et al. [44] raised a network incorporating octave convolution with attention mechanism and designed a multisupervised loss function.For a further improvement in the interpretability of pure deep networks, modeldriven methods have been suggested.Specifically, these methods solve the iterative algorithm by building a deep network [45].Combining effective mathematical theoretical guidance, Dong et al. [46] suggested a dual spatial-spectral optimization strategy and introduced two optimization branches based on spatial and spectral priors, respectively.Based on U-Net architecture, Wang et al. [49] proposed a novel approach incorporating spectral and spatial attention that employed dense multiscale link as skip connection to obtain finer feature information.Ran et al. [51] presented a fusion network enabling to solve different resolution augmentation tasks, and incorporated multiscale high-resolution guidance to yield promising fusion results.
Transformer was initially applied in natural language processing.Due to the outstanding performance, it is gradually introduced to other fields as well [52], [53].Likewise, many HSI super-resolution methods-based Transformer has also been raised [54], [55], [56], [57].In the beginning, Hu et al. [54] directly fed the upsampled LRHS image concatenated with HRMS image to vision Transformer and achieved excellent results.Wang et al. [55] presented a Transformer-based network that utilized cross-attention for information fusion and enabled multilevel feature extraction and aggregation.A novel pyramid network was proposed based on window self-attention by Deng et al. [56], they considered information interaction between patches and solved computational complexity problem by fixing a smaller window size.

III. METHODOLOGY
In this section, we provide a thorough overview of the proposed network and loss function.

A. Overall Network Architecture
For brevity, the LRHS image is represented by Y ∈ R h×w×C , where h × w and C correspond to its spatial resolution and band number, respectively.Z ∈ R H×W ×c indicates the HRMS image, and H, W , and c stand for its height, width, and number of bands, respectively.The HRHS image to be generated is denoted as X ∈ R H×W ×C .The primary goal of our method is to generate HRHS images that share as much spectral information as possible with the input LRHS images and spatial information with the input HRMS images.
The proposed network is illustrated in Fig. 1, which mainly contains four primary modules: feature extraction module (FEM), SFIM, SSPT, and multiscale fusion module (MFM).To match the size of two inputs, we first upsample the LRHS images by commonly used bicubic interpolation.Then, two FEMs are employed to extract features from the upsampled LRHS image and HRMS image, where FEM is composed of two same 3 × 3 convolutional layers with a stride of 1, and these spectral and spatial features are fused by designed SFIM to realize cross-modality information interaction.Motivated by U-Net, we introduce an SSPT as skip connection that allows the network to capture long-range dependencies and compensate information loss.Finally, MFM gradually incorporates multiscale fusion information by continuous stacking and upsampling, the numbers of channels for each feature map are 64, 96, and 128, respectively.The proposed network achieves a compromise of spectral and spatial information, generating accurate and highquality HRHS images.

B. Spectral-Spatial Feature Interaction Module
HSIs are considered as integrated data cubes of imagery and spectrum, both spectral and spatial information are important.LRHS images have richer spectral information, while HRMS images contain more spatial information.To integrate these spectral information and spatial information effectively, we introduce an SFIM at each scale.The feature maps of LRHS and HRMS images are denoted as Y i and Z i , respectively.The details of SFIM are shown in Fig. 2. In SFIM, we first extract Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.spectral features from Y i and spatial features from Z i by using a 3 × 3 convolution operation.After that, the extracted features are fused by performing a matrix multiplication to obtain interaction feature O 1 i .The formula of O 1 i can be summarized as follows: where Matmul represents matrix multiplication.Notably, considering O 1 i as the first-order interaction feature, we can further obtain the second-order interaction feature by performing similar operations.Specifically, same convolutional operations are performed on Y i and Z i again, and then, a matrix multiplication is performed between the obtained features and O 1 i , resulting Y 1 i .In the same way, we can also get Z 1 i .At last, Y 1  i and Z 1 i are added to generate second-order interaction feature O 2 i .The process of obtaining O 2 i is formulated as Besides, Y i and Z i are added together to retain detailed information, which enhances the network's capability to preserve spatial and spectral information.

C. Spectral-Spatial Parallel Transformer
Transformer is well known for its ability of capturing longdistance dependencies in spatial locations.Given that spectral and spatial information are both important for HSIs, we design a spectral-spatial parallel Transformer, which takes both spatial global correlations and spectral correlations into consideration.
As we can see in Fig. 3, SSPT includes a spectral selfattention, a spatial self-attention, and a feedforward network.Taking the spatial self-attention as an example, the input represented by X i ∈ R H×W ×C is first projected and reshaped into P ∈ R HW ×C , and then P is projected into K ∈ R HW ×C , Q ∈ R HW ×C , and V ∈ R HW ×C by the linear layers.The spatial self-attention can be formulated as where Q indicates the query matrix, K denotes the key matrix, and V represents the value matrix, respectively.Their corresponding learnable projection matrices are represented by W Q , W K , and W V ∈ R C×C .dk corresponds to the dimension of K.QK T calculates the attention score by dot product.Multihead attention divides the Q, K, and V into multiple heads, each of which calculates self-attention and captures different aspects of information in data.Specifically, the selfattention is computed h times in parallel with h being the number of heads, and then, these heads are combined to obtain multihead attention.The multihead self-attention is formulated as follows: where W O is a learnable projection matrix, and h is set to be 4 in experiments.
The spectral self-attention calculates the spectral correlations among pixels, and the spatial self-attention calculates spatial correlations among spectral bands.Their calculation processes are similar, and their corresponding illustrations are shown in Fig. 4. Different from the spatial self-attention, the three learnable projection matrices of spectral self-attention are reshaped into Q, K, and V ∈ R C×HW .

D. Multiscale Fusion Module
After the above process, we obtained spectral-spatial feature maps at different scales.The sizes of these feature maps are 64, 32, and 16, respectively.In order to make full use of these feature maps and generate HRHS image, we adopted a progressive fusion strategy and designed the MFM whose specific structure is depicted in Fig. 5.
In the MFM, feature maps from different scales are gradually upsampled by a block consisting of a 2 × 2 transposed convolutional layer and a 3 × 3 convolutional layer, where their strides are set to 2 and 1.The upsampled feature maps are concatenated to generate HRHS image.This strategy is inspired by Zhang et al. [58], and we employ transposed convolutional operation for upsampling and achieve a better super-resolution performance.In Section IV, we have done ablation experiments with direct upsampling strategy and demonstrate the effectiveness of this progressive strategy.

E. Loss Function
We adopt L1 loss as loss function, which is commonly used to compute the difference of the desired image and fused images at pixel level.The formula of L1 loss function is as follows: Loss = X − X 1 (10) where X and X denote the reference and fused images, respectively.

IV. EXPERIMENTS RESULTS
In this section, we present the datasets employed in our work as well as their data processing procedures, and present general evaluation metrics.In addition, we performed several ablation experiments and comparative experiments to evaluate the superiority of our approach.

A. Datasets Introduction 1) CAVE:
The CAVE dataset1 includes 32 HSIs, each image contains 31 spectral bands with resolution of 512 × 512.In this dataset, the wavelength ranges from 400 to 700 nm, and the spectral resolution is 10 nm.In our experiments, the former 20 images were assigned to the training set while the rest 12 images were devoted to the test set.
2) Harvard: The Harvard dataset2 includes 50 different scenes with a resolution of 1392 × 1040.For every image, there are 31 spectral bands.Its wavelength ranging from 420 to 720 nm with a spectral resolution of 10 nm.In our experiments, we cropped the upper left corner of each image, resulting images with size of 1360 × 1024.We selected former 30 images as training set while the rest images were selected as test set.
3) Pavia Center (PC): The PC dataset3 consists of HSIs with size of 1096 × 640 and band number of 115, which was captured by ROSIS sensors.After removing 13 noisy bands, there are 102 bands left.In our experiments, we cropped 40 nonoverlapping subimages of size 128 × 128 from the original image.The first 28 subimages were organized as training set, and the rest were organized as test set.
Three simulated datasets were processed following Ranchin and Wald's protocol [59].Specifically, we performed a Gaussian filter with a scale factor of 4 on the original HSIs to generate LRHS images, and the HRMS images were generated via spectral response function (SRF).For the first two datasets, the corresponding SRF was derived from a Nikon D700 camera, while the corresponding SRF was from the IKONOS satellite for the PC dataset.In our experiments, we cropped image patches of size 64 × 64 and 16 × 16 from the observed HRMS and LRHS images as input.

B. Quantitative Assessment Metrics 1) Spectral Angle Mapper (SAM):
SAM is a commonly used metric that quantifies the image quality in terms of spectral dimension.The lower SAM value indicates the lower spectral distortion (11) 2) Peak Signal-to-Noise Ratio (PSNR): PSNR is a general metric to calculate pixel similarities between a pair of images.Higher values of PSNR indicate better results PSNR(X, X ) = 10 lg max(X) 2   1 where max(X) presents the largest pixel value in X.
3) Root-Mean-Squared Error (RMSE): RMSE can access the average difference between X and X in pixel wise.Its value range is from 0 to 1, and the smaller RMSE value indicates the better result (13) where X k (i, j) denotes pixel value at position (i, j) of the kth band of X.

4) Erreur Relative Globale Adimensionnelle De Synthse (ER-GAS):
ERGAS is an evaluation index that is used to assess the overall quality of image.The higher ERGAS value indicates the superior fusion quality ERGAS(X, X ) = 100 r where r denotes downsampling factor while μ is a function that calculates mean value.5) Structure Similarity Index Measure (SSIM): SSIM evaluates the structural similarity between two images.The higher SSIM value suggests the better quality of fused image where a 1 and a 2 are constants, μ X k and μ X k denote the mean values of X k and X k , respectively, and σ X k and σ X k present the standard value of X k and X k , respectively.σ X k X k is the covariance between X k and X k .

C. Comparison Methods
To thoroughly exhibit the effectiveness of our approach, we conducted comparisons against eight methods, including four model-based methods, namely, FUSE 4 [30], HySure5 [60],  [13], and GSA 4 [61], and four deep learning-based methods, namely, SSR-Net7 [62], HSRNet8 [63], Guided-Net9 [51], and MCT-Net10 [55].For fairness in comparison, all experiments were implemented using the same training set and testing set.The four deep learning-based methods were all executed in a Pytorch framework with a GeForce GTX 3090Ti 24 GB GPU.During the training process, we chose Adam optimizer and trained for 200 epochs, and learning rate was set to 0.0001.The four model-based methods were implemented in MATLAB 2019a.Parameter settings of all comparison methods were consistent with their respective original papers.

D. Ablation Study
In this section, we conducted multiple ablation experiments on SSPT and its components, SFIM as well as MFM on the CAVE dataset.All ablation experiments were conducted under the same environmental settings.

1) Influence of Components:
We investigated the influences of some important modules in the model by removing them individually.From the results presented in Table I, we observed that the quantitative metrics significantly declined when either SFIM or SSPT was removed.When SSPT was removed, all metrics went worse, with particularly substantial changes in PSNR and SAM, which indicates that SSPT served an essential role in capturing both spatial and spectral information from a global perspective.Similarly, the absence of SFIM leads to suboptimal fusion results, which proved that effective crossmodality information interaction can enhance performance.In conclusion, SFIM and SSPT are both effective for the proposed network, and the network performs best when SFIM and SSPT are employed simultaneously.
2) Influence of Self-Attention: Different attention mechanisms are employed in SSPT.The results were displayed in Table II.When SSPT only contained a spatial self-attention, the values of PSNR and SAM both decreased, indicating that the ability of spectral self-attention for extracting global spectral features.Similarly, when SSPT only consisted of a spectral self-attention, the values of PSNR, SAM, and EGRAS show significant fluctuations, which demonstrated the effectiveness of spatial self-attention in capturing global spatial feature.The optimal fusion results are obtained when SSPT included both them.
3) Influence of MFM: Two different image reconstruction approaches are compared, one is directly upsampling images to the same size and subsequently fuse, the other is progressively upsampling and fuse.The ablation experiments on the reconstruction approach were conducted, and the results are presented in Table III.It is obvious that the direct upsampling achieves worse fusion results compared with progressive upsampling, which indicates that more information is lost during cross-scale fusion.Therefore, progressive upsampling and fusion were found to be more effective at preserving information and achieving better fusion results.IV gives the results on the CAVE dataset, where optimal results are bolded.What we can conclude is that our method outperforms in all quantitative evaluation metrics.This suggests that our method could improve spatial resolution while retaining spectral information.For a more intuitive representation of the reconstruction results of each method, we display the fused images and their corresponding error images on the sponges image in Fig. 6.We have marked the meaningful areas with red boxes.The error images can visualize the difference that exists between the reference and fused images.It is evident that spectral distortion and detail loss are common problems in HySure, CNMF, SSR-Net, and Guided-Net, and our method has the optimal fusion quality among all comparison methods.The PSNR values of all bands are plotted in Fig. 7(a), where we can notice that our proposed method has the highest PSNR values on all bands, demonstrating the superiority of our method.

E. Results of Comparison Experiments on Simulated Datasets 1) Results on CAVE: Table
2) Results on Harvard: Table V illustrates the results for all comparison methods on the Harvard dataset.On all quantitative evaluation indicators, our network all obtains the best results, followed by Guided-Net.There are significant differences between model-based methods and deep learning-based methods on the Harvard dataset.We pick the imgf1 from the Harvard dataset for visualization in Fig. 8. What we can learn from the images is that there exists obvious distortions of FUSE, Hysure, CNMF, GSA, and MCT-Net, while our method has the best visualization results with the least amount of differences.Fig. 7(b) shows the PSNR values of all spectral bands.Although there is an overall decreasing trend in the PSNR values on the Harvard dataset, the optimal performance is achieved by our method.This suggests that our method is able to recover in parallel with spatial and spectral information.
3) Results on PC: The results of all the methods on the PC dataset are presented in Table VI.From the table, we find that GSA performs best in terms of PSNR, while CNMF performs best on SAM metric among the model-based methods.Our method obtains better values than all the other comparison methods on five metrics, followed by Guided-Net.Fig. 9 gives the fused images and their corresponding error images on band 61 of the nine methods.From the visualized results, we can learn that the model-based methods universally suffer from serious spectral and spatial distortion, followed by SSR-Net, HSRNet, and MCT-Net.Guided-Net and our method achieve better fusion  results.Fig. 7(c) provides a comparison of the PSNR on each spectral band of all methods.The difference between the fusion quality of model-based and deep learning-based approaches is obvious.Among deep learning-based approaches, our proposed method yields better quantitative and qualitative results on the PC dataset than other methods.

F. Experimental Results on Real Dataset
We performed further experiments on WV2 dataset to demonstrate the effectiveness of our proposed method in real-world scenarios.The WV2 dataset consists of an LRHS image and an RGB image with sizes of 419 × 658 × 8   [59].Specifically, we regarded the original images as reference and generated the HRMS and LRHS images using filters estimated by HySure [60].In the training phase, we cropped HRMS and LRHS images with patch sizes of 32 and 8.In testing phase, we directly fed the original images into the network.Fig. 10 illustrates the visualization results on the WV2 dataset.The meaningful regions are zoomed in red boxes.
From the visualization results especially the error images, it is apparent that our method yields best visual effects in details and is closest to the original LRHS image.The outperformance in real scenarios further confirms the contributions of our method.

G. Computational Efficiency
To provide a comprehensive comparison, it is necessary to analyze the efficiency and computational cost of deep learningbased methods.Table VII displays the specific values of the number of parameters, FLOPs, and the testing time for deep learning-based methods.From the results in Table VII, we can learn that the proposed method has a higher number of parameters than other deep learning-based methods.The FLOPs of our model are lower than Guided-Net and slightly higher than HSRNet.Because the proposed model is composed of multiple SSPTs, which inevitably leads to suboptimal computational costs.The test time for a single image of our method is shorter than that of Guided-Net and MCT-Net, but longer than that of SSR-Net and HSRNet.

V. CONCLUSION
This article proposes a crossed dual-branch U-Net for HSI super-resolution.The network adopts a dual-branch structure based on U-Net, focusing on extracting spatial features in HRMS images and spectral features in LRHS images, respectively.An SFIM is designed between the two branches to achieve cross-modality information interaction.Specially, we introduce an SSPT as skip connection, which can efficiently supplement correlative features and contributes to restore detailed information in the upsampling process.Finally, we employ a fusion strategy of progressive upsampling to further enhance the final fusion quality.Extensive comparison and ablation experiments are conducted on different datasets, where all outcomes confirm that our approach is outperforming many advanced techniques.
Although our method achieved excellent fusion results, the network contains multiple Transformer modules, resulting in an excessive amount of parameters and high computational complexity.In future work, we will strive to achieve the balance between the performance and computational costs.

Fig. 1 .
Fig. 1.Illustration of the proposed cross dual-branch U-Net.The structures of FEM, upsampling, and downsampling are shown at the bottom right, where k represents the kernel size, and s presents the stride size.LReLU indicates LeakyReLU.

TABLE I ABLATION
STUDY OF THE SFIM AND SSPT ON THE CAVE DATASET CNMF6

TABLE II ABLATION
STUDY OF THE SELF-ATTENTION IN SFIM ON THE CAVE DATASET

TABLE VII NUMBER
OF PARAMETERS, FLOPS, AND TEST TIME OF THE DEEP LEARNING-BASED METHODS and 1676 × 2632 × 3, respectively.In our experiments, we cropped four sets of nonoverlapping images, where the size of HRMS images was 512 × 512 and the size of LRHS images was 128 × 128.The first three subimages were treated as training set and the rest one as test set.Since there are no available reference images, we regenerated experimental data according to Ranchin and Wald's protocol