A Multiresolution Details Enhanced Attentive Dual-UNet for Hyperspectral and Multispectral Image Fusion

The fusion-based super-resolution of hyperspectral images (HSIs) draws more and more attention in order to surpass the hardware constraints intrinsic to hyperspectral imaging systems in terms of spatial resolution. Low-resolution (LR)-HSI is combined with a high-resolution multispectral image (HR-MSI) to achieve HR-HSI. In this article, we propose multiresolution details enhanced attentive dual-UNet to improve the spatial resolution of HSI. The entire network contains two branches. The first branch is the wavelet detail extraction module, which performs discrete wavelet transform on MSI to extract spatial detail features and then passes through the encoding–decoding. Its main purpose is to extract the spatial features of MSI at different scales. The latter branch is the spatio-spectral fusion module, which aims to inject the detail features of the wavelet detail extraction network into the HSI to reconstruct the HSI better. Moreover, this network uses an asymmetric feature selective attention model to focus on important features at different scales. Extensive experimental results on both simulated and real data show that the proposed network architecture achieves the best performance compared with several leading HSI super-resolution methods in terms of qualitative and quantitative aspects.


I. INTRODUCTION
T HE hyperspectral images (HSIs) are those that provide dense spectral sampling at each pixel [1]. Compared with natural images, HSIs contain a wider spectral range, where the channel division of the spectrum is muchly detailed, and the number of channels can reach tens to hundreds. HSIs can discriminate some similar materials and is suitable for remote sensing applications such as classification [2], [3], object recognition [4], change detection [5], [6], disaster [7], and biodiversity [8]. Nevertheless, due to the limitations of the imaging The authors are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: fangjian@njust.edu.cn; yang123jx@mail.nwpu.edu.cn; abdol-raheem@njust.edu.cn; xiaoliang@mail.njust.edu.cn).
Digital Object Identifier 10.1109/JSTARS.2022.3228941 characteristics of the hyperspectral camera itself and the image acquisition environment in practical scenarios, the direct acquisition of HSIs with high spatial resolution is difficult [9]. As such, there is an increasing interest in fusing low-resolution (LR)-HSI with high-resolution multispectral images (HR-MSI) to achieve HR-HSI by enhancing the calculation method of LR image quality for hyperspectral imaging [10]. The HSI fusion methods can be categorized into component substitution (CS), multiresolution analysis (MRA), model-driven, and deep learning methods. The CS methods [11], [12] and the MRA methods [13], [14] inherit from the traditional pan-sharpening methods. The CS methods decompose the LR-HSI image into spectral and spatial information, then replace the spatial information with HR-MSI, and finally invert this process to obtain HR-HSI. The MRA methods employ multiscale decomposition to obtain HR-MSI spatial detail information to be injected into the corresponding band of HSI. In spite of the fact that CS and MRA fusion methods are effective in injecting the spatial detail of MSI into HSI, they tend to cause more severe spectral distortion.
Model-driven methods are based on mathematical models for HSI-MSI fusion, and representative methods include Bayesianbased, matrix factorization, and tensor representations. Bayesian distribution-based HSI fusion methods [15], [16], [17] use a Bayesian dictionary and sparse coding to reconstruct HSI. Taking advantage of the presence of target images in lowdimensional subspaces, Wei et al. [15] proposed a variationalbased method to fuse HSI-MSI. Based on matrix decomposition, the work in [18], [19], [20], [21], and [22] utilized the high correlation between spectral bands to decompose the HS image into a coefficient matrix and spectral basis, which turns the HSI fusion problem into a problem of estimating the coefficient matrix and spectral basis. The spectral basis was extracted for LR-HSI in [20]. Sparse coding was extracted for HR-MSI using G-SOMP+ [20], and finally, the HR-HSI was attained using sparse coding and spectral basis. Based on tensor representation, the work in [23], [24], and [25] treat the HSIs as tensors without destroying the spatial-spectral structure, so tensor decomposition may be a better solution to the image fusion problem. Dian et al. [23] proposed a nonlocal sparse tensor decomposition HSI super-resolution method. This method decomposes HSIs into estimates of sparse core tensors and dictionaries, where This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dictionaries and core tensors are learned from LR-HSI and HR-MSI. The model-based approach makes full use of image a priori factors, such as sparsity, low rank, and global similarity. Regardless of the good interpretability of the method, the complex correlation and nonlinear features of HSIs are difficult to depict with these priori factors, and the fusion performance is limited.
With the vigorous development of deep learning, more and more researchers pay attention to the direction of deep HSI fusion. Some researchers [26], [27], [28] utilize deep learning networks to reconstruct HR-HSI to learn degradation models. Dian et al. [26] used convolutional neural networks (CNNs) to learn image priors, and then combined the image priors with traditional HSI fusion algorithms. In [29], the observation model and the estimated fusion process were optimized, and a deep learning blind algorithm for HSI fusion was proposed. The CNN denoising-based method (CNN-Fus) algorithm using subspace representation and CNN denoising was proposed in [30]. Some researchers have proposed deep HSI fusion algorithms using model-driven methods. An iterative HSI super-resolution algorithm with a deep HSI denoiser was suggested in [31], which is based on likelihood and deep image prior domain knowledge. Since deep learning requires a large number of training samples, but there are not many samples in real scenarios, some researchers presented an unsupervised deep HSI fusion method. Wei et al. [32] exploited deep neural networks to capture the statistics of images and proposed an unsupervised recursive HSI super-resolution method using pixel-aware refinement. While the deep learning-based fusion methods have achieved excellent performance, they have room for improvement in terms of spatial detail. Embedding a multiscale spatial feature extraction module in a deep network has been shown to be effective in alleviating problems such as the easy blurring of boundaries [33]. Second, the multiscale feature module fuses feature information from several different scales, which can suppress the noise passed by shallow features and recover the spatial structure detail information of the fused image more effectively in the decoding stage and improve the fusion effect of the model. However, most current deep learning methods use the connectivity of HSI and MSI along the spectral channel as the input to the network; this does not fully take into account the underlying multiscale spatial information.
Several researchers have proposed the fusion of LR-HSI and HR-MSI at different scales to obtain HR-HSI. Zhou et al. [34] proposed a pyramid fully convolutional network to solve the MSI and HSI fusion problem. This network comprises two subnetworks; the first is to extract LR-HSI's spectral information by the convolutional kernel and encode them as deep features; The second subnetwork is intended to combine HR-MSI pyramids with encoded deep features to acquire HR-HSI. The method proposed in [35] solved the HR-MSI and LR-HSI fusion problem. In this network, the deep features of LR-HSI are gradually enlarged by deconvolution, and then the deep features of LR-HSI and HR-MSI are fused at different scales. However, this structure ignores the basic and shallow features of MSI. Therefore, the work [36] introduced a dual UNet (DUNet) fusion method, which first used the encoding-decoding network to extract MSI spatial features at different scales and then used these scale features to inject them into the UNet network. Previous work has used pooling, convolution, and upsampling operations to extract multiscale information from HR-MSIs and LR-HSIs to fuse HR-HSIs. Such an approach requires a large number of parameters to learn the detailed information of the MSIs, and the spatial details learned are not necessarily those needed in the fused images. In contrast, discrete wavelets have also been shown to extract the spatial details of images well in experiments on single HSI super-resolution. Therefore, the method proposed in this article uses the multiscale wavelet details extracted by the discrete wavelet transform and combines them with convolution to extract multiscale information from MSIs. Inspired by the above problems, this article designs a multiresolution details enhanced attentive dual-UNet (MDA-DUNet). As shown in Fig. 1, this network can be divided into four parts. The first part is the detail extraction network to extract the spatial detail information of MSI. The second part is the spatio-spectral encoding module, which integrates the details of the above network and the detail extraction encoding module. The third part is the asymmetric feature selective attention module (AFSAM), which selects the vital information from the multiscale information of the spatio-spectral encoding module. Finally, the spatio-spectral decoding module incorporates the obtained features from the AFSAM with the features from the detail extraction decoding module and the spatio-spectral encoding module to produce the ultimate fused image.
The main contributions of this article are listed below: 1) In this article, an MDA-DUNet network is proposed to fuse LR-HSI and HR-MSI to obtain HR-HSI. The proposed framework can fully exploit the multiscale information of MSI and HSI for better spatio-spectral fusion. 2) A wavelet detail extraction module has been designed to learn wavelet detail features using a deep network of encoder and decoder structures. The discrete wavelet transform is used in this module to extract multiscale detail features from multispectral images, and the decoder and decoder structures are combined to extract multiscale detail features. In this way, the extracted wavelet features are not only involved in the encoding process but also in the decoding process, thus maximizing the use of the spatial detail features of the MSI. 3) In addition, an attention module for asymmetric feature selection is being designed in this article. The asymmetric features refer to the deep features in the UNet network where the spatial size and channels output at each scale are inconsistent. Asymmetric features in UNet are selected by this module using a spatial-spectral attention mechanism, which provides a significant performance improvement compared to the simple use of splicing. The rest of this article is organized as follows. The dual-branch network approach with asymmetric attention and wavelet subband injection is described in Section II, followed by simulation experimental results and analysis in Section III, then real-data experimental results and analysis in Section IV, and finally, Section V concludes this article.

II. METHODOLOGY
In this section, our proposed method is described in detail. The structure of the proposed network is shown in Fig. 1. Let X ∈ R H×W ×b denotes the HR-MSI, where H, W , and b represent the dimensions of the spatial height, width, and spectral band number, respectively. Let Y ∈ R h×w×B denotes the LR-HSI, where h, w, and B represent the dimensions of the spatial height, width, and spectral band number, respectively. The method proposed in this article consists of four parts, namely, wavelet detail extraction module, spatio-spectral encoding module, asymmetric feature selective module, and spatio-spectral decoding module. The high-frequency details are extracted from the wavelet detail extraction network by discrete wavelet transform, and then high-frequency spatial information of different deep and scales is extracted purely from MSI by an encoding-decoding. The spatio-spectral fusion module is accountable for incorporating spatial information from all phases of the high-frequency detail extraction network into the HSI for detailed enhancement. The asymmetric feature selective module extracts the asymmetric features of the spatio-spectral encoding module, then extracts the important features using the spatio-spectral attention mechanism, and finally integrates the extracted features into the decoding.

A. Wavelet Detail Extraction Module
Wavelet transforms [37] are effective methods for analyzing an image's message since they decompose the image into lowpass sub-band images and multiscale directed high-frequency sub-band images. According to [38], wavelet transform in a CNN was favorable for single-image super-resolution. A wavelet residual network was proposed in [39] for computed tomography image reconstruction, which uses wavelet detail to enhance image quality.
Discrete wavelet transform [40] extracts high-frequency detail features of HR-MSI. In this article, Haar discrete wavelet transform (filter bank is "DB1") is used to extract high-frequency details of multiresolution from MSI, in which low-pass filter and high-pass filter banks are represented. The image passes through a low-pass filter for the low-frequency sub-band image, whose ranks and rows are all at the d-scale, which can be obtained by the following equation: where C d represents the dth scale of low-frequency sub-band image. Images of high-frequency sub-bands in the three directions are defined as whereΦΨ(C) represents the convolution of C with the separable filterΦΨ, C d−1 represents the low-pass sub-band image at the dth scale, and W 1 d , W 2 d , and W 3 d represent high-frequency subband images in horizontal, vertical, and diagonal directions at the d-scale, respectively. We extract the features from the image using a wavelet feature fusion module, as shown in Fig. 1. We concatenate the three high-frequency detail features of wavelets W 1 i , W 2 i , and W 3 i and further refine the concatenated features using 3 × 3 convolutional layer, which can be written as follows: where conv 3×3 represents the convolution operation, and its kernel size is 3 × 3, cat indicates a channel dimension concatenate operation. The low-frequency sub-band image C 1 and high-frequency sub-band images W 1 1 , W 2 1 , and W 3 1 are obtained from the MSI through discrete wavelet transform. These three high-frequency sub-band images are concatenated together in the channel dimension, and the first feature of the encoding is obtained using the convolution of 3 × 3, denoted byW 1 . Since the low-frequency sub-band image C 1 also has high-frequency information, discrete wavelet transform is used for the MSI lowfrequency sub-band image to obtain the low-frequency sub-band image C 2 and the high-frequency sub-band images W 1 2 , W 2 2 , and W 3 2 , and concatenate them in the channel dimension. These three high-frequency sub-band images result in a high-frequency image, and then this image undergoes a 3 × 3 convolution, represented byW 2 .
The deep featuresW 1 are first subjected to a maximum pooling operation and concatenated in the high-frequency deep featureW 2 in channel dimension, then passed through a convolutional layer of size 3 × 3 to finally obtain the deep features of the second output of the encoding module. This process is represented as where F (X 1 , X 2 ) = conv 3×3 (cat(X 1 , X 2 )), X 1 and X 2 are deep features, and maxp(·) is the pooling operation for the maximum channel dimension. A wavelet detail extraction decoding module is intended to complement the information from the spatio-spectral encoding module. This decoding module consists of a deconvolution layer of stride size 2 and a convolution layer with a kernel size 3 × 3. The deep features of the three outputs of the decoding module are expressed as where dec(·) is deconvolution to up-sample feature, and W D n represents the outputted deep feature of the decoding module at the nth stage of detail extraction.

B. Spatio-Spectral Encoding Module
The LR-HSI is preprocessed; that is, the space size of the LR hyperspectral is sampled at most the same as that of the spectrum, which can be achieved as follows: where Y represents the LR-HSI after upsampling, and Up(·) represents a spatial upsampling operation. The upsampling method is bilinear interpolation with a scale factor of 8.
To address the problem of the training error increasing rather than decreasing after adding too many layers. He et al. [41] proposed the residual network. Residual blocks are used in this step to extract deep features; hence, they will be briefly discussed below. The residual block consists of two convolution layers with a size of 3 × 3 and the ReLU activation function. This process is formulated as where X in is the input feature, and δ is the ReLU function. The up-sampled HSI and MSI are first concatenated in the same channel dimension as the original input, followed by learning the concatenated image using a convolutional layer of 3 × 3 size and a residual block, and finally obtaining the first output feature of the deep feature encoding module. The deep detail features are concatenated with the maximum pooling features of the (n-1)th output of the coding module, and the concatenated deep features are then trained using a convolutional layer of size 3 × 3 and a residual block to obtain the nth output feature of the decoding module. This process is expressed as where MUe n represents the output feature of the encoding module at the nth stage.

C. AFSAM
In the encoding-decoding structure [42], the encoding part performs layer-by-layer downsampling using pooling layers, and the decoding part performs layer-by-layer upsampling using deconvolution. The spatial information in the original input image is gradually recovered along with the details in the image. The resulting LR image is eventually mapped to a pixel-level, HR image. To compensate for the information lost in the downsampling of the encoding stage, the UNet [43] uses a splicing operation between the encoding and decoding of the network to fuse the feature maps at the corresponding positions in the two processes. The decoding is able to retain more HR detail information contained in the high-level feature maps during upsampling, thus better recovering the spatial detail information of the original image.
An asymmetric feature fusion module (AFFM) was proposed in [44] to improve the deblurring image performance using multiscale features. This module turns the multiscale features of the encoding into the same spatial dimension, then concatenates these features again in the channel dimension, afterward does the convolution of the concatenated features, and finally concatenates the obtained convolution with the features of the decoding. Although this method can make good use of multiscale features, simple splicing cannot be exploited to greater advantage. Multikernel networks [45] were proposed to extract scale-important features. This article proposes an asymmetric selective attention mechanism by combining the multikernel networks and the asymmetric fusion module. Fig. 2 shows an example of MUe 0 based on the AFSAM. First of all, deconvolution and convolution are used to change the space size; and channel number size of input deep features to the same size as MUe 0 ; and this process can be written as the following: Then, adding the three obtained deep features element by element to obtain the deep feature, which is expressed as where Su denotes the element-by-element sum of the three deep features.
An extraction process of spatial and channel attention weights is introduced so as to select the important features using the spatial and channel attention mechanisms of multiresolution features.
The global averaging pooling operation is first used to extract the global perceptual field of Su so that the channel attention weights of the deep feature Su can be obtained, and each feature channel is abstracted as a feature point. This process is defined as where gp(·) represents the space dimension average pooling operation.
A two-layer multilayer perception network is used to carry out nonlinear feature transformation to construct the correlation between feature graphs. This process is formulated as where B represents the batch normalization [46]. In order to obtain the spatial attention weights of the deep features, two deep features with constant spatial dimension and one channel dimension are first obtained by average pooling and maximum pooling for the Su channel dimension. Then, the two deep features are concatenated together in the channel dimension, and this process is expressed as where avgp(·) is the pooling operation for the average channel dimension. The spatial attention weight is obtained by convolution layer with a size of 7 × 7 calculation for Ss, and the process is determined as where Sz is the spatial attention weight. The obtained spatial and channel attention weights are multiplied to obtain the spatial-spectral attention weight, which is defined as Three 1 × 1 convolutions obtain three spatio-spectral attention weights, which can be rewritten as where conv 1×1 represents a convolution layer with size of 1 × 1.
The softmax function is used for the three attention weights to obtain Sa + Sb + Sc = 1, which ends up with the following equations: where Sa, Sb, and Sc represent the spatial-spectral attention weights of Su 0 , Su 1 , and Su 2 , respectively. The attention module is obtained by multiplying the spatiospectral attention weight and deep feature, and the output of the module is obtained by adding the three attention modules element-by-element. This process is expressed as where AFSAM 1 is based on MUe 0 , the space size and channel number of MUe 1 and MUe 2 are changed to be the same as MUe 0 by deconvolution and convolution operations, and then Su is obtained by adding. While AFSAM 0 is based on MUe 1 , the space size and channel number of MUe 0 and MUe 2 are changed to be the same as MUe 1 by deconvolution or pooling and convolution operations, and then Su is added.

D. Spatio-Spectral Decoding Module
The AFSAM and detail extraction decoding module are used to construct the spatial-spectral decoding module, and the fusion results are obtained by the ReLU activation function. The first output of the decoding module is obtained by extracting the deep feature of MUe 2 using convolution, residual block, and deconvolution. Afterward, the output AFSAM n−1 of the AFSAM, the output MUd n−1 of the spatio-spectral encoding module, and the output W D n−1 of the detail extraction decoding module are spliced together. Then, the convolution layer with a size of 3 × 3 and residual blocks are purposed. Finally, the nth output of the spatio-spectral decoding module is obtained by deconvolution, which can be formulated as follows: dec (RB (conv 3×3 (MUe 2 ))) , n = 0 RB(conv 3×3 (cat(AFSAM n−1 , MUd n−1 , W D n−1 )))), n = 1, 2 where MUd n represents the decoding module of the nth stage. After that, the output MUd 2 of the decoding and the decoding WD 2 of detail extraction are spliced on the channel dimension, and features are extracted by convolution. Then, the up-sampled HS image and extracted features are added element-by-element. Finally, the fusion image is obtained by using the ReLU activation function. This process is described aŝ whereẐ denotes the reconstructed image.

E. Loss Function
In our MDA-DUNet network, training is achieved by minimizing the following loss function:

A. Comparison Methods
The proposed method is compared with seven current mainstream HSI image super-resolution methods, including three traditional methods, namely coupled nonnegative matrix factorization (CNMF) [18], 1 the subspace regularized method (HySure) [48], 2 and coupled spectral unmixing (CSU) [19], 3 and four deep learning methods, namely deep HSI sharpening method (DHSIS) [26], 4 deep blind iterative fusion network (DBIN) [29], 5 CNN-Fus [30], 6 a novel model-guided deep convolutional network (MoG-DCN) [31], 7 and a dual U-Net (DUNet). For a fair comparison, the same data preprocessing is used in all methods, and the deep learning-based methods among the methods compared are trained using the code provided by the authors with the proposed parameters on the same training data and the same protocol for evaluating the experimental results of all methods.

B. Experimental Dataset
Three publicly simulated hyperspectral imaging datasets are used to verify the performance of the proposed method, i.e., Columbia Computer Vision Laboratory (CAVE) dataset, 8 Harvard dataset, 9 the Interdisciplinary Computational VisionLab (ICVL) dataset, 10 and the Chikusei dataset. 11 With a spatial size of 512 × 512, a band range of 400-700 nm, a wavelength interval of 10 nm, and 31 spectral bands, the CAVE dataset comprises 32 indoor HSIs. The Harvard dataset consists of 50 indoor and outdoor HSIs, which have a spatial size of 1040 × 1392, a band range of 420-720 nm, a wavelength interval of 10 nm, and 31 spectral bands. The ICVL dataset comprises 201 HSIs with a spatial size of 1300 × 1392, a band range of 400-700 nm, a wavelength interval of 10 nm, and 31 spectral bands. For convenience, in the experiments, we crop the top-left 1024 × 1024 pixels from Harvard and ICVL datasets for training and testing the proposed method. The Chikusei dataset contains airborne HSI taken by visible and near-infrared imaging sensors in agricultural and urban areas of Chikusei, Ibaraki Prefecture, Japan. This hyperspectral dataset has 128 bands in the spectral range of 363-1018 nm, and the scene consists of 2517 × 2335 pixels. After removing black borders from the spatial domain, the centered 2048 × 2048 pixels were cropped and extracted for use in our experiments. Partial images of the test set for these four datasets are shown in Fig. 3.
The LR-HSI for the four datasets is acquired by a Gaussian filter of r × r (the mean value is 0, the standard deviation is 2) and down-sampling every r pixels in the vertical and horizontal directions of each band of the reference image, namely, the extraction factor is r × r. The HR-MSI of the same scene is simulated by spectrally downsampling the HR-HSI using the subspectral sampling matrix R, where R adopts the Nikon D700 camera response function. 12 For the Chikusei dataset, given the diversity of hyperspectral sensors, the spectral response function R of IKONOS satellite 13 was used to generate HR-MSI. At the same time, the observed images from these datasets are 10 I  AVERAGE MPSNR, RMSE, ERGAS, SAM, UIQI, AND MSSIM RESULTS OF THE ABOVE METHODS ON THE CAVE DATASET WITH GAUSSIAN BLUR KERNEL  AND SCALING FACTORS OF 8, 16, AND 32 used as reference images. In experiments, we performed spatial enhancements of factors 8, 16, and 32. The first 20 HSIs from the CAVE dataset are used for the training process, and the last 12 HSIs for testing. For Harvard dataset, the first 30 HSIs are used for the training process and the last 20 HSI are used as testing images. For the ICVL dataset, 50 datasets are selected from the 201 datasets for the experiments. The first 30 HSIs are used as the training dataset, and the next 20 HSIs are used as the testing dataset in the experiments. Since deep learning needs a large number of data as training sets, blocks of these training HSI are used as training sets for training the proposed network. In the case of upscaling factor 8, the size of the LR-HSI block is 4 × 4 × 31, the size of the HR-MSI block is 32 × 32 × 3, and the HR-HSI block is 32 × 32 × 31, respectively; in the case of upscaling factor 16, the size of the LR-HSI block is 2 × 2 × 31, the size of the HR-MSI block is 32 × 32 × 3, and the HR-HSI block is 32 × 32 × 31, respectively; in the case of upscaling factor 32, the size of the LR-HSI block is 1 × 1 × 31, the size of the HR-MSI block is 32 × 32 × 3, and the HR-HSI block is 32 × 32 × 31, respectively.
In the Chikusei dataset, we selected images of 1024 × 2048 pixels in size from the top region of the images for training, while cropping the rest of the images into nine nonoverlapping 512 × 512 × 128 as the test data. In the case of upscaling factor 8, the size of the LR-HSI block is 4 × 4 × 128, the size of the HR-MSI block is 32 × 32 × 4, and the HR-HSI block is 32 × 32 × 128, respectively; in the case of upscaling factor 16, the size of the LR-HSI block is 2 × 2 × 128, the size of the HR-MSI block is 32 × 32 × 4, and the HR-HSI block is 32 × 32 × 128, respectively; in the case of upscaling factor 32, the size of the LR-HSI block is 1 × 1 × 128, the size of the HR-MSI block is 32 × 32 × 4, and the HR-HSI block is 32 × 32 × 128, respectively.

C. Quantitative Indicators
This article uses six evaluation metrics to quantitatively evaluate the difference between the fused images and the reference images. For example, mean peak signal-to-noise ratio (MPSNR), spectral angle mapping (SAM) [49], mean structural similarity indicator (MSSIM) [50], erreur relative global adimensionnelle synthese (ERGAS) [51], root mean square error (RMSE), and universal image quality index (UIQI) [52]. In contrast to MP-SNR, MSSIM, and UIQI (larger is better), RMSE, ERGAS, and SAM are negatively correlated with image quality (smaller the better).

D. Experimental Results
Tables I-IV show the evaluation results of the different methods. For the CAVE, Harvard, ICVL, and Chikusei datasets, different fusion methods are first evaluated on 10, 20, 20, and eight test datasets, respectively, and then, the mean of the evaluation metrics is calculated. According to Tables I-IV, the proposed method achieve the higher MPSNR, UIQI, and MSSIM metrics and the lower RMSE, ERGAS, and SAM metrics. This means that the HSI reconstructed by the proposed method has a better spatial structure and less spectral distortion than the comparison methods.
Since the fused HR-HSIs shown in Figs. 4, 6, 8, and 10 are close to each other, the visual heat maps of mean squared error  II  AVERAGE MPSNR, RMSE, ERGAS, SAM, UIQI, AND MSSIM RESULTS OF THE ABOVE METHODS ON THE HARVARD DATASET WITH GAUSSIAN BLUR KERNEL  AND SCALING FACTORS OF 8, 16, AND 32   TABLE III  AVERAGE MPSNR, RMSE Fig. 14(a). As can be seen from the figure, our method has the lowest index on each band.
2) Results on Harvard Dataset: As can be seen from Fig. 7, the fusion results of CNMF, HySure, CSU, MoG-DCN, and  IV  AVERAGE MPSNR, RMSE, ERGAS, SAM, UIQI, AND MSSIM RESULTS OF THE ABOVE METHODS ON THE CHIKUSEI DATASET WITH GAUSSIAN BLUR KERNEL  AND SCALING FACTORS  As can be seen from these figures, the proposed method has the highest index on each band. Each band SAM of the reference image and the fused image is shown in Fig. 14(b). According to this figure, the proposed method has the lowest index on each band.

5) Results on Different Noise Levels:
In fusion tasks, MSIs and HSIs are often affected by noise [53], so noise is added to the images in this article. When we simulate HSI and MSI from HR-HSI, Gaussian noise is added to the HSI and MSI and the signal-to-noise ratio (SNR) varies from 10, 20, and 30 dB. For each noise level, we calculated the evaluation metrics for the CAVE dataset and then averaged them, as shown in Table V [18]. (c) HySure [48]. (d) CSU [19]. (e) DHSIS [26]. (f) DBIN [29]. (g) CNN-Fus [30]. (h) MoG-DCN [31]. (i) DUNet [36]. (j) MDA-DUNet. FLOPs, with the DUNet method achieving the best performance. The MDA-DUNet is a lightweight framework with less running time and the low FLOPs, demonstrating the effectiveness and efficiency of the proposed method.
The conclusion that can be drawn from the above experimental results is that the method in this article has good spatial and spectral reconstruction capabilities on the simulated dataset.

1) Function of Each Component of the Proposed MDA-DUNet:
The role of the different parts of the MDA-DUNet, i.e., different variants, are trained on the same training data, i.e., the CAVE dataset. Fig. 15(a) shows UNet-1, Fig. 15(b) shows UNet-2, Fig. 15(c) shows UNet-3, Fig. 15(d) shows UNet-4, and Fig. 15(e) shows UNet-5. To verify the effectiveness of  Fig. 15. The performances of these methods on the CAVE dataset are shown in Table VII. From the ablation experiments, we can see that the wavelet detail extraction module and AFSAM proposed in this article are facilitative to the network, and therefore, we consider them to be effective.
2) Comparison of the Proposed AFSAM With the AFFM: From Table VIII, it can be seen that the performance of the AFSAM proposed in this article is improved by 1.6 dB in terms of MPSNR compared to the AFFM. This demonstrates the effectiveness of using spatial and spectral attention selection mechanisms to extract important features from asymmetric features.

IV. REAL DATA EXPERIMENT
In the following, the real dataset is used to further verify the proposed method's effectiveness. We use the LR-HSI acquired by Hyperion sensors using the Eo-1 satellite and the HR-MSI acquired using the Sentinel-2 satellite. The Hyperion HSI has a spectral range of 400-2500 nm, including 242 bands, with a spatial resolution of 30 m. After removing the water vapor and noise bands from the HSI, 89 bands remain. The MSI S2 has a total of 13 bands, from which four bands of 490, 560, 665, and 842 nm are selected as the HR-MSI with a spatial resolution of 10 m.
This section aims to fuse 30-m HSI and 10-m MSI data to obtain 10-m HSI. Since the proposed network is supervised learning, 10-m HSI data are required as a reference image, but there are no 10-m HSI data in the real scene. Therefore, we use the strategy in [54] and [55] to convert 30-m HSI and 10-m MSI through downsampling. The original 30-m HSI is used as a reference image for training, while in the testing phase, the original 30-m HSI and 10-m MSI are used to obtain 10-m HSI data.
The Hyperion HSI contains a spatial size of 2350 × 990, and the size of the S2 MSI is 7050 × 2970. For the experiment, 200 × 200 pixels from the HSI data and 600 × 600 pixels from the MSI data are cropped as the test set. At the same time, the rest of these data are used as the training dataset. Furthermore, the test images are shown in Fig. 16.
From Fig. 17, it can be seen that there is not only much noise in the fusion image generated by the CNMF method but also a lot of color distortion in the river part. In the image generated by the HySure method, not only the green part is distorted, but most of the white area has disappeared, and other areas also have distortion. The green part of the fusion image generated by the CSU method is distorted into gray. The fusion result of the DHSIS method has not only spectral distortion in the white area, but also has striped noise in the image. There is more noise in the image obtained by the DBIN method. CNN-Fus method fusion image retains only rough outlines and blurred details. There was a small amount of noise in the images fused by the MoG-DCN and DUNet methods. Compared with other methods, the results obtained by the method proposed in this article have less distortion.
As shown in Fig. 19, the fused images generated by CNMF and HySure methods are blurred and too bright. The spectral distortion occurs in CSU, with red roofs and gray background. The colors of the images generated by the DHSIS and CNN-Fus methods are distorted, while the image generated by the DBIN and DUNet methods is blurry. The images generated by the MoG-DCN method has mesh noise, whereas the proposed method has less spectral distortion and less noise.
From the above analysis, it can be concluded that the MDA-DUNet has better spatial and spectral reconstruction capabilities on real datasets.

V. CONCLUSION
In contrast to the present CNN-based approaches, the suggested fusion technique can aid the CNN in sufficiently exploring the HR-MSI's spatial information and incorporating the extracted spatial information into the latent image to rebuild the HR-HSI in a global-to-local pattern progressively. This article proposes a dual-ended UNet to improve the spatial resolution of HSI. The first branch is the detail extraction network, which is an encoding-decoding whose main purpose is to extract different spatial features of MSI. The other branch is the spectio-spectral fusion module, which aims to inject the features of the detail extraction network into the HSI to better reconstruct the HSI. Moreover, this network uses an asymmetric attention to focus on essential features at different scales. The experimental results on simulated and real data indicate that the proposed models are qualitatively and quantitatively outperforming the existing state-of-the-art methods.
Since transformer [56], [57] can effectively mine the nonlocal correlation of images, it has been widely used in the direction of image restoration and classification. The literature [58], [59] also introduces the transformer to HSI fusion. However, this method does not fully exploit the multiscale information of HSI and MSI. Therefore, the combination of transformer and UNet is used, both multiscale and nonlocal information of HSI can be exploited simultaneously.