A Deep-Shallow Fusion Network With Multidetail Extractor and Spectral Attention for Hyperspectral Pansharpening

Hyperspectral (HS) pansharpening aims at fusing a low-resolution HS image with a high-resolution panchromatic (PAN) image to obtain a HS image with both higher spectral and spatial resolutions. However, existing HS pansharpening algorithms are mainly based on multispectral pansharpening approaches, which cannot perfectly restore much spectral information in the continuous spectral bands and much broader spectral range, leading to spectral distortion and spatial blur. In this paper, we develop a new hyperspectral pansharpening network architecture (called Hyper-DSNet) to fully preserve latent spatial details and spectral fidelity via a deep-shallow fusion structure with multi-detail extractor and spectral attention. First, to solve the problem of spatial ambiguity, five types of high-pass filter templates are used to fully extract the spatial details of the PAN image, constructing a so-called multi-detail extractor. Then, a multi-scale convolution module and a deep-shallow fusion structure, which reduces parameters by decreasing the number of output channels as the network goes deeper, is utilized sequentially. In final, a spectral attention module is conducted to preserve the spectrum for a wealth of spectral information of HS images. Visual and quantitative experiments on three commonly used simulated datasets and one full-resolution dataset demonstrate the effectiveness and robustness of the proposed Hyper-DSNet against the recent state-of-the-art hyperspectral pansharpening techniques. Ablation studies and discussions further verify our contributions, e.g., better spectral preservation and spatial detail recovery.


I. INTRODUCTION
H YPERSPECTRAL (HS) images have hundreds of narrow continuous bands in the same scene simultaneously [3], which contain rich spectral information, making HS images widely applied in many fields such as military surveillance [4], environmental monitoring [5], mineral exploration [6], [7], agriculture [8], [9], and change detection in commercial products [10]. However, due to the physical limitations of sensors, expanding the spectral range also brings a reduction in spatial resolution. When compared to panchromatic (PAN) images, HS images typically have a lower spatial resolution, which may be insufficient in some practical applications where both high spatial and spectral resolutions are desired [11]. Therefore, HS pansharpening, aiming to merge the HS and PAN images to generate a fused HS image with both higher spectral and spatial resolution, is of great significance from many perspectives, also receiving great attention from the remote sensing and image processing communities [12].
In the recent decade, a number of data fusion techniques have been developed to improve the spatial resolution of HS imagery. They can be roughly classified into five categories: component substitution (CS), multiresolution analysis (MRA), Bayesian, matrix factorization, and deep learning (DL) based approaches.
The Bayesian approach depends on the usage of the posterior This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ distribution of the required high-resolution HS (HRHS) image for the given low-resolution HS (LRHS) and PAN images [12]. Wherein, Gaussian prior (Bayesian sparse) [31], Bayesian naive Gaussian prior (Bayesian naive) [32], and Bayesian HySure [33] are typical Bayesian approaches. Moreover, the matrix factorization based method is to utilize an optimization tool to factorize the related matrices after first modeling the observed data with a signal subspace representation, including a representative method called the coupled nonnegative matrix factorization (CNMF) [34]. Besides, there are other typical variational methods that also belong to VO-based methods [35], [36], [37], [38], [39], [40], [41]. The Bayesian and matrix factorization based methods are often constrained by the insufficient representation ability, and serious quality degradation may occur if the prior assumptions do not fit the situation. Furthermore, the majority of available fusion model optimization strategies are solved iteratively, which is time-consuming and inefficient. Over recent years, DL-based methods, particularly convolutional neural network (CNN) based DL techniques, have achieved significant advances in image processing fields, e.g., image resolution reconstruction [42], [43], [44], [45], [46], [47], [48], [49], image classification [50], [51], [52], image denoising [53], image fusion [49], [54], [55], [56], [57], [58], [59], etc. Therefore, many methods [1], [2], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71], [72], [73] based on DL have also been applied to solve the pansharpening problem. Dong et al. [42] originally introduce a shallow three-layer CNN (SRCNN) to learn the mapping between LR and HR patches for single image super-resolution. Based on the effective residual learning technique, Ledig et al. [43] employ a residual network to build a deeper network for image SR. Especially, CNNs have shown promising results not only in single image superresolution but also in multispectral (MS) pansharpening. More recently, more researchers have made attempts to employ CNN in HS pansharpening. Masi et al. [60] develop a three-layer CNN architecture for pansharpening, utilizing preinterpolated LR MS images stacked with PAN images as input. This is the first work utilizing CNN for MS pansharpening, inspired by the SRCNN. Besides, Yang et al. [61] propose a deep network (PanNet) for the pansharpening problem whose main contribution is adding up-sampled MS images to the network output to propagate the spectral information and training parameters in the high-pass filtering domain rather than the image domain. He et al. [1] introduce spectrally predictive structure (HyperPNN) to strengthen the spectral prediction capability of the CNN for the task of HS pansharpening. Moreover, HS pansharpening is also handled as a restricted minimization problem with extra priors learned by the CNN by Xie et al. [62]. Furthermore, He et al. [2] develop new spectral-fidelity CNN architecture (HSpeNet) for HS pansharpening to keep the fidelity of the pansharpened image, focusing on the decomposability of HS details and meanwhile introducing a spectral-fidelity loss. Recently, some works have achieved good results by directly using no-reference loss without downsampling to simulate training data. Xiong et al. [74] first designed a loss function that does not need the reference fused image. Based on this, Li et al. [75] combined CNN with transformer block to design a CNN+ pyramid transformer network with no-reference loss.
However, in some of these approaches, the particularity of remote sensing images, especially HS images, is ignored due to all features extracted from input images being treated identically, further restricting the ability to employ relevant information selectively. Besides, for the characteristics of a wider spectral range of the HS image than the MS one, most networks are not designed for the special spectral preservation, which fails to consider the importance and sensitivity of spectral information and leads to spectral distortion easily. Besides, for PAN images, pioneer works often feed them directly into the network together with HS images or use a fixed high-frequency template for preprocessing, which will inevitably lose some spatial information. Moreover, when it comes to a deep network structure, researchers often only pay attention to the results after multilayer convolution and ignore the importance of the shallow feature. In addition, the features extracted from the deep and shallow layers in the network are different, and the shallow features usually contain more texture details.
To tackle the problems mentioned above, we propose a socalled Hyper-DSNet, containing a deep-shallow fusion (DSF) structure with multidetail extractor (MDE) and spectral attention (SA), for the task of HS pansharpening. To summarize, the main contributions of the work include four aspects listed as follows.
1) For the challenging of spectral preservation in the HS pansharpening, we appropriately and specially used an SA module generating different channel weights to distinctively preserve the HS image's rich and sensitive spectral information. It delivers the impact of reducing spectral distortion and improving the network's spectral fidelity. 2) We give an MDE module that contains several distinct high-pass filtering templates for extracting different spatial details from the PAN image and injecting them into the network alongside the PAN image. Abundant and diverse high-frequency information with other characteristics promotes better use of the spatial information of the PAN image. 3) After passing a multiscale convolution, extracted features will go into a specifically designed DSF module, not only connecting the deep and shallow features but also reducing network parameters, for better spatial information recovery. Experimental results on three benchmark HS datasets demonstrate the superiority of the proposed Hyper-DSNet over recent state-of-the-art (SOTA) HS pansharpening techniques, as shown in Fig. 1. What is more, the best evaluation results at full resolution prove the robustness of our method.

II. RELATED WORKS
In this section, a brief review of several DL-based methods for HS pansharpening, some works related to the proposed architecture and our motivation will be presented.

A. CNN-Based HS Pansharpening Framework
Recently, CNNs have been widely used in the field of image processing and computer vision. They are mainly proposed for processing regular matrices by continuous sliding window (kernel) convolution. In the training process, each parameter of the convolution kernel is continuously updated and optimized via forward and backpropagation to minimize the loss function. The main mathematical formulation for CNN can be summarized as follows: where * is the convolutional operation, O l represents the output feature map on the l th layer, W l and b l stand for the network parameters and biases on this layer, respectively, and f (·) means an activation function. Consider the case of HS pansharpening, CNN-based framework accepts the observed HS image and the PAN image as input and finally outputs an HRHS image. The PAN image with the size L × W is denoted as P 0 ∈ R L×W ×1 , while the LRHS image with l × w pixels and B spectral bands is indicated as H 0 ∈ R l×w×B . The expected HRHS output is H ∈ R L×W ×B and the fused output of the CNN-based framework can be written asĤ with the same dimension, i.e., where M (·; θ) means the mapping from input to output with all parameters θ to be optimized. In final, the network parameters of CNN-based HS pansharpening can be generally updated by minimizing the following 2 loss function where · 2 refers to the 2 norm. Once M (·; θ) is learned, and the new observed PAN and HS images P 0 and H 0 are input into the mapping again, the predicted HRHS image can be obtained. Compared with the general MS pansharpening problem, HS pansharpening is faced with greater challenges. One is that the spectrum range of HS image [191 bands from 400 to 2400 nm of Hyperspectral Digital Imagery Collection Experiment (HY-DICE) sensor] is wider than the range of MS image (eight bands from 400 to 1040 nm of WorldView-3 sensor), causing a larger spectral gap between the HS image and the PAN image; the other is that more details in continuous bands with high spectral resolution need to be reconstructed at the same time. These challenges make HS pansharpening more prone to problems such as spectral distortion and have higher requirements on the accuracy of the algorithm and the ability to predict and reconstruct the spectrum.
In view of the characteristics of HS images, many corresponding solutions have been proposed. For instance, HyperPNN [1] adds spectrally predictive layers to strengthen the spectral prediction ability of the network and composes a spectral prediction subnetwork and a spatial-spectral inference subnetwork. Both HSpeNet1 and HSpeNet2 [2] assume the decomposability of HS details and accordingly synthesize those details progressively. Specifically, HSpeNet1 reconstructs HS details from bottom level to top level, and HSpeNet2 synthesizes those details in a manner of band groupwise reconstruction. Besides, FusionNet [76] focuses attention on traditional CS and MRA frameworks and directly extracts details by differencing the single PAN image with each MS band.

B. Image Differential Operator
For the MS pansharpening task, Yang et al. [61] propose a deep network (called PanNet) that uses up-sampled MS images to the network and training parameters in the high-pass filtering domain rather than the image domain. However, they only use one predefined high-pass template, which may cause the loss of some detailed information. Based on this idea, we expect to use more different high-pass templates to extract more types of high-frequency details for a better fusion process. In this section, some high-pass image differential operators that we will use are first introduced.
The first one is the simplest first-order difference operator. For 2-D images, it contains differences in two directions, i.e., x-axis and y-axis, which can be represented by the following kernels: Also, we can use the following 2-D kernels to describe the difference between the two diagonal directions, i.e., Roberts operator However, this kind of operator is not very convenient in practice because there is no center pixel; thus, we intend to use the operator of 3 × 3 such as the Prewitt operator. When calculating the gradient of the center position, unlike the previous 2 × 2, which uses the positive and negative deviations of only one pair of pixels, 3 × 3 expands outward into three pairs to make it more sensitive to specific directions ⎡ On this basis, the Sobel operator performs a certain weighting to make the nearest pair of pixels have a higher weight, which is beneficial to reduce the influence of noise; see the following operators: In addition, the Laplacian operator is a second-order differential operator that often appears in image enhancement. Compared with the first-order operator, the second-order differential has a stronger edge positioning ability and a better sharpening effect. The Laplacian operator is defined as the result of performing the gradient operation ∇ on the function g first, and then the divergence operation ∇ · ∇; see as follows: where g is a second-order differential function and Δ is the Laplacian operator.

C. Motivations
As mentioned before, the HS pansharpening method must deal with two key issues, i.e., the substantial spectral coverage disparity between the HS and PAN images, as well as the necessity to recover features in numerous continuous narrow bands simultaneously. Although the methodologies discussed above presented numerous empirical approaches to realize these challenges, some constraints have yet to be addressed.
1) The PAN image is an important basis for restoring spatial details, but it is usually directly used as the input of the network. Therefore, the high-frequency information in PAN images cannot be fully utilized. It motivates us to give multiple high-pass filters for constructing a so-called MDE module for better detail extraction. 2) Second, few methods take into account the particularity of the more continuous spectra HS bands, which makes the spectrum information critical and sensitive. Spectrum preservation operations should be specially designed, motivating us to utilize SA for spectral preservation. 3) Third, a large number of spectral bands also brings an increase in the number of parameters (NoPs), leading to the difficulty of training. Additionally, low-level feature information needs to be valued more in the image fusion task. Therefore, a special module with reduced parameters can be appropriately designed and embedded or replaced in other networks, which motivates us to develop a DSF module with the reduction of channel numbers. Taking these considerations together, we design our Hyper-DSNet, which will be introduced in detail in what follows.

III. PROPOSED METHODS
Based on the above analysis and motivation, we will introduce each part of our proposed Hyper-DSNet in detail in the section, including the detailed main architecture shown in Fig. 2 and the corresponding loss function.
In general, our Hyper-DSNet contains three submodules, that is, MDE module, DSF module, and SA module, which will be described one by one in the following sections.

A. Multidetail Extractor
In the field of image super-resolution, the reconstruction quality of high-frequency information (e.g., edges, contours, and textures) is pretty crucial for the performance. Thus, we expect to extract and utilize those rich high-frequency details of the PAN image instead of training with the original image. We believe that the artificial extraction and intervention process will bring better efficiency and effects. Furthermore, PanNet [61] has noticed the importance of features on the high-pass filtering domain, but only one type of high-pass filter is integrated for extracting one single level of detail. It inspires us to adopt a more comprehensive detail extraction method. We believe that the multilevel high-pass information could favor a better performance, thus proposing the so-called MDE module.
For the MDE module, PAN image P 0 ∈ R L×W ×1 first goes through five high-pass operators to extract multilevel highfrequency information which will be then concatenated with PAN image itself to construct the input feature. The five highpass operators, i.e., first-order difference operator, Robert operator, Prewitt operator, Sobel operator, and Laplacian operator have been shown in (4)- (8) in turn, and here we denote them as α dir , α robert , α prewitt , α sobel , α laplacian , respectively; thus, the input high-pass feature O P ∈ R L×W ×7 is as follows: We show the results of using these five high-pass operators on PAN image in Fig. 3. As we can see that each extracts significantly different high-frequency information, some are smoother, and some are more delicate, which meets our expectations.

B. Deep-Shallow Fusion Module
In this section, we mainly present the structure of detail extraction which could be divided into two parts, i.e., multiscale convolution module and DSF module whose goal is to extract effective and crucial spatial-spectral information. Before this, the HS image will be first up-sampled to the same size as PAN by a polynomial kernel [77]. The output of the MDE module and the up-sampled HS image (LRHS U ∈ R L×W ×B ) are concatenated along the spectral dimension as the input of the structure of detail extraction.  The multiscale convolution module, first introduced in MS-DCNN by Yuan et al. [66], is used here to extract multiscale information. Three different sizes of convolution kernels are followed to perform feature extraction in diverse receptive fields. This process can be formulated as where W i and b i , respectively, represent the kernel weights and biases, O i is the output of the response convolutional layer, the subscript i (i = 3, 5, 7) means the size of the convolutional kernel, O b is the output of this multiscale convolution module, and δ(·) standards for an activation function of rectified linear unit [77]. Here, the channel number of output feature maps at each layer is set to 16 for the aim of parameters reduction. After the multiscale convolution module, it is followed by a DSF module. In general, the shallow convolutions are mainly used to focus on local region with small receptive field yielding fine-grained features, which lacks contextual information. In comparison, the deep layer has larger receptive fields obtaining abstract features with semantic information. However, it may be too abstract to utilize in the field of low-level vision task that focuses on pixel reconstruction instead of understanding the image content. So the shallow and deep features are both important in our HS pansharpening task. In previous methods, the result of deep convolution is often used directly as the final output, which will result in only paying attention to the deep information, may lose part of the low-level features. Here, each shallow and deep convolution result will be concatenated to maintain those two types of critical information in each step.
First focus on the first layer of convolution, which could be viewed as a weighting of the three different sizes of convolutions in the front. Then the following several deep convolutions can be mathematically represented as where O bi means the ith convolution's output, and W 3i and b 3i represent the weights and bias of the ith 3 × 3 convolution in this part.
As mentioned before, we concatenate each shallow and deep convolution results in the channel dimension to keep useful key information in each step: where O c represents the output of DSF module. Furthermore, the low-level spatial information obtained by shallow convolution needs more attention in the pixelwise vision task. Shallow and deep convolution kernels with the same number of features will bring a certain amount of information redundancy. Thus, more feature maps are set to describe the low-level information to avoid the redundancy problem. With the deepening of the convolutional layer, the number of feature maps decreases from high to low. More clearly, the number of channels in the DSF module is set to [48,32,16,8,8] in order as shown in Fig. 2, which will be further introduced in Section IV-E2.

C. Spectral Attention Module
Compared with other PAN sharpening fusion tasks, the biggest challenge of HS pansharpening lies in the spectral information that is rich and sensitive, which places higher demands on the spectral fidelity of HS images. For this reason, we argue that a dedicated module is needed to guarantee spectral information in super-resolution.
The feature maps extracted from the previous detail extraction module attach equal importance to each feature channel, ignoring the different degrees of spectral contribution, which needs some attention to help call out different channels' importance and remove the information redundancy. Among many attention mechanisms, we give the so-called SA module that is actually based on the channel attention mechanism proposed in [78] for HS pansharpening, due to its competitive abilities of cost-effective property and spectral preservation. Thus, an SA module is constructed to characterize the relationship among channels.
Specifically, the LRHS U image is as input of the SA module. First, a global average pooling layer is adopted to aggregate spatial information more conveniently, which will output a vector where I b (i, j) is the value at the position (i, j) in the bth channel of the LRHS U image, and v b means the bth value of the output vector. Following this, the global spectral information is squeezed into a B-length vector. To properly and fully capture channel-specific dependencies, here, we employ a simple gating mechanism with a sigmoid activation where output s ∈ R B , W 1 ∈ R C r ×C , and W 2 ∈ R C× C r are the weights of two fully connected convolution layers with the kernel size of 1 × 1 and σ means the sigmoid activation. In order to reduce the amount of calculation, the number of channels is first reduced with a ratio r and then expanded back to B successively through two consecutive layers of convolution: (15) By applying this SA module, the final output is obtained by rescaling the detailed extracted output, and skipping connection to add the initial LRHS U as the residual part. It is believed that the target ground truth (GT) can be seen as adding more detailed information on the basis of LRHS U . As a result, employing the initial LRHS U as a skip connection can preserve its original spectral information, avoid overfitting, prevent degradation as the network depth increases, and speed up convergence, allowing the network to train better and more quickly to achieve the desired effect, which is respired by He et al. [79] and proved by other pansharpening methods [61], [77].

D. Loss Function
To depict the difference between the network output and the GT, we adopt 1 loss function to optimize the proposed network in the training process. The loss function can be expressed as follows: where GT is the GT image, N represents the number of training samples, and · 1 means the 1 norm.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
This section is devoted to experimental evaluation to demonstrate the effectiveness of the given Hyper-DSNet. The proposed method will be compared with some recent SOTA HS pansharpening approaches on benchmark datasets obtained by different sensors.

A. Experimental Setup
This section introduces the details of experimental datasets, including data simulation, experimental platform, and hyperparameter settings. To evaluate the effectiveness of our Hyper-DSNet for remote sensing pansharpening, a series of experiments are conducted on three simulated HS datasets, i.e., Washington DC, Pavia Center, and Botswana, and one full-resolution dataset, i.e., FR1, which is described in detail as follows. The various features of the dataset are displayed in Table I [80], the original HS images from three datasets serve as the reference (REF) images, and the LRHS images are gained by applying a Gaussian blur and then downsampling the result by selecting one out of every four pixels in both the horizontal and vertical directions. The simulated PAN image is obtained by multiplying the reference HS image on the left of the original HS images, by a suitably chosen spectral response vector. Next, we use the down-sampled LRHS image and the simulated PAN map to obtain the estimated super-resolution result images through various HS super-resolution methods. Finally, the estimated HS images will be compared with the original HS images to obtain quantitative quality measures. The specific simulation process refers to the MATLAB toolbox 2 of Loncan et al. [12].
For fair comparisons, all DL-based methods are retrained in Python 3.8.5 with Pytorch 1.9.0 on a Linux system with NVIDIA GeForce GTX 3080Ti. We set 2000 epochs for our Hyper-DSNet training with an initial learning rate of 0.0001. We use Adam [81] optimizer to minimize the 1 loss function (16) and the weight_decay is set to 1 × 10 −7 . Besides, our network approach takes around 6 h to train. 1
Several quantitative assessments are carried out to evaluate different HS pansharpening methods with reference images. In this work, we consider four of the most often used metrics to assess the quality of the results, including cross-correlation (CC), spectral angle mapper (SAM), root mean squared error (RMSE), erreur relative globale adimensionnelle de synthèse (ERGAS) [12], structural similarity index (SSIM) [82], and peak signal-to-noise ratio (PSNR) [82]. Wherein CC, SSIM, and PSNR give the measurement of spatial distortion, characterizing the geometric distortion by the average CC for each image band. SAM is a spectral index defined as the angle between the reference and fused images. As global indices, RMSE and ER-GAS calculate the 2 norm between the estimated and reference images, aiming to evaluate the spatial fidelity.
In addition, to evaluate the performance of all involved methods on full-resolution, the QNR, D λ , and D s [83], [84] indexes are applied. The QNR has an ideal value of 1; instead, D λ and D s have an ideal value of 0.

C. Experimental Results on Reduced-Resolution Datasets
This section tests the performance of all compared approaches on the three simulated datasets in the simulated way as mentioned before.
1) Dataset of Washington DC Mall: WDC dataset has 191 channels and the test data consists of four 128 × 128 images clipped from the original image; the rest is used to train the network parameters. For the training part, the original PAN and HS images are divided into 921 small patch pairs of 64 × 64 PAN patches and 16 × 16 HS patches, respectively. For validation, we leave 103 patch pairs from the simulated patches.
For the testing, Table II shows the average quantitative assessment of different methods on the HYDICE WDC dataset. The best performance is shown in bold and the second is underlined. As shown in Table II, all DL-based methods show better results  than traditional techniques and far exceed in the SAM, RMSE, and ERGAS metrics. Moreover, our method also surpasses the other four DL-based methods in all indicators, which verifies the effectiveness of our spectral preservation and the better extraction of spatial details.
To show a visual comparison of all methods, Fig. 4 shows the pansharpened outcomes with the pseudocolor images by selecting three bands from all the 191 image bands. It can be seen that our Hyper-DSNet method is closer to the GT map, especially the edges and corners of the building in the enlarged part. At the same time, the residual map has shown in Fig. 5.
In the magnified region that we specially present, the bright spots in most traditional methods can be seen clearly, while that in other DL-based methods are obviously reduced, but there are still visible remnants. Obviously, our method has more dark blue and less yellow, which means that our error map is closer to 0. In addition, to perform band-dependent quality evaluations of the fused HS images on the WDC dataset, the CC and PSNR curves as functions of the spectral bands for different methods are presented in Fig. 6. Our results in dark red show better performance overall.
2) Pavia Center Dataset: Pavia Center Dataset has 102 channels and the test data consists of two 400 × 400 images clipped from the original image; the rest is used to train the network parameters. For the training part, the big PAN and REF images are divided into 1512 small patches of 64 × 64 with overlapping. For validation, 168 patch pairs are left from the simulated patches.
For the testing, Table III lists the average quantitative assessment of different methods on the Pavia datasets. As shown that our Hyper-DSNet method takes first place under CC, SAM, RMSE, and ERGAS metrics. For visual inspection, Fig. 7 shows the HS pansharpened outcomes with the pseudocolor images by different methods. In the enlarged green part, details such as houses and roofs are more clearly restored in our Hyper-DSNet   For the testing, Table IV and Figs. 9 and 10, respectively, display the results of average quantitative evaluation, visual presentation, and residual analysis. On this different dataset and sensor, we can still achieve the best results compared to other methods, further confirming the reliability and popularity of our proposed method. In Fig. 9, first judging from the overall color perception, the traditional method has an obvious color    difference compared to the GT image. Near the pink ripple, the red of our method is more vivid and the color contrast is more obvious, which is closer to GT. At the same time, we have almost no bright spots in the error map of Fig. 10.
In order to further evaluate the spectral preservation capability of different HS pansharpening methods, the spectral different value curves of four random pixels in the previous three datasets are shown in Fig. 11. Apparently, our Hyper-DSNet provides lower spectral differences in most bands, which also shows that our algorithm can better reconstruct the details caused by the large spectral gap.

D. Experimental Results on Full-Resolution Datasets
We also test the performance of all compared approaches on the full-resolution dataset FR1. The dataset FR1 has 69 channels and the test data consists of two images (240 × 240 for HS and 60 × 60 for PAN) clipped from the original image, while the rest is trained after the downsampling simulation mentioned earlier. Similarly, we divided the training part into 734 small patch pairs of 60 × 60 PAN patches and 10 × 10 HS patches, respectively. For validation, we leave 82 patch pairs from the simulated patches.
The quantitative results in terms of all indicators are reported in Table V. Furthermore, through the visual experiment of Fig. 12, the advantages and disadvantages of each strategy can be represented more naturally. It can be seen that our proposed Hyper-DSNet can achieve better results at the full resolution, which also shows the effectiveness and robustness of the proposed method. Hyper-DSNet represents our proposed method and the suffix v0 means that only the PAN image is concatenated like most common methods. From the first six suffixes a1-a6 of Table VI, we reduced one operator in turn based on the original operators, i.e., Dir-xy, Robert, Prewitt, Sobel, Laplacian operator, and the PAN image, to test their effects. Furthermore, we test that only one operator is selected at a time, while keeping the same dimension as the original for fairness, or not using a high-pass module at all, which are defined as suffixes b1-b6.

E. Ablation
As can be seen from the results in Table VII, all results with high-pass templates are much better than those without using a high-pass operator. Hyper-DSNet-v0, the most primitive method  without adding any high-pass template, has the worst ERGAS and second-worst CC value in the result. While in all methods that add high-pass operators, Hyper-DSNet has achieved the best results. It is worth noting that the evaluation indicators will also slightly decrease in the a6 group without the PAN image. In addition, the training loss comparison of whether to use a high-pass module is shown in Fig. 13. The proposed method has lower loss and converges faster.
2) Deep-Shallow Fusion Module: To evaluate the advantage of the DSF module, we replace this module with the following forms in Fig. 14. We set only the shallow layer and only the deep layer to prove the advantages of the DSF module. Furthermore, we believe that in the fusion task, more attention should be paid to shallow texture information rather than deep semantics. Therefore, we specially set up three experiments about the same number of feature maps, more shallow layers, and more deep layers. We summarize the results and the corresponding module parameters in Table VIII.
It is obvious that the effects of the last three with both deep and shallow layers are better than the first two, which means that the deep and shallow layers both have the information we need. It is also noticed that the SAM and RMSE metrics deteriorate significantly in the only shallow network. In addition, the effect  of more shallow layers is better than more deep layers, which also shows that the detailed information in the shallow layers may be more important. Compared with the same number of feature maps, setting the numbers of channels to decrease with depth can not only reduce the parameters but also maintain a fairly better effect.  Table VI, the most primitive method without adding any high-pass template.

3) Multiscale Convolution Module and SA Module:
Finally, we discuss the role of the multiscale convolution module and the SA module. On the basis of the original network, we set up two sets of experiments by removing the corresponding part. For example, the multiscale convolution module is replaced with a general 3 × 3 convolution with the same number of feature maps. The result is shown in Table IX which indicates the improved effect of adding these two modules, especially the SA module.

F. Parameter Numbers
The NoPs of all the compared DL-based methods and corresponding test time on the Pavia dataset are presented in Table X. It can be seen that the amount of parameters of Hyper-DSNet has not increased much than the other compared DL-based methods but achieved the best results, which proves our method can fully mine and utilize information. In this article, we propose a new framework named Hyper-DSNet for the two challenges in HS pansharpening, i.e., spectral distortion by the wider spectral range between HS and PAN image, and spatial information loss in continuous spectral bands. Specifically, our Hyper-DSNet mainly consists of three parts, i.e., MDE module, DSF module, and SA module. Plenty of experiments on three benchmark datasets and one full-resolution dataset acquired by multiple sensors demonstrate that our method has both good quantitative indicators and visual outcomes, surpassing the previous traditional and SOTA CNNbased techniques. We emphatically examined the importance of the MDE module and DSF module, which can also be widely embedded in other networks. Also, sufficient ablation studies are given to verify the effectiveness of multiple high-pass operators in the task of HS pansharpening. His research interests include low-level computer vision and deep learning.