Two-Stage Pansharpening Based on Multi-Level Detail Injection Network

Pansharpening is an effective technology to obtain high resolution multispectral (HRMS) images by fusing low resolution multispectral (LRMS) images and high resolution panchromatic (PAN) images. With the rapid development of deep learning, some pansharpening methods based on deep learning have been proposed. Although fused images are greatly improved, there are still some areas for improvement. For example, the spectral preservation is not good enough and the details of fused images are not rich enough. To address the above problems, a two-stage pansharpening method based on convolutional neural network (CNN) is proposed. In the first stage, image super-resolution technology with residual block is used to enhance LRMS. In order to preserve spectra, inspired by the SAM (spectral angle mapper) index, a new spectral loss function is proposed. The second stage is the fusion stage. Detail injection block is proposed by combining detail injection and CNN in this stage. Experiments on WorldView2 and GeoEye1 images demonstrate that our fused images present more spatial details and better spectra by comparing with existing methods.


I. INTRODUCTION
Remote sensing images are widely used in many fields such as classification and detection. Panchromatic (PAN) images and multispectral (MS) images are acquired simultaneously by some satellites such as WorldView2 (WV2), WorldView3 (WV3), QuickBird (QB), and GeoEye1 (GE1). Due to the limitations of some objective conditions, the PAN image with high spatial resolution contains little spectral information. Although the MS image presents large amounts of spectral information, its spatial resolution is usually only one fourth of the corresponding PAN image. Only PAN image or MS image could not meet practical needs, and high resolution multispectral (HRMS) images are needed. Pansharpening aims to provide HRMS images by fusing low-resolution multispectral (LRMS) images and PAN images [1].
In the past few decades, many pansharpening methods have been proposed. There are four representative categories: component substitution (CS) [2], multi-resolution analysis (MRA) [3], sparse representation and deep learning.
The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues .
CS-based methods first transform MS image into another space, which can separate the spatial structure and spectral information into different components. Subsequently, the component with spatial structure of transformed MS image is replaced by the PAN image. The classic CS-based methods include intensity-hue-saturation (IHS) fusion method [4], principal component analysis (PCA) fusion method [5], and Gram-Schmidt (GS) fusion method [6]. CS-based methods can obtain rich detail, but the spectral distortion is usually serious. The core of MRA-based methods is multi-scale detail extraction and injection. In general, the spatial details are firstly extracted from the PAN image by MRA, and then injected into the up-sampled multispectral (UPMS) images. The widely used MRA methods include the Laplacian pyramid [8], wavelet transform [9]- [11], curvelet transform [12], non-subsampled contourlet transform [13], [14], sheartlet transform [15], and non-subsampled sheartlet transform (NSST) [42]. Compared with CS-based methods, MRA-based methods present better spectra. To combine the advantages of different pansharpening methods, some hybrid approaches [16], [17], [49] are proposed. In [49], Kwan et al. proposed a fusion strategy for WV3 satellite images and a new no-reference image index GQNR by combining the remote sensing image index (D λ ) and the natural image quality index (NIQE).
In the past few years, sparse representation has drawn significant research interest [32]. The core idea of sparse representation is that image can be represented as linear combination of the fewest atoms in an over-complete dictionary. Some pansharpening methods based on sparse representation were proposed in [32]- [36]. Ayas et al. took texture information into account in the fusion process, which protects spectra and details better [35]. Gogineni et al. proposed a multi-scale learned dictionary for high frequency component [36]. Although the pansharpening methods based on sparse representation achieve good performance, they are usually time consuming.
In recent years, remote sensing fusion methods based on convolutional neural network (CNN) received lots of attention. Some CNN-based pansharpening methods have been proposed, e.g., pansharpening by convolutional neural networks (PNN) [19], Target-PNN [20], multi-scale and multi-depth network (MSDCNN) [21], deep network for pansharpening (PanNet) [22], remote sensing image fusion with deep convolutional neural network (RSIFNN) [23], convolutional autoencoder-based MS fusion (CAE) [41]. In [48], CNN is used to estimate the degradation blur kernel of MS images, which improve the adaptivity of pansharpening method. Although MS/Hyperspectral (HS) image fusion is a relatively new topic in remote sensing, a number of relevant literatures have been published. In [43], 3-D CNN was used to fuse MS and HS images. Two branches network was proposed in [44]. Before fusing MS and HS images, two branches are used to extract spectral and spatial information from HS and MS images, respectively. Han et al. proposed a HS and MS image fusion method by combining cluster and multibranch neural networks [50]. Super-resolution and hybrid color mapping were combined to fuse a high-resolution color image and a low-resolution HS image in [51]. Compared with traditional CS-based and MRA-based algorithms, the CNN-based methods significantly improve the pansharpening performance. There are still some problems in these methods. For example, both Target-PNN and RSIFNN lack of specific detail processing, which results in that the details of fused images are not sharp enough. Although PanNet sharpens spatial details, the relationship among spectral channels of MS image is not considered, which may result in some spectral distortion.
In order to preserve spectra and enrich details of fused images, we propose a two-stage pansharpening method with a new spectral loss function based on the following three motivations.
1. In general, the UPMS images are directly used to fuse with PAN images in the pansharpening methods based on CNN. However, this way does not make full use of UPMS images, which may result in spectral distortion. Our method includes super-resolution stage and fusion stage. The super-resolution stage with residual block is used to enhance the spatial resolution of UPMS images and preserve spectra. In fusion stage, multi-level detail injection network is proposed to further enhance the spatial details of super-resolution MS images.
2. The idea of detail injection was used in traditional methods and got good performance. We combine CNN and the idea of detail injection.
3. MSE (mean square error) is commonly used as the loss function between the super-resolution image and reference image. But MSE is a pixel-wise loss and lacks the relationship among spectral bands, which lead to spectral distortion. In order to reduce spectral distortion, we propose a new spectral loss function inspired by SAM index.
The remainder of this paper is organized as follows. Section II is related work, which describes detail injection, pansharpening and super-resolution with residual learning. The proposed method is presented in Section III. Section IV gives the experimental results and analysis. Finally, Section V gives conclusion and future work.

II. RELATED WORK A. DETAIL INJECTION FOR PANSHARPENING
In traditional methods [1], the details of PAN image can be injected UPMS image as follows: where MS k and MS k represents the k-th band of the UPMS image and the fused image, respectively, K is the number of bands, P denotes the PAN image. P L and P dt represent the approximation and detail part of PAN image, respectively, and g k is the injection weight. According to the formula (1), pansharpening can be decomposed into the following steps. First, the appropriate P dt should be obtained. Its spatial resolution is the same as that of HRMS image. In general, the approximation part P L is obtained by low-pass filtering, and P dt is created by subtracting the low-pass approximation of PAN image from PAN image [37]. However there exist obvious differences between the details of PAN and MS images, because the spectral range of PAN images and each band of MS images is different. In order to get required detail image, we need to multiply P dt by a weight g k , which influences the spectral and spatial quality of fused images. Therefore, P dt and g k are important for generating excellent pansharpened images. Some detail injection-based methods have been proposed. BDSD [18] is representative injection algorithm. Liu et al. proposed locally linear detail injection method [38], which is based on the assumption that the spatial details of each band of MS image can be locally and linearly represented by the spatial details of PAN images. In [39], the PAN image is decomposed into a low-frequency layer, an edge layer, and a detail layer. The edge layer and the detail layer are injected into the MS image by a proportional injection model. In [40], the spatial details are first extracted from the MS and PAN images. Then the details are sparsely represented. In order to refine joint details, they designed an adaptive weight factor. Finally, the refined joint details are injected into the MS image by modulation weight to get the fusion result. Inspired by the idea of detail injection, multi-level detail injection network is proposed to achieve image fusion in this paper.

B. PANSHARPENING AND SUPER-RESOLUTION WITH RESIDUAL LEARNING
It is well known that ResNet [30] proposed by He et al. is very effective. Its core idea is to form residual through an identity mapping, which can transmit information to the next level well and reduce the difficulty of network learning. The network with residual learning can converge quickly. It has good performance when the complexity of network structure is increased. Some ResNet-based methods have been proposed for the pansharpening problem. The first work using residual learning for pansharpening is the deep residual pansharpening neural network (DRPNN) [52]. Target-PNN [20] is a simple and effective three layers CNN with the idea of residual learning. Researchers tried to use complex structures to design pansharpening network, which make the network having stronger learning ability. Yang et al. combined inception module and residual learning to propose a multi-scale and multi-depth network (MSDCNN) [21]. PanNet [22] also used the residual module to build the network model by paying attention on details of fused image, which make the spatial quality of fused image better. According to different characteristics of PAN and MS images, a two-branch network called RSFINN [23] was proposed by extracting features of PAN and MS images respectively. RSIFNN used the idea of residual learning by adding long shortcut.
In recent years, some ResNet-based methods have been proposed for image super-resolution. Long shortcut is used to learn the residual information in [45], [53], which make network converge quickly. Various residual modules have been proposed by combining some technologies and residual learning. Residual dense block (RDB) [54] was proposed by combining dense network and residual connection for image super-resolution. Dense residual generative adversarial network (DRGAN) [56] uses RDB block as basic block to implement remote sensing super-resolution. Attention mechanism and residual block are combined for image super-resolution in [46], [47]. Residual channel attention [57] is used in remote sensing super-resolution. Due to the powerful performance of residual learning, we also use it to design our super-resolution network.

III. PROPOSED METHOD
This section is divided into three subsections. Firstly, the overall framework of proposed method is given. Secondly, the super-resolution stage is described. Thirdly, the fusion stage with multi-level details injection is introduced.

A. OVERALL FRAMEWORK
Generally, the deep learning-based pansharpening methods belong to supervised learning. The learning process can be regarded as the minimization of the following formula: where X is reference image, f and w denote network and related parameters, respectively, l is loss function. Fig. 1 shows our pansharpening framework, which consists of SR stage and fusion stage. The UPMS, SRMS and HRMS images are the input, output and label images of SR stage, respectively. In fusion stage, SRMS image and the panchromatic detail P dt are the input, fused MS image is the output, and HRMS image is the label image. Although our approach is a two-stage network with two-stage loss, the two-stage network still is an end-to-end network. In the first stage, super-resolution technology is used to enhance spatial resolution and protect spectra simultaneously. In order to preserve spectra effectively, a new spectral loss function is proposed. In the fusion stage, the details of PAN images are injected into the enhanced MS images. An effective detail injection module is proposed in this stage. Multi-level details are obtained by stacking this module. Fused images with richer detail are obtained by fusing multi-level detail features. Our method can be regarded as the minimization of the following formula: where l sr , l fusion and l all represent super-resolution loss, fusion loss and the total loss, respectively, w 1 represents the balance parameter.

B. SUPER-RESOLUTION STAGE
Generally, the input of existing CNN-based methods is PAN and UPMS images. In this way, the UPMS image is not utilized effectively. Fusion result depending on PAN images excessively may lead to spectral distortion. In this paper, we fully utilize UPMS image to preserve spectra and improve the spatial resolution by super-resolution technology. As shown in Fig 2, the super-resolution stage is composed of feature extraction, non-linear mapping and reconstruction. First, low resolution features are extracted by a convolutional layer. It can be expressed by the following formula: where S 1 (·) represents low-resolution feature extraction.
Fea_LR denotes low-resolution features. Conv i (·) is the i-th convolution, Relu (·) is the ReLU (rectified linear unit) activation function. Then, high resolution features can be obtained by non-linear mapping. Residual blocks are used to implement non-linear mapping. The core idea of residual block is to form residual through an identity mapping, which can transmit information to the next level well and reduce the difficulty of network learning. Non-linear mapping can be expressed as: where S 2 (·) represents non-linear mapping, Fea_HR denotes high resolution features. R i (·) is the i-th residual block, m denotes the number of residual blocks. Finally, super-resolution (SR) images are reconstructed from high resolution features. Long shortcut as a residual connection is added in the super-resolution network so that the network converges quickly and the efficiency is higher. It can be expressed by where S 3 (·) represents reconstruction, Pre_SR denotes the reconstructed super-resolution MS (SRMS) image. MSE (mean square error) is commonly used as the loss function between the reconstructed super-resolution image and reference image. But MSE is a pixel-wise loss and lacks the relationship among spectral bands, which lead to spectral distortion. The SAM metric calculates the angle between the corresponding pixels of the fused image and reference image, which can quantify the spectral distortion. The SAM is defined as: where I and J are pixel vector with size 1 × K , and K is the number of bands in MS images.
To facilitate solving gradient and back propagation, we use the absolute value function to substitute arccos function after calculating the spectral correlation between SRMS image and HRMS image. The highest value of correlation is 1, and the higher correlation means better fusion performance. In order to be consistent with minimization optimization, we subtract the correlation from 1. The proposed spectral loss l spe is defined as follows: where Pre_SR and X are SRMS image and reference image, respectively. ·, · denotes the inner product. Both Pre_SR (i, j) and X (i, j) are a vector with size 1 × K . The loss of the super-resolution stage is weighted average of MSE loss and spectral loss, i.e., where the weight w 2 is used to balance two kinds of losses.  As previously mentioned in section 2.1, first we need to obtain the suitable details P dt of PAN image. The PAN image is filtered by the mean filter with size 5 × 5 to get P L , then the PAN details P dt is obtained by subtracting P L from the PAN VOLUME 8, 2020 image. However, P dt is a single detail map, which is difficult to satisfy the requirement. It is feasible to extract multiple detail features from P dt by CNN for its strong learning and non-linear representation ability. Generally, the spatial details of PAN and MS images are different, and corresponding injection weight for each band can adjust the injected details to avoid some artifacts. So, the second step is to find the appropriate weight. Each band of MS image has its own characteristics, and pansharpening is to obtain HRMS image. Therefore, the weights should be obtained from an image that is similar to HRMS image. Compared with UPMS image, SRMS image is more similar to HRMS image. Therefore, we obtain the weights by extracting the features of SRMS image. Finally, the Hadamard product is used to get the detail features of fused image. In order to obtain more details, we stack injection block to extract multi-level detail features. Multi-level detail injection can be expressed by where F d p w p−1 and F w p d p−1 represent the p-th level detail features extraction network and adaptive weights extraction network, respectively. ⊗ denotes Hadamard product. w p is the p-th adaptive weight, and d p is the obtained p-th detail features. w 0 and d 0 are SRMS image and the details of PAN image, respectively. They are the input of the first level injection block, and are also the input of the fusion stage. d multi represents multi-level detail features. C denotes the concatenation of multi-level detail features on channel dimension. P is the number of proposed injection block.
The final details are obtained by fusing multi-level detail features. We add long shortcut, and the details are added into the SRMS image. In this stage, it can be regarded as the minimization of the following model: where F fusion denotes the multi-level detail features fusion which is built by three simple convolution layers. Pre_Fusion, Pre_SR and X denote fused image, SRMS image and reference image, respectively.

IV. EXPERIMENTAL RESULTS AND ANALYSES
In this section, we performed some experiments on GeoEye-1 (GE1) and WorldView-2 (WV2) images. Their spatial resolution and band information are shown in Tables 1 and 2. First, the ablation experiments and parameter selection are given to analyze the network structure. Then our method is compared with four traditional methods (SFIM [17],  MTF-GLP-HPM [8], BDSD [18] and ATWT [9]) and three CNN-based methods (Target-PNN [20], RSIFNN [23] and PanNet [22]). Six indices (Q [25], SAM [26], ERGAS [24], SCC [28], Q4 [27], Q 2n [31]) are used to evaluate the quality of fused images at the reduced scale. Q evaluates the structure similarity between fused images and reference images. Q4 is the vector extension of the Q index. Q 2n is suitable for the assessment of images with the number of spectral bands greater than four. Spectral angle mapper (SAM) represents spectral distortion by calculating the average angle between the corresponding spectral vector of fused image and reference image. Relative global dimensional synthesis error (ERGAS) reflects image comprehensiveness distortion. Spatial correlation coefficient (SCC) reflects the correlation between HRMS image details and fused image details. Four commonly used indices (D λ [29], D s [29], QNR [29], and SAM) are used to evaluate the quality of fused images at the full scale. D λ and D s reflect spectral distortion and loss of spatial detail, respectively. QNR is comprehensiveness distortion through combining D λ and D s . The training images are generated according to Wald's protocol. We rotate the data sets 90 degree, 180 degree and 270 degree, and extract 9801 samples on WV2 and 12,800 samples on GE1 as training set. The patch size is 64 × 64, and the batch size is 64. The test images include 26 images on GE1 and 55 images on WV2. We use TensorFlow framework to implement the proposed method and select the Adam optimizer. The initialization method is Xavier uniform initializer. Long shortcut and local residual connection are used in our method, which make the network converge quickly. We do not utilize any tricks such as gradient clipping to deal with gradient vanishing or explosion. Network parameter setting is given in Table 3.

A. PARAMETER SELECTION AND NETWORK STRUCTURE ANALYSIS
The loss of super-resolution stage is composed of MSE loss and spectral loss. MSE loss focuses on optimizing the spatial part of the MS image, while spectral loss function preserves the spectral part of the MS image. We use the parameter w 2 to balance the relationship between them. We study the influence of the parameter w 2 on fused image quality. The experimental results are given in Table 4. ERGAS is a comprehensive image quality index. We analyze the image quality through using the index ERGAS. When w 2 is 1, the fusion result is the worst, which shows that our spectral loss function is effective. As the value of w 2 rises from 0.1 to 0.6, the value of ERGAS fluctuates. Thus, the fusion result is sensitive to small w 2 . When w 2 is 0.6, the best fusion result is obtained. With the value of w 2 rising to 0.9, the value of ERGAS rises slightly. Although the image quality drops slightly, the image quality is high, which shows that the fusion result is not sensitive to large w 2 . Therefore, the proposed spectral loss function is effective, and the combination of MSE and our spectral loss function can improve the image quality. We set w 2 to 0.6 according to the above results. In addition, the total loss is composed of super-resolution loss and fusion loss. We use the parameter w 1 to balance the relationship between them. We study the influence of the parameter w 1 on fused image quality. The experimental results are given in Table 5. When w 1 equals 1, the fusion result is the worst, which shows that our two-stage loss is effective. As the value of w 1 rises from 0.1 to 0.9, the value of ERGAS only rises or drops slightly. The image quality is high and stable, which shows that the fusion result is not sensitive to w 1 . When w 1 is 0.6, the best fusion result is obtained. Thus, the parameter w 1 is set to 0.6. Our super-resolution network is mainly composed of residual module (RM). To verify the effectiveness of the super-resolution network, residual module (RM) used in our super-resolution network is compared with standard convolution module (SCM) [45], residual channel attention module (RCAM) [47] and residual attention module (RAM) [46]. The experimental results are shown in Table 6. Compared with other modules, RM presents better results. The proposed detail injection block consists of two branches. Adaptive weights are obtained from the first branch, and the second branch is to generate the detail feature maps. In order to verify the effectiveness of this module, we performed some comparative experiments. First, we study the necessity of two branches by comparing single branch (Fig. 4a) with our two branches (Fig. 4c). Second, the way to generate the weights is important. Our three-dimensional weights are directly obtained by CNN. Squeeze and excitation block (SE-block) [55] (Fig. 4b) is widely used to obtain channel attention weights. We study the influence of different way of weight generation on fusion performance by comparing SE-block (Fig. 4b) with ours (Fig. 4c). The experimental results over GE1 and WV2 dataset are shown in Tables 7 and 8, respectively. The comparison between single branch and our two branches shows that two branches are necessary. Moreover, our method gives better result than SE-block, which means that our weights are more appropriate. The fusion result with our detail injection block presents the best performance. Therefore, the proposed detail injection bock is effective.

B. ABLATION EXPERIMENTS
Our method consists of two stages, i.e., super-resolution stage and fusion stage. Super-resolution stage is used to preserve spectra and enhance spatial resolution simultaneously. Fusion stage with multi-level detail injection network generate richer details. We analyze their impact on fusion results by comparing the following seven cases.
1. We give the performance of up-sampled multispectral (UPMS) image obtained by bicubic interpolation.
2. The network only includes super-resolution (SR) stage and does not include fusion stage.
3. Firstly, UPMS images are super-resolved by our SR network to obtain super-resolution multispectral (SRMS) images. Then, SRMS images are fused with PAN images by guided filtering (GF) (the GFCS-B method in [58]).
4. The network includes fusion stage with single-level (SL) detail injection and does not include super-resolution stage.
5. The network includes fusion stage with multi-level (ML) detail injection and does not include super-resolution stage.
6. The network includes fusion stage with single-level detail injection and super-resolution stage (SLSR).
7. The network includes fusion stage with multi-level detail injection network and super-resolution stage (MLSR).
The evaluation indices on WV2 and GE1 dataset are given in Tables 9 and 10, respectively. The best performance is obtained by the MLSR, which proves that our method is effective. Comparing SR with bicubic interpolation, the SAM index is decreased by 1.9 and 1.7 on GE1 and WV2 dataset, respectively, and the SCC index is improved by 0.17 and 0.19 on GE1 and WV2, respectively. Therefore, the super-resolution stage can effectively improve image quality. Although the spatial quality of UPMS image has been improved, it is not enough. The spatial resolution ratio between PAN image and MS image from the same satellite is usually 4. It is difficult to improve the resolution of image by 4 times. The fusion stage is used to further improve the image quality of SRMS. Comparing SR with MLSR, the SAM index is decreased by 0.7 and 1.2 on GE1 and WV2 dataset, respectively, and the SCC index is improved by 0.19 and 0.24 on GE1 and WV2 dataset, respectively. It is obvious that MLSR is much better than SR. Therefore, fusion stage can further improve fusion performance, and PAN image provides important contribution to fusion result.  Comparing the SL with the SLSR, the value of SAM on GE1 and WV2 dataset decreases by 0.72 and 0.52, respectively. Therefore, SR stage can get better spectral preservation. Compared with SRGF, SLSR presents better fusion performance, which demonstrates that the proposed injection block is effective. Moreover, the spatial quality of multi-level injection is better than that of single-level injection. SCC is improved significantly. It demonstrates that multi-level detail injection can get richer detail than single detail injection. Therefore, SR stage can effectively preserve spectra, and multi-level detail injection can provide richer details. The fusion quality will be further improved through combining the two stages.

C. EXPERIMENTS AT REDUCED SCALE
The mean value of evaluation indices of fused images on GE1 and WV2 is given in Tables 11 and 12, respectively. The method with the best performance among other methods is compared with our method in the following analyses.
From Table 11, it can be seen that our fusion result gives the best performance. SAM is decreased by 0.48 compared with PanNet, which indicates that our fusion result presents better spectra. SCC is increased by 0.04 compared with Pan-Net. Thus, more details are injected into fused images of our method. From Table 12, it can be seen that the SAM of our method is decreased by 0.37 compared with Target-PNN. It proves that SR stage can effectively protect spectra. SCC is increased by 0.01 compared with PanNet, which shows that detail injection network generates more abundant details.  A representative fusion result is given for each satellite. First, the fusion results are compared on GE1. The RGB bands are displayed in Fig. 5. From this figure, it can be seen that CNN-based methods present better fused images than traditional methods. In order to give more obvious difference, the absolute value of the difference between fused images and HRMS image is given in Fig. 6. Our fusion result gives less spatial and spectra information loss, especially in the red rectangle area.
Then the fusion performance is analyzed on WV2. The RGB bands of fused images and residual images are presented in Figs. 7 and 8, respectively. Traditional methods still lose some details. All CNN-based methods perform well that can be observed from Fig. 7. Compared with other CNN-based methods, our result displays less error  in the red rectangle that can be observed from Fig. 8. Thus, our fusion result shows better spectra and richer details.   If the response range of PAN image does not cover the spectral range of MS bands, pansharpening task is more difficult. From Table 2, it can be seen that the spectral range  of all bands of multispectral image is in the range of PAN image for GE1, while the spectral range of Coastal, NIR1 and NIR2 bands is not in the response range of PAN for WV2. The indexes of three bands (Coastal, NIR1, NIR2) are shown in Table 13. Our method performs the best for all evaluation indexes. The NIR2 band of WV2 images and the corresponding residual images are displayed in Figs. 9 and 10, respectively. It can be seen that our fusion result presents richer detail and less information loss from Figs. 9 and 10. Therefore, our method preforms the best on both GE1 and WV2. All indices are improved, and the fused images of the proposed method are better than that of other methods. It proves the superiority of our method.

D. EXPERIMENTS AT FULL SCALE
In this section, some experiments and analyses at full scale are given. The mean value of evaluation indices on GE1 is given in Table 14. Our fusion result gives the best performance for all evaluation indices. D λ is decreased by 0.0064 compared with Target-PNN, which shows that our fusion result presents less spectral distortion. D s is decreased by 0.0055 compared with RSIFNN. It proves that more details are injected by our method. Our result gives the best QNR. The mean value of evaluation indices on WV2 is given in Table 15. Although our fusion result only gives the second best performance on D λ and D s , the SAM and the comprehensive index QNR of our method are the best. SAM is decreased by 0.27 compared with Target-PNN. Although the detail of PanNet is rich, its spectra is not good enough. RSIFNN preserves spectra well, but the details of fusion result are not good enough. Therefore, on the whole, our method gives the best results on both GE1 and WV2.
In the visual comparison part, a pair of source images and theirs fused images at full scale are presented. The RGB bands of the fused images on GE1 are shown in Fig. 11. It can   be observed that the fusion results of CNN-based methods are better than that of the traditional methods.
There is no reference image for full scale. Because the UPMS images as input are sharpened by various methods, the difference between the fusion results and UPMS images can display the injected details and spectral enhancement region. Fig. 12 shows the difference between the RGB bands of fused images and that of UPMS images. Compared with other methods, the proposed method injects more edge details. Compared with the results obtained by PanNet, our fusion results display more spectral enhancement regions, especially in the red rectangle region, which indicates that our method better protects spectral information.
The RGB bands and the difference between fused images and UPMS images on wv2 are given in Figs. 13 and 14, respectively. Our method presents better visual performance than other methods that can be observed from Figs. 13 and 14. Our fusion result shows better spectra and richer details. The evaluation index of three bands (Coastal, NIR1, NIR2)  are shown in Table 16. Although the D S index is not the best, others indexes are the best. The NIR2 bands and the difference between fused images and UPMS images are  given in Figs. 15 and 16, respectively. By observing Fig. 15, the fusion result of PanNet presents obvious noise and the results of traditional methods are smooth. It can be seen that more details are injected into UPMS image by our method. From Fig. 16, it is clear that our fusion result shows richer details, especially in red rectangle area. Therefore, our method improves the fusion results in terms of subjective visual performance and objective indices at full scale.
The computation time of different methods is listed in Table 17 for fusing the UPMS image with size of 480 × 480 × 8 and the PAN image with size of 480 × 480. The traditional methods were measured on CPU, and the CNN based methods were measured on GPU. From this table, it can be seen that the computation time of the traditional methods is less than the CNN based methods. Our method takes more computation time, but 0.2542 second is still acceptable. The training time of CNN based methods is given in Table 18. Although our method needs the longest training time, it only takes 1.1 hours.

V. DISCUSSION
Some networks designed for super-resolution task are similar to our proposed network, but there are still many differences among them. Both the networks in [45], [53] and our super-resolution network use long shortcut to learn the residual information, which make network converge quickly. The difference between [45], [53] and ours is that we add local residual block in non-linear mapping part, which further improve non-linear mapping ability. Standard residual modules that contain two convolution layers and an identity mapping are used in our super-resolution network. Compared with standard residual block, residual dense block (RDB) [54] contains many parameters, and its calculation cost is high. Standard residual modules are light and effective. In the part. A, section IV, some experimental results are given. Compared with the network block in [45]- [47], the standard residual module gets better fusion results.
There are some similarities among RAM [46], RCAM [47] and the proposed detail injection block, but they have essential differences. Attention mechanism is used in RAM and RCAM to recognize where or which feature map is important. Then the network focus on optimizing these areas to improve image quality. Pansharpening is to sharpen the LRMS image through using the high-resolution PAN image. The spatial details extracted from PAN image can indicate the concerned regions, and thus the attention module is not needed. Because the details of the PAN image and the MS image are not the same, we need different weights. The proposed method obtains the weight from the MS image in the other branch. RAM and RCAM learn the corresponding weight from the extracted features, which are single-branch structure. The RCAM obtained the one-dimensional channel attention by global average pooling, two 1 × 1 convolution layers and softmax operation. The RAM generates one-dimensional channel attention weight by global variance pooling, two 1×1 convolution layers and the two-dimensional spatial attention weight by channel separation convolution, and then combines them to form the three-dimensional weight, while our three-dimensional weight are directly obtained by two convolution layers. Therefore, there are some important differences among our detail injection module, RCAM and RAM.
In addition, our network consists of two stages. The first stage combines the spectral loss function and super-resolution network to preserve spectra and enhance details, and the second stage integrates the detail injection idea into CNN to achieve spatial sharpening. The advantages of two stages are merged into a whole framework to sharpen the LRMS images.

VI. CONCLUSION
In this paper, we proposed a two-stage pansharpening network, which includes super-resolution stage and fusion stage. In super-resolution stage, we make full use of the spectral information of LRMS images to protect spectra. In fusion stage, detail injection block is proposed. It can extract detail features well. The ablation experiment demonstrates the effectiveness of the two stages. The proposed method is compared with other pansharpening methods on GE1 and WV2 satellite image datasets. The experimental results at the reduced and full scale verify the superiority of the proposed method in terms of subjective visual performance and objective evaluation indices.
Remote sensing images are becoming more and more abundant, and their tasks are also diversified. For example, Sentinel-2 satellite provides remote sensing images with three spatial resolution. Sentinel-2 image fusion is more difficult because the spectral range of different bands is different and the number of spatial resolutions is increased. In the future, we will study how to combine the idea of multi-level detail injection with the task of multi-resolution remote sensing image fusion.