Pansharpening Using Unsupervised Generative Adversarial Networks With Recursive Mixed-Scale Feature Fusion

Panchromatic sharpening (pansharpening) is an important technology for improving the spatial resolution of multispectral (MS) images. The majority of the models are implemented at the reduced resolution, leading to unfavorable results at the full resolution. Moreover, the complicated relationship between MS and panchromatic (PAN) images is often ignored in detail injection. For the mentioned problems, unsupervised generative adversarial networks with recursive mixed-scale feature fusion for pansharpening (RMFF-UPGAN) are modeled to boost the spatial resolution and preserve the spectral information. RMFF-UPGAN comprises a generator and two U-shaped discriminators. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. Further, a recursive mixed-scale feature fusion subnetwork is designed. Perform a prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure, and finally, generating key information. A compensation information mechanism is also designed for the reconstruction of key information to compensate for information. A nonlinear rectification block for the reconstructed information is developed to overcome the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed and a new composite loss function is defined. The presented model is validated using two satellite data and the outcomes reveal better than the prevalent approaches regarding both visual assessment and objective indicators.


I. INTRODUCTION
R EMOTE sensing images are extensively utilized in geological exploration, terrain classification, agricultural yield prediction, pest detection, disaster prediction, national defense, environmental change detection, and so on [1], [2]. In these applications, images with high spatial resolution, high spectral resolution, or high temporal resolution are required. However, due to the limitations of sensor technology, we obtain low spatial resolution multispectral or hyperspectral (LRMS/LRHS) images, low temporal resolution multispectral or hyperspectral images, and low spectral resolution panchromatic (PAN) images [3], [4]. This requires fusion technology to fuse LRMS and PAN images to generate high spatial resolution multispectral (HRMS) images. This fusion technology is called panchromatic sharpening (pansharpening). The pansharpening techniques are generally divided into component substitution (CS) approaches, multiresolution analysis (MRA) techniques, variational optimization (VO) methods, and deep learning (DL) models [1], [5], [6].
CS techniques primarily involve intensity-hue-saturation (IHS) and variants [7], Gram-Schmidt (GS) [8], GS adaptive (GSA) [9], principal component analysis (PCA) [10], and band-dependent spatial detail (BDSD) [11]. First, the LRMS images are projected into another spatial domain, then the spatial structure information is extracted and replaced by the high resolution image. Finally, the image is inversely transformed into the original space to obtain the fused image. The strengths of the CS are simplicity, extensive application, integration in individual software, easy implementation, and greatly enhancing the spatial resolution of LRMS images. The drawbacks involve spectral distortion, oversharpening, aliasing, and fuzzy problems.
MRA approaches principally include the smoothing-filterbased intensity modulation (SFIM) [12], Laplacian pyramid (LP) transform [13], generalized LP (GLP) transform [14], curvelet transform [15], contourlet transform [16], nonsampled contourlet transform (NSCT) [17], and modulation transfer function-GLP (MTF-GLP) transform and variants [7]. The MRA approaches decompose the LRMS and PAN images, then fuse them through some rules and generate the fused images by inverse transformation. Compared with CS methods, MRA can preserve more spectral information and reduce spectral distortion, but their spatial resolution is relatively low. VO methods can be divided into two parts: energy function and optimization methods. The core is the optimization of the variational model, such as the panchromatic and multispectral image (P+XS) model [18], the nonlocal variational panchromatic sharpening model [19], and the others [7], [20]. Compared with the CS methods and MRA methods, the VO methods have higher spectral fidelity, but the calculations are more complex.
Convolutional neural networks (CNNs) and generative adversarial networks (GANs) have been widely applied in image processing. Some achievements have been made in the pansharpening of remote sensing images. Early on, pansharpening by CNN (PNN) with three layers was designed [21] based on the superresolution reconstruction. The nonlinear mapping of the CNN is employed to generate HRMS images by feeding LRMS and PAN image pairs into the PNN. The PNN is relatively simple and easy to implement, but it is prone to overfitting. Subsequently, the target adaptive CNN (TA-CNN) [22] was modeled, which utilizes the target adaptive adjustment stage to solve the problems of mismatched data sources and insufficient training data. Yang et al. [23] presented a deep pansharpening network based on ResNet modules, i.e., PanNet, employing the high-frequency information of LRMS and PAN images as the input and outputting the residual between HRMS and LRMS images. Nevertheless, the PanNet overlooks the low-frequency information, causing spectral distortions. Wei et al. [24] modeled a deep residual pansharpening neural network (DRPNN), implemented on the ResNet block. Although the DRPNN is realized by using the powerful nonlinear capability of the CNN, the number of samples required should increase with increasing network depth to avoid overfitting. Regarding training in the spatial domain, the generalization ability of the model still needs to be improved. Deng et al. [25] proposed the FusionNet model based on a CS and MRA detail injection model. The injection details are obtained with a deep CNN (DCNN). Difference from other networks, the input of the network is the difference between PAN images, which are copied to the same number of channels as the LRMS images, and LRMS images. Thus, this network can introduce multispectral information and reduce spectral distortion. Hu et al. [26] proposed a multiscale dynamic convolutional neural network (MDCNN). This MDCNN mainly contains three modules: a filter generation network, a dynamic convolution network, and a weight generation network. The MDCNN uses multiscale dynamic convolution to extract multiscale features of LRMS and PAN images and designs a weight generation network to adjust the relationship between features at different scales to improve the adaptability of the network. Although dynamic convolution improves the flexibility of the network, the network design is more complicated. Simultaneously extracting the features of LRMS and PAN images, the network tends to reduce the effective detail information and spectral information. Wu et al. [27] proposed RDFNet based on a distributed fusion structure and residual module, which extracts multilevel features of LRMS and PAN images, respectively. Then, the corresponding level MS and PAN image features and the fusion result of the previous step are fused gradually to obtain HRMS images. Although the network uses multilevel LRMS and PAN features as much as possible, it is affected by the depth of the network and cannot obtain more details and spectral information. Wu et al. [28] also designed TDPNet based on the cross-scale fusion and multiscale detail compensation. GAN offers great potential for generating images [5]. Shao et al. [29] presented a supervised conditional GAN comprising a residual encoder-decoder, i.e., RED-cGAN, which enhances the sharpening ability with the restriction of PAN images. Liu et al. [30] developed a deeply CNN-based pansharpening GAN, i.e., PsGAN, consisting of a dual-stream generator and a discriminator, which distinguishes the generated MS image from the reference image. Benzenati et al. [31] introduced a detail injection GAN (DIGAN) constructed by a dual-stream generator and a relativistic average discriminator. RED-cGAN, PsGAN, and DIGAN are supervised approaches trained on degraded resolution data, nevertheless, the products are not satisfactory for applying to full-resolution data. Ozcelik et al. [32] constructed a self-supervised learning framework considering pansharpening as a colorization, i.e., PanColorGAN, which reduces blurring by color injection and random-scale downsampling. Li et al. [33] put forward a selfsupervised approach using a cycle-consistent GAN trained on reduced resolution data, which builds two generators and two discriminators. The LRMS and PAN images are fed into the first generator to yield the predicted image, and then, the predicted image is input to the second generator to acquire the PAN image, which remains consistent with the input PAN. Regarding the problem of having no reference HRMS images, some unsupervised GANs were presented. Ma et al. [34] suggested an unsupervised pansharpening GAN (Pan-GAN) composed of a generator and two discriminators (a spectral discriminator and a spatial discriminator). The generator produces HRMS images with concatenated MS and PAN images. The spectral discriminator is adopted to judge the spectral information between HRMS and LRMS images, and to produce HRMS data with the consistent spectrum of LRMS data. The spatial discriminator discerns the spatial information between the HRMS and PAN images, enabling the generated HRMS image to agree with the spatial information of the PAN image. Pan-GAN uses two discriminators to better retain spectral information and spatial structure information and solves the problem of the ambiguity caused by downsampling in the supervised training process. However, the input is the concatenated MS and PAN images, resulting in insufficient details and spectral information. Zhou et al. [35] proposed an unsupervised dual-discriminator GAN (PGMAN), which utilizes a dual-stream generator to yield HRMS and two discriminators to retain spectral information and details individually. Pan-GAN and PGMAN are trained on the original data directly with no reference images, which obtains better results at full resolution, but the results on the degraded resolution data are not desirable. This reveals the poor generalization ability of the models.
Although various scholars have proposed a variety of pansharpening networks and achieved certain fusion results, a majority of the models are trained on reduced resolution data, which exhibits problems of spectral distortion and loss of details in fusing the full-resolution data due to changes in resolution. Moreover, in the detail injection model, the details are directly added to the upsampled MS image, ignoring the complicated relationship between the MS image and the PAN image, which is likely to lead to spectral distortion or ringing. For the mentioned problems, unsupervised GANs with recursive mixed-scale feature fusion for pansharpening (RMFF-UPGAN) are modeled to boost the spatial resolution and preserve the spectral information, which is trained on observed data without reference images. The main contributions of this article are as follows.
1) A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. We employ a ResNeXt block and residual learning block to obtain the spatial structure and spectral information of four scales. 2) A recursive mixed-scale feature fusion structure is designed by executing prior fusion and mixed-scale fusion sequentially and generates key information. 3) A compensation information mechanism is also designed for the reconstructing of key information to compensate for information. 4) A nonlinear rectification block for the reconstructed information is developed to overcome the distortion induced by ignoring the complicated relationship between MS and PAN images. 5) Two U-shaped discriminators are designed and a new composite loss function is defined to better preserve spectral information and details. The rest of this article is organized as follows. Section II describes related work. Section III describes the proposed model in detail. Section IV introduces datasets, evaluation indicators, experimental settings, and comparative experiments. Finally, Section V concludes this article.

A. MRA-Based Detail Injection Model
MRA methods [36], [37] are a class of image fusion methods and are particularly common in the field of remote sensing images. These methods have good multiscale spatial frequency decomposition characteristics, singularity structure representation abilities, and visual perception characteristics. The implementation form of the efficient filter bank of the wavelet provides the possibility for processing large-scale remote sensing image fusion. Based on MRA methods, first, the image is decomposed into a low-frequency component and a high-frequency component by some decomposition method, and then, the highfrequency component and low-frequency component are fused by a fusion method. Finally, the fused high-frequency component and low-frequency component are reconstructed by inverse transformation to generate the fused image. An MRA-based detail injection model can be represented by a general detail injection framework, as shown in the following expression: whereF k represents the kth-band fused HRMS image, ↑ M k represents the kth-band upsampled LRMS image, g k is the kthband detail injection gain, P represents the PAN image, P L is the low-frequency component of the PAN image, and N is the number of bands of the MS image.

B. ResNeXt
Xie et al. [38] proposed a ResNeXt structure, which is an improvement of ResNet [39]. The network uses group convolution to reduce the network complexity and improve the expression ability. The core of ResNeXt is the proposal of cardinality, which is used to measure the complexity of the model. ResNeXt proves that in the case of similar computational complexity and model parameters, increasing the cardinality can achieve better expression ability than increasing the depth or width of the network. The ResNeXt network structure [38] takes advantage of the split-transform-merge idea. However, the convolution operation of each topology is the same, which reduces the computational complexity. The mathematical expression is as follows: where C is the cardinality, i.e., the number of identical paths; x represents the input and y represents the output; and T i () represents the function of the ith path.

III. METHODOLOGY
RMFF-UPGANs are modeled to improve spatial resolution and retain spectral information. RMFF-UPGAN is trained directly using the raw full-resolution data to decrease the effect of resolution variation on results. The overall architecture of the RMFF-UPGAN is illustrated in Fig. 1 and is composed of one dual-stream generator and two U-shaped relative average discriminators (i.e., U − RaLSD pe and U − RaLSD pa ). In Fig. 1, M and P stand for the raw MS and PAN images, and ↑ M refers to the upsampled MS image, HM is the fused image. For the generator, first, a dual-stream trapezoidal branch is designed to obtain multiscale information. A ResNeXt block extracts lowlevel semantic information of fine-scale and residual learning blocks extract high-level semantic information of mesoscale and coarse-scale to obtain the spatial structure and spectral information of four scales. Second, a recursive mixed-scale feature fusion subnetwork is designed via residual learning. Perform a prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure and finally generating key information. Then, the key information is reconstructed and a supplemental information structure is also designed for the reconstruction of key information to compensate for information. Finally, a rectification block for the reconstructed information is developed to obtain the fused image, which overcomes the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed to better preserve spectral information and details. The U − RaLSD pa discriminator differentiates the details of the HM image from the details in the P image and prompts the details of the HM to be consistent with that in the P image. The U − RaLSD pe discriminator is applied to distinguish the spectral information of the HM from the spectral information in  the M image, which drives the spectral information of the HM to be consistent with that in the M image.

A. Dual-Stream Generator
The designed dual-stream generator consists of a dual-stream trapezoidal multiscale feature extraction module, a recursive mixed-scale feature fusion module, a dual-stream multiscale feature reconstruction module, and a reconstructed information rectification module. The architecture of each module is explained in detail as follows.

1) Dual-Stream Trapezoidal Multiscale Feature Extraction (DSTMFE):
The structure of the DSTMFE branch of the generator is shown in Fig. 2, which consists of two independent branches and differs from our previous work TDPNet [28]. We substitute the maxpooling operation with the Conv4, i.e., a convolution operation with a kernel size of 4 and a stride of 2.  of the ResNeXt provides multiple branches of convolution, which provides a better way to retain information. This can increase cardinality and improve the network accuracy while reducing network complexity. Therefore, to retain more original information and to reduce network complexity, the ResNeXt module extracts the first scale features P 1 and M 1 , respectively. In the latter three scales, residual learning blocks and downsampling operations (i.e., Conv4) extract P 2 − P 4 and M 2 − M 4 features, respectively. The structures of the ResNeXt block [38] and residual learning block [39] used in the RMFF-UPGAN are depicted in Fig. 3(a) and 3(b). In Fig. 3(a), the parameters used in the ResNeXt block, i.e., 1(4), 1 × 1, 4, the parameter 1(4) represents the number of channels of the PAN (MS) image, and 1 × 1 and 4 represent the kernel size and the number of convolutions. In Fig. 3(b), the leaky ReLU (LReLU) function is employed.
The expressions that extract features of the MS image and PAN image using the ResNeXt module are displayed in (3) and (4), respectively. The expressions that extract features of the MS image and PAN image using the residual learning module are where For four-scale features of the MS image and PAN image, the prior fusion block (PFB) is designed to aggregate the information of the MS image and PAN image. The PFB is helpful for the learning of multimodal information and the fusion of preliminary features of the MS image and PAN image. A "con-catenate+Conv3+residual block" mode is employed to build the PFB, illustrated in Fig. 5(a). Conv3 is a convolution operation followed by LReLU function to implement the primary fusion and adaptively adjust the number of channels, then the residual block implements further fusion. The kernel size of the Conv3 and residual block is 3 × 3, and the stride is 1. The numbers of the convolution kernels are 32, 64, 128, and 256, respectively. The mixed-scale fusion block (MSFB) performs the fusion of  information from different scales, displayed in Fig. 5(b). The MSFB is constructed using a scale transfer block (STB), concatenation, Conv3, and residual block, where H i represents a fine-scale image and L i+1 means a coarse-scale image. The STB is shown in Fig. 6. The fine-scale image H i is downsampled by the STB to generate an image with the same scale as L i+1 , and then, fuses with L i+1 . The downsampling operation is conducted by Conv4, and the numbers are 64, 128, and 256, respectively. The mixed-scale fusion yields three-scale results, i.e., Mix_f 5 , Mix_f 9 and Mix_f 13 .
As illustrated in Fig. 4, first, the same-scale features M i and P i (i = 1, 2, 3, 4) are fused by the PFB to generate P _M i (i = 1, 2, 3, 4). Then, the MSFB fuses the prior fusion result P _M i (i = 1, 2, 3) with the next scale result P _M i+1 (i = 1, 2, 3) to generate the feature Mix_f i+4 (i = 1, 2, 3) with the same scale as P _M i+1 (i = 1, 2, 3). The mixed-scale information fusion is realized in an aforementioned manner sequentially, and the recursive fusion is carried out to generate the key information Mix_f 13 . The entire fusion subnetwork constitutes a recursive mixed-scale fusion architecture, which utilizes the information of MS and PAN images with various modalities and scales to reduce the loss of information in MS and PAN images.
The expression of the PFB is as follows: where P _M i represents the prior fusion result of the ith-scale features P i and M i , PF i indicates the function of the PFB, and W PF i represents the parameter.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. The expression of the MSFB is as follows: The MRB is presented in Fig. 5(c). Compared with the scale of the information to be reconstructed, H represents finer scale information, S represents the same-scale information, and L represents coarser-scale information. Multiscale information needs to be converted into information with the same scale before reconstruction, and the STB is presented in Fig. 6. The coarse-scale information is converted to fine-scale information through a deconvolution operation and the fine-scale information is converted to coarse-scale information through a downsampling operation. The size of the convolution kernels of the Conv3 and residual learning block used by the MRB is 3 × 3, the stride is 1, and the numbers are 128, 64, and 32, respectively.
The proposed DSMR structure reuses the extracted low-level features for reconstruction through multiscale skip connections. The low-level features contain rich details, such as edges and contours, which can reduce the loss of details. In this way, the loss of details in the PAN image and MS image is reduced, and the spatial resolution is upgraded.

4) Reconstructed Information Rectification:
Since the physical imaging of different sensors, the relationship between the MS image and PAN image is very complex. The band ranges of the MS image and PAN image do not exactly overlap, the linear combination of MS image bands cannot accurately express PAN image [4]. The detail injection model directly adds the injected details and the upsampled MS image, as the expression (1). The detail injection model ignores the complex relationship between the PAN image and MS image, which may result in spectral distortion. Therefore, we design a "concate-nate+Conv1+conv(3 × 3)" mode to construct a simple reconstructed information rectification block (RIRB), which builds a nonlinear injection relation. The RIRB is displayed in the orange box in Fig. 7. The kernel size of Conv1 is 1 × 1, the number is 12, followed by the LReLU function. The kernel size of conv(3 × 3) is 3, the number is 4. The HM image is generated by the nonlinear mapping of ↑ M image and the reconstructed information T R .
The expression for the generator of the pansharpening model is as follows: where HM denotes the fused HRMS image, G P indicates the function of the designed generator, and W P is the parameter.

B. U-Shaped Relative Average Least-Squares Discriminator
To promote the performance and stability of the pansharpening model, we employ a relativistic average discriminator to distinguish the relative probabilities between the generated  image and the real image and optimize the model using a leastsquares loss function, i.e., the relativistic average least-squares discriminator (RaLSD). The architecture of RaLSD is similar to that of the Real-esrgan [40], enhancing the capability of the RaLSD using a U-shaped structure. However, the differences are that the residual structure is applied to replace the existing convolution operation, and we utilize the "concatenate+SN(conv1-1)+LReLU" mode to substitute the sum operation to increase the discriminative capacity of the network in the skip connection part. SN(conv1-1) indicates spectral normalization (SN) [41] for the convolution operation with a kernel size of 1 and a stride of 1. The structure of the proposed U-shaped RaLSD (U-RaLSD) network is illustrated in Fig. 8, which consists of a spectral discriminator U − RaLSD pe and a detail discriminator U − RaLSD pa , and the structures of the U − RaLSD pe and U − RaLSD pa are the same. The interpretation of the colored arrows in the U-shaped structure is presented in Fig. 8, where the SN operation is conducted except for the convolution operation in the last layer. The architectures of the DRB and URB employed in the U-shaped structure are displayed in Fig. 9(a) and 9(b). In the DRB and URB, we utilize a convolution with a stride of 2 instead of a maxpooling operation for downsampling, i.e., SN(conv3-2) refers to a convolution operation with a kernel size of 3 and a stride of 2, and an SN operation is performed on conv3-2. Moreover, we employ a deconvolution with a stride of 2 instead of an interpolation operation for upsampling, i.e., SN(deconv3-2) refers to a transposed convolution operation with a kernel size of 3 and a stride of 2, and an SN operation is performed on deconv3-2. The FURB operation is performed by a simple fusion, i.e., the "concatenate+SN(conv1-1)+LReLU " mode, followed by the URB. The original MS image or DHM pa , which is the spatially reduced version of the HM image, is fed into the U − RaLSD pe to generate relativistic probabilities. U − RaLSD pa takes the original PAN image or DHM pe , which is the spectrally reduced version of the HM image, as input.
The expressions of the U-RaLSD are given in (12) and (13).
where D URaLS means the relative probability of the U-RaLSD, σ represents the sigmoid function, and C() denotes the untranslated output of the U-RaLSD. R and Q indicate the distribution of the real data Z m (M or P image in Fig. 8) and fake data Z g (DHM pa or DHM pe image in Fig. 8). E Z m ∼R and E Z g ∼Q indicate the mean operation for the real data and fake data in a batch.

C. Composite Loss Function
We build a new composite loss function composed of a spatial consistency loss function, a spectral consistency loss function, a no-reference loss function, and two adversarial loss functions.
The loss function of the spatial consistency is presented as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where L pc means the spatial consistency loss function, T refers to the batch size, HM t is the tth generated image, and || · || F means F norm. h() and ∇() indicate a high-pass filter and a gradient operator to obtain high-frequency information and gradient information of the image. The goal is to integrate the spatial information of PAN images into MS images. Since the reference image does not exist, we boost the spatial information of MS images by utilizing the high-frequency information and gradient information of PAN images.
To maintain the consistency of spectral information between the HM t and raw MS images, the spectral consistency loss function is described as follows: where L mc indicates the spectral consistency loss function, ds represents the down-resolution operation consisting of a blurring operation and a downsampling operation, and M t is the raw MS image.
Since without the reference data, we adopt the no-reference index QNR to measure the quality of the generated image. The desired value of QNR is 1, i.e., the generated image has neither spectral loss nor spatial detail loss. Therefore, the expression of the no-reference loss function is as follows: where L q stands for the no-reference loss function. The QNR relates to the spectral loss metric D λ and the spatial loss indicator D S , and the representation is as follows: where the expressions for D λ and D S are (18) and (19), and the l and v are constants, generally 1.
where B is the number of bands, and M n and F n are the nth-band LRMS image and generated HRMS image, respectively. Q is the image quality index and the representation is as follows: where P refers to the PAN image and P is the low-resolution version of the PAN image.
where h and k are the inputs, σ hk denotes the covariance between h and k, σ h and σ k represent the variance of h and k, andh and k indicate the mean of h and k.
We optimize the adversarial model using a relative average least-squares (RaLS) loss function to improve the performance and stability of the model. The adversarial loss of the generator with the U − RaLSD pe and U − RaLSD pa discriminators, respectively, is expressed as where L URaLS G represents the adversarial loss of the network, L URaLS GD pe denotes the adversarial loss with the U − RaLSD pe discriminator, and L URaLS GD pa denotes the adversarial loss with the U − RaLSD pa discriminator.
The expressions for L URaLS GD pe and L URaLS GD pa are presented as follows: where M refers to the raw MS image, HM pa indicates the spatially reduced-resolution version of HM, i.e., DHM pa in Fig. 8. P denotes the raw PAN image, and HM pe means the spectrally reduced-resolution version of HM, i.e., DHM pe in Fig. 8. The RaLS loss function for the U − RaLSD pe and U − RaLSD pa discriminators are given as follows: The total loss function is expressed as where L t is the total loss, and λ, μ, ξ, κ, and ρ are the coefficients.

A. Datasets
To verify the pansharpening performance of the designed RMFF-UPGAN model, we employ data from Gaofen-2 (GF-2) and QuickBird satellites. For visual observation, red, green, and blue bands are used as R, G, and B channels.
GF-2 data with four-band were acquired from the regions of Beijing and Shenyang, China, with a total of seven large data, and one of them works for testing and the other six serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 1 and 4 m, and the radiation resolution is 10-bit. We generate 12 000 samples by randomly cropping the six training data, of which 9 600 serve for training and 2 400 for validation. We precisely adhere to the Wald's protocol [42] to create the reduced-resolution testing data and the full-resolution testing data with the number of 286, respectively.
QuickBird data with four-band were acquired from the regions of Chengdu, Beijing, Shenyang, and Zhengzhou, China, with a total of eight large data, and one of them works for testing and the other seven serve for training and validation. The resolution ratio of the MS and PAN images is 4, the spatial resolutions of the PAN and MS images are 0.6 and 2.4 m, and the radiation resolution is 11-bit. We generate 8000 samples by randomly cropping the seven training data, of which 6400 serve for training and 1600 for validation. We create the reduced and full-resolutions testing data with the number of 158, respectively.
The sizes of the MS and PAN images for training and validation are 64 × 64 × 4 and 256 × 256 × 1. The sizes of the MS and PAN images of the testing data at the degraded and full resolutions are 100 × 100 × 4 and 400 × 400 × 1.

B. Quality Evaluation Metrics
To verify the designed RMFF-UPGAN model, we carry out two types of experiments, i.e., reduced-resolution pansharpening and full-resolution pansharpening. In addition, subjective assessment and objective index evaluation of the pansharpening results are conducted. The subjective assessment mainly compares a pansharpened image (PI) with a reference image (RI), judging the retention of spatial details and spectral information. The objective evaluation indexes include full-reference metrics and no-reference indexes.
The full-reference indexes are applied to evaluate the reducedresolution pansharpening and compare the PI with the RI. The metrics we utilized include the structural correlation coefficient (SCC) [42], structural similarity (SSIM) [43], universal image quality index (UIQI, abbreviated as Q) [44] extended to n bands (Qn) [45], [46], spectral angle mapping (SAM) [47] and erreur relative global adimensionnelle de synthése (ERGAS) [48]. Specifically, the SCC determines the structural correlation between the PI and RI. The SSIM measures the similarity between the PI and RI from three aspects: luminance, contrast, and structure. The Q comprehensively measures the difference between the PI and RI in terms of correlation loss, luminance distortion, and contrast distortion. The smaller the difference is, the closer the Q is to 1, and the better the PI is. The SAM evaluates the angle of the spectral vector between the PI and RI. The smaller the angle, the closer the SAM is to 0, and the closer the PI is to the RI. The ERGAS evaluates the spectral quality of bands in the spectral range, which represents the overall conditions of the spectral changes. The closer the value is to 0, the better the pansharpening effect in the spectral range.
The quality with no-reference (QNR) indicators do not need the RI when evaluating the pansharpening performance. They The metrics include D λ , D S , and QNR [49]. The ideal value of the SCC, SSIM, Q, Qn, and QNR is 1, the closer to 1, the better the pansharpening effect is. The ideal value of the SAM, ERGAS, D λ , and D S is 0. The closer to 0, the better the effect of pansharpening is.

C. Implementation Details
The implemented framework is TensorFlow, and the experimental setups involve an Intel Xeon CPU and an NVIDIA Tesla V100 PCIe GPU with 16-GB video memory. In the training phase, we optimize the model using the Adam optimizer [50], setting the batch size to 8 and the epoch to 40. The initial learning rate is set to 2 × 10 −4 . The factors in (26) are set following [35] as follows: λ = 2 × 10 −4 , μ = 1 × 10 −4 , and ξ = κ = ρ = 1.
To compare the pansharpening results of various approaches more fairly, the CNN-based approaches are completed on the GPU and the CS/MRA-based approaches are performed on the CPU.

D. Reduced Resolution Experiments
To evaluate the pansharpening performance of the presented RMFF-UPGAN model, we have carried out comparative experiments with the advanced traditional methods and GANbased methods. The compared traditional methods are GSA [9], BDSD [11], SFIM [12], and MTF-GLP [51]. The GAN-based methods include RED-cGAN [29], PsGAN [30], PGMAN [35], and DIGAN [31]. We carry out comparative experiments on the GF-2 and QuickBird data with reduced and full resolutions. The averages of the quantitative evaluation of the GF-2 and QuickBird experimental results at the reduced resolution are listed in Tables I and II. From Tables I and II, it is noticed that the proposed RMFF-UPGAN is optimal in all six indicators, demonstrating a superior pansharpening capability.
We display and analyze the experimental outcomes of GF-2 and QuickBird as follows. For better observation of the differences between various comparative approaches, we carry out the representation of the differences between the PI and RI via the average spectral difference map (ASDM), which is the spectral angle between the PI and RI, and the average intensity difference map (AIDM).  The pansharpening results of all the compared models on the testing data of GF-2 at the degraded resolution are shown in Fig. 10. Fig. 10(a) is an upsampled degraded-resolution MS image, Fig. 10(b) is the corresponding PAN image, Fig. 10(c)-(k) illustrates the pansharpening results of the GSA, BDSD, SFIM, MTF-GLP, RED-cGAN, PsGAN, PGMAN, DIGAN, and RMFF-UPGAN models, respectively, and Fig. 10(l) is the reference image, i.e., ground truth (GT). To visualize the details more distinctly, we magnify the contents of the red and yellow boxes in Fig. 10, depicted in Fig. 11. From Fig. 10, it is evident that the pansharpening result of the RMFF-UPGAN model is the most similar to GT regarding both spectrum and structure.
From Fig. 11, it is significantly observed that the result of the GSA approach exhibits spectral distortion in comparison with the GT. For the contents in the red box, spectral distortion caused by the blue rendering occurs for all the approaches except the proposed RMFF-UPGAN model. Furthermore, the outcomes of the RecGAN, PGMAN, and DIGAN approaches suffer from blurring edges. The result of the RMFF-UPGAN approach is the closest to GT visually. As to the contents in the yellow box, the outcomes of the BDSD, RED-cGAN, and PGMAN are rather fuzzy. Moreover, only the RMFF-UPGAN accurately captures the spectral information in the white ellipse, whereas the others fail to express precisely.
Figs. 12 and 13 depict the ASDM and AIDM. To facilitate comparison, the differences are highlighted in colors, the color bar varies from dark blue to red, and the values vary from 0 to 1. From Figs. 12 and 13, we can evidently observe that the difference between the PI of the RMFF-UPGAN approach and GT is the smallest, and the fusion property is optimal.
Table III lists the objective assessment indicators between the PI and GT on the GF-2 testing data at the decreased resolution. In Table III, the bold indicators reflect that the RMFF-UPGAN is superior for all the indices and has the best pansharpening performance amongst all the comparative approaches. The SAM and ERGAS of the RMFF-UPGAN approach are minimal, signifying that there is the least spectral distortion, and the result retains more spectral information. Combining Figs. 10-13 and the SCC and SSIM indices in Table III, we can distinctly reveal

E. Full-Resolution Experiments
This section presents the pansharpening experiments on the full-resolution testing data of GF-2 and QuickBird. Because   it is evident that the proposed RMFF-UPGAN approach is optimal for the three metrics, D λ , D S , and QNR, revealing that it is superior for the preservation of spectral and structural information, and the optimal performance for pansharpening. The results of the full-resolution experiments performed on the GF-2 and QuickBird data are described as follows, respectively. Fig. 17 presents the visual assessment of the pansharpening results for comparative approaches on the GF-2 testing data with the full resolution. Fig. 17(a) is the raw MS image, Fig. 17(b) is the upsampled MS image, Fig. 17(c) is the raw PAN image, and Fig. 17(d)-(l) are the pansharpening results of all the comparative models. Table VI illustrates the objective assessment indexes of the PI of the various comparative models on the GF-2 testing data with the full resolution. From Fig. 17, we evidently note that the results of the GSA and MTF-GLP exhibit rather severe spectral distortion, as presented in the yellow ellipse of the figure. The outcomes of the BDSD and SFIM are blurry and exhibit ringing. Compared to the MS image, the results of the RED-cGAN and PsGAN exhibit darker colors. The edges   Fig. 17. The pansharpening results of the PGMAN and RMFF-UPGAN models are better in details retention, however, for the preservation of spectral information, the RMFF-UPGAN approach is superior. Furthermore, from Table VI, it is noticed that the D λ , D S , and QNR of the RMFF-UPGAN model are optimal, therefore, the designed RMFF-UPGAN method is superior. Fig. 18 is the visual estimation of the pansharpening results on the QuickBird testing data with the full resolution. To provide a better overview of the details, we magnify the contents of the red, yellow, and orange rectangles in Fig. 18. The enlargements  Fig. 19) and structural distortion (as highlighted in the upper left in Fig. 19) exist in the result of the SFIM approach, and artifacts arise in the results of the SFIM and MTF-GLP approaches. The outcome of the DIGAN model is   Table VII that the result of the RMFF-UPGAN is the best in D λ , D S , and QNR indexes, illustrating better conservation of spectral information and details and more satisfactory pansharpening capability.

F. Ablation Experiment
Four ablation experiments are conducted to verify the usefulness of the structure of the presented network, completed on the QuickBird testing data with reduced and full resolutions. The first is to validate the effectiveness of the ResNeXt block in the RMFF-UPGAN model, the second is to prove the usefulness of the CIM in the reconstruction phase, the third is to verify the validity of the RIRB in the RMFF-UPGAN model, and the fourth is to confirm the validity of the U-shaped structure in the discriminator.

1) Effectiveness of the ResNeXt Module:
The ResNeXt block is a vital component of the proposed RMFF-UPGAN model. We replace the ResNeXt block with a plain residual block, keeping the other parameters of the network invariant, recorded as NX. The objective metrics in Table VIII reveal that the results of the RMFF-UPGAN are better than those of the NX on the QuickBird testing data with reduced and full resolutions, demonstrating the positive impact of the ResNeXt block on feature extraction.
2) Validity of the CIM: The CIM in the reconstruction stage serves to complement the spectral and structural information. We remove the CIM for comparison, and the parameters of the network remain invariant, denoted as NMF. As presented in   Table VIII, the pansharpening property of the RMFF-UPGAN outperforms the NMF on the reduced-resolution data. For the full-resolution data, the NMF is preferable to the RMFF-UPGAN for the D λ index, but for metrics D S and QNR, the RMFF-UPGAN is superior to the NMF, and more spatial structure information is preserved. It can be noticed that for metrics, the NX is better than the NMF, indicating that the CIM plays a more effective role in the reconstruction stage for retaining spectral and structural information.
3) Effectiveness of the RIRB: We generate HM by replacing the RIRB with a convolutional layer and an addition operation, keeping the other parameters of the network unchanged, named NRM. The capability of the RMFF-UPGAN is superior to that of the NRM on data with decreased and full resolutions, as reflected by the metrics in Table VIII, revealing that the RIRB reduces spectral distortion and structural distortion.

4) Validness of U-Shaped Discriminator:
We remove the upsampling structure from the U-shaped discriminator and only retain the downsampling component to form a plain discriminator, holding the other parameters of the network invariant, denoted as NU. As presented in Table VIII, the pansharpening capability of the RMFF-UPGAN is better than that of the NU on the reduced and full-resolutions data, signifying that the U-shaped discriminator boosts the capability of the network. Furthermore, from the contribution to enhancing the performance of the network, the CIM contributes the most, followed by the U-shaped discriminator, the ResNeXt block, and the RIRB.

G. Computation and Time
The model complexity is discussed from model parameters and test time, as presented in Table IX. For the traditional models, only the test time is given. The parameters and test time of DL-based models are provided. The test time is obtained by taking average for time of 286 pairs of GF-2 test data at full resolution. The sizes of MS image and PAN image of GF-2 test data are 100 × 100 × 4 and 400 × 400 × 1, respectively. The experiments of the first four traditional models are implemented on the above mentioned CPU, and the experiments of the later five DL-based models are implemented on the previously mentioned GPU. The unit M of parameters in Table IX denotes 10 6 . The unit of time is seconds (s). From Table IX, it can be observed that GSA and SFIM approaches are relatively simple and fast to implement, while BDSD and MTF-GLP approaches are relatively complex and slow to implement. For the DL-based models, the RMFF-UPGAN model has the most parameters, followed by the PGMAN, while the PGMAN has the least test time. The parameters of RED-cGAN and PsGAN are very close, and the test time is similar. This is because the PGMAN processes the high-frequency information of MS and PAN images, while other models process the entire image. Because of the downsampling in the RMFF-UPGAN model, the image size is reduced, so the calculation is relatively small. From the model size and speed, the RMFF-UPGAN model is efficient.

V. CONCLUSION
In this article, we model the RMFF-UPGAN to boost spatial resolution and preserve spectral information. RMFF-UPGAN comprises a generator and two U-shaped discriminators. A dual-stream trapezoidal branch is designed in the generator to obtain multiscale information. The ResNeXt block and residual learning blocks are employed to obtain the spatial structure and spectral information of four scales. Further, a recursive mixedscale feature fusion subnetwork is designed. Perform the prior fusion on the extracted MS and PAN features of the same scale. A mixed-scale fusion is conducted on the prior fusion results of the fine scale and coarse scale. The fusion is executed sequentially in the aforementioned manner building a recursive mixed-scale fusion structure, and finally, generating key information. The CIM is also designed for the reconstructing of key information to compensate for the information. The nonlinear RIRB is developed to overcome the distortion induced by neglecting the complicated relationship between MS and PAN images. Two U-shaped discriminators are designed and a new composite loss function is defined. The RMFF-UPGAN model is validated using GF-2 and QuickBird datasets, and the outcomes of the experiments reveal better than the prevalent approaches regarding both visual assessment and objective indicators. The RMFF-UPGAN model has superior performance in enhancing the spatial resolution and retaining the spectral information, which boosts the fusion quality. Our further work will investigate the unsupervised models to further strengthen the capability of the pansharpening network.