Integrating Spectral and Spatial Bilateral Pyramid Networks for Pansharpening

Multispectral (MS) image pansharpening is a significant technology for remote sensing image analysis, aiming to restore a high-resolution MS image by merging a high-resolution panchromatic image with a low-resolution MS image. While the convolutional neural networks (CNNs) have garnered considerable attention for their exceptional fusion capabilities in recent years, the existing CNN-based methods cannot effectively integrate spectral–spatial information. In this article, we introduce a novel MS pansharpening framework that integrates spectral and spatial networks in a bilateral pyramid form, allowing for the extraction of hierarchical spectral–spatial information. The proposed reduced residual dense (RRD) module serves as the fundamental building block of the spatial network. The RRD module gradually reduces the dimension of feature maps and employs the concatenation of RRD with global residual learning for comprehensive feature representation. In the spectral network, we present a cooperative attention fusion module to further enhance the correlation between spectral and spatial features. Through extensive experiments conducted on benchmarking simulated and real datasets, our proposed framework consistently outperforms state-of-the-art methods, demonstrating its effectiveness in MS image pansharpening applications.


Integrating Spectral and Spatial Bilateral Pyramid
Networks for Pansharpening Yue Que , Member, IEEE, Hanqing Xiong , Xue Xia , Jie You , and Yong Yang , Senior Member, IEEE Abstract-Multispectral (MS) image pansharpening is a significant technology for remote sensing image analysis, aiming to restore a high-resolution MS image by merging a high-resolution panchromatic image with a low-resolution MS image.While the convolutional neural networks (CNNs) have garnered considerable attention for their exceptional fusion capabilities in recent years, the existing CNN-based methods cannot effectively integrate spectral-spatial information.In this article, we introduce a novel MS pansharpening framework that integrates spectral and spatial networks in a bilateral pyramid form, allowing for the extraction of hierarchical spectral-spatial information.The proposed reduced residual dense (RRD) module serves as the fundamental building block of the spatial network.The RRD module gradually reduces the dimension of feature maps and employs the concatenation of RRD with global residual learning for comprehensive feature representation.In the spectral network, we present a cooperative attention fusion module to further enhance the correlation between spectral and spatial features.Through extensive experiments conducted on benchmarking simulated and real datasets, our proposed framework consistently outperforms state-of-the-art methods, demonstrating its effectiveness in MS image pansharpening applications.

I. INTRODUCTION
R EMOTE sensing images are obtained through satellite imaging sensors, enabling the capture and depiction of the Earth's surface features [1].These images play a crucial role in various Earth observation applications, including agriculture [2], territorial management [3], traffic management [4], and various other fields.However, obtaining high-resolution remote sensing images directly is challenging due to various physical and technical limitations [5].
Xue Xia is with the School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013, China (e-mail: yeziandkuma@ foxmail.com).
Yong Yang is with the School of Computer Science and Technology, Tiangong University, Tianjin 300387, China (e-mail: greatyangy@126.com).
Digital Object Identifier 10.1109/JSTARS.2024.3356513panchromatic (PAN) image [6].This integration produces an output that maintains comparable spatial resolution from the PAN image while preserving the identical spectral information of the MS image [7].The resulting high-resolution MS (HRMS) images offer abundant spectral and spatial information, facilitating precise observation and identification of geometric features, such as the size and shape of target objects.Moreover, they also provide detailed insights into the intrinsic physical characteristics of the objects [8].Several techniques have been proposed to address this challenge, including classical component substitution (CS), multiresolution analysis (MRA), variational optimization (VO) analysis, machine learning (ML), and deep learning (DL) based methods [7], [9], [10].
Classical CS approaches include principal component analysis [11], intensity hue saturation [12], and Gram-Schmidt [13].These methods utilize linear transformations to decompose lowresolution upsampled MS images into spectral and spatial components.Subsequently, they replaced the spatial components with PAN images.Despite their simple physical meaning, rapid execution, and easy implementation, CS methods frequently neglect the differences between two image types during the fusion process and may lead to the destruction of low-frequency information in the images, resulting in severe spectral distortion.
The MRA provides an alternative strategy to CS through the utilization of a linear decomposition technique to extract spatial structures from a PAN image and does not require the training data [14].Subsequently, these structures are integrated into a low-resolution MS image to generate fused imagery.This category of methods encompasses Laplacian pyramids [15], wavelets transform [16], and contourlet transform [17].These methods are commonly known as spatial methods due to their ability to overcome spectral distortion issues.However, they may lead to a certain degree of spatial quality degradation, and their performances are influenced by the outcomes of image registration, potentially causing some problems, such as aliasing artifacts and edge artifacts.
The category of VO approaches mainly focuses on the correlation among the input PAN image, the origin MS image, and the target HRMS image to produce the appropriate scheme.Representative VO methods encompass the full-color sharpening model based on the construction of a sparse prior and regularization term [18], the regularization models emphasizing piecewise smoothness [19], and image-based nonlocal similarity [20].Wu et al. [21] proposed a low-rank tensor completionbased framework for MS image pansharpening, in which a local-similarity-based dynamic detail mapping regularizer was developed to better characterize the spatial structure information.However, the execution speed of VO methods is relatively slow.
With the development of computer software and hardware, ML-based methods demonstrated significant potential in pansharpening.The ML methods encompass dictionary learning [22], [23], compressive sensing techniques [24], K-means clustering for parameter estimation [25], and conditional random fields [26].
Recently, with the flourishing of DL [27], [28], DL-based PAN and MS image fusion algorithms have become a new research hotspot.Considering the superior nonlinear fitting capabilities of DL, better fusion results can be achieved by DL-based approaches.Masi et al. [29] introduced a novel pansharpening technique utilizing convolutional neural networks (CNN) and achieved very promising results.However, their network architecture is limited to three convolutional layers, indicating that their algorithm is insufficient to fully exploit the excellent fitting capabilities of deep neural networks and still has considerable room for optimization and improvement.Yang et al. [30] introduced a deep network (PanNet) designed to generate high-frequency information and integrate it into upsampled MS images.Building on this, Yuan et al. [31] introduced a multiscale and multidepth CNN named MSDCNN for remote sensing image pansharpening in 2018.Their approach incorporated multiscale feature extraction and residual learning into the foundational CNN architecture, contributing to advancements in the field.Later, He et al. [32] introduced a method based on image detail injection PAN sharpened DL network architecture, which can effectively learn the MS details in an end-to-end pattern.This approach is similar to the physical process of the traditional filtering methods and has the potential to get a minimal initial loss, resulting in accelerated convergence and demonstrating outstanding pansharpening performance.Wang et al. [33] constructed a 44-layer network by using densely connected structures, which increased the depth of the network and improved its performance to some extent.However, as the network depth increases, training becomes increasingly challenging.Directly adding network layers is not able to fully exploit the image characteristics from the input image.
Yang et al. [34] introduced an innovative progressive cascade deep residual network comprising two residual subnetworks to solve the problem of losing details.Although these algorithms have been continuously enhanced in terms of network depth and width and achieved relatively good performance, they have not fully exploited the multiscale features of the images.Fang et al. [35] proposed a parallel pyramid CNN for pansharpening and achieved superior performance on various datasets.Yang et al. [36] proposed a dual-stream CNN with residual information enhancement to strengthen the information exchange and sharing between features of images with different resolutions.Li et al. [37] put forth a pansharpening method by building the CNN-pyramid transformer structure with no-reference loss.Liang et al. [38] introduced a DL architecture that leverages parallel multiscale attention mechanisms (PMACNet) to enhance the pansharpening process, offering improved spatial and spectral quality in HRMS imagery.Deng et al. [39] proposed a pyramid shuffle-and-reshuffle transformer structure for MS and hyperspectral image fusion task.This model combined the shuffle-and-reshuffle strategy and the multiscale feature extraction to reduce the computation resources.Hou et al. [40] presented a bidomain modeling paradigm for pansharpening (BiMPan), which combines local spectral specificity with global spatial detail.However, their approaches treat the features of PAN and MS images equivalently and are not able to comprehensively extract the distinct features of PAN and MS images.This limitation could potentially lead to a reduction in pansharpening quality.In addition, pansharpening methods community commonly employs multiscale networks based methods to extract the global structure in the spatial domain.However, they usually suffer from limited receptive fields.
Most of the existing DL-based pansharpening networks overlook the intrinsic correlation between PAN images and MS images.These approaches frequently emphasize enhancing performance by augmenting network depth and complexity, resulting in tremendous computational demands and information redundancy.To tackle the above limitations, this study aims to achieve a better balance between performance and complexity for remote sensing pansharpening.Instead of simply stacking network layers, we propose a novel framework that enables end-to-end fusion of multiple branches and multiple inputs.Inspired by the bilateral structure [41], which is widely applied to lightweight vision model, we developed a structure to integrate spectral and spatial networks to enhance the efficiency of pansharpening.Different from the typical bilateral models that employ different branches for simulating spatial and spectral features of the same input, we propose variations in the forms of spectral and spatial branch inputs.Furthermore, the spectral branch can be regarded as a super-resolution task, and this kind of input heterogeneous information from spectral and spatial branching and construction of an auxiliary learning task in another domain can further help the training of our deep network.
Our main contributions can be summarized as follows.1) We proposed a novel bilateral pyramid framework to integrate spectral and spatial networks for efficient remote sensing pansharpening.2) We introduced a reduced residual dense (RRD) module for extracting deep features with an expandable receptive field and hierarchical information.This module improves the performance of the spatial network while saving computational cost.3) We designed a cooperative attention fusion (CAF) module, which leverages the internal relations between spectral and spatial subnetworks.It combines multiple attention mechanisms to aggregate and enhance features learned from the spatial branch.The rest of this article is organized as follows.In Section II, the proposed bilateral pyramid pansharpening architecture is introduced, followed by a detailed presentation of the proposed RRD module and CAF module.Section III provides a comprehensive report and analysis of the experimental results.The ensuing discussion is presented in Section IV.Finally, Section V concludes this article.

II. PROPOSED METHODS
Building on the analysis and motivation discussed above, this section provides a comprehensive description of each component of our proposed network.This includes an in-depth exploration of the bilateral pyramid architecture and the construction of its branches.In this article, we propose a novel framework aimed at addressing the challenges of spectral distortion and detail loss in remote sensing pansharpening.As illustrated in Fig. 1, a spectral branch systematically enhances spectral information at multiple pyramid levels, starting from low to high resolution.Concurrently, a spatial branch directly processes the full-resolution input, focusing on enhanced high-frequency details.Furthermore, we introduce a novel feature fusion module that incorporates branchwise cooperative attention for the effective fusion of branch features.

A. Overview
Many existing methods typically initiate the pansharpening process by interpolating and upsampling a low-resolution MS image to match the size of a PAN image.Subsequently, the upsampled MS image and PAN image are merged along the spectral dimension to create the input for the network.Notably, this input is already sized equivalently to the target high-resolution image, obviating the necessity for any upsampling operations within the intermediate layers of the network.
To fully integrate spectral-spatial features, we propose a bilateral pyramid framework, integrating a pair of spectral and spatial branches.In the spatial branch, we concatenate the upsampled MS image and PAN image with its high-frequency residuals (PAN Laplacian pyramid) as inputs.This compels the spatial branch to learn supplementary spatial details.High-frequency residuals are computed by the Laplacian pyramid in (1), where P represents the PAN image, g(.) denotes the Gaussian filter, and i is the number of levels in the pyramid.The outputs of the spatial branch are four sets of feature maps with different resolutions of the concatenate input image ( This is a framework incorporating a suite of high-pass filter templates specifically engineered for the extraction of diverse spatial details from PAN images.This approach harnesses the high-frequency information inherent in PAN images, traditionally underutilized when these images are directly inputted into networks.By integrating these extracted details back into the PAN image within the network, our method significantly enriches the utilization of spatial information.PAN image forms a pivotal basis for spatial detail restoration.However, conventional methods often directly employ these images as inputs to networks, thereby the high-frequency information inherent in PAN images is not fully utilized.This observation motivates us to propose PAN Laplacian pyramid to enhance the extraction of details more effectively.
The spectral branch consists of three pyramid levels, used for super-resolving the MS image with a scale factor of 4. Unlike the existing methods [40], [53], which only use interpolation for preupsampling to PAN size, our spectral network directly extracts multilevel spectral features from original resolution MS images.To better fuse the detailed features from the spatial branch with the spectral branch, a cooperative attention-based feature fusion method is introduced.Then, a standard convolutional operation generates the final pansharpened image using the fused feature maps.

B. RRD Module
The RRD module is a critical component of the spatial network.Fig. 2 illustrates the structure of the RRD module.Specifically, each module consists of four blocks, each comprising a convolutional layer, a batch normalization layer, a leaky ReLU Fig. 2. RRD module with downsampling used in our spatial network removes convolutions with stride=2 if downsampling is not required.activation layer, and a convolutional layer with a kernel size of 1.We use ConvB i to represent the operations of the ith Conv block.Hence, the output of the ith block is formulated as follows: where k i is the kernel size of the convolutional layer, and x i−1 and x i are the input and output of the ith block, respectively.In the RRD module, we use the kernel size of 1 for the first block and 3 for the rest.Inspired by Fan et al. [42], assuming the number of channels at the input of the RRD module is C, and the number of channels at the output of the ith Conv block is C/2 i , except that the last Conv block has the same number of filters as the previous one.In most high-level computer vision tasks, it is a common approach to increase the number of channels in deeper layers.However, for low-level vision tasks, our emphasis lies on achieving an expandable receptive field and integrating multiscale information.Shallow layers require an adequate number of channels to encode finer grained information with a smaller receptive field.Conversely, deeper layers with a larger receptive field prioritize the integration of higher level information.Using the same number of channels in deeper layers as in shallow layers may result in information redundancy.Downsampling operations are set in the second block if necessary.To enrich feature information, skip-path connections are established to connect the output feature maps of each block.Before concatenation, the feature maps of the second blocks within the RRD module are downsampled to a consistent spatial size through a convolutional operation employing a stride of 2, as shown in Fig. 2.
Finally, we further design a global residual learning to develop hierarchical features.We first use two convolutions with a batch normalization layer to convert the size of the input feature to the same as the dense concatenated feature.Residual learning has been demonstrated to be beneficial for constructing deeper network structures.This approach not only facilitates faster convergence but also enhances stability compared with a direct mapping network.Following the global residual learning, we can generate complete multiscale features.In our network, the final output of the RRD module is where x I and x O denote the RRD module input and output, respectively, while x 1 , x 2 , x 3 , and x 4 are feature maps from four blocks, respectively.Considering the efficiency, we chose concatenation as our fusion method.The design of the RRD module is motivated by two considerations.First, it significantly reduces computational complexity by carefully adjusting the filter number of the blocks through a geometric progression of reduction.Second, by connecting the features of all blocks and combining them with the global residual, the final output of the RRD module preserves an expandable receptive field and hierarchical information.In RRD, each block has a different size receptive field, and the output collects multiscale information from all blocks.

C. Bilateral Pyramid Network Structure
The spatial branch consists of four stages, each downsampling the input spatial resolution by a factor of 2. Stage 1 is typically considered the shallow layer for extracting appearance features.For efficiency, we use only two Conv blocks with a kernel size of 3 in Stage 1, while in each of the other stages, there are two RRD modules.In these RRD-based stages, the first RRD module in each stage performs a downsampling by a factor of 2 on the input feature resolution.The other RRD modules in each stage maintain the feature resolution constant.The resolution of the final output feature of the spatial branch is one-eighth of the input, which fully extracts the multiscale features of the input.Table I demonstrates the detailed design of the proposed spatial network.
The spectral branch is similar to a simple image superresolution network, where the input is an MS image, and the output is an HRMS image.In our architecture, the shallow features of the MS image are first extracted using two convolutional Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.layers and then fused with the features of the last two stages of the spatial branch.Next, a set of operations is repeated twice, which upsamples the fused features at scale 2 by transposed convolutional layers initialized with a 4×4 bilinear kernel.The upsampled features are then fused with the output features of the spatial branch stages 2 and 1 through the fusion module to generate higher resolution features.The whole model is a stacked CNN, and the structure of each layer is the same.We corporately train the upsampling layer with all other layers to improve the learning capability of the transpose convolution.

D. CAF Module
Using concatenation or addition to integrate features from various branches is a common practice.In some works, an attention mechanism is employed to independently reweight various channels for each feature map [43], [44].Nevertheless, for the pansharpening task, assuming equal contributions of features from spectral and spatial branch to feature integration is unreasonable.Guo et al. [45] mine the relationship between detailed spatial information and high-level semantic information and design a relation-aware feature fusion method.Furthermore, the convolution operation plays a crucial role in extracting informative features by amalgamating cross channel and spatial information.To leverage the cooperative correlation between spatial detail information and spectral information, we introduce a CAF module, as illustrated in Fig. 3.In the proposed CAF module, the feature maps from the two branches are initially fused in channel dimensions, and then we adopt spatial attention to emphasize meaningful features along spatial dimensions.
Let F s ∈ R H S ×W S ×C and F D ∈ R H D ×W D ×C represent the feature maps from spectral and spatial branches, respectively.First, channel attention CA is calculated as follows: where GAP s denotes the average pooling in the spatial dimension, and the ConvB is the same as that introduced in Section II-C.Thus, attention vectors CA s and CA d ∈ R C are generated for FC s and FC d , respectively.CA s and CA d are subsequently separated into k groups of length r in an orderly manner, denoted by G s , G D ∈ R k×r .Then, we define the correlation matrix between G s and G d by the inner product for each group pair.The adjustment factor M ∈ R C is defined as follows: where σ is the sigmoid function and γ is a learnable parameter.And then, the fused features F 1 are calculated as follows: where • denotes the elementwise multiplication.
Next, we compute a spatial attention map of reweighted features R s and R d by leveraging the interspatial relationship within the features.In contrast to the channel attention mechanism, spatial attention focuses on the location of the information part.The computation of spatial attention begins with average-pooling operations applied along the channel axis, outlined as follows: where GAP c denotes the average pooling in the channel dimension.Thus, spatial attention maps SA s and SA d ∈ R H S ×W S ×1 are generated for R s and R d , respectively.And then connect them to generate an efficient relationship representation between R s and R d .Those are convolved by a standard convolution layer, generating our modulation factor M ∈ R H S ×W S ×1 as follows: where conv 7×7 is a convolution operation with the kernel size of 7 × 7.And then the fused features F 2 are calculated as follows: Finally, we combine the fusion features of the two parts to get the final fusion feature F as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Loss Function
To measure the dissimilarity between the result and the ground truth (GT), a hybrid loss is designed during training to optimize the proposed model.In comparison to other loss functions, the multiscale structural similarity index (MS-SSIM) exhibits reduced sensitivity to variations in brightness or color while also preserving high-frequency information more effectively.Conversely, the L 1 norm proves more adept at retaining color and brightness features, albeit potentially lacking in contrast when compared with MS-SSIM.As a result, these loss functions can be combined, and the comprehensive loss function can be denoted as follows: where MS and L 1 are the content losses evaluating MS-SSIM and the one-norm distance between the output x and the GT y, and α is the coefficient that balances different loss terms.In ( 16), l M , c j , and s j are the luminance, contrast, and structure comparison measures [56] at scale M and j, respectively, α M , β j , and γ j are the given nonzero values used to adjust the relative importance of different components.In (17), i is the index of the pixel and N is the total number of pixels.

III. EXPERIMENTAL RESULTS
In this section, the efficacy of the proposed model is assessed through a series of experiments.The proposed model is compared against a series of pansharpening methods using benchmark datasets obtained from three satellite sensors.

A. Experimental Setup
The three datasets utilized in our experiments are sourced from the PanCollection introduced by Deng et al. [7].PanCollection provides multiple pansharpening datasets, accompanied by comprehensive instructions for data simulation.These datasets can be downloaded from the author's website.In this article, we utilized datasets collected from three satellite sensors, including WorldView-3 (WV3), QuickBird (QB), and GaoFen-2 (GF2).

B. Compared Methods and Quantitative Metrics
To facilitate meaningful comparison, our model is benchmarked against a carefully selected set of traditional methods, including BT-H [46], BDSD-PC [47], MTF-GLP-FS [48], and TV [49].The implementation codes for these methods are publicly accessible through the MATLAB toolbox provided by Loncan et al. [50].Furthermore, six recently published pansharpening networks based on deep neural networks were used for comparison, encompassing PanNet [30], MSDCNN [31], DiCNN [32], BDPN [51], FusionNet [52], LAGConv [53], PMACNet [38], and BiMPan [40].All implementations were conducted using Pytorch adhering strictly to the respective network designs and parameters as reported in the corresponding literature or code.For a fair comparison, all DL-based models were retrained on a Windows system with an NVIDIA GeForce RTX 3090.
In this study, a comprehensive set of quantitative evaluations was conducted to assess various methods for pansharpening.This assessment included key metrics, such as the multiband extension of the universal image quality index (UIQI) [54], spectral angle mapper (SAM), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [50], the spatial correlation coefficient (SCC) [55], and structural similarity index measure (SSIM) [56] for reduced-resolution datasets.
1) Q2 n : The UIQI is an image quality assessment metric utilized to characterize fusion performance by quantifying the correlation, brightness, and contrast similarity between the result and the reference image.The UIQI metric is also referred to as the Q-index, and it is formally defined as follows: where μ and σ are the means and the standard deviations of the input image, respectively.σ F R represents the covariance between the fused image F and the reference image R. The terms from left to right in the expression convey the respective physical meanings of correlation, average brightness similarity, and contrast similarity between the two images.A higher Q indicates a greater similarity between two images.The Q4 index is an extension of the Q index to four-channel vectors, which means it further considers spectral distortion, as proposed by Alparone et al. [57].First, the Q4 index models each pixel F |i| as a combination of quaternions, specifically represented as follows: where, K 1 , K 2 , and K 3 are the weighted coefficients for the respective channels.The specific computation of the Q4 index is given as follows: ) Similarly, further extending the number of channels to be greater than 4 can result in Q8 and Q2 n index [54].
2) SAM: The SAM utilizes the angle between vectors of image pixel spectra and reference spectra to quantify the similarity between spectral curves, as shown in (21).In our experiments, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the value of SAM is obtained by averaging the values of all image pixels, as shown in (22).The ideal value of SAM is 0 3) ERGAS: The ERGAS is a metric widely used to evaluate remote sensing images.It helps measure the balance between image reconstruction accuracy and maintaining spatial resolution, as defined by the following equation:

4) SCC:
The SCC index proves effective in measuring the similarity of spatial information between the fused result and the PAN image.Specifically, SCC initially extracts high-frequency detail information from both the PAN image and the fused image using a two-dimensional high-pass filter.It then calculates the high-frequency detail correlation of the fused image and the PAN image using the correlation coefficient.The SCC index has a range of −1 to 1, where a higher SCC value indicates that the fused image retains more spatial information.

5) SSIM:
The SSIM index is a metric used to measure the similarity between two images [56], as shown in the following equation: where μ and σ are the pixel sample means and the variances of the input image, respectively.σ xy is the covariance of two images.c 1 and c 2 are constants used to avoid any logical confusion in the division by zero.
In addition, to evaluate performance on full-resolution datasets, the hybrid quality with no reference (HQNR) index is employed.This index is determined through the combination of the spatial distortion index D S [58] and the spectral distortion index D λ [59].
1) D S : D S is a metric that measures the detail loss between two images, defined as follows: where B represents the number of spectral bands of the fullresolution MS images, and PAN LP represents the PAN image with the same dimensions as the MS image.During the computation, it is essential to ensure that the downsampled PAN image is spatially aligned with the MS image to avoid any biases in this metric.Furthermore, q is used to highlight large differences during the computation.Following the method proposed by Alparone et al. [58], the value of q is set to 1.
2) D λ : D λ is a metric that measures the spectral loss between two images, defined as follows: where p is a positive integer that emphasizes significant spectral differences.In this study, p is set to 1.The smaller the D λ value, the less spectral distortion.
3) HQNR: The HQNR is a quality index that considers both spectral and spatial aspects relying on D λ and D S , which is defined as follows: where α = β = 1.Considering that, the optimal value of D λ and D s is 0, and the optimal value of HQNR is 1.

C. Experimental Results on Reduced-Resolution Cases
In this section, the performance of all compared methods is tested on three simulated datasets.Each dataset consists of 20 PAN images of size 256 × 256 and 20 MS images of size 64 × 64 clipped with downsampled from the original image.During training, the original PAN and MS images are clipped into small patch pairs of size 64 × 64 for PAN and 16 × 16 for MS.For the seven CNN-based pansharpening methods, they were first trained on three training datasets and then tested on both simulated and real datasets.
1) Dataset of WV3: We initially evaluated the performance of all methods on the test dataset of the WV3 satellite.Table II presents the quantitative evaluation results for various comparison methods on the WV3 dataset.In the table, the mean and standard deviation metrics are used to summarize the obtained performance, and the best results are in bold.Notably, the DL-based methods exhibit superior average metrics compared with traditional methods, underscoring the advantages of DL-based approaches due to the shared data structure and image characteristics between the training and test datasets.Specifically, our proposed method outperforms other DL-based methods across all metrics, highlighting the effectiveness of our spectral preservation and superior spatial detail extraction capabilities.This underscores the robust nonlinear fitting and feature extraction capabilities of our method during the training process, facilitating the seamless learning of the end-to-end relationship for the pansharpening task.
For a more intuitive visual comparison between methods, we selected an example from the WV3 test data to illustrate the visual effects of different sharpening methods.As depicted in Fig. 4, the visual results align with the quantitative indicators in Table II, showcasing that DL-based methods exhibit stronger spectral fidelity and spatial detail description abilities compared with traditional methods.In particular, the images generated by the proposed methods effectively preserve spectral information and retain the texture and other details of the image.In addition, we generated residual maps between the fusion results and the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.GT images, as shown in Fig. 5. Ideally, a residual map should approach zero, and the richness of content in the residual map corresponds to the blurriness of the fusion image.The residual maps of the four traditional methods appear quite unclean.Among DL methods, the content in the residual map of our proposed method is less than other methods, which indicates that our result is rich in detail.To highlight the differences, two enlarged subregions are depicted in the residual map.It is evident that our method produces fused images that closely resemble the GT images, with cleaner residual maps.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Dataset of QB: This section demonstrates the performance of all methods on the QB simulation dataset.As presented in Table III, all DL methods based on CNN exhibit exceptional quantitative results, surpassing traditional methods by a significant margin.Notably, the proposed method stands out with the most outstanding performance, attributed to its excellent hierarchical feature extraction ability and data fitting capabilities.For a visual validation of the pansharpening results obtained by different methods, Fig. 6 showcases enlarged details of the HRMS images within the blue rectangles.In the enlarged blue box, our method more clearly reconstructs details, such as paths.Similarly, the residual maps for different methods, as  shown in Fig. 7, also validate the superiority of our method.Specifically, the paths in our residual map are the least clear.
3) Dataset of GF2: This section presents the performance of all methods on the GF2 simulation dataset.As indicated in Table IV, akin to the results for datasets WV3 and QB, all DL-based methods outperform traditional methods in quantification.Even on datasets collected by different sensors, better results are consistently achieved compared with the comparison methods, underscoring the effectiveness and robustness of the proposed model.For an intuitive comparison, the fusion results of each method are showcased in Fig. 8.The fused high-resolution images obtained by all DL-based methods demonstrate competitive visual results.However, upon closer inspection of the blue zoomed box, it becomes apparent that all DL methods used for comparison struggle to reconstruct details successfully.In contrast, our proposed method clearly exhibits spatial superiority, featuring clearer details and enabling the exceptional performance of the proposed network to be easily observed.

D. Experimental Results on Full-Resolution Cases
We also evaluate the performance of all compared methods on the full-resolution datasets, including WV3, QB, and GF2.The three full-resolution test datasets consist of 20 image pairs (128×128 for MS and 512×512 for PAN) cropped from the original images, respectively.Quantitative results for all metrics on the three datasets are shown in Tables V-VII.It is evident that the advantages of DL-based methods over some traditional methods are not as pronounced at full resolution.However, DL-based methods generally outperform traditional methods in D s metrics, illustrating the excellent spectral fidelity capabilities Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.   of DL methods.Notably, our method achieves the competitive results in terms of overall quality metrics at full resolution.BiMPan also gives good results, but its model complexity is much higher than ours.
Furthermore, the advantages and disadvantages of each method can be more clearly displayed through the visualization experiment in Figs. 9 and 10.Compared with the comparison methods, the proposed method captures more image details and fuller image colors.Overall, the proposed method consistently achieves superior results at full resolution, indicating the effectiveness and robustness of the proposed model.

IV. DISCUSSION
In this section, we first conduct an in-depth discussion of the proposed model, focusing on the bidirectional structure and the impact of the feature fusion module.It is important to note that, for the sake of brevity, we utilize a reduced-resolution WV3 dataset with a spatial size of 256 × 256 for MS images and 64 × 64 for PAN images as a test example.

A. Effectiveness of Our Bidirectional Architecture
To validate the effectiveness of our pansharpening network, we build two variants of the proposed architecture.The first variant is also the dual-branch network structure, but the spectral branch removes the input part of the original scale MS and corresponding two convolutional blocks.This variant is designed to confirm the advantages of using the original scale MS as input for spectral retention.The second variant also retains the dual-branch network structure but removes the high-frequency residuals in the inputs of the spatial branches, simply combining PAN and the upsampled MS as inputs.This variant is designed to confirm the advantages of using high-frequency residuals (PAN Laplacian pyramid) as inputs for learning complementary spatial details.
We conducted experiments on reduced-resolution cases of WV3, and the results, as presented in Table VIII, indicate that the proposed architecture produces the best overall performance,   underscoring the effectiveness of our method.This observation also suggests that the dual-branch network proves more effective in capturing various information compared to a single input.

B. Effectiveness of Our Feature Fusion Method
We perform an ablation study to evaluate the effectiveness of the proposed CAF module.The comparison strategies include direct addition, concatenate, only the channel attentionbased fusion F 1 , and the proposed CAF module.As shown in Table IX, the concatenate strategy failed to complete the entire training process and crashed during training.The performance of the simple addition strategy drops significantly.Moreover, our cooperative attention strategy achieves better performance than the channel attention-based fusion strategy.Therefore, this module is suitable for fusing different features of spectral and spatial branches.

C. Model Efficiency Analysis
Finally, we compare the model efficiency of four DL-based algorithms published over the past two years.The results are presented in Table X, which includes a detailed comparison of testing time, memory usage, number of epochs, and overall training time for these methods, using the WV3 full-resolution test datasets.The data in Table X clearly demonstrate that our method outperforms others in terms of inference speed, requiring only 3.4 s to process 20 test images.In addition, our method shows superior efficiency in the training process, marked by a lower number of epochs and reduced memory usage, all while maintaining state-of-the-art performance.Notably, the total training time for our method is significantly less than that of the comparative methods.We attribute this enhanced efficiency to the strategic design of our model, particularly the decrement in the number of feature map channels in each RRD module.This design choice significantly boosts the efficiency of our model relative to other contemporary methods.

V. CONCLUSION
In this article, we present a pansharpening algorithm that integrates spectral and spatial networks for MS pansharpening.We introduce an improved RRD module to construct the spatial network, enabling the extraction of deep features with an expandable receptive field and hierarchical information.To further enhance spatial details, we propose incorporating the high-frequency residuals of the PAN image as input to reinforce spatial details.In addition, to leverage the relationship information between the spectral and spatial branches, we introduce a novel CAF module.Extensive experiments and visual results validate the effectiveness of the proposed pansharpening method, which achieves state-of-the-art performance on datasets collected from three different sensors.
In the future, there are areas where our proposed methods can be further refined.First, while our designed feature fusion module has proven to be effective, its complexity presents an opportunity for optimization.Simplifying this module could make the method more efficient.Second, although our method achieves high metrics on reduced-resolution simulated data, there is room for improvement in the accuracy of full-resolution remote sensing fusion.These aspects will be focal points for our future research and development efforts.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Manuscript received 4
December 2023; revised 10 January 2024; accepted 16 January 2024.Date of publication 22 January 2024; date of current version 5 February 2024.This work was supported in part by the National Natural Science Foundation of China under Grant 62362032 and Grant 62162029, and in part by the Natural Science Foundation of Jiangxi Province under Grant 20232BAB212011 and Grant 20232BAB202055.(Corresponding author: Yong Yang.)

Fig. 1 .
Fig. 1.Overall architecture of the introduced spectral-spatial bilateral pyramid network.

Fig. 3 .
Fig. 3. Illustration of CAF module.F S and F D denote the feature maps generated by spectral and spatial branches, respectively.
They consist of PAN images and MS images, where the MS images in dataset WV3 have eight bands, and those in datasets QB and GF2 have four bands each.WV3 comprises a total of 9714 training samples and 1080 validation samples.QB consists of a total of 17 139 training samples and 1905 validation samples.GF2 comprises a total of 19 809 training samples and 2201 validation samples.We set 200 epochs for training the proposed network.

TABLE I DETAILED
ARCHITECTURE OF SPATIAL NETWORKS, TAKING THE WV3 DATASET AS AN EXAMPLE

TABLE II PERFORMANCE
EVALUATION AT REDUCED RESOLUTION ON 20 WV3 TEST SAMPLES

TABLE III PERFORMANCE
EVALUATION AT REDUCED RESOLUTION ON 20 QB TEST SAMPLES

TABLE IV PERFORMANCE
EVALUATION AT REDUCED RESOLUTION ON 20 GF2 TEST SAMPLES

TABLE V PERFORMANCE
EVALUATION AT FULL RESOLUTION ON 20 WV3 TEST SAMPLES

TABLE VI PERFORMANCE
EVALUATION AT FULL RESOLUTION ON 20 QB TEST SAMPLES

TABLE VII PERFORMANCE
EVALUATION AT FULL RESOLUTION ON 20 GF2 TEST SAMPLES

TABLE VIII PERFORMANCE
ASSESSMENT BY VARYING THE INPUT STRUCTURE

TABLE IX RESULTS
OF QUANTITATIVE METRICS FOR DIFFERENT METHODS

TABLE X COMPARISON
OF THE MODEL EFFICIENCY FOR DIFFERENT METHODS