Multispectral Pansharpening Based on Multisequence Convolutional Recurrent Neural Network

Multispectral (MS) pansharpening is defined as the fusion of spatial information in panchromatic (PAN) image and spectral information in MS image. In this work, we propose an MS pansharpening based on multisequence convolutional recurrent neural network (MCRNN). The proposed MCRNN contains two subnetworks (shallow feature extraction subnetwork and deep feature fusion subnetwork). In the shallow feature extraction subnetwork, PAN and MS images are superimposed in the spectral dimension as multisequence data. A convolutional neural network based on residual learning is then used to obtain the feature maps from multisequence data. In the deep feature fusion subnetwork, since MS and PAN images are highly correlated, a convolutional recurrent neural network belonging to recurrent neural network is used to model adjacent and across-band relationships between these feature maps to capture the local and global correlations of the features in different bands. The global average pooling is then performed on the output results to yield the pansharpening result. Several datasets are tested at reduced and full-resolution experiments, the experimental results show that the performance of the proposed MCRNN is superior to the traditional pansharpening methods.


I. INTRODUCTION
H IGH-RESOLUTION multispectral (MS) images have wide application prospects in engineering evaluation, land management, and urban planning [1], [2], [3]. In addition, the MS images are also critical in various remote sensing tasks, such as classification, spectral unmixing, and superresolution mapping [4], [5], [6], [7]. However, due to the limitation of remote sensing satellite sensors, it is difficult to acquire MS images with high spatial and spectral resolutions [8], [9], [10]. The fusion of panchromatic (PAN) and MS images is known as MS pansharpening. Since most MS remote sensing platforms are equipped with PAN imaging equipment, MS pansharpening has become a popular approach for obtaining a fusion result with high spectral and spatial resolutions [11].
There are a variety of MS pansharpening methods presented in the past decades, including component substitution (CS), multiresolution analysis (MRA), variational optimization (VO), and deep learning (DL) [12]. The CS-based methods first transform an MS image into the intensity-hue-saturation space and then use the PAN image to replace the spatial components. Finally, the inverse transform is applied on these data to obtain the pansharpening result. The well-known CS-based methods include Gram-Schmidt (GS) fusion [13], GS adaptive (GSA) [14], and band-dependent spatial-detail [15]. The spatial information in the pansharpening result of CS-based methods is enhanced, but there are severe spectral distortions. The MRA-based methods inject the detailed spatial information extracted from the PAN image into the MS image to improve the spatial resolution and reduce the spectral distortion. These methods include wavelet transform (WT) [16], Laplacian pyramid [17], additive wavelet luminance proportional [18], and a tróus WT [19]. Although the MRA-based methods preserve the spectral information effectively, the computational complexity of these algorithms is relatively high. Recently, the VO-based methods have been presented. These methods formulate the energy function based on This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ assumptions or a priori, and then minimize the energy function to obtain the pansharpening result. The main VO-based methods include nonlocal optimization based on k-means clustering algorithm [20], Bayesian posterior probability [21], and adaptive regularization based on normalized Gaussian distribution total variation operators [22]. However, the VO-based methods sometimes have low efficiency, which seriously hinder their application.
In recent years, DL has performed remarkably in the field of computer vision. Inspired by the convolutional neural network (CNN) [23], Masi et al. [24] proposed the pansharpening based on CNN (PNN) to learn the mapping relationship between lowresolution MS image and high-resolution PAN image. In order to make full use of high nonlinearity of the DL methods, Wei et al. [25] presented the concept of residual learning to design a deep residual PNN (DRPNN) to improve the performance of PNN. Yang et al. [26] proposed a deep network architecture for pansharpening (PanNet), which was trained based on the high-pass filter domain to achieve spatial fidelity. Xu et al. [27] implemented pansharpening based on a detailed injection by using CNN (GPPNN), which mined the MS details to achieve rapid convergence. Zhou et al. [28] proposed a novel mutualinformation-driven pansharpening framework. This framework first projected the PAN and MS images into modality-aware feature space independently, and then imposed the mutual information minimization to explicitly encourage the complementary information learning. Wu et al. [29] proposed a dynamic cross-feature fusion network, which boosted the performance of pansharpening. The generative adversarial networks (GANs) also have shown good competitiveness in the field of pansharpening, due to their unsupervised learning characteristics. Ma et al. [30] proposed a new unsupervised pansharpening based on GAN (pan-GAN) to achieve unsupervised pansharpening. Zhou et al. [31] introduced the combination of transformer and invertible neural network into pansharpening for the first time to achieve good fusion result. The performance of DL-based pansharpening can be also further improved by combining the traditional pansharpening methods [32], [33], [34]. Recently, some researchers treated the MS image as multisequence data and used recurrent neural network (RNN) to learn features to obtain spectral information. RNN is good for modeling multisequence data and learning long-term correlations. In addition, RNN tends to have fewer trainable parameters as compared to CNN [35]. The RNN has been successfully applied for feature extraction, feature fusion, and hyperspectral image classification [36]. In addition, Fu et al. [37] have exploited the structure of RNN to pass feedback information to achieve pansharpening.
In this work, we propose an MS pansharpening based on multisequence convolutional RNN (MCRNN). The proposed MCRNN comprises two subnetworks, including shallow feature extraction subnetwork and deep feature fusion subnetwork. In the shallow feature extraction subnetwork, PAN and MS images are processed from the perspective of multisequence data. In order to fully extract the image details, CNN is used in the feature extraction subnetwork to extract the spatial and spectral features from PAN and MS images. In addition, the residual blocks with shared parameters are introduced in the feature extraction subnetwork to improve the performance. In the deep feature fusion subnetwork, a convolutional RNN (ConvGRU) belonging to RNN [38] is introduced to capture the local and global correlations between the spatial and spectral features extracted from the PAN and MS images to achieve pansharpening.
The major contributions of this work are presented as follows.
1) The proposed MCRNN introduces ConvGRU into pansharpening. ConvGRU models adjacent and across-band relationships between these feature maps extracted from the input multisequence data. Therefore, ConvGRU could capture the local and global correlation, which refers to the correlation between adjacent bands and across bands [39], improving the pansharpening performance. 2) In the shallow feature extraction subnetwork, PAN and MS images are modeled as multisequence data. The shallow feature extraction subnetwork is used to extract the spatial and spectral features from multisequence data band-byband to obtain more detailed information. 3) In the deep feature fusion subnetwork, ConvGRU significantly improves the convergence speed due to its small number of parameters (NOPs). When compared with CNN, RNN requires fewer iterations to converge under the same training conditions. The experimental results show that the proposed MCRNN has better performance while maintaining a small NOP. In addition, the proposed MCRNN achieves fast convergence with small loss values. Therefore, the proposed MCRNN has a lightweight performance. The rest of this article is organized as follows. The related work and motivation are introduced in Section II. The proposed method is presented in Section III. Sections IV and V present the experimental results and discussion, respectively. Finally, Section VI concludes this article.

A. Recurrent Neural Network
The RNN is a feedforward neural network that receives and processes the feedback information [40], i.e., the past and current inputs jointly affect the current output. The feedforward neural network in RNN is different from the traditional feedforward neural network in CNN. The good memory of RNN makes it suitable to be used in natural language processing, machine translation, and speech recognition.
Recently, the RNN has also been applied to MS image for obtaining the spectral information. The schematic diagram of RNN is shown in Fig. 1. Suppose x i (i = 1, 2 . . . , M) is a multisequence data from MS image with M bands. The output activation h i of hidden layer is expressed as where σ is the activation function. W h and U h represent the weight matrix from the current input layer to the hidden layer and from the previous hidden layer to the current hidden layer, respectively. h i−1 is the output activation of previous hidden layer. b h is the bias of the hidden layer. The output activation h i of the hidden layer is used to compute the output y i as follows: where W y represents the weight matrix from the hidden layer to the output layer. b y is the bias of the output layer. Therefore, the network is allowed to perform prediction beyond the power of a standard multilayer perceptron.

B. RNN-Based Spectral Image Processing
The good memory of RNN makes it suitable to be used in natural language processing and machine translation [41]. Spectral images are usually composed of multiple bands and have a sequence of pixels in the channel dimension, so spectral images can also be regarded as a special kind of sequence data. Some researchers also began to use RNN to process spectral images. Mou et al. [42] introduced RNN into hyperspectral image classification and showed good performance, which opened the window of RNN application in hyperspectral image classification. Later, RNN has also been applied in hyperspectral image superresolution. To effectively extract structural, Fu et al. [39] proposed a bidirectional quasi-recurrent pooling module.

C. Motivation
Recently, CNN-based methods have played an important role in the field of MS pansharpening and achieved satisfactory result. However, most of these methods concatenate PAN and MS images in the spectral dimension and then input them into the network directly [24]. This results in extracting less image details and underutilization of the feature correlation among different bands of PAN and MS images, thus affecting the quality of pansharpening result [25].
Mou et al. [42] used RNN for hyperspectral image classification, which showed the effectiveness of RNN in modeling spectral multisequence data and learning local and global correlation. Although MS pansharpening is not a multisequence data problem in nature, we can treat PAN and MS images as multisequence data. Therefore, we present MCRNN to extract more detailed information and focus on the contextual semantic features of all bands in PAN and MS images.

III. PROPOSED NETWORK STRUCTURE
In this section, the proposed MCRNN, which is an end-to-end network composed of two subnetworks, including shallow feature extraction subnetwork and deep feature fusion subnetwork, is introduced in detail. The former subnetwork extracts the shallow features from the PAN image and each band of MS image by using a CNN based on residual blocks. The latter subnetwork captures the local correlation and global correlation between these spatial and spectral features obtained from PAN and MS images by using ConvGRU. The flowchart of the proposed MCRNN is shown in Fig. 2.

A. Shallow Feature Extraction Subnetwork
The PAN and MS images are collectively treated as a multisequence data, because the shallow feature extraction subnetwork of the proposed MCRNN extract the spatial and spectral features from multisequence data band-by-band. We can take advantage of this to train MS images of different bands (four or eight bands) without changing the network structure.
Suppose the multisequence data where B is the batch size, H is the height of data, W is the width of data, and M + 1 is the number of data samples. The multisequence data are obtained by following steps. As shown in (3) are then stacked and regarded as multisequence data x i and used as input data, as shown in (4) The CNN is selected as the basic structure of the shallow feature extraction subnetwork. PAN image has more spatial information and less spectral information, and MS has more spectral information and less spatial information. But both PAN and MS images have spatial-spectral features. This subnetwork is used to extract spatial-spectral features from MS and PAN images. Therefore, the subnetwork is designed with the same parameters and the same structure, the nonadaptive situation will be reduced, and the network parameters will be reduced as much as possible. In addition, since the shallow feature extraction subnetwork is used band by band, the convolutional layers are processed in serially. The residual learning has demonstrated its effectiveness for extracting features from images [43]. Thus, we introduce the residual block in the shallow feature extraction subnetwork. Fig. 3 shows the residual block as the main structure of shallow feature extraction network. The residual block includes two convolutional layers and a skip connection. A ReLU activation function is embedded after each convolutional layer. The working principle of the residual block is mathematically expressed as follows: where x i and x i+1 are the input and output of the ith residual block, R(.) is the residual function, F (y i ) is an activation function, and h(x i ) is an identity mapping function. In this work, the parameters of the residual blocks are shared to preserve the accuracy and NOPs. The multisequence data x i are used as the input of the shallow feature extraction network φ to obtain the feature map f i (i = 1, 2, . . . , M + 1) and is expressed as: where all the convolution layers use 3 * 3 convolution kernels without any normalization layers, and the parametric ReLU is used as the activation function [44].

B. Deep Feature Fusion Subnetwork
After the shallow feature extraction subnetwork, the feature maps of all bands, including spectral information of MS image and spatial information of PAN image, are extracted. Most of the existing DL-based pansharpening methods directly concatenate these feature maps to obtain the spectral and spatial features. But these feature maps are highly correlated and contain complementary information, it is important to model adjacent and across-band relationships between these feature maps. Various RNN methods, such as long short-term memory [45], gated recurrent unit (GRU) [46], and ConvGRU have been proposed to capture the correlation of spectral and spatial features between different bands. In particular, ConvGRU is suitable for processing the two-dimensional multisequence data. Therefore, ConvGRU is utilized as the basis of deep feature fusion subnetwork to process the feature maps of all bands, which are regarded as two-dimensional multisequence data in this work. The proposed deep feature fusion subnetwork preserves the spatial resolution of the input and output, whereas the model within-and-across-view relationships between bands captures the local and global correlations among different bands.
The schematic diagram of ConvGRU is shown in Fig. 4. Suppose that f i (i = 1, 2, . . . , M + 1) is the ithfeature map. The update gate and reset gate are z i and r i , respectively. The update gate z i is a logic gate when updating activation h i , and is expressed as where σ is the activation function. In the update gate, W z and U z represent the weight matrix from the current input layer to the hidden layer and from the previous hidden layer to the current hidden layer, respectively, and h i−1 is the output activation of previous hidden layer. * represents the convolution operation. The reset gate r i decides whether to ignore the previous activation h i when determining the candidate activationh i , and where W r and U r represent the weight matrix from the current input layer to the hidden layer and from the previous hidden layer to the current hidden layer, respectively. The candidate activationh i aims to receive [f i , h i−1 ], and is expressed as where W h and U h represent the weight matrix from the current input layer to the hidden layer and from the previous hidden layer to the current hidden layer, respectively. Finally, the output activation h i of hidden state of the cell , and is obtained by using the following expression: where • represents the Hadamard operator. The size of the convolution kernel of ConvGRU is 3 * 3.
The feature maps of all bands f are fed to the deep feature fusion subnetwork to obtain the feature fusion resultsf by using the expressions presented in (8)- (11). The global average pooling is then used to make the output feature fusion result fixed at the target resolution on a multisequence dimension, and is expressed as The pansharpening result is obtained by a convolutional layer with 1 * 1 convolution kernel. This convolutional layer acts like a decoder to restore the feature fusion results into a high-resolution MS image.

A. Experimental Design
The experimental datasets are publicly available and published by Meng et al. [47]. These datasets are acquired from three different sensors, including QuickBird, World-view4, and World-view2. The information about the three datasets is summarized in Table I [22]. 9) PanNet: A deep network architecture for pansharpening [24]. 10) MSDCNN: A multiscale and multidepth CNN [53]. 11) TFNet: Two-stream fusion network [54]. 12) GPPNN: Detail injection of CNNs [25]. The source codes of all the benchmark methods are available in the public domain [55]. All the DL methods are trained by using Python 3.8.2 with PyTorch 11.2 on a desktop PC equipped with two NVIDIA GeForce GTX 3080Ti GPUs with 11 GB. The performance of the pansharpening methods is evaluated from two aspects, i.e., the reduced resolution evaluation and full-resolution evaluation experiments. The reduced-resolution evaluation uses the reference image to evaluate results. We use three quality assessment indices in the reduced-resolution experiment, including the spectral angle mapper (SAM) [55], which is used to evaluate the spectral quality of image, the error relative global synthesis index (ERGAS) [56], which is used as an extension of the root-mean-square error of the multidimensional array to evaluate the spatial quality, and Q-index (Q2 n ) [57], [58], which is the average representation of all bands. Generally, low values of SAM and ERGAS and high value of Q2 n indicate good performance. The full-resolution evaluation does not involve the reference image. In this work, we also consider three quality assessment indices. The spectral distortion index D λ [59] and the spatial distortion index D s [60] are used to evaluate the spatial and spectral quality of the result, respectively, The QNR [61] is obtained by the combination of   Table II. The parameters of the comparison methods are obtained according to the public code, and the parameters of the proposed method are obtained according to obtaining the best result. In this work, we use Adam [62] to optimize the loss function of the proposed MCRNN. The root mean square (MSE) [63] is used as the loss function, the initial learning rate is 0.00001, the batch size is 64, the total number of epochs is 400, and the learning rate decays by 0.97 every 10 epochs. The total number of iterations is 1.6 × 10 5 . During the training phase, we extract 25 600 PAN, low-resolution MS, and high-resolution MS patch pairs of size 32 × 32, and split them into 70%, 20%, and 10% subsets for training, testing, and validating the model, respectively.

B. Reduced-Resolution Experiments
In the reduced-resolution experiments, the reduced-resolution input MS and PAN images are obtained by performing low pass filtering at a spatial resolution ratio of 4 Fig. 8 shows the pansharpening results of QuickBird dataset in the reduced-resolution experiment. In order to reflect the difference in results, the enlarged subregions are highlighted in Fig. 8. Brovey and CNMF have obvious spectral distortion due to spatial transformation, which leads to the loss of spectral information. All the DL-based pansharpening methods perform better when compared with the traditional pansharpening methods. In particular, the proposed MCRNN in Fig. 8(m) is the closest result to the reference image presented in Fig. 5(a). In order to highlight the differences between the pansharpening results, the SAM error maps are shown in Fig. 9. As shown in Fig. 9(n), the proposed MCRNN produce the less errors than the other methods.
Three quality assessment indices at reduced-resolution experiments are listed in Table III. It can be seen that Q2 n is equal to Q4 in QuickBird dataset due to MS image with four spectral bands. The black bold part in Table III shows that the proposed MCRNN has the best performance on three quality evaluation indices.
In order to further understand the generalization ability of the proposed MCRNN, the performances of all pansharpening methods on World-view4 dataset and World-view2 dataset are tested. The MS image in World-view2 dataset has eight bands. Therefore, we also assess the ability of the proposed MCRNN in different band numbers. The size and the number of testing and training images in World-view4 dataset and World-view2 dataset are consistent with those in the QuickBird dataset.       Fig. 11(d) and (e), and the spatial structure information is poorly maintained. The visual results of the DL-based pansharpening methods are very similar. Therefore, the SAM error maps of all pansharpening results are shown in Figs. 12 and 13 to find the difference. It is observed that the result of the proposed MCRNN has the least errors, when compared with the reference image.
Tables IV and V show three quality assessment indices on two datasets. In particular, Q2 n is equal to Q8 in World-view2 dataset due to MS image with eight spectral bands. The experimental results show that the proposed MCRNN obtains the best quality assessment indexes. The experimental results in the reduced resolution demonstrate a good generalization ability of the proposed MRCNN.

C. Full-Resolution Experiments
We use the same datasets to perform the full-resolution experiments. There is no simulated reduced-resolution process,    Figs. 14-16 show the three datasets at full-resolution experiments, including original MS and PAN images. Fig. 17 shows the results of 13 pansharpening methods by testing the QuickBird dataset at full resolution, and the obvious subregions of these results are enlarged. The DL-based pansharpening methods depict better visual effects as compared to

V. DISCUSSIONS
There are various factors that affect the network structure of the proposed MCRNN. The QuickBird dataset at reduced resolution is selected as a test example for the sake of brevity. Similar conclusions can be drawn from other datasets as well.

A. Number of Residual Blocks
In the shallow feature extraction subnetwork, the residual blocks are used to improve the pansharpening result. In this section, different numbers of residual blocks are used to discuss the effect of the residual blocks on the pansharpening result.
In order to perform a fair comparison, the number of hidden layers in ConvGRU is always set to 2. Table IX shows the assessment of the proposed MCRNN by varying the number of residual blocks. When comparing the first row of Table IX with the second row, the existence of the residual blocks improves the accuracy of the pansharpening result of the proposed MCRNN. When the number of the residual blocks is 2, the accuracy of the pansharpening result achieves the best performance. In addition, when MRCNN with different numbers of residual blocks is trained, the cost of time per epoch is shown in Fig. 20. The increase in the number of residual blocks increases the computational complexity. In terms of accuracy and computation time, two residual blocks yield the best performance for the proposed MCRNN.

B. Number of Hidden Layers of ConvGRU Units
In the deep feature fusion subnetwork, the number of hidden layers is a key parameter for ConvGRU units. It affects the accuracy of pansharpening result and training time.
In order to perform a fair comparison, the number of residual blocks is always set to 2 at this time. Table X shows the assessment of the proposed MCRNN for different numbers of hidden layers of ConvGRU units. In addition, when MRCNN with different numbers of hidden layers of ConvGRU units is trained, the cost of time per epoch is shown in Fig. 21. As an increase in the number of hidden layers, the quantitative evaluation indicators show a small improvement, but the training time increases significantly. When the number of hidden layers is 2, the proposed MCRNN has the best pansharpening result in terms of accuracy and time.

C. Number of Parameters
The NOPs of six DL-based pansharpening methods is compared in Table XI. Table XI shows that the NOP of the proposed MCRNN is significantly less than the other methods. Although the NOP in the proposed MCRNN is not the lowest, it is close  to the best-obtained NOP in PNN. Therefore, the proposed MRCNN is a good lightweight network for pansharpening.

D. Convergence Analysis
In this section, we train all the DL-based pansharpening methods and obtain the MSE curves. As shown in Fig. 22, in order to show the difference between each curve clearly, the ordinate of the MSE graph is set to the logarithmic scale. Due to the size of the images, only 250 000 iterations are shown. The result shows that the proposed MCRNN achieves faster convergence with smaller loss as compared to the other DL-based pansharpening methods. Therefore, this result further proves that the proposed MCRNN has a good lightweight performance.

E. Running Time
Since the calculation of the traditional pansharpening methods is carried out on CPU, whereas that of the DL-based pansharpening methods is carried out on GPU, their results are less comparable. Therefore, we only calculate the running time of the DL-based pansharpening methods, which refers to the timing of the test data from the input to the output. As shown in Table XII, we list the running time for the QuickBird dataset. Among the six DL methods, PNN has the least running time, whereas the proposed MCRNN has the longest running time. This is mainly because the proposed MCRNN requires feature extraction band by band in the first part.

F. Limitations
Although the proposed MCRNN achieves good results, there is still room for improvement in some respects. First, the proposed MCRNN is MS pansharpening method. Namely, the MCRNN can train MS images, but for training more spectral bands, such as hyperspectral images, the effect of our proposed method is not very good. This is because the extraction of the features of spectral images band by band will consume more GPU memory. In addition, because the structure of ConvGRU is relatively simple, it cannot model the relationship between hyperspectral bands efficiently. Therefore, the next work is worth designing a new RNN model to achieve hyperspectral pansharpening.
Second, the direction of the spectral relationship may be bidirectional to improve the performance of the proposed method. However, to maintain lightweight performance, we do not consider this detail, and the spectral relationship is unidirectional in this article. In the future, the bidirectional RNN could be used to model the relationship between bands to improve the quality of pansharpening result.
Finally, the proposed MCRNN is tested in small-scale images in this article. In future work, the performance of the proposed method in large-scale images is also worth further study.

VI. CONCLUSION
In this work, an MS pansharpening based on MCRNN is proposed. The proposed MCRNN contains the shallow feature extraction subnetwork and the deep feature fusion subnetwork. The shallow feature extraction subnetwork extracts the shallow features from the PAN and MS images. The deep feature fusion subnetwork models within-and-across-view relationships between bands to capture the local and global correlations, improving the performance of MS pansharpening. The extensive experiments demonstrate that the proposed MCRNN outperforms the traditional pansharpening methods and the state-of-the-art DL-based pansharpening methods. and the handling editor and anonymous reviewers for their valuable comments.