Siamese Networks Based Deep Fusion Framework for Multi-Source Satellite Imagery

A critical aim of pansharpening is to fuse coherent spatial and spectral features from panchromatic and multispectral images respectively. This study proposes deep siamese network based pansharpening model as a two-stage framework in a multiscale setting. In the first stage, a siamese network learns a common feature space between panchromatic and multispectral bands. The second stage follows by fusing the output feature maps of the siamese network. The parameters of these two stages are shared across scales in order to add spatial information consistently (across scales). The spectral information is preserved by adding appropriate skip connections from input multispectral image. Multi-level network parameters sharing mechanism in pyramidal reconstruction of pansharpened image, better preserves spatial and spectral details simultaneously. Experimental work carried out using deep siamese network in multi-scale setting (to obtain inter-band similarity among different sensor data) outperforms several latest pansharpening methods.


I. INTRODUCTION
Remote sensing imaging systems mainly provide two types of images; one enriched in spectral information while other with high spatial information. Owing to certain technological limitations and costs, satellite sensors lack to capture images with high spatio-spectral information. Multi spectral (MS) images are multiple low resolution color bands. The mono channel panchromatic images (PAN), carry high resolution spatial information while lacking spectral colors. It is desirable to fuse low resolution multi spectral (LR-MS) and PAN images to produce high resolution multi spectral (HR-MS) images. The task becomes challenging to transfer maximum information from two sensors simultaneously. In this regard, various state of the art schemes are discussed below. Existing methods can be specified into four main categories namely: Component substitution (CS) methods, multi resolution analysis (MRA) based on image decomposition, variational optimization (VO), and deep learning (DL) based on convolutional neural network (CNN) and residual blocks.
The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca .
Wu et al. [1], proposed pansharpening algorithm based on detail injection with overflow minimization and adaptive spectral contribution of MS bands to reduce spectral distortion. However, its real time application is limited due to use of guided filter (with less detailed information). In [2], bidirectional data properties are used to carry out blind model estimation for observation and hyperspectal image fusion. The scheme produces high quality results with sharp edges.
In [3], conditional random fields are used to model state and transition functions, related to blur and spatial relationships of HR-MS (corresponding to upsampled MS, and PAN images). The scheme efficiently produces impressive results while avoiding spectral distortions. However, convergence of state function requires more time as compared to modulation transfer function (MTF). In [4], variance and regression are used to define superpixel segments for texture descriptors of PAN image. Injection coefficients are then computed from superpixels alongwith MTF-matched generalized Laplacian pyramid method. Xu et al. [5], have used feed forward CNN with progressive loss function to boost network training and maintain ground truth consistency. The scheme produces favorable results while avoiding gradient disappearance. In [6], multi linear regression is used for polynomial estimation of relationship between PAN image details (that are missing in the MS image). Injection coefficients, based on generalized linear approach, alongwith MTF filter estimation produces robust results; however, sometimes the scheme produces slightly poor spectral quality.
Palubinskas [7], used LR-and PAN images to estimate energy disbalance across different bands. The image adjustment is then used to improve results. In [8], adaptive high dimensional components of MS image along with scaling factors are used to minimize spectral distortion in sharpened image. The model is quite successful, however, it requires optimal values for better results. In [9], high pass filtering is used for contrast enhancement of PAN image. Regression coefficients between MS bands and PAN image are then used to generate better clarity fused image. Hybrid pansharpening method, based on IHS andà-trous wavelet decomposition, injects spatial information in MS bands while preserving spectral details [10]. However, its computational complexity is higher as compared to other CS methods.
Imani [11], proposed multi resolution pansharpening framework using morphological operators (with different structuring elements) to extract structural details from MS and PAN images. Max-absolute fusion rule is then used to generate fused results. Jiao and Wu [12] proposed restoration framework based on blind deblurring and back projection while avoiding ringing artifacts. The scheme uses Tikhonov regularization to perform blurring filter estimation and high-pass modulation to transfer spatial information while minimizing spatial distortions. In [13], convolution based features of PAN image are used as prior information in a multi-scale setting. Structural loss minimization is then employed to improve rich structural quality. Lai et al. [14], proposed an encoder-decoder setting for feature extraction of MS and PAN images (from coarsest to finest level). These features and primitive information are then used to reconstruct high resolution MS images (in a multi-scale setting). In [15], serial concatenated features of MS and PAN images are fed to select matching correspondences while suppressing noise. The output features are then used to restore spectral and spatial details.
Variational methods transform pansharpening into observed model definition and optimization problem to obtain HR-MS image. Variational methods can be classified as observational or sparse representative models. Sparse representation based methods use dictionary learning in reconstruction of target HR-MS image. Iterative dictionary learning process is generally based on LR-MS, PAN, and intermediate HR-MS images, therefore, leads to more computational complexity. Lu et al. [16], used MS and PAN images to obtain a difference image. Fast shrinkage optimization and total generalized variation based model are then used to improve PAN image spectral details. Khademi and Hassan [17], proposed pansharpening model based on primal-dual representation to balance spectral and spatial information in the fused image. The scheme is efficient as compared to other vector based methods while avoiding spectral distortion. In [18], CNN based proximal details along with spatial dependent weights are used as priory information regularizer. Fu et al. [19] formulated HR-MS image as variational optimization problem based on regression and gradient difference of HR-PAN and LR-MS image (for spatial preservation).
Shallow structure architecture [20], uses three convolutional layers (with different sizes to improve results of HR-MS image. In [21], target-adaptive deep fusion network is proposed for pansharpening. The residual learning and L 1 loss function along with fine tuning pass show robust performance for pansharpening as compared to PNN [20]. Wei et al. [22] proposed two staged deep residual leaning based PNN (DRPNN), using multiple convolutional layers. The scheme shows favorable results for pansharpening based on residual estimation with the provision of skip connections. Pan-GAN [23], uses generator-discriminator setting in an unsupervised deep learning framework. The scheme uses two discriminators to preserve spectral and spatial details of MS and PAN images in resultant HR-MS image. Likely, Zhou et al. [24] used QNR (quality with no reference) based loss function discriminators along with for pansharpening. These methods produce good quality results without ground truth information. However, down-sampling methods, affect the performance evaluation of down-sampled HR-MS images, and may be further explored to avoid distortions.
Dian et. al [25], formulated reconstruction of highresolution hyperspectral image (HR-HSI) as optimization solving problem through deep learning. Residual learning priors are returned to construct the target HR-HSI. However, the performance can be enhanced using nonlocal and low rank prior information. In [26], dilated and concatenated feature maps in a multi-scale setting are used to improve spatial content in target HR-MS image. In addition, the proposed scheme removes the batch normalization thus improves the time complexity of the network. Hu et. al [27], incorporates super resolution method in a deep learning framework to generate HR-MS image from LR-MS image. PAN sharpened image is then obtained from HR-MS image and details (extracted using super resolution and total loss methods). However, the scheme can be further improved by exploring multi-scale detail injection method. In [28], proposed scheme uses kernels of different sizes and channel wise rescaled maps to improve feature extraction. The scheme uses channel attention mechanism along with feature extraction and residual learning in a multi-scale fashion to improve network convergence time.
Recently, Shamsolmoali et al. [29] proposed a multi scale architecture with multi patch learning and gradient norms for robust object detection. The cross-scale connections enable adaptive channel selection and sizes (of feature maps) thus improves computational efficiency. Wang et al. [30], used wavelet decomposition, three dimensional CNN, and dimension amplification to improve super-resolution. The scheme can further be improved by following Heisenberg uncertainty principle (to overcome shortcomings of wavelet transform). Similarly, rotation feature pyramid framework [31], uses multibox detection and region proposal network to improve region of interest and object localization. Scale adaptation can further enhance detection capability (focusing small objects) in aerial scenes. Yang et al. [32], uses upsampled MS image for spectral correction of high pass components. PanNet addresses the shortcoming of resNet while preserving spectral and spatial information. VO and DL based method [33], uses convolutional sparse encoding to extract shift invariant features in a variational model. Pansharpened image synthesis step is then merged with the training phase along with traditional optimization methods to boost fusion performance.
To overcome the limitations of the existing schemes, proposed scheme synchronously transfers spatial and spectral details in resultant image. The proposed siamese fusion network bank is initialized with UP-MS and down-sampled PAN images. The siamese networks bank then learns inter-band similarity among different MS and PAN bands to maximize obtainable details for fusion. These maps are passed to global feature fusion block which adopts local-global fusion strategy to estimate the pansharpened image at a particular level. The provision of skip connections ensures utilization of source information. To simultaneously preserve spectral and spatial content in the reconstructed HR-MS image, network parameters are shared across multi-level local and global fusion blocks. Likely, fusion loss optimization is carried out using L 2 norm and ground truth information at different levels. The major contributions of the proposed scheme are: • The proposed work employs Siamese fusion networks in a multiscale pyramidal setting to obtain inter-band similarity prior among different MS and PAN bands.
• Local-global fusion strategy is adopted to estimate the pansharpened image at a particular level from concatenated feature maps.
• Multi-level parameters sharing help in preserving spectral and spatial content simultaneously (in the target HR-MS image).
The remaining part of the paper is structured as follows: Proposed system model and mathematical formulations are discussed in section 2. Demonstration of simulation results and findings of the experiments are discussed in section 3. Finally, section-4 concludes the paper along with some future directions.

II. PROPOSED SCHEME
Proposed model is divided into a two-stage framework. In first stage, a siamese network learns a common feature space between panchromatic and multispectral bands. The second stage is followed by fusing the output feature maps of the siamese network. The parameters of these two stages are shared across scales to add spatial information consistently across scales. The spectral information is preserved by adding appropriate skip connections from input multispectral image.
where * shows convolution operator while g represents linear interpolation [34] filter bearing symmetric and separable properties. Let I m and X m represent m th band in I M and X M respectively and the PAN image be represented by I P . The captured MS images have quarterly resolution as of PAN images. To make originally captured images workable, the PAN image is downsampled with a factor of 2 given as, Siamese network imitates deep fusion network to obtain corresponding stacked scale components S ω . To minimize the spectral and spatial distortions, global and local features, are jointly processed. The two-stage proposed framework uses The scale-space approximation of the target HR-MS image is given as, where FuseNet(S ω ) provides a deep convolutional mapping as shown in fig 2. The same network parameters are shared accross the different scales. The framework uses multi-spectral connections X M ω for interpolated estimation of associated input image. Cascaded siamese network and FuseNet(.) are then recursively implemented using X m ω+1 and X P . The final reconstructed HR-MS image can be represented as, where, indicates the desired scale for the fused target.

B. DEEP FUSION NETWORK
In order to minimize the spectral and spatial distortions, proposed residual learning framework simultaneously processes the global and local features. The proposed fully convolutional FuseNet is shown in figure 3. The output Y  at  th layer is calculated as, Here α represents activation function while W( ) and δ( ) indicate  th layer associated filter weights and bias respectively. The first layer formulates shallow extraction of ω th input stack S ω , i.e., Y(0) = S ω . The extracted output Y (1) is then passed onto the stack of R residual learning blocks (each of which learns local features for fusion).

C. LOCAL FEATURES MAP ESTIMATION
Each residual learning block estimates local feature fusion map by linear weighting of the cascaded channels. The input to the residual learning block is sequentially passed through convolutional layers with filter size, f s = 5 × 5. The concatenated outputs from subsequent layers, along with block input, are then passed through final convolution layer using f s = 1 × 1 to estimate linearly weighted feature map.
Since the filter weights and biases are shared across R blocks, the output of  th layer at r th local fusion block be determined as Y r ( ), i.e., The third layer in arbitrary local feature block can be represented as,

D. GLOBAL FEATURES MAP ESTIMATION
The shallow feature map (Y 1 ) provides skip connection and passed as an additional input to each local fusion block, i.e., Estimated local feature maps are then used for global feature map estimation. The concatenated outputs from R residual blocks are expressed as follows, Feature maps, Y R , are convolved using f s = 1 × 1 and passed from another convolutional layer using f s = 5 × 5 to generate M channel output. The provision of skip connections is useful for model convergence. Deep architectures with skip connections omit some layers in the neural network and feed the output of one layer as input (to next layers). Network parameters of FuseNet are optimized by minimizing the L 2 norm loss function defined as, where, G M is the ground truth HR-MS image and Y M is the approximation of target output at corresponding scale ω. The ground truth image HR-MS is downsampled to compute loss function Bank outputs with skip connections are then used by global feature fusion network to learn global features. The difference at next stage is that the eight MS bands are again upscaled to make them workable with the original full scale HR PAN image. Loss function uses the ground truth HR-MS image in a multi-scale setting to preserve spectral and spatial information simultaneously. The corresponding Siamese block parameters are shared across scales to better learn spatial information. Similarly, across scales parameter sharing for global fusion network, ensure better transfer of spectral details. Figure 3 gives an insight to the fusion block with learnable layers. The final layer of the implemented deep fusion network, outputs M channels. Apart from the final layer, each layer functions with 32 feature maps. The learning mechanism uses Xavier procedure [35] for parameter initialization. A batch length of 20 is used along with patch size of 192 × 192. The number of iterations and epochs are adjusted empirically as 8000 and 16 respectively. For linear processing, Leaky rectified linear units are used as activation function with a slope of −0.2. Loss minimization is carried out using gradient-based Adam optimizer [36] while keeping learning rate at 1e −3 . The execution time using CPU, and parameters information are listed in table 1. To train, and evaluate proposed fusion framework against several state-of-the-art schemes, we used CPU having following specifications: with 12GB RAM. Table data shows that proposed architecture takes less time to perform pan-sharpening in comparison to many other state-of-the art architectures. Table 1 also provides number of network parameters for existing schemes and proposed model.

III. EXPERIMENTS & EVALUATIONS
To fully assess the robustness of the proposed scheme a large number of experiments are carried out using Worldview-2 (WV-2) [37] and Worldview-3 (WV-3) [38] datasets. Fusion quality of pansharpened image essentially depends on inclusion of spectral and spatial information while avoiding distortions in the fused output. In this regard, comparisons are being made with latest deep learning models including CNN based Pansharpening (PNN) [20], enhanced PNN named as PNN+ [21], Deep Residual PNN (DRPNN) [22] and PanNet [32]. In addition to deep learning models, recently proposed Variational Pansharpening with Local Gradient constraints (VPLG) [19] is also included in comparison. The pre-trained models of all of these schemes are already available online.

A. 8-BAND PANSHARPENING
The presented work considers pansharpening using PAN images and LR-MS images having 8 bands. Over 430 image pairs from WV-2 [37] and WV-3 [38] datasets are used in this evaluation. Satellite images provided by these sensors WV-2 and WV-3 have different spatial resolutions having size 512 × 512. Nearly 250 samples belong to WV-2 while remaining are captured by WV-3. Nearly one-third samples (by making random selection) of the acquired dataset are used for training phase, and remaining samples are used during the testing phase. Despite these sensors provide different-type of imagery, proposed model learns network parameters for both sensors simultaneously.

B. REDUCED SCALE FUSION EVALUATION
The proposed method is compared for both reduced-scale and full-scale pansharpening assessment. Image quality at high and low resolutions are closely related [43]. As groundtruth HR-MS images are not available, fusion quality assessment is carried out while considering the low resolution MS image as ground truth G M . Similar to [32] PAN and MS images are downsampled and fused output is then compared at the resolution of available MS image.
Various metrics used for evaluation of pansharpening at reduced scale are breifly described here. Spectral angle mapper (Q SAM ) [39], compares pixel wise difference in spectral information while universal image quality index (Q UIQI ) [40], computes distortions in spectral bands using factors like correlation and luminance. Likely, Global adimensional relative error of synthesis (Q ERGAS ) [42], computes global error; while spatial correlation coefficient (Q SCC ) [41], computes correlation with the reference image. Lower values for Q SAM and Q ERGAS correspond to less distortion while higher values show increased distortion. Whereas, for Q UIQI VOLUME 10, 2022    Figure 5 presents an example of reduced scale pansharpening for visual illustration using WV-3 satellite image pairs. Figure 5(a) shows reference MS image while figures 5(b)-(g) present results for the compared schemes and the proposed model. Corresponding differences (for each scheme and reference image) are shown in figure 6. Among the proposed schemes, DRPNN [22], (figure 5d) and VPLG [19], (figure 5f), produce visually better quality results. However, blurring effect can be seen in some areas of produced results. Further DRPNN [22], suffers from spectral distortion. PANnet [32], (figure 5e) renders pleasant output, even so, the results are slightly over-sharpened. PNN+ [21], (figure 5c) result shows some lack of clarity and suffers from blurring artifacts with some distortions. Similarly, PNN [20], (figure 5b) suffers from color distortion. In comparison to   existing schemes, the proposed framework (figure 5g) outputs good quality fused image that is close in appearance to the reference image.
Quantitative comparison of existing and proposed schemes is provided in table 2. Table 2 reports mean values for pansharpened test images using WV-3 and WV-2 satellite data. The obtained metric values illustrate that proposed method outperforms latest pansharpening methods while minimizing spatial and spectral distortions simultaneously.

C. FULL SCALE FUSION EVALUATION
The learnt model is also tested for full scale pansharpening. In the absence of ground truth, the input LR-MS and PAN images are used as spectral and spatial references respectively. In case of full reference pansharpening evaluation, no-reference quality metrics are used for assessment based on degraded PAN image. In this regard, spectral and spatial distortions denoted by(D λ ) [44] and (D s ) [44] are computed. Additionally, quality with no-reference QNR [45], is also used to estimate global distortion. Lower values of D λ and D s indicate better quality results (with minimum distortions). However, higher value of QNR indicates better fusion for target HR-MS image. Figure 7 illustrates full scale pansharpening case using WV-2 sensor images. Input PAN and interpolated MS image are shown in figure 7a and 7b respectively. Output generated VOLUME 10, 2022 by PNN [20], (figure 7(c)), generates reasonable output, however, the scheme suffers from color distortion. PNN+ [21] (figure 7d) has better colors, however, suffers from artifacts, which are quite evident in the roof top areas. DRPNN [22] (figure 7e) produces good spatial features (edges are preserved) however spectral quality is suffered and several objects on the right side of the image are not easily distinguishable. PANnet [32] (figure 7f) renders good quality are easily visible to distinguish along with colors are preserved. Still some less spectral details are included along with a slightly over-sharpened image. In resultant image of VPLG [19], (figure 7g), color details are preserved to a great extent; but suffers from some artifacts. In comparison to existing schemes, the proposed model (figure 7h) renders good quality fused image in all aspects. Table 3 presents measures for spectral, spatial and global distortions. In comparison to other schemes, VPLG [19], and proposed scheme have better values. Corresponding table data for these schemes are similar. However, careful observation of metric values indicate that VPLG [19], tends to minimize spectral distortion in terms of Q λ while the proposed model better minimizes spatial distortion, Q s .

IV. CONCLUSION
This work presents a pansharpening framework employing siamese network in a multi-scale setting named (DSN-PAN). Siamese network extracts feature maps based on interband similarities among adjacent multispectral and enriched panchromatic bands. A local global fusion network generates pansharpened image at each level in the Gaussian pyramid with shareable learning parameters. Fusion performance of the proposed method validates its superiority for reduced and full scale pansharpening. In our future work, we may explore different backbones for the proposed framework.