No Reference Pansharpened Image Quality Assessment Through Deep Feature Similarity

Pansharpening refers to the process of enhancing the spatial resolution of a multispectral image with the help of a high spatial resolution panchromatic (PAN) image. Quality assessment (QA) of pansharpened images helps provide a formal framework for the analysis and design of pansharpening methods and is thus extremely important. However, lack of availability of a reference multispectral image makes QA of pansharpening algorithms a challenging task. Given the popular use of QA algorithms that use a reference, this article focuses on predicting the quality under a “no-reference” (NR) setting. Specifically, a learning based NR pansharpened image quality assessment (IQA) approach is adopted to predict state-of-the-art reference-based measures such as $Q2^{n}$ and spectral angle mapper without the need of a reference. We design an end-to-end deep pansharpening IQA network to compute the similarity of deep features fused from the PAN and input low-resolution multispectral with similar features extracted from the given pansharpened image. To train and test our learning-based approach, we create a large corpus of pansharpened images belonging to different satellites and thematic scenes by applying different pansharpening algorithms. Our experiments demonstrate that our NR pansharpened IQA algorithm achieves excellent performance and generalizes well across different satellites and resolutions.


I. INTRODUCTION
P ANSHARPENING refers to the problem of fusing low spatial resolution multispectral images with a high-resolution panchromatic (PAN) image to obtain high spatial resolution multispectral images [1]. While the problem has been widely studied in literature, the quality assessment (QA) of pansharpened images has received much less attention. This is important in benchmarking the performance of pansharpening algorithms Neeraj Badal is with the Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India, and also with the Space Applications Centre, Ahmedabad 380015, India (e-mail: neerajbadal@ sac.isro.gov.in).
Rajiv Soundararajan is with the Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore 560012, India (e-mail: rajivs@ iisc.ac.in).
The source code and the dataset will be made available at https://github.com/ neerajbadal/DPIQA. Digital Object Identifier 10.1109/JSTARS.2022.3199446 apart from providing a formal approach to optimize these algorithms. QA also has applications in the quality control of images to determine their suitability for further use in downstream applications. While several measures of quality, such as relative dimensionless global error (ERGAS) [2], spectral angle mapper (SAM) [3], and Q2 n [4] are popularly used to evaluate pansharpened images, these measures require a reference high-resolution multispectral (HRMS) image for QA. This assumption is a severe limitation in the application of QA when such reference images may not be available and motivates the study of no reference (NR) pansharpened image quality assessment (IQA). Indeed the problem of NR pansharpened IQA has been studied in literature to some extent. The quality without reference (QNR) index is one of the earliest NR measures, which evaluates spectral distortion using image quality index (QI) [5] values calculated before and after fusion between the pair of multispectral bands, and spatial distortion using QI values calculated before and after fusion between each multispectral band and the PAN image [6]. The hybrid quality without reference (HQNR) index [7] improves the QNR index by combining elements of the QNR index with Wald's protocol [8]. The idea of extrapolating multiscale measurements at lower resolutions to the desired resolution through polynomial curve fitting has been used in NR pansharpened IQA [9]. Vivone et al. [10] further designed its Bayesian extension for NR pansharpened image QA. Further, such measures have also been used as loss functions while training pansharpening algorithms [11], [12]. Nevertheless, the main drawback with such measures is that they tend to have inconsistencies while comparing different families of pansharpening algorithms [1].
Despite the existence of various approaches inspired by QNR and multiscale extrapolations, the role of learning-based methods to design NR pansharpened IQA methods has not been studied much to the best of our knowledge. Further, the success of deep learning in pansharpening itself has not been sufficiently leveraged in the QA of pansharpened images. Recently, quality aware natural scene statistics based features were studied in learning based perceptual QA of pansharpened images [13]. However, such an approach requires elaborate human studies to collect human opinion scores to evaluate NR pansharpened IQA algorithms.
Instead, we focus on a slightly different learning framework for NR pansharpened IQA. We focus on the particular setting where a reference HRMS image is available for generating ground truth quality scores while training on a satellite but the algorithm is expected to work without the need of a reference This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ image on a target satellite. In particular, we ask whether our NR pansharpened IQA algorithm can predict the quality of pansharpened images at a similar resolution without the need for a reference on other satellites. We believe this situation is quite common in practice. We seek to design NR algorithms to predict state-of-the-art full reference measures, such as Q2 n and SAM, which can be computed using the reference in the training dataset. We choose these full reference measures as the ground truth quality since they are considered a more reliable quality score when a reference HRMS image is available. We note that our problem formulation is a little different from the classical settings of full resolution and reduced resolution studied in pansharpened IQA.
We develop a novel framework based on deep learning to predict Q2 n and SAM without the need for a reference. In particular, our deep pansharpened IQA (DPIQA) measure is designed by computing the similarity of deep features fused from the PAN and input low-resolution multispectral (LRMS) with similar features extracted from the given pansharpened image. The premise behind our approach is to learn quality features from the PAN and LRMS image that could approximate such features extracted from the reference to evaluate the quality of the pansharpened image. Further, we progressively fuse the features from the PAN and multispectral images allowing for a rich representation of features for quality comparisons.
We conduct detailed experiments on a large corpus of images from four different satellites, including: 1) Worldview-2; 2) Worldview-3; 3) Worldview-4; 4) Ikonos. More specifically, we create a large database by applying several different pansharpening algorithms on images from these satellites to obtain pansharpened images of varying quality. Since the paired multispectral and PAN images are publicly available, we will publicly release this pansharpened image dataset for enabling a standard benchmark for evaluating pansharpened image QA algorithms.
We show through several experiments that our DPIQA method achieves state-of-the-art performance in terms of predicting both Q2 n and SAM. In particular, we evaluate the performance of our algorithms by training on just one of the satellites yet showing very good performance on different satellites with similar resolutions of pansharpened images. Our experiments reveal that the NR pansharpened IQA algorithms we design are quite robust and generalize well across different satellites.
In summary, our contributions are as follows. 1) We introduce an NR quality assessment framework where we learn to predict full reference measures without using an HRMS image at test time. 2) We design a DPIQA measure based on computing the similarity of quality representations extracted from the PAN/LRMS images and the pansharpened image respectively. 3) We create a dataset of pansharpened images from different satellites and pansharpening algorithms for experiments which we will make publicly available. 4) We show through detailed experiments that our DPIQA method achieves very good generalization performance in terms of predicting Q2 n and SAM on images from test satellites different from the training satellite.
The rest of this article is organized as follows. Section II reviews the literature on different classes of pansharpening approaches and quality indexes (QIs) used in panasharpened IQA. Section III-B describes the construction of pansharpened image database used for training and testing the deep learning framework. Section IV then describes in detail the proposed deep learning network. Discussions related to performance comparisons and ablation studies are reported in Section V. Finally, Section VI concludes this article and discusses future work.

A. Pansharpening Algorithms
Existing pansharpening methods are mainly divided into a few categories: component substitution (CS), multiresolution analysis (MRA), variational and sparsity based approaches, and learning-based methods. The CS-based algorithms include intensity-hue-saturation (IHS) [14], Generalized IHS (GIHS) [15], Brovey Transform [16], principal component analysis (PCA) based injection [17], [18], [19], and band-dependent spatial detail with local parameter estimation (BDSD) [20]. These methods perform pansharpening by substituting components of the transformed multispectral input with the PAN image. The substituted multispectral image is then transformed back to the original space [1]. The MRA-based methods comprise algorithms like high pass filter (HPF)-based injection [21], generalized Laplacian pyramid [22], and additive wavelet luminance proportional [23]. These methods first compute the spatial information as the difference between the PAN image and its low-pass version. The spatial information is then injected into the upsampled LRMS image to achieve pansharpening by preserving spectral information to a large extent.
Pansharpening methods have also been designed based on variational approaches [24], [25], Bayesian models [26], and sparsity-based compressed sensing ideas [27]. The success of deep learning in image super resolution has enabled researchers to explore this avenue for pansharpening. Popular models are either based on convolutional neural networks such as CNNbased pansharpening (PNN) [28], image super-resolution design such as PanNet [29], or generative adversarial learning, such as pansharpening based on generative adversarial network (PanGAN) [30] and a generative adversarial network for remote sensing image pansharpening known as PsGAN [31].

B. QA of Pansharpened Images
Pansharpened image QA can also be broadly classified as full reference and NR QA similar to natural IQA literature [32]. Full reference QA refers to the scenario where the quality of a pansharpened image is assessed by comparing it to a reference HRMS image. Measures such as ERGAS [2], SAM [3], universal image quality index (UIQI) [5], and Q2 n [4] can be evaluated by comparing the pansharpened image with a corresponding HRMS image.
However, often when pansharpening is applied, a reference HRMS image is not available for QA. This motivates the need to study NR pansharpened IQA. Early efforts in this space used the original LRMS as a reference for evaluating the pansharpened image through Wald's protocol [8]. The premise behind such an approach was that any addition of spatial details from PAN image should not lead to spectral distortions in the fused image. There are broadly two main principles that govern the design of pansharpened IQA strategies, namely: 1) the consistency; 2) synthesis properties of the pansharpened image. Consistency requires the fused image to be as similar as possible to the original LRMS image. Note that this is only a necessary condition since it does not evaluate the appropriateness of the spatial resolution enhancement. The synthesis property requires that the pansharpened MS image in each spectral band be as close as possible to what MS image would have been observed at that resolution. Further, the mutual relations among the spectral bands are required to be similar to what the sensor would observe at that resolution [33]. Thus, the synthesis property considers the evaluation of both the spectral and the spatial quality of the fused image. The two properties have been considered in detail in literature [34].
NR algorithms in literature include methods like quality without reference (QNR) index [6], the method by Khan et al. [35] and HQNR [7]. QNR and HQNR are primarily designed to predict the full reference Q2 n measure. QNR measures the spectral and spatial distortions separately and then combines them. The spatial distortion, in turn, is measured by computing the similarity of each band with the PAN image and comparing such similarities before and after pansharpening. While computing the similarity before pansharpening, a downsampled PAN image is used for comparison. The spectral distortion is assessed by comparing similarities among pairs of spectral bands before and after pansharpening.
The method by Khan et al. [35] used a combination of Wald's protocol and their own measure in the spatial domain. In particular, they compute the UIQI between the high pass information of the pansharpened spectral bands and the PAN image at two scales. The absolute difference in UIQI at the two scales measures spatial quality. The HQNR approach refines the QNR approach by combining the spatial distortion measure from QNR with the spectral consistency measure of Khan's protocol [35]. There are several other extensions with modifications in the similarity measure [36] and use of the natural image quality evaluator [37]. The spatial and spectral distortions have also been jointly measured using a multivariate Gaussian model [38].
Attempts have also been made to evaluate reference-based measures at full-scale through multiscale extrapolations [9]. However, its success has been limited only to the prediction of Q2 n . The use of quality aware natural scene statistics-based features to evaluate perceptual quality at full scale is promising, but it requires detailed human studies to collect human opinion scores to evaluate NR IQA algorithms [13].
While deep learning has impacted the pansharpening process itself [39], we observe that there is a lack of significant work in deep learning based pansharpened IQA. Deep networks have been used in the natural IQA literature [40], but the question of how to extract quality features from the pansharpened image and use the input LRMS and PAN image for QA using deep networks remain unexplored. Our work aims to fill this gap by learning deep features from the pansharpened image, LRMS, and PAN images to predict the reference-based metrics Q2 n and SAM without the need of a reference. We train such a deep algorithm on a large custom corpus of pansharpened images comprising images from different satellites and pansharpening algorithms.

III. PROBLEM STATEMENT AND DATASET CONSTRUCTION
This section first introduces the notation and problem statement followed by a description of the datasets and their generation details.

A. Notation and Problem Statement
Let M * ∈ R h×w×b denote the HRMS image with height h, width w, and b spectral bands. Let c be the spatial resolution ratio between the HRMS and LRMS images. We denote the The pansharpened image is denoted as X ∈ R h×w×b . LetM denote the LRMS image upsampled to the scale of PAN using bicubic interpolation. We particularly focus on predicting Q2 n and SAM without using a reference, on account of their popular usage in evaluating pansharpening methods.
To validate the NR prediction of SAM and Q2 n against the ground truth scores, in our experimental set up, we simulate the LRMS images by downsampling the HRMS images by a factor c. We also simulate the PAN image P by downsampling P * ∈ R ch×cw×b , a higher-resolution PAN image available in a given database. The pansharpening algorithms are applied on M and P to obtain the pansharpened image X. The HRMS image M * serves as a reference MS image to compute the ground truth Q2 n and SAM scores for training and performance evaluation. The goal of the NR pansharpened image quality methods is to predict these Q2 n and SAM measures given M , P and the pansharpened image X. While the HRMS image M * is available during the training phase for us to compute the ground truth, during testing on a different satellite, a pansharpened IQA method can only take as input M , P , and X. The training and testing scenarios are described in Fig. 1.

B. Datasets
To facilitate learning-based QA, large-scale databases are required for training. For effective learning and verifying robustness of the quality score prediction, we collect images belonging to different satellite sensors comprising different thematic surfaces with urban areas, green vegetation, and water bodies. Details of satellite sensors for the data samples we collect are shown in Table I    WorldView-3 datasets are taken from the SpaceNet Challenge datasets [41]. WorldView-4 and Ikonos datasets were taken from the Harris Geospatial website [42].
While Worldview-2 and Worldview-3 consist of eight band multispectral imagery, the other two datasets consist of four bands. To remain consistent with respect to spectral bands across different satellite sensors, the commonly matched red (630-690 nm), blue (450-510 nm), green (510-580 nm), and infrared (770-895 nm) bands were picked from the WorldView-2 and WorldView-3 satellite multispectral data. We choose a downsampling ratio c = 4 to create M from M * and P from

C. Pansharpening Algorithms
To evaluate the quality of pansharpened images, we create a database of pansharpened images by applying different pansharpening algorithms from different satellites. In particular, we obtain pansharpened images generated from six different pansharpening algorithms. The first set of these methods include the plain LRMS upsampling (EXP) [9], GIHS [15], PCA-based substitution [17], [18], [19], and Brovey transform [16], all categorized as component substitution based methods. We also include HPF injection based on the multiresolution analysis family of methods [21] and one deep learning based PanNet [39] pansharpening method. Fig. 3 highlights a sample image patch from the WorldView-3 dataset along with its ground truth Q2 n score for Brovey, HPF, PCA, and EXP pansharpening algorithms. Recall that the ground truth quality score corresponding to reference-based metrics Q2 n and SAM are generated for the pansharpened image X using M * as the reference HRMS image. Fig. 4(a)-(d) and Fig. 5(a)-(d) shows the distribution of ground truth quality scores belonging to Q2 n and SAM, respectively, for all the four satellite datasets. As shown in Fig. 4(a) Fig. 5(a)-(d).

IV. DEEP FEATURE SIMILARITY FOR NR QA
We now discuss the proposed deep learning based QA model DPIQA in detail. We first describe the network architecture in terms of the design choice of feature extractors, feature fusion strategy, and quality score compute block. We then introduce the loss function and implementation details of the training process.

A. Network Architecture
The proposed network is an end-to-end learning framework to estimate the QA score of an input pansharpened image fed along with the corresponding PAN and LRMS images. The PAN and LRMS images are used to extract quality features approximating that of a reference HRMS image. These features are then compared against the pansharpened image features to carry out quality score regression.
As shown in Fig. 6, the proposed model consists of three feature extractor streams followed by a quality score compute block. The feature extraction block design is further illustrated in the Fig. 7. We design Φ pr as a network that fuses features from the PAN image and LRMS images to approximate the features of the reference multispectral image in the form of a pseudo-reference. The features from the LRMS image are in turn obtained through Φ m and fused progressively with Φ pr . The network Φ x denotes the feature extractor function for the pansharpened image. We adopt a slow fusion strategy at each layer in Φ pr to fuse the features from the PAN and LRMS images. Note that the LRMS and pansharpened images share the same spectral information for a given scene at different spatial resolutions. Thus, we share the weights of Φ m and Φ x . This helps reduce model complexity and more directly compare the features of the pansharpened image and our approximation of the features of the HRMS image. To apply the same network on both the LRMS and pansharpened images, the LRMS image is up-sampled to the resolution of the pansharpened image using bicubic interpolation, before passing it through the network. Recall that this upsampled image is denoted asM . Although we describe the architecture for both Φ m and Φ x below, we note that their network parameters are shared.

B. Feature Extraction Module
Residual Block Design: We first define a residual block [43] before describing its usage in our architecture. The design of each residual block is shown in Fig. 8. If x is the input, the output x is realized as where and σ(y) is the ReLU nonlinear activation function. Conv 1 is the first convolutional layer, and it uses a stride of 2 to reduce the height and width of the input x by half. Conv 2 is the second convolutional layer with a stride of 1. Both the convolutional layers use kernels of size of 3 × 3. Conv 3 is a 1 × 1 convolutional block with stride 2 deployed to adjust the channels and resolution of the input to be compatible with F(x). Features: The feature extraction modules are a series of convolutional layers arranged as stacks of residual blocks inspired from the ResNet model [43] to facilitate efficient training of deep networks. Both Φ pr and Φ m are built by stacking L = 4 residual blocks. Each residual block extracts 16, 64, 128, and 256 feature maps for a given input. We denote the residual blocks of Φ pr , Φ m , and Φ x as r l pr , r l m , and r l x , respectively, where l ∈ {1, 2, . . . , L} and L denotes the number of blocks. The features at the lth layer of Φ m , Φ pr , and Φ x for l = 2, 3, . . . , L are obtained as The first layer outputs are given as  Our approach intends to learn quality features from the PAN and LRMS image that could approximate features extracted from the reference. As we intend to learn a final quality score mapping based on the local and global distortions, a slow fusion approach of progressively fusing the PAN and LRMS image features at multiple intermediate layers is followed. Specifically, the residual block outputs φ l−1 pr and φ l−1 m are concatenated and fed as an input to the lth residual block of Φ pr . This ensures rich fusion at multiple layers and the final fused feature space can have both spatial and spectral information at the local and global level. We denote the fused pseudo reference HRMS features as F pr . Note that F pr = φ L pr . We also denote the features extracted from the pansharpened image as F ps . Mathematically, F ps = φ L x .

C. Quality Score Computation
Let F pr and F ps be represented as 3-D matrices of size s × t × k, where s × t denotes the spatial dimension and k denotes the number of channels. Thus, each spatial location (i, j) can then be treated as a k-dimensional vector represented as f pr (i, j) and f ps (i, j) for the pseudo HRMS reference and pansharpened features, respectively. A quality score map q is then generated at each location (i, j) as Depending on the choice of the reference-based pansharpening QA metric, the function Ψ(., .) needs to be suitably designed.
In our work, we demonstrate quality score prediction for two metrics Q2 n and SAM as separate models through and respectively. Note that x, y denotes the dot product between vectors x and y and ||x|| 2 denotes the two norm of x. As Q2 n closely resembles the similarity computation between an HRMS and a pansharpened image [4], Ψ is defined through (11). Similarly, SAM determines spectral similarity by calculating the angle between the HRMS and pansharpened image spectra [3]. Hence Ψ is defined as (12). The final prediction for the quality score is obtained by performing global average pooling given by

D. Loss Function
Suppose we are given a set of pansharpened images X n along with their corresponding PAN P n and LRMS M n images as input, for n = 1, 2, . . . , N, where N is the total number of training samples. Our DPIQA network with parameters ω predicts the quality score q n pred for the nth training example through (13). The goal of training this network is to arrive at a parameter set that minimizes the overall mean-squared error quality prediction loss between q n pred and ground truth score q n gt for all the pansharpened images in the training dataset. The overall optimization is given by min ω L(ω) where the loss function L is defined as

E. Training Details
Recall that we train two separate models to predict the reference-based quality metrics Q2 n and SAM. Due to the limited availability of data from other sensors, we choose the WorldView-3 dataset to be part of the training set with (75 : 25) train-test split percentage on WorldView-3 for both the quality metrics. Further, this set up also enables us evaluate the model for its robustness and generalizability by testing on different satellites. The models are trained end-to-end over 31 104 WorldView-3 image patches by back-propagation using adaptive moment estimation (ADAM) [44] optimizer employed with β 1 = 0.6 and β 2 = 0.99 and initial learning rate set to 4e −7 . We trained the networks for 400 and 540 epochs for Q2 n and SAM predictions, respectively.

V. EXPERIMENTAL RESULTS
We now discuss different experiments conducted to assess the performance of our proposed model. We first explain the experimental setting and evaluation measures. Section V-B details the benchmarking methods against which the model is compared. This is followed by performance comparisons in Section V-C and a discussion on different ablation studies performed on the proposed model in Section V-D.

A. Experimental Setup
The models trained for Q2 n and SAM score predictions are evaluated on different test datasets to assess the performance over different satellite resolutions and different geographical locations, respectively. While our models are trained on the WorldView-3 dataset, the performance is demonstrated on image patches from WorldView-2, WorldView-4, Ikonos, and the test set of WorldView-3 satellite. We include results on the test set of the WordView-3 satellite to get a sense of the comparison with the other datasets. The prediction performances are compared based on Spearman rank order correlation coefficient (SRCC), Pearson correlation coefficient (PLCC), and root-mean-square error (RMSE) between the predicted q pred and ground truth q gt quality scores. We choose these evaluation measures since they are popularly used to evaluate QA for natural images [45].
PLCC is a measure of linear correlation between two sets of scores in terms of the normalized covariance between them. For a given pair of measurements (x i , y i ), i = 1, 2, . . . , N, PLCC ρ is defined as (15) It can take a value between the range −1 and 1, with a value 1 indicating positive correlation and a value of 0 indicating uncorrelated pairs of variables.
SRCC is defined as the PLCC between the ranks of the given pairs of scores. Given pairs of raw scores, let r(x i ) and r(y i ) be their respective ranks. The SRCC r s is then defined as (16) A higher value indicates higher performance in the prediction of a specific quality metric with its ground truth. The SRCC measure is particularly useful in understanding how a given quality measure ranks a set of images in terms of their quality and whether the ranking order correlates with the ground truth ranking.
RMSE indicates how different the predicted and ground truth scores are from each other. For a given set of N samples having a pair of ground truth y i and predicted x i scores for each ith sample, the rmse is calculated as We evaluate these measures on each dataset by taking all the pansharpened images from different algorithms into a single set of N samples. Such an approach is helpful in evaluating quality at an image level. Later, in Section V-C, we also adopt a slightly different approach of evaluating how our DPIQA method is useful in predicting the performance of different pansharpening algorithms.

B. Benchmarks for Comparisons
We compare our model with several interesting benchmarks for predicting the Q2 n and SAM scores. We first compare with quality estimation by fitting (QEF) [9], which is a multiscale extrapolation method. Similar to prior work [9], we also compare with QNR and HQNR for Q2 n prediction, since the nature of their predictions bears a lot of similarity with the Q2 n computation. Apart from these, we compare with two sets of approaches for NR pansharpened IQA. In particular, we benchmark the performance of features based on natural scene statistics [46] and the use of a plain vanilla ResNet model [43].
1) Features Based on Statistical Models: While natural scene statistics (NSS) [47], [48] based features have been often used in other contexts for perceptual QA [13], their performance has not been benchmarked for the task of predicting Q2 n or SAM. Here we adapt the NSS-based features described in the blind/referenceless image spatial quality evaluator (BRISQUE) model [46] for QA of natural images. In particular, we extract NSS features from each of the multispectral image channels and concatenate them. The combined features from all the image channels are then regressed against the quality scores using a linear linear support vector regression (SVR) model [49]. We evaluate two regression models based on this feature extraction: 1) BRISQUE feature regression using only the pansharpened image (BRISQUE-X); 2) BRISQUE feature regression using both the pansharpened and LRMS images (BRISQUE-XM). While BRISQUE-X uses features only from the pansharpened image, BRISQUE-XM uses the features from both the pansharpened and LRMS images concatenated with each other. These models are trained on the WorldView-3 train dataset to predict the chosen reference-based quality score on the different test sets.
2) ResNet-Based Deep Features: Here we train a ResNet model named ResNet-IQA from scratch for the task of predicting the Q2 n or SAM scores. In particular, we deploy a ResNet-21-based architecture by replacing final class prediction layer with a global average pooling layer and two dense layers of sizes 512 and 1 which output the quality prediction. The input to the ResNet model consists of four color channels corresponding to the pansharpened image. We train ResNet-IQA on the WorldView-3 train dataset by back-propagation using ADAM [44] optimizer employed with β 1 = 0.6 and β 2 = 0.99 and initial learning rate set to 4e −7 for 300 epochs. Separate Fig. 9. Benchmark scatter plots for Q2 n predicted versus ground truth quality scores.
ResNet-based models are trained for Q2 n and SAM quality score predictions respectively.
While computing the PLCC and RMSE for all the above measures, we process the quality measures not obtained through a learning method as follows. Since the relationship between QNR/HQNR and the ground truth Q2 n may be nonlinear, both QNR and HQNR scores are passed through a five parametric logistic nonlinearity [45] before computing PLCC and RMSE for mapping to Q2 n space.

C. Performance Comparisons
We show the performance comparison results for the proposed DPIQA model and other models in Tables II and III for Q2 n and SAM. The best and the second best results are bold faced and underlined, respectively. These tables also include a column for the average performance on all the three datasets from satellites that are different from the training dataset. The Q2 n prediction results in Table II indicate that our DPIQA method achieves the best performance on the WorldView-3, WorldView-2, and Ikonos datasets. Its performance is still competitive on the WorldView-4 dataset. The average performance indicates our model as a clear winner in terms of all the metrics.
The analysis of the SAM predictions in Table III indicates that our proposed DPIQA model provides superior performance for almost all the satellite datasets in terms of all three performance measures. Our model predictions have the lowest RMSE amongst all other benchmarking methods for the WorldView-2 dataset and come out to be the second best in terms of SRCC. A composite analysis of all the datasets from a different testing satellite interpreted through the average scores indicates that our DPIQA model achieves the best performance across all datasets. Since QNR and HQNR are primarily designed to approximate Q2 n , we do not compare them with the SAM scores.
The above results are further evident when looking at the scatter plots for the predicted versus ground truth Q2 n and SAM values on the WorldView-2 test dataset in Figs. 9 and 10, respectively. The performance rank observed in the Tables II  and III for the WorldView-2 dataset can be seen matching with scatter plots shown for the benchmarking methods.
The drastic drop in performance for BRISQUE-based features on WorldView-3, WorldView-4, and Ikonos datasets is probably because these datasets include patches belonging to water, dense vegetation, and sometimes a mix of both of them along with building structures. The scenes consisting purely of water and dense vegetation may not follow statistical characteristics of natural images used by BRISQUE-based NR IQA methods.
Figs. 11 and 12 show a visual comparison of the predicted Q2 n and SAM values for a sample image patch from the Worldview-2 dataset for Brovey transform, PCA, and EXP based pansharpening techniques, with varied levels of ground truth Q2 n and SAM quality scores. The performance ranks seen visually for sample image patch under the respective pansharpening algorithms match with the predicted Q2 n and SAM score trends.
All the above results demonstrate the robustness of the proposed DPIQA model on different satellite datasets. Further, a single framework that we design can reliably predict different reference-based quality scores through minor modifications in the output computation.
To verify the ability of the quality prediction methods to compare different pansharpening algorithms, we perform the following experiment similar to [10]. For each QA method, we compute the average predicted quality score for different pansharpening algorithms. In other words, we compute the predicted quality score for all images pansharpened according to a given algorithm and compute their average. We then compute the RMSE between the average predicted and average ground truth values of quality scores across different pansharpening algorithms. Tables IV and V demonstrate this RMSE performance for different methods where the best and the second best results are bold faced and underlined, respectively. Since we are interested in relative comparisons on a given dataset, we ensure that the predicted scores on different datasets are globally aligned with the ground truth scores while computing RMSE. In particular,

D. Ablation Studies
We now conduct a series of ablation experiments to evaluate the contribution of each component used in the proposed DPIQA network. The evaluation set up is the same as before where the proposed architecture of DPIQA is trained on the WorldView-3 train dataset to evaluate performance over different satellites.
We use the image level prediction set up for these studies. The different ablation models are trained under the same parameter settings as for the original DPIQA and for 500 training epochs. The ablation results are shown in Tables VI and VII for Q2 n and SAM metrics, respectively.
Influence of PAN Image: In this experiment, we study the need for the PAN image and its features in pansharpened IQA. We conduct this experiment by excluding this image from the training setup and as a consequence, the PAN feature extractor branch Φ pr is no longer required. The modified model termed as DPIQA without using panchromatic image features (DPIQA-WP), thus treats the LRMS image deep features as the pseudo HRMS deep feature reference. Since the Q2 n metric incorporates both spatial and spectral distortions, we expect that the absence of the PAN image during training will degrade the prediction performance. The ablation results shown in Table VI reveal the same trend where the drop in performance is observed for the Q2 n prediction task. As the SAM metric is more sensitive towards spectral distortions [6], the results reveal no significant drop in SAM prediction performance as seen in Table VII.
Influence of LRMS Image: Here we explore the need for the LRMS image by excluding the LRMS image feature extractor branch Φ m from the algorithm setup. As the spectral distortion component is essential to both Q2 n and SAM, a drastic drop in prediction performance is seen for these measures in the modified model termed as DPIQA without using LRMS image features (DPIQA-WLRMS) as seen in Tables VI and VII. Influence of Weight Sharing: In this experiment, the feature extractor of LRMS image Φ m is refrained from sharing the weights with the feature extractor of pansharpened image Φ x . We refer to this model as DPIQA with no weight sharing (DPIQA-NWS). The input to Φ m , although, is still the bicubically interpolated up-sampled LRMS image. The prediction performance under this setting is similar to the one obtained under the original DPIQA architecture setup. However, we see a small drop in SRCC values and an increase in RMSE values for Ikonos Q2 n prediction and WorldView-4 SAM prediction tasks, respectively. These results demonstrate that we do not suffer much in sharing the weights between Φ m and Φ x . However, such weight sharing results in allowing us to store fewer parameters, thereby reducing the size of the model.

Slow Fusion versus Late Fusion:
In this experiment, the mixing of the LRMS image features with the PAN image features is delayed till the last layer of the feature extraction. The pseudo HRMS features F pr is obtained from the concatenation of features from last residual block outputs of r L pr and r L m , respectively, followed by two 3 × 3 convolutional layers with the same number of filters as used in last residual blocks of r pr and r m . This model, termed as DPIQA with late fusion (DPIQA-LF), performs similar to the original DPIQA model and even better for WorldView-4 SAM predictions. However, we see a drop in Q2 n prediction performance for WorldView-4 and Ikonos datasets compared to the original DPIQA model. Moreover, since the number of channels is larger in deeper layers, concatenation at this point increases the size of the model. Thus, this design uses 4 166 144 model parameters whereas the original DPIQA uses only 2 816 000 parameters.

Role of Effective Contrast Stretching of Dataset:
In this experiment, the impact of proper image contrast on the proposed DPIQA model prediction performance is explored. Specifically, only linear contrast stretching is performed on all the satellite datasets instead of the nonlinear contrast stretching described in Section III-B. In particular, before the extraction of P and M patches, the original PAN and MS images are normalized to have pixel values within the range [0,1] based on the minimum and maximum values of the scenes from which patches are obtained. The training and cross-satellite evaluation datasets are then prepared in the same manner as explained in the section III-B. The original DPIQA model is then trained over this linearly contrast stretched dataset separately under the same parameter settings and training steps as for the original DPIQA and we refer to this model as DPIQA trained on data with linear contrast stretching (DPIQA-LSE). The prediction performance under the linearly contrast stretched dataset performs well for SAM predictions. However, we see a drop in performance for WorldView-4 and Ikonos dataset Q2 n predictions.

VI. CONCLUSIONS AND FUTURE WORK
In this article, we presented a novel deep learning framework to predict the state-of-the-art reference-based measures Q2 n and SAM without the need for a reference. We achieve this by computing the similarity of deep features fused from the PAN and input LRMS with similar features extracted from the given pansharpened image. As part of the learning process, a large corpus of pansharpened images was built containing different thematic scenes belonging to four different satellites obtained by applying six different pansharpening techniques. Our detailed experiments on this large corpus of images from Worldview-2, Worldview-3, Worldview-4, and Ikonos satellites show the superior performance of our deep learning based no reference pansharpened IQA model in terms of correlating well with Q2 n and SAM measures. The robustness and generalization ability of our DPIQA model is further evident from the fact that it is trained on just one of the satellites. Yet, it shows very good performance on different satellites with different resolutions.
Based on the challenging cases evident from Section V-C, an important future goal is to further improve upon the prediction score accuracy for image patches consisting of mixed thematic scenes of water, vegetation, and buildings, which were observed in WorldView-4 and Ikonos cases. For this purpose, a design choice incorporating multiscale score predictions like QEF may be explored. The custom generated pansharpened image database is currently limited to the spatial and spectral distortions observed while applying different pansharpening algorithms. This can be expanded by including the pansharpened images that suffer from misregistration of spatial features due to registration errors between the input LRMS image channels and coregistration distortion between input PAN and LRMS images [50]. The same DPIQA framework with some minor modifications in the overall network design can also be utilized to predict error-based reference metrics such as ERGAS.