UPSNet: Unsupervised Pan-Sharpening Network With Registration Learning Between Panchromatic and Multi-Spectral Images

Recent advances in deep learning have shown impressive performances for pan-sharpening. Pan-sharpening is the task of enhancing the spatial resolution of a multi-spectral (MS) image by exploiting the high-frequency information of its corresponding panchromatic (PAN) image. Many deep-learning-based pan-sharpening methods have been developed recently, surpassing the performances of traditional pan-sharpening approaches. However, most of them are trained in lower scales using misaligned PAN-MS training pairs, which has led to undesired artifacts and unsatisfying visual quality. In this paper, we propose an unsupervised learning framework with registration learning for pan-sharpening, called UPSNet. UPSNet can be effectively trained in the original scales, and implicitly learns the registration between PAN and MS images without any dedicatedly designed registration module involved. Additionally, we design two novel loss functions for training UPSNet: a guided-filter-based color loss between network outputs and aligned MS targets; and a dual-gradient detail loss between network outputs and PAN inputs. Extensive experimental results show that our UPSNet can generate pan-sharpened images with remarkable improvements in terms of visual quality and registration, compared to the state-of-the-art methods.


I. INTRODUCTION
With the advent of deep-learning, many deep-learning-based methods have been proposed to solve various image restoration problems, i.e., super-resolution [7], [18], [20], [22], [35], showing state-of-the-art performances in terms of reconstruction quality. Likewise, the growing usage of deep-learning for satellite imagery research can be observed recently. Satellite imageries contain various scenes around the world. The research areas for satellite imagery include prediction of forest growth, classification of crops, buildings and roads, environmental monitoring, and many other applications. To achieve high performance for solving such problems, it is essential to obtain high-quality, high-resolution satellite image datasets. However, due to the constraints of intrin-The associate editor coordinating the review of this manuscript and approving it for publication was Xiaohui Yuan . sic satellite sensor resolutions and transmission bandwidths, most satellites acquire multi-spectral images with varying resolutions for the same geographical regions. In general, satellite images are comprised of pairs of low-resolution (LR) multi-spectral (MS) images of a larger ground sample distance (GSD) and high-resolution (HR) panchromatic (PAN) images of a smaller GSD. Pan-sharpening or pan-colorization is the task of generating pan-sharpened (PS) multi-spectral images which have the same spatial resolutions as the PAN images, by fusing the high-frequency details from the PAN images and the color information from the MS images. Fig. 1 shows an example pair of PAN, MS and PS results from various pan-sharpening approaches, including the proposed method.
[33], [38], [43], [44]. These methods are based on supervised learning ( Fig. 2-a) that often requires a degradation model to prepare a training dataset of PAN-MS pairs. For this, the original PAN-MS pairs are degraded (down-scaled) to LR PAN-MS pairs which are then used as inputs to the networks, and the original MS images are used as pseudo ground truth for training. In doing so, the networks are trained to output down-scaled PS images of input MS scales in such a lower scale scenario. Therefore, when these networks are tested under the original scale scenario, they perform poorly where the networks yield the PS images of input PAN scales. To overcome the scale (resolution) mismatch between training and testing, we propose an effective unsupervised learning framework for pan-sharpening, where a ground truth is not required for training. This enables the network to be trained and tested on the same scales, resulting in better visual quality.
Since the ground truth data are not available in pansharping, conventional supervised PS methods could not help but utilize the lower scale scenario. These methods optimize their PS outputs with mean absolute error (MAE) or mean squared error (MSE) loss using pseudo ground truth MS image. In our unsupervised PS ( Fig. 2-b), where no ground truth image is required, we design two novel loss functions so that our UPSNet can effectively learn the high-frequency details from PAN inputs and color information from MS inputs in the original scale scenario without any pseudo ground truth: one is a dual-gradient detail loss between network outputs and PAN inputs; and the other is a guided-filter-based color loss between network outputs and our aligned MS targets.
One of the main difficulties of the pan-sharpening task is a misalignment between PAN and MS image pairs. PAN and MS images often have the misalignment of some pixel distances due to inherent limitations in satellite sensor arrays and acquisition time difference. A misaligned dataset used for training often entails undesired artifacts in pan-sharpened results such as double edge and color spread artifacts. To remedy this problem, we incorporate a preprocessing step only during the training where each MS image is registered to its corresponding PAN image in the sense of correlation maximization. The aligned MS images are not used as inputs to the network but are used as targets for the color loss. By doing so, our UPSNet can learn to implicitly match the high-frequency information from PAN inputs and color information from misaligned MS inputs during training, without any dedicatedly designed registration module. The trained UPSNet can then properly handle misaligned PAN-MS input pairs during testing. As shown in Fig. 1, the output image from UPSNet shows that structures and colors of the objects are better well-aligned compared to the other five methods. We can also observe that the produced pan-sharpened image from UPSNet has the most similar color compared with that of the input MS image while preserving the strong edges from the corresponding PAN image.
Furthermore, we found that a patch-based normalization can effectively deal with non-stationary PAN and MS input images of various pixel intensity distributions depending on 201200 VOLUME 8, 2020 geographical features, which often leads to color distortion in the pan-sharpened results. Similar to a batch normalization [13], this reduces the internal covariate shift and enables faster and more stable training of the network, which could possibly result in higher performance. Besides, applying local normalization helps maintain the color information of the MS input. This allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Our contributions can be summarized as follows: • We propose a novel unsupervised learning framework for pan-sharpening where our proposed UPSNet can achieve state-of-the-art performance for most metrics and shows significantly better visual quality when tested on the original scale.
• Two novel loss functions for pan-sharpening are proposed, which effectively fuse the high-frequency details from PAN images and color information from MS images: a dual gradient detail loss and a guided-filterbased color loss. The dual gradient detail loss can appropriately handle different characteristics of PAN and MS image signals, so that UPSNet can effectively learn the details of PAN images. The guided-filter-based color loss allows UPSNet to effectively learn the color information from aligned and upscaled target MS images.
• With a preprocessing step of correlation-based alignment between PAN and MS images only for training, UPSNet can be trained to implicitly handle the inherent misalignment between PAN and MS input images without the preprocessing step in testing.
• We propose a simple yet very effective patch-based normalization technique that boosts up the generalization capability of our UPSNet for PAN-MS images of various satellites.

II. RELATED WORKS A. TRADITIONAL PAN-SHARPENING METHODS
Before the advent of deep-learning, pan-sharpening algorithms were based on component substitution, multiresolution analysis, and model learning. Component substitution methods [5], [9], [17], [34], [42] apply spectral transformations on an interpolated MS input, and its spatial channel is replaced with a modified PAN. Multiresolution analysis based methods [27], [36] fuse the high-frequency details of PAN images into up-sampled MS input images. To decompose such high-frequency components, wavelet or undecimated decomposition techniques are utilized. Then these decomposed components are incorporated into interpolated MS input images to form pan-sharpened images. These methods have relatively low computational complexity but tend to produce the resulting images with mismatched spectral information and artifacts because they do not consider local properties of MS and PAN images. Model learning-based methods [11], [29], [31] learn pan-sharpening models by using regularization terms. In these methods, pan-sharpening is defined as an ill-posed problem, where a certain model is optimized to generate an output image so that a similarity metric between the output and target pan-sharpened image is maximized. These methods tend to produce pan-sharpened images with better quality having well-preserved spectral information, but require high computational complexity compared to the previously mentioned methods.

B. DEEP-LEARNING-BASED PAN-SHARPENING METHODS
Recent pan-sharpening methods incorporate various types of CNN structures. Pan-sharpening CNN (PNN) [28] is known to be the first CNN-based pan-sharpening method, showing competitive performance compared to conventional methods. The PNN adopted a shallow 3-layered network structure from SRCNN [7], which is the first super-resolution method to use CNN. Inspired by the success of ResNet [10] in classification, Yang et al. [43] proposed PanNet that has adopted the ResNet structure as their backbone network, where residual connection enables the network to focus on preserving the high-frequency details. PanNet applies high-pass filtering to MS and PAN inputs, and their edge components are used as network inputs. This enables better network generalization, being robust for unseen satellite datasets. By adopting the network architecture of the state-of-theart SR network, EDSR [22], Lanaras et al. [19] proposed a deep network (DSen2) and a deeper network (VDSen2) for super-resolution of the Sentinel-2 satellite images. DSen2 and VDSen2 are not exactly pan-sharpening methods since they super-resolve the images in 9 lower-resolution bands using the images in 4 higher-resolution bands as guidance. PAN images are not included in the Sentinel-2 dataset. PanNet and DSen2 show top performance in various quantitative metrics, producing PS images with high visual quality. Zhang et al. proposed a bidirectional pyramid network [45] that processes the MS and PAN images in two separate branches, which allows the spatial detail features from the PAN branch to be fused into the spectral information features of the MS branch, finally generating the output pan-sharpened images. This type of feature fusion has improved the preservation of high-frequency spatial information from PAN images.
Recently, Choi et al. proposed an S3 [6] loss, which considers the correlation between PAN and MS images. The S3 loss is devised to be applied adaptively for the areas according to the correlation values between MS and PAN images, thus reducing the ghosting artifacts around moving objects such as cars on the roads. Although the aforementioned deeplearning-based methods have greatly enhanced the performances and visual qualities over the traditional methods, they still have some limitations that those methods were trained in lower scales in a supervised manner, resulting in suboptimal PS outputs.
Recently, a few attempts have been made to tackle the drawbacks that come from supervised learning with pseudo ground truth. Ma et al. [26] proposed an unsupervised scheme based on a generative adversarial network with spatial and spectral discriminators. PercepPan [46] adopted an autoencoder architecture into their unsupervised PS network VOLUME 8, 2020 design, and utilized a perceptual loss to improve visual quality. Qu et al. incorporated a self-attention mechanism [32] that estimates spatially varying detail extraction and injection functions. Luo et al. also proposed an unsupervised pansharpening method [25] with an iterative fusion network. Although these unsupervised PS methods resolved the drawbacks of training in lower scales, none of them considered the inherent misalignment between MS and PAN inputs.

III. PROPOSED METHOD
As aforementioned, the pan-sharpening (PS) is defined as a task to obtain high-quality PS images using high-resolution (HR) PAN images and their corresponding low-resolution (LR) MS images. The resulting PS images should have the high-frequency detail information of the PAN images and the color information of the MS images as similar as possible.
To avoid the drawbacks that come from training PS networks using pseudo ground truth images, our UPSNet learns the pan-sharpening in the original scale scenario, as shown in Fig. 2(b). Another root cause of inferior visual quality of previous pan-sharpening methods is a misaligned PAN-MS input pair. To allow UPSNet to implicitly handle the misalignment between PAN and MS images, which we call ''registration learning'', a data preparation step is introduced with a correlation-based alignment between PAN and MS images, which is only used during the training. To effectively train our UPSNet, we present two different types of loss functions, which allow the network to learn spatial information from PAN inputs and spectral information form MS inputs to produce high-quality PS images. Note that the training of UPSNet is done in the original scales of PAN and MS images, where the testing is also taken place. In order to handle diverse characteristics of PAN and MS images taken from different satellites with UPSNet, we propose a simple but very effective patch-based normalization technique to have a generalization capability for PAN-MS images from various satellites. More details for loss functions, registration method, and normalization will be thoroughly explained in the following subsections.

A. FORMULATIONS
In general, satellite imagery datasets include PAN images of higher resolution (smaller GSD), denoted as P 0 , and the corresponding MS images of lower resolution (larger GSD), denoted as M 1 . The subscript number denotes a level of resolution, where a smaller number indicates a higher resolution. Our final goal in pan-sharpening is to utilize P 0 and M 1 to generate a high-quality pan-sharpened image S 0 which has the same resolution as P 0 and similar spectral information of M 1 . This requires a pan-sharpening model g which takes P 0 and M 1 as inputs and yields a pan-sharpened image S 0 as an output. In the conventional CNN-based pan-sharpening based on supervised learning, their models are trained using P 0 and M 1 as targets and their down-scaled version P 1 and M 2 as inputs, where their training is done in a lower scale scenario.

B. UNSUPERVISED LEARNING FRAMEWORK FOR PAN-SHARPENING
One of the main limitations of the previous CNN-based pansharpening methods is that the PAN-MS pairs are downscaled to enable supervised learning. These networks are only trained in the lower scale scenario, so they perform poorly when tested in the original scale scenario which is always a realistic case. Since the misalignment between MS and PAN images would be more severe in their original scales, the networks trained in such a lower scale scenario are not able to appropriately handle the PAN and MS input images with larger misalignment.
On the contrary, the proposed unsupervised learning framework can overcome this problem, as our network is trained and tested under the same original scale scenario. The conceptual difference between conventional methods and the proposed framework is depicted in Fig. 2.
Unlike the conventional methods in Fig. 2-(a) for pansharpening that are trained under a lower-scale scenario, UPSNet is trained and tested under the same original scale as depicted in Fig. 2-(b). For the training, unlike the lowerscale scenario, the original PAN images are used as targets for a detail loss, and the aligned MS images of the same scale as PAN images are used as targets for a color loss. By doing so, our UPSNet can be trained in the original scale scenario. Here, one of the main points is how to obtain the aligned MS images of the same scale as the PAN and PS images. This will be detailed in the following subsections.

C. REGISTRATION
The conventional pan-sharpening methods that were trained with L1 or L2 loss functions on the misaligned datasets tend to produce the PS images of inferior visual quality, including double edge and spread color artifacts. To remedy this, it is necessary to use aligned datasets for the training of pansharpening networks. For the alignment between PAN and MS images, we propose a novel correlation-based PAN-MS registration on the PAN scale, which is done off-line. The resulting MS images have the same size as PAN images and are aligned to the PAN images. It should be noted that the aligned MS images are used as targets in the color loss function during training, not as the input for the network. In doing so, UPSNet internally learns the registration for the misaligned PAN-MS input pairs. That is, the aligned MS image is not required during the test. Fig. 3 shows the off-line alignment steps. For a given pair of an original PAN image P 0 and a grayed MS image M g 1 , a PAN-sized aligned MS image M 0 is constructed via a correlation-based searching process. For each pixel location of the PAN image, an optimal multi-channel (e.g., RGB) pixel value in the MS image is selected and placed in the corresponding pixel location of the aligned MS image of the PAN scale. The optimal pixel is determined as the center pixel of a searching window that finds the highest correlation value between the PAN and gray MS images is found by searching the grayed MS image within a search region. When the searching window size for the gray MS image is M×M, the corresponding window size for the PAN image is set to dM×dM where d is a dilation equivalent to the scale difference between PAN and MS images.
The details of the searching process are as follows: First, we obtain a grayed MS image ( Fig. 3-(b)) where a searching window of size 27×27 with dilation 1 is applied. The searching window slides in a pixel-wise manner of stride 1 within a pre-defined search region of size 7 × 7. The corresponding window size applied for the PAN image is of size 27×27 with VOLUME 8, 2020 dilation 4 due to the 4 times resolution difference between PAN and MS images. The goal of this registration is to replace all pixels in PAN with the best matching MS pixels, so that we can get MS images that are well aligned to their corresponding PAN images. Therefore, for a current pixel location of the PAN image, we search for the best matching patch with the highest correlation value in a search region of the grayed MS image. The 49 correlation values are calculated by sliding the searching window of size 27 × 27 with stride 1 in the 7 × 7 pixel grid (search region) centered at the current pixel location. When the best match is found, the MS pixel corresponding to the center pixel of the searching window is placed in the corresponding pixel location of the aligned MS image of the PAN scale. The searching process is repeated for all pixel positions of the PAN images. The aligned MS images of the PAN scales will then be used as MS targets for the color loss during training.
The above correlation-maximization-based registration involves two hyper-parameters: the searching window size (27 × 27) and the search range size (7 × 7) that are set empirically through extensive experiments. The searching window size should be large enough to capture a sufficient amount of local structures for correlation calculation but at the expense of computational complexity. A too small-sized searching window will ignore neighboring pixel correlation, and a too large one may ignore some misaligned pixels in correlation matching because the amount of misalignment gets relatively small. For the search range, a larger search range would be beneficial in handling larger misalignment but also at the expense of computational complexity. The search range size of 7 × 7 is large enough to handle the inherent misalignment between PAN and MS images for our experiments because it can cope with up to 3-pixel misalignment in MS scale that corresponds to a maximum 12-pixel misalignment in PAN scale.
It is worthwhile to mention some other alignment options to perform alignment in the MS scale. In this case, the computation of color loss can have two possible options for matching the scale (resolution), where PS images and MS images have different resolutions. The first option is to downscale the PS images to the MS scale by applying a degradation model, which causes the resulting trained PS networks to yield PS outputs with checkerboard artifacts. The second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as PS images. However, this causes a new misalignment due to the upscaling process, thus leading to the degraded quality of the PS output. The experimental results for these options are provided in Sec. IV-C2.

D. LOSS FUNCTIONS
Previous deep-learning-based methods in supervised learning have applied a degradation model to the input images P 0 and M 1 , which yields P 1 and M 2 . Then, the network output S 1 is compared to the pseudo ground truth MS images M 1 by using a simple L1 or L2 loss between them. On the other hand, to train the network in an unsupervised manner, we propose two different types of loss functions: First, a detail loss that enforces the network output to have similar details (high-frequency information) with PAN images P 0 ; Secondly, a color loss that helps the network match the spectral information of the network output S 0 and the aligned PAN-resolution MS image M 0 . More details for the proposed loss functions will be thoroughly explained in the following.

1) DETAIL LOSS
We now define a detail loss that minimizes spatial distortions between network outputs S 0 and PAN inputs P 0 . We first obtain grayed PS outputs S g 0 . In general, a vanilla detail loss, which is a simplified version of the spatial loss [6], can be defined as where d(·) is a gradient extractor using horizontal and vertical difference (e.g. [1, -1]) operators.
One of the difficulties in pan-sharpening tasks is inherent differences in image signal characteristics between the PAN and MS images. PAN images generally cover a wide range of wavelengths by merging a broad spectrum of visible lights into a single-channel image. Therefore, luminance values in MS images considerably differ from the PAN images. For example, certain objects that appear bright in an MS image (e.g., water) can appear dark in a corresponding PAN image or vice-versa (e.g., trees, grass). When we consider three bands (R, G, B) in MS images separately, the luminance difference between each of the bands and PAN images would be even larger than comparing with the grayscale versions of the MS images.
This inherent luminance difference between PAN and MS images generates not only dissimilar luminance values but also opposite directions of intensity gradients between them, which hinders deep-learning networks from properly learning the task of pan-sharpening. To solve this, we propose a novel loss function, called a dual-gradient detail loss, which is specially designed to handle such opposite gradient directions. This loss is utilized to enforce the PS outputs to have similar edge details with PAN images, together with the vanilla detail loss. Our dual-gradient detail loss is defined as where −d(P 0 ) is the reversed gradient map of PAN input d(P 0 ), and S R 0 , S G 0 and S B 0 are R, G, B channels of the network output PS image respectively. The gradient map of the output PS image is compared to both the gradient map and the reverse gradient map of the PAN image. Then, the smaller gradient differences (in absolute value) are chosen to be included in the loss computation. The proposed dualgradient detail loss enables the network to handle the opposite directions of gradients which frequently occur between PAN and each channel of an MS image. The loss then enforces the PS output to have similar edge details with PAN, while preserving the gradient directions as those of the color channels. This prevents the double edge artifact which happens due to the gradient direction mismatch, resulting in better visual quality.

2) COLOR LOSS
In addition to the two detail loss functions, we propose a guided-filter-based color loss to impose color similarity between the MS input and the network PS output. Here we utilize previously aligned PAN-resolution MS images M 0 as color targets to avoid any artifact that comes from the misalignment between P 0 and M 1 . The previous deep-learningbased methods in supervised learning have used L1 or L2 loss between the network PS output S 1 and the pseudo ground truth MS image M 1 , under the assumption that those two have similar high-frequency details and colors.
However in our unsupervised learning setting (original scale scenario), there exists no ground truth, but the network output S 0 is supposed to have high-frequency details learned from the PAN image P 0 , where such high-frequency details are not present in the input MS image M 1 . So as to ensure that the network produces the PS output S 0 having similar colors as the aligned PAN-resolution MS image M 0 while not losing the high-frequency information, we first apply a guided filter to the network output S 0 using the previously aligned MS target M 0 as guidance. Then the resulting guided-filtered PS output GF(S 0 ) is compared with the aligned MS target M 0 using L1 loss. Without the guided-filtering step, this becomes a direct comparison between the network output S 0 and the aligned MS image M 0 , which would result in a substantial loss of the high-frequency details that are learned from the PAN image P 0 . Our guided-filter-based color loss is defined as where GF(S 0 ) is a guided filtering operation on the network PS output S 0 with guidance M 0 , and b(·) is a Gaussian blurring operation with the filter size of 3 with σ = 2/3. The values are set empirically to apply a mild blur as strong blur often leads to a loss of detail information. The Gaussian blur is applied to reduce the pixel blocking artifact introduced during the alignment and upscaling operation described in Sec. III-C. The proposed guided-filter-based color loss enforces the PS output to have a similar color as that of the MS image, while avoiding the checkerboard artifacts that may come from a down-sampling operation and loss of high-frequency details due to a direct comparison between PS and MS images.

3) TOTAL LOSS
The total loss function to train the network is defined as a weighted sum of the aforementioned loss functions, which is given by where w dg and w c are empirically set to 1 and 2, respectively. Our total loss function is simple yet effective.

E. NORMALIZATION
Throughout the whole training and test processes, the inputs are normalized by the mean and standard deviation values VOLUME 8, 2020 at each pixel computed within a local patch around the pixel. We have conducted extensive experiments for various types of normalization, such as uniform normalization for all images using the dataset statistics, and global normalization by computing mean and standard deviation values for each image. But local normalization per patch has shown to be the most effective method. As mentioned earlier, PAN and MS input images are non-stationary, having various pixel intensity distributions depending on geographical features. Also, pixel intensity distributions can be very different according to satellite sensor types. It is time-consuming and costly to train dedicated PS networks for different satellite datasets. Motivated by this, we propose a simple but effective patch-based normalization technique that allows the network trained on the images acquired by a specific satellite to be well generalized for unseen images of other satellites. Applying our normalization helps maintain the color information of the MS input.
Our proposed normalization downscales the PAN and aligned MS images to the MS scale, and computes the mean and variance values in a local window of size 9 × 9 over downscaled images for complexity reduction. Then upscaled mean and variance maps are used to normalize the PAN and MS input images. Denormalization is applied to the network output to yield the final PS result images. The values that are used for the denormalization are the upscaled mean and variance map of the MS input that were used for normalization. Local window size should be large enough to capture the regional characteristics of geographical features, as the goal of local normalization is a generalization to unseen datasets. However, the computational complexity quadratically increases as the window size goes bigger. Through a set of experiments, we found that a local window of size 9 × 9 shows good generalizability without being computationally too expensive. Our normalization technique can be easily adopted to any PS networks. This allows the network to maintain the low-frequency color information of the MS input, having a similar effect as the residual connection.

F. NETWORK ARCHITECTURE AND TRAINING DETAILS
Our network, UPSNet, comprises of 28 residual blocks, each of which has one leaky ReLU (negative slope = 0.1), one convolution layer, and one identity mapping. In total, our network has 30 convolution layers with about 1M filter parameters. To reduce the computational complexity, a single channel PAN input is de-shuffled and transformed into an MS-sized 16-channel input, which is an opposite operation of the subpixel convolution layer [35]. The de-shuffled PAN image is then concatenated with the 3-channel MS input. Therefore, the MS-sized 19-channel data is fed into the first convolution layer of UPSNet. The last convolution layer generates 48channel feature maps, which is then converted to a PAN-sized 3-channel (if MS is RGB) residual output via a shuffle layer. Finally, a nearest-neighbor interpolated MS image is added to the residual output to generate the final pan-sharpened image.

2) TRAINING
We trained UPSNet using the ADAMW [23] optimization technique with the initial learning rate of 10 −4 and the weight decay of 10 −7 . For training other deep-learning-based PS methods, we followed the training details provided in their original papers. We employed the uniform weight initialization technique in [14] for training. All the networks were implemented using TensorFlow [1], and were trained and tested on NVIDIA TITAN TM RTX GPU. Our network is trained for 10 6 iterations, where the learning rate was lowered by a factor of 10 after 5 × 10 5 iterations. The mini-batch size was set to 2. Training of UPSNet takes about 10 hours, and it takes 0.237 seconds for testing an image of size 648 × 648 (PAN) on average.

b: LOWER-SCALE VALIDATIONS
Due to the unavailability of ground-truth pan-sharpened images, we evaluate the performances of UPSNet and other PS methods under two different settings: lower-scale and fullscale (original-scale) validations. We use the full-reference metrics under the lower-scale validation following the Wald's protocol [40]. For this, the downscaled versions of PAN and MS images are fed as input to all the methods under comparison, and the resulting output PS images of lowerscale are compared with their corresponding pseudo-groundtruth original MS images. Four different metrics are used for the lower-scale validations: (i) spatial correlation coefficient (SCC) [47]; (ii) erreur relative globale adimensionnelle de synthèse (ERGAS) [24]; (iii) Q index [41]; and (iv) peak signal-to-noise ratio (PSNR).

c: FULL-SCALE VALIDATIONS
For the full-scale validation, SCC is also measured between original PAN inputs and grayscale versions of PS output images. The SCC values measured at full-scale indicate how much a pan-sharpening method can maintain the sharpness of the input PAN images in the PS output images. We also measure the quality-with-no-reference (QNR) [2] which is a no-reference metric for pan-sharpening, and another no-reference metric called a joint quality measure (JQM) [30] metric, which is known to better coincide with the perceived visual quality of PS output images than QNR.

d: MISALIGNMENT ISSUE BETWEEN PAN AND MS IMAGES
In general, the PAN and MS images are misaligned due to inevitable acquisition time difference and mosaicked sensor arrays. However, none of the above seven metrics for lowerand full-scale validations considers the inherent misalignment between PAN and MS images. On one hand, UPSNet is designed to correct the inherent misalignment between them by aligning the color (MS) of an object with the objects' details (PAN). So, it can produce output PS images that have very well aligned colors and shapes of objects. In this case, it is important to note that directly measuring the spectral distortion of the PS output with respect to the color of the original MS input is meaningless for the aligned PS output. This is because the colors of the PS output generated by UPSNet are moved (aligned) to match the shapes (details). Therefore, in addition to such conventional direct measures with respect to the original MS inputs, we also measure the distortions with respect to the aligned MS images created by the alignment method in Section III-C for fair and meaningful comparison. Tables 1 and 2 show the average metric scores for 100 randomly chosen test image pairs from the WorldView-3 dataset measured with respect to the original MS input without alignment and with the aligned MS image, respectively. ↑ and ↓ indicate that the higher the better, and the lower the better performance, respectively for each metric. In Table 1 Table 1 are the same as those in Table 2. This is because MS images are not used in measuring the SCC metric, as mentioned earlier. As shown in Tables 1 and 2, UPSNet performs the best in terms of SCC. DSen2 shows the highest QNR value in Table 2, however it shows poor perceived visual quality, which will be later discussed in Sec. IV-B3. In Table 2, UPSNet outperforms all other PS methods in terms of all quality metrics except QNR, and UPSNet (w/o align) achieves the highest value of QNR. Fig. 6 and 7 show visual comparisons for our UPSNet against the previous state-of-the-art methods. It is clearly shown in Fig. 6-(p) and 7-(p) that the PS output image from UPSNet well preserves the high-frequency details of PAN inputs and the color information as similar as possible with MS inputs, also having minimal distortions. The effectiveness of registration (alignment) learning by our UPSNet can be clearly seen around the pool area in Fig. 6-(h). Since the pool is located at a slightly up-right position in the MS image ( Fig. 6-(b)) compared to the PAN image (Fig. 6-(a)), most of the previous  SOTA PS methods show artifacts (color of the water is placed at slightly up-right position compared to the shape of the pool) due to this misalignment, but the output PS image of UPSNet shows no such artifacts. Also, UPSNet produces the most similar color with the original MS images, especially the color of the water in the pool (Fig. 6-(h)). The effectiveness of the registration learning is even more emphasized in Fig. 7-(h). As can be seen in Fig. 7-(a) and (b), the color of the orange roof in the MS image is placed slightly upward compared to the shape of that in the PAN image. UPSNet is the only method that is able to fuse the colors of the orange roof from the MS image with their appropriate shapes in the corresponding PAN image. More visual comparisons are provided in Figs. 13 and 14.

3) CONSIDERATIONS FOR NO-REFERENCE METRICS: QNR AND JQM
In this paper, we have utilized two full-scale no-reference metrics, QNR and JQM. However, several previous works have pointed out the drawbacks and unexpected properties of QNR [16], [30], [40], especially when perfect alignment between the MS and PAN images is not assured. As known, PAN and MS images in the WorldView-3 dataset are not wellaligned, so it can be expected that the values of the QNR metric are not well agreed with the observed visual quality.
We have intensively investigated this discrepancy between QNR metric and subjective quality for PS output. Figs. 8 and 9 show visual comparison on PS outputs obtained by various pan-sharpening methods. As shown, it is important to note that, although the PS output images of PNN, PanNet and Dsen2 relatively exhibit higher QNR scores than those of PanNet-S3, DSen2-S3 and UPSNet, their perceived visual qualities are much worse, showing severe ghost artifacts in Fig. 8 and the misalignment between colors and shapes (details) in Fig. 9. It is also worthwhile to point out that the PS output of UPSNet in Fig. 9 shows the best visual quality but has the lowest QNR value.
To remedy this problem, we additionally adopted another metric (JQM) which is known to be better agreed with the perceived visual quality on PS images [32]. As shown in Figs. 8 and 9, it can be easily noticed that the values of the JQM metric are very well agreed with the perceived visual qualities of the PS output. As opposed to the QNR metric, PNN, PanNet and Dsen2 relatively exhibit lower JQM scores than those of PanNet-S3, DSen2-S3 and UPSNet in Figs. 8 and 9. In both figures, PS outputs from our UPSNet yield the highest JQM scores, coinciding with the perceived visual quality. The visual qualities of the PS outputs produced by DSen2-S3 and PanNet-S3 are ranked the second and the third in terms of JQM values, which are very reasonably ranked in agreement with the perceived visual qualities.
The discrepancy between QNR and perceived visual quality comes from the fact that QNR does not directly reflect the spectral and spatial distortions in its calculation form [2]. The spectral distortion term (D λ ) of QNR indirectly obtains the spectral distortion index by taking the difference between inter-band similarity measures of the MS and PS images. Similarly, the spatial distortion term (D S ) of QNR is measured indirectly by taking the difference between the two relations: (i) each channel of an MS image and its corresponding low-pass-filtered and downscaled PAN image; (ii) each channel of a PS output image and a PAN image. On the other hand, JQM [30] directly measures both the spectral distortion between MS and downscaled PS images, and the spatial distortion between PAN and fused PS images. The JQM was argued that it is better agreed with perceived visual quality than QNR [30]. Throughout our intensive experiments, we also have found that the JQM is better correlated with perceived visual quality for various PS output images, as shown in Figs. 8 and 9. VOLUME 8, 2020

C. ABLATION STUDIES
Ablation studies have been conducted in a few different settings to show the effectiveness of key aspects of our proposed UPSNet. Throughout the experiments, only one component has been changed, and others remained the same. Evaluation of different models has been conducted under full-scale, using original MS and PAN input as inputs for the network. We measure two different criteria for measuring the performance of output PS images: high-frequency detail similarity with PAN images (SCC) and color similarity with MS images (ERGAS). ERGAS is measured between aligned MS images and PS output images. We denote this as ERGAS-A.

1) LEARNING FRAMEWORK
First, we provide ablation study results on learning framework including unsupervised learning, training in original scales, and alignment. Experiment conditions are as follows.
Condition 1 is for training on lower scales using our unsupervised framework and testing on original scales. Condition 2 is for training without alignment, using the bicubic interpolated original MS image as a target for the color loss. In Condition 3, we train UPSNet in a supervised manner, similarly to PanNet [43] and DSen2 [19], where each training pair of PAN and MS images is downscaled by a scale of 4, and the original MS input is used as a pseudo ground truth. The network for Condition 3 is regularized by L1 loss between output PS images and original MS inputs to have similar settings as PanNet [43] and DSen2 [19].
As shown in Table. 3, all conditions entail substantial performance drops in terms of all metrics. Fig. 10 shows the visual comparison for Conditions 1, 2, and 3. Due to the scale mismatch between training and testing, and the absence of alignment between MS and PAN images, it is clear that the results in Fig. 10-(b), (c), and (d) suffer from misaligned   colors, especially on the areas pointed by the red arrows. As can be seen in Table. 3, UPSNet trained in a supervised manner has shown a substantial amount of performance drop, especially in terms of SCC. Fig. 10-(d) clearly shows that supervised training in lower scales causes inferior visual quality, also showing artifacts in the homogeneous region.

2) REGISTRATION SCALE
In Sec. III-C, we have discussed other possible alignment options to perform the registration step in the MS scale. The aligned MS, the output of the registration step, is only used as a target for the proposed guided-filter-based color loss and has the same size as the PAN image, as explained in Sec. III-C. Then, the PS output images from UPSNet and their corresponding aligned MS images are compared by the guided-filter-based color loss function without any scale conversion. However, when the registration is performed in the MS scale, aligned MS images would have the same size as input MS images. Therefore there exists a scale mismatch between the PS images and the corresponding aligned MS images. In this case, the computation of color  loss can have two possible options for matching the scale (resolution).
The first option is to downscale the PS images to the MS scale by applying a degradation model, and the second option is to upscale the aligned MS images (aligned in the MS scale) to have the same resolution as their corresponding PS images. Since both options require scale conversion, a new type of misalignment is introduced inevitably during the scale matching process. Table 4 provides the quantitative experiment results for UPSNet and its variants trained under two options mentioned above. The values of ERGAS-A and SCC metrics 201212 VOLUME 8, 2020 are lowered in the two options. Figs. 11 shows the artifacts introduced by the scale conversion. UPSNet can effectively handle the misalignment between the PAN and MS images, especially on the moving cars, but variants of UPSNet that included scale conversion (Fig. 11-(c), (d)) failed because they could not properly learn to handle the misalignment. The overall experiment results show that registration in the PAN scale yields the best pansharpening performance in both quantitative and qualitative perspectives.

3) LOSS FUNCTIONS
In this section, we discuss the effectiveness of the proposed loss functions. Two loss functions have been newly proposed to train our UPSNet: a guided-filter-based color loss (L c ) between network outputs and our aligned MS targets; and a dual-gradient detail loss (L dg ) between network outputs and PAN inputs.
Ablation studies have been conducted under two different conditions to show the effectiveness of the proposed loss functions. Condition 1 is training the network without the VOLUME 8, 2020   dual-gradient detail loss. Condition 2 is applying Gaussian blur kernel instead of the guided-filter used for the color loss. We denote the Gaussian blur kernel-based color loss as L gb . The parameters of a Gaussian blur kernel are adequately adjusted so that the PS images after applying the Gaussian blur kernel to have similar visual quality with the corresponding aligned MS images. Table 5 shows the average metric scores of ERGAS-A and SCC. The performance drops are observed for both Condition 1 and Condition 2, showing that the proposed loss functions are essential for training our UPSNet. Fig. 12   shows the visual comparison regarding the ablation study on loss functions. Condition 1 seems to produce reasonable visual quality, but it tends to disturbingly enhance the local contrast, introducing some artifacts in the output PS images. Condition 2 introduce rainbow-like artifacts in all images. Both quantitative and qualitative experiments show the effectiveness of our proposed loss functions in training UPSNet.

4) CROSS-DATASET EXPERIMENT
Cross-dataset experiments have been conducted to show the generalization capability of our UPSNet. Each pansharpening network is trained and tested in four different settings using the datasets acquired from two different satellites, KOMPSAT-3A (K3A) and WorldView-3, as described in Tables. 6, 7, 8, and 9. The upward and downward arrows ↑↓ indicate that higher and lower values imply better performance, respectively. The best and second-best results are highlighted in bold and underline, respectively. It can be seen that UPSNet is showing a good generalization capability while other methods show performance drop when tested on the dataset that is different from the training dataset.

V. CONCLUSION
In this work, we propose an effective unsupervised learning framework with registration learning for pan-sharpening, called UPSNet. To resolve a misalignment between PAN and MS, we first propose a simple PAN-MS registration based on correlations to obtain an aligned MS target of PAN-resolution from each misaligned PAN-MS input pair. The aligned MS target is then used to enforce the network to learn how to handle the misalignment between PAN and MS images by giving it as a target for the color loss. It should be noted that the registration for training is no longer required in testing. Additionally, we designed two loss functions for the training of our network: a guided-filter-based color loss between the network's PS outputs and our aligned MS targets; and a dualgradient detail loss between the network's PS outputs and PAN inputs. Intensive experimental results show that our UPSNet can generate pan-sharpened images with remarkable improvements in terms of color similarity and texture details compared to the state-of-the-art pan-sharpening methods.