SelfS2: Self-Supervised Transfer Learning for Sentinel-2 Multispectral Image Super-Resolution

The multispectral image captured by the Sentinel-2 satellite contains 13 spectral bands with different resolutions, which may hider some of the subsequent applications. In this article, we design a novel method to super-resolve 20- and 60-m coarse bands of the S2 images to 10 m, achieving a complete dataset at the 10-m resolution. To tackle this inverse problem, we leverage the deep image prior expressed by the convolution neural network (CNN). Specifically, a plain ResNet architecture is adopted, and the 3-D separable convolution is utilized to better capture the spatial–spectral features. The loss function is tailored based on the degradation model, enforcing the network output obeying the degradation process. Meanwhile, a network parameter initialization strategy is designed to further mine the abundant fine information provided by existing 10-m bands. The network parameters are inferred solely from the observed S2 image in a self-supervised manner without involving any extra training data. Finally, the network outputs the super-resolution result. On the one hand, our method could utilize the high model capacity of CNNs and work without large amounts of training data required by many deep learning techniques. On the other hand, the degradation process is fully considered, and each module in our work is interpretable. Numerical results on synthetic and real data illustrate that our method could outperform compared state-of-the-art methods.

classification [4], and change detection [5], [6], [7]. For many reasons, such as the physical limitations of the radiometric resolution of the imaging sensors, design considerations, and achieving a higher signal-to-noise ratio (SNR), the spatial resolution [or say, the ground sampling distance (GSD)] of a single remotesensing MSI captured by some well-known satellites, including the Moderate Resolution Imaging Spectroradiometer [8], the Advanced Spaceborne Thermal Emission and Reflection Radiometer [9], and the recently deployed Sentinel-2 (S2) launched by the European Space Agency (ESA) [10], might be different across different spectral bands. As all such resolution differences would not go away with hardware improvements [11], computational super-resolving those coarser bands so as to have all bands available at the highest spatial resolution is of paramount importance.
Without loss of generality, this article focuses on the S2 satellite. An MSI collected by the S2 satellite contains 13 spectral bands (443-2190 nm), covering the visible-near infrared and short-wave infrared wavelengths with three different resolutions, as shown in Table I. The super-resolution problem of S2 MSIs is to increase the resolution of bands at 20-and 60-m spatial resolution to the maximal resolution (i.e., 10 m). The presence of fine bands makes super-resolving S2 images similar to the conventional pansharpening problem. In [12], existing component substitution [13], [14], [15] and multiresolution analysis [16], [17], [18], [19], [20] methods designed for pansharpening are extended to super-resolve bands at 20 m of S2 images via integrating four fine bands into a single band. A geostatistical approach, the area-to-point regression kriging approach [21], [22], is also extended. However, as pointed out in [11] and [23], the super-resolution of S2 images differs from the conventional pansharpening problem [20], [24], [25], [26], [27], [28] and the hyperspectral image/MSI fusion problem [28], [29], [30], [31] mainly for the high-spatial-resolution bands (i.e., the panchromatic image in pansharpening and the MSI in hyperspectral image/MSI fusion), spectrally overlapping the lower resolution bands in these two tasks, while this condition is not met by the images of S2 as the bands at the resolution of 10 m and the bands at the resolution of 20/60 m in S2 imagery are almost nonoverlapping (see Table I). Thus, novel methods tailored for S2 images super-resolution are expected.
Recent works super-resolve coarse bands via inverting the observation model that depicts the degradation (downsampling, blurring, and noise) of S2 images in the imaging process. Lanaras et al. [11]  the downsampling and blurring processes are characterized as two block diagonal matrices with each block associated with a specific band. Especially, when the blur is assumed to be spatially invariant, each block of the blur matrix is a block-circulant circulant-block matrix associated with the point spread function (PSF) of the corresponding band. In [11], the variance of Gaussian blur is computed from the calibrated modulation transfer function released by the ESA [32]. This observation model has also been rewritten in the tensor form by Wang and Ji [33].
The inversion of the observation model is obviously illposed [34]. Therefore, introducing the prior knowledge of the underlying fine bands via regularization is required. Three types of prior knowledge can be found. In [11], the spatial local discontinuity of four fine bands in existence is encoded as a weighted total variation (TV) regularizer to sharpen the 20-and 60-m bands. Furthermore, Paris et al. [35] utilize the spatial nonlocal self-similarity under the plug-and-play framework. In [35], a CBM3D [36] denoiser is plugged in, and this version of CBM3D could compute the patch similarity from a reference image, which is obtained from the linear combination of four fine bands. In [23] and [33], the nonlocal self-similarity is exploited by formulating explicit regularizers. Another important property of the underlying high-resolution S2 images is that they are globally correlated. That is, MSIs are living in a low-dimensional subspace. Thus, low-rank subspace representation is employed in [11], [23], and [35], and Ulfarsson et al. [37] turn to reduce the rank of the data matrix.
Lanaras et al. [38] directly learn a deep convolutional neural network (CNN), which can be viewed as the inverse mapping of the observation model mentioned above, with the training data synthesized based on the scale invariance assumption. For example, when super-resolving the 20-m bands, the original 10-m bands and 20-m bands are downsampled by a factor of 2. Thus, the ground-truth "high-resolution" with the same resolution (at 20 m) across all bands at a reduced scale for training/validation/testing is then gathered together with the existing 20-m bands. Then, this dataset serves for the fully supervised training for 2× super-resolution. However, in [38], the 2× superresolution and 6× super-resolution are separately conducted, and the network architecture for 6× super-resolution is much deeper. Palsson et al. [39], [40] and Gargiulo et al. [41] also adopt this strategy, and the deep residual network (ResNet) [42] and the generative adversarial network [43] are taken into consideration.
Generally, deep-learning-based methods are believed to maintain a high model capacity to represent information in need [44]. However, existing S2 super-resolution approaches using deep CNNs, e.g., [38], [39], [40], face one unavoidable challenge of their generalization ability. Those model-based methods [11], [23], [33], [35], [37] could be scene adaptive as the prior knowledge (global/local/nonlocal) they considered is widespread in different scenes. Thus, to yield excellent performance with global applicability, methods in [38], [39], and [40] need to collect training data for different scenes as much as possible. Subsequently, this would also lead to a large volume of training data and a corresponding high computation burden for training.
To address the above issues, this article proposes a selfsupervised learning method for S2 MSIs super-resolution. Our contributions could be summarized as follows.
1) First, to tackle the Sentinel-2 image super-resolution, we employ the deep image prior expressed by the CNN with a high model capacity. Specifically, we adopt a plain deep residual CNN architecture with the 3-D separable convolution. This structure is expected to adaptively capture a great deal of low-level MSI statistics for different scenes. 2) Second, the loss function is established based on the degradation model of S2 images. Thus, our results naturally conform to the degradation. Meanwhile, as the fine bands at 10-m resolution are available, we initially infer the network parameters with fine bands and then transfer them for super-resolving coarser bands. The overall learning stage is fully in a self-supervised manner, which means the inferring of network parameters is solely from the observed S2 MSI, and no extra training data are needed. Extensive experimental results are carried out to illustrate the effectiveness of the proposed method. The rest of this article is organized as follows. Section II presents the proposed method in terms of the network architecture, the loss function, and details of learning. Section III reports experimental results and discussions. Finally, Section IV concludes this article.

II. METHOD
In this section, we first review the degradation model of S2 images. Then, we establish the framework of our super-resolution method, including the loss function and the specific network architecture. Subsequently, our learning scheme for S2 images is tailored.

A. Problem Formulation
As the input and output of our network are all in the tensor format in our implementation, we describe the degradation model of S2 images in the tensor format. Hence, before formulating the S2 super-resolution problem, we first introduce the tensor notations we used throughout this article. Lowercase letters are used for scalars, e.g., x; boldface lowercase letters are used for vectors, e.g., x; boldface uppercase letters are used for matrices, e.g., X; Euler script letters are used for tensors, e.g., X . Given a third-order tensor X ∈ R n 1 ×n 2 ×n 3 , we use X ijk to denote its (i, j, k)th element. The kth frontal slice of X is denoted as X (k) (or X (:, :, k), X k ).
Note that the spatial size and the PSF of blurring matrices for different bands of S2 images could be different. Formulating the degradation model of S2 images is tricky [23]. In the following, we reformulate the degradation model proposed in [11] in the tensor format with referring to [33]. For notational simplicity, we first conduct plain upsampling via the Kronecker product to ensure that the observed images are of the same spatial size. For example, for a band at the 20-m resolution, denoted as X ∈ R M 2 × N 2 , we upsample it via where ⊗ denotes the Kronecker product. Then, for S2 images having B = 12 spectral bands 1 and the spatial size M × N (at 10-m resolution), after the plain upsampling, it can be written as a third-order tensor Y ∈ R M ×N ×B . The order of the bands in Y is from B1 to B12 (except B10), in accord with Then, the degradation model, up to the noise, could be expressed as where Y ∈ R M ×N ×B is the observed tensor, M ∈ R M ×N ×B is the binary mask tensor representing the downsampling process, denotes the elementwise multiplication, B ∈ R h×h×B is the Gaussian kernel tensor, indicates the slicewise convolution, and X ∈ R M ×N ×B is the underlying MSI with the highest spatial resolution across each band. The ith frontal slice of B (i.e., (3) where ones(m, n) indicates generating a matrix of the size m × n with all entries equaling to 1. The construction of M also indicates that the downsampling the conducted via selecting the element in the top-left corner.

B. Proposed Method
The super-resolution of S2 images aims at recovering the high-resolution MSI X from the observation Y. On the one hand, with pre-estimated B and a given M, previous model-based S2 super-resolution methods [11], [23], [33], [35], [37] can be formulated aŝ where L is the fidelity term and φ(X ) is the regularization term. L enforces the super-resolution result being in accordance with the degradation model, and the regularization term is formulated to express the prior knowledge of the underlying results. On the other hand, those deep-learning-based methods can be viewed as learning the mapping function from a large amount of training data. Our method attacks the S2 super-resolution problem in another way and can be formulated as where f θ (·) is a specific CNN with the network parameter θ, L(·) is the loss function, and Z is the network input. As for the network input Z, we follow the strategy of using the degraded image in [46]. However, directly inputting the Y constructed via the Kronecker product in (2) would not be helpful for extracting spatial and spectral features since there are too many zero entries. Therefore, instead of directly using Y, we use a simple spatial bicubic interpolation [47] to fill those zeros entries within 20-and 60-m bands. Meanwhile, to better utilize the global low dimensionality of the MSI, the bicubic interpolation result is then projected to a low-dimensional subspace detected via the HySime algorithm [48] and projected back. For simplification, this preprocessing step can be denoted as Z = Preprocessing(Y). As we will show in the experimental part, other preprocessing techniques also work. We then explain the distinctions and connections of our method to (4) and (5). For convenience, we rewrite (6) as where X is an auxiliary variable to represent the network output.
On the one hand, we can see that our method is different from previous CNN-based methods, which directly map the observed low-resolution image to a high-resolution result. The parameters of the CNN in (5) are learned from large datasets of images via computing the loss between the network output and the synthetic high-resolution via gradient descent and backpropagation [49]. Our method is to infer the network parameters solely from a single observed low-resolution image by minimizing the distance between the network output after degradation and the low-resolution image. That is, the network input Z, which is obtained from the observed MSI via simple preprocessing techniques, in (7) is always fixed across the overall training procedure. It can be viewed as our training set containing only one image, and this single image would be repeatedly used many times. Thus, our method is in a self-supervised learning manner, being different from previous fully supervised deep learning methods.
On the other hand, our method is closer to the model in (4). The first term L(M (B Z), Y) is in the same manner as the fidelity term in (4). Both of them can enforce the result X to be in accordance with the degradation model (2). The main difference lies in the regularization term. In (4), φ(·) represents the regularizer to express prior knowledge, such as TV for spatial local discontinuity and CBM3D for nonlocal self-similarity. In our method, f θ (·) is herein to express the deep image prior [44], which widely exists in many types of visual data. That is, the CNN f θ (·) itself serves as the regularization term. Meanwhile, (4) can be solved via traditional optimization methods, while this is not suitable for (7).
The super-resolution result is obtained byX = f θ (Z) after (6) is optimized. 2 The flowchart of our method is shown in Fig. 1.

C. Network Architecture and Loss Function
Those two parts, i.e., f θ (·) and L(·), corresponding to the fidelity term and regularization term, are vital for successful recovery. In the following, we will introduce the specific architecture of f θ (·) and tailor the loss function L(·). 1) Network Architecture: As pointed out by Ulyanov et al. [44], the image statistics prior could be well captured by the structure of a CNN independent of learning. That is, the structure of the CNN could express the prior knowledge of the natural images. Hence, the selection of CNN's structure is of great importance. In this article, we adopt a simple deep ResNet [42] structure with separable 3-D convolutional blocks. Meanwhile, the input of our network is Z constructed from the observation Y. These are different from [44], in which a generator network with a U-net [50] like structure is employed, and the random noise is taken as the input. For an input Z ∈ R M ×N ×12 , the first two 2-D convolution blocks would increase the 2-D features and extend the feature dimension to the size 256. Then, a ResNet structure with 34 separable 3-D convolutional blocks follows. Finally, two 2-D convolution blocks are in series to decrease the feature dimension to 12. The 2-D convolution block consists of a cascade of 2-D convolution layer, a 2-D batch normalization [51] layer, and the LeakyReLU [52] as nonlinear activation function.
Considering that the cubic nature of the underlying MSI and the pattern along the spectral direction would be different from that along with spatial modes, we exploit the separable 3-D convolution [53] blocks 3 to construct the main pipeline of f θ (·). We would illustrate that the separable 3-D convolution is superior to 2-D convolution or plain 3-D convolution in the experimental part. The (separable) 3-D convolution block (Conv3D Block) is shown in Fig. 1. It is implemented via separating a 3-D convolution operation into a 2-D spatial convolution and a 1-D spectral convolution in sequence. The separable 3-D convolution is expected to simultaneously capture the spatial and spectral information with fewer filter coefficients.
2) Loss Function: In our work, minimizing the loss function L(M (B f θ (Z)), Y) rectifies the deviation between the network output after degradation and the observation. It would enhance the network output f θ (Z), which involves the deep image prior expressed by the network structure to comply with the observation model. From the perspective of maximum a posteriori, the fidelity term should be designed based on the probability of the observation, given the underlying high-quality MSI related to the noise distribution. For example, the 2 norm in [11], [23], and [35] or the Frobenius norm in [33] accounts for the independent and identically distributed Gaussian noise. In this article, inspired by a recent work for the image superresolution [54], in which the 1 norm has been proved to be better in preserving image edges and textures, we adopt the 1 norm in our loss function. In the meantime, the structural similarity (SSIM) [55] loss is added to the loss function to enhance the structural feature extraction ability of f θ for a better preservation of the structural information. Thus, our loss function turns to be where · 1 is the 1 norm defined as the sum of absolute values and α 1 and α 2 are set to 1. The choice of α 1 and α 2 is discussed in Section III-C5.

D. Parameter Initialization and Implementation Details
The ultimate goal of our work is to infer a nonlinear mapping function f θ (·), which could map the observed MSI with different spatial resolutions to a desired MSI with the highest resolution across different bands. Empirically, the network f θ is expected to learn high-resolution features from 10-m bands and help to super-resolve the 20-and 60-m bands. Considering that the observation contains four 10-m bands, we proposed the following network parameter initialization strategy to further take advantage of those high-resolution bands.
First, we simulate pseudo low-resolution bands from those 10-m bands. For the jth band (j ∈ s 2 ∪ s 6 ) of the observation, i.e., Y (j) or Y(:, :, j), we solve the following linear regression problem: where M (i) , B (i) , and Y (i) are the ith frontal slice of M, B, and X , respectively. Equation (9) aims at finding the optimal linear combination of high-resolution bands to express the low-resolution bands. For a given j ∈ s 2 ∪ s 6 , we obtain the regression coefficients β j 2 , β j 3 , β j 4 , and β j 8 . Then, a pseudo low-resolution input Y low-pseudo is simulated as Correspondingly, we could simulate a pseudo high-resolution MSI Y high-pseudo as The above strategy to simulate the pseudo high-resolution MSI is indeed the band-synthesized scheme of the hypersharpening framework proposed in [56]. Then, the parameters of our network f θ are random initialized and subsequently trained via minimizing the following loss function: using the ADAM optimizer [57]. After 1000 epochs of training with a learning rate of 0.02, we stop it and use the network parameters at this time as the initialization for minimizing (8) in the main training phase. This initialization stage directly enforces the network output close to a (pseudo) high-resolution MSI when inputting the (pseudo) low-resolution observation. As we will illustrate in the experiment part, it significantly improves the super-resolution ability of f θ . Our method is implemented with the PyTorch [58] framework on a desktop of GPU NVIDIA GeForce RTX 2080Ti with 11-GB GDDR6 RAM. For each iteration (epoch), following the setting in [44], an independently generated zero-mean Gaussian noise with the standard deviation σ = 1 30 is added to the input (after the bicubic interpolation and the subspace projection). Finally, in the main training stage, the learning rate is set to be 0.02 with 1000 epochs.

III. EXPERIMENTS
In this section, we compare our method with other state-ofthe-art methods on the synthetic data and real data. Compared methods are listed as follows: 1) the simple bicubic interpolation; 2) SupReME [11], a method utilizing the discontinuities of higher resolution bands via a weighted TV regularizer; 3) MuSA [35], a method employing the C-BM3D denoiser to express the nonlocal self-similarity; 4) NSTMR [33], a method explicitly formulating the nonlocal self-similarity regularization regularizer in the tensor format; 5) DSen2 [38], a CNN-based method directly learning the mapping from degraded images to high-resolution results. DSen2 is implemented on the TensorFlow framework with the pretrained network 4 provided by the authors. As DSen2 is pretrained on data synthesized from the real S2 images with the scale invariance assumption, directly applying DSen2 on our synthetic data would be unfair. Thus, we have retrained DSen2 following the training scheme provided in [38]. The only difference lies in that we use the degradation model (2) to generate low-resolution and high-resolution MSI pairs. The remaining model-based methods are implemented on MATLAB. For all the model-based methods, their parameters are manually tuned for their best performances. We would exhibit the results on the synthetic data and real data and then conduct ablation studies to examine the effect from every single pipeline of the proposed method.

A. Results on the Synthetic Data
As the ground-truth super-resolved S2 images with 10-m resolution at all 12 bands are inaccessible, it is needed to simulate the synthetic high-resolution S2 images. However, the simulation is tricky since both the spectral and spatial properties should be simultaneously considered. Fortunately, previous research works have addressed this in a reasonable way. For example, Paris et al. [35] employ the hyperspectral images captured by the NASA Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [59]. The detailed steps can be seen in [35, Sec. IV-A] and [23, Sec. IV-B]. We restate key steps here for readers' convenience. In this work, being the same as [35], we select four AVIRIS images corresponding to spatial regions of city (at 3.5-m GSD), coast (at 5-m GSD), crop (at 3.2-m GSD), and mountain (at 5-m GSD). The first step is to spatially filter those AVIRIS images with a Gaussian kernel (with standard deviation of 1.5 for city data, 1.2 for coast and mountain data, and 1.6 for crop data). Then, the second step is the spatial downsampling (with a factor of 3 for city data, 2 for coast and mountain data, and 3 for crop data). After this stage, the hyperspectral images are approximately reaching the 10-m GSD. The AVIRIS hyperspectral sensor provides 224 narrow contiguous spectral bands from 0.4 to 2.5 μm [59], covering that of Sentinel-2 images. Thus, the final step is to simulate the spectral properties of S2 images by applying its spectral response to the 224-band hyperspectral images. Consequently, we can obtain 12-band simulated ground-truth S2 images (respectively, denoted as "AVIRIS-City," "AVIRIS-Coast," "AVIRIS-Crops," and "AVIRIS-Montain"). Their spatial sizes are all 406 × 108 at the 10-m bands.
In the meantime, other three simulated ground-truth images 5 in [33] are also taken as the ground truth . Two of them are generated from the HYDICE images of Washington DC Mall and Terrain (respectively, denoted as "HYDICE-WDC" and "HYDICE-Terrain") with 2.8-m spatial resolution, and the remaining one comes from the airborne prism experiment (APEX) [60] image of Baden (denoted as "APEX-Baden") with 1.8-m spatial resolution. The spatial sizes of these three synthetic S2 images at the 10-m bands are 96 × 96. Those three ground-truth images are generated by downsampling hyperspectral images to obtain a spatial resolution of approximately 10 m and selecting 12 bands with the same wavelength as the S2 satellite.
Then, simulated S2 images are generated from the synthetic ground-truth data following the degradation model in (2). Specifically, the size of the Gaussian blur kernels is 10 × 10 across different low-resolution bands, and standard deviations are set according to ESA's data quality report on Sentinel-2 satellite products [32] per spectrum. We remark here that we also set the kernel size as 10 × 10 when running compared methods in [11], [33], and [35] for a fair comparison. For each band, the Gaussian noise is added to control the SNR to 40 dB.
After having the pairs of ground-truth images and simulated S2 images, we utilize five quality metrics to quantitatively measure the results by different methods. They are: 1) the SSIM [55]; 2) the root-mean-square error (RMSE); 3) the signal to reconstruction error ratio (SRE); 4) the universal image quality index [61] (UIQI); and 5) the spectral angle mapper (SAM). For i ∈ s 2 ∪ s 6 , we denote the ith band of the ground truth as X i ∈ R M ×N and the ith band of the super-resolved result aŝ X ∈ R M ×N . Then, the SSIM value of the ith band is computed via where μ x and μx are the average values of X i andX i , respectively, σ x , σ y , and σ xy are the standard deviation of X i ,X i , and covariance of X iXi , respectively, and c 1 and c 2 are two constants, respectively, defined as (k 1 L) 2 and (k 2 L) 2 with the range of per pixel L and size of sliding windows k 1 and k 2 6 . Then, the SRE value of the ith band is computed as Next, the UIQI value of the ith band is computed as We compute the average values of the SSIM, SRE, and UIQI across bands in s 2 ∪ s 6 . After denoting the ground truth as X ∈ R M ×N ×B and the super-resolved result as X ∈ R M ×N ×B , the RMSE and SAM values are, respectively, computed via and In Table II, we exhibit quantitative metrics of the superresolved results by different methods. We can see that our method is comparable to NSTMR on the data simulated from the AVIRIS hyperspectral images, as they rank the first and second places in most cases. As for the data simulated from HYDICE and APEX hyperspectral images, our method achieves the best performance on account of all the quantitative metrics, while NSTMR and SupReME alternately obtain the second-best values. We can also see that the performance of DSen2 is not the best, although it employs a CNN with massive parameters. This collaboratively validates our analysis in the previous part that DSen2 could not well fit the degradation model (2) well since all the data in this subsection are degenerated in accord with (2).
In Figs. 2-5, we exhibit the pseudo-color images composed of three bands of results on AVIRIS-City, AVIRIS-Crops, HYDICE-WDC, and HYDICE-Terrain, respectively. Considering that the spatial sizes of AVIRIS-City and AVIRIS-Crops are a little bit big, we only clip square areas of the size 108 × 108 from the results for better visualization and space saving. The selected bands to make the pseudo-color images always consist of, at least, one 60-m band. In the meantime, the corresponding error images are also shown. From Figs. 2 and 3, we can see that errors are generally related to edges. For AVIRIS-City, SupReMe and our SelfS2 perform well on the B5 band, whereas MuSA and our SelS2 obtain better results on the B9 band. For the 60-m band B1, the error of the result by our method is the lowest. For AVIRIS-Crops, all the compared methods have a good performance on the B6 band, while our method obtains the lowest error on both the B1 and B11 bands. In Fig. 4, results of the B6 and B7 bands by our method and SupReMe are the best and our method super-resolves the B9 band (60 m) of HYDICE-WDC better than compared methods. It can be easily found in Fig. 5 that our method achieves the best performance on all of the illustrated bands. From those visualizations, we can see that the superior of our method is more obvious on the 60-m bands.

B. Results on Real Data
In this part, we test all the methods on two real datasets: Verona and Malmo. The spatial size of them is 180 × 180 at the 10-m bands. For the real data, we only display the visual results. The pseudo-color images composed of three bands on Malmo and Verona are shown in Figs. 6 and 7, respectively. We can see that  all the methods work and generate sharper results compared with the bicubic interpolation for the real data.
From Fig. 6, it can be found that the result by MuSA is too smooth that some image details are blurred. Also, we can see color distortions of the result by DSen2, and its result looks a little bit blurry. The color styles of the results by the bicubic interpolation, MuSA, and DSen2 tend to be similar, and this color difference would be related to their spatial blurring as these three results are visually blurry. SupReMe, NSTMR, and our SelfS2 all obtain good results. We can see from the area boxed by the red-dashed line that our result is more clear.
In Fig. 7, we highlighted two areas of all the results with red-dashed circles. The results by DSen2 and MuSA are more blurry for this real data compared with other methods. We can see from the center of large red-dashed circles that MuSA, NSTMR, and our SelfS2 preserve the orange and round area well, and only our method and NSTMR fully reconstruct the small line, which is composed of red and cyan blocks, right above the orange and round area. From the small red-dashed circles, the color distortion of the result by NSTMR is obvious compared with the bicubic interpolation. More specifically, the dot, which is in the orange color of the bicubic interpolation, tends to be red in the result of NSTMR.

C. Ablation Study and Parameter Analysis
In this part, we conduct the ablation study and parameter analysis to test the effects of important modules of our framework   1) Network Input: As the initial work of deep image prior [44] adopts a generator network with the random input, we test our method with the network input randomly produced obeying the Gaussian distribution with the standard deviation σ = 1/10. Meanwhile, we also adopt the bicubic interpolation and results by MuSA, SupReMe, and NSTMR as the network input. Table III shows the results of our method with different network inputs. We can see that for this S2 super-resolution problem, the bicubic interpolation is more suitable than the random input. After projected into the lowdimensional subspace detected by HiSime [48], although this projection step may delete some spectral anomalies, the bicubic interpolation could help our method to obtain the best performance.
2) Loss Function: The loss function rectifies the deviation between the network output after degradation and the observation, enforcing the network output to obey the degradation model. To test the effect of different types of loss functions, we select four frequently used loss functions, i.e., the 1 loss, the 2 (or known as the mean square error) loss, the SSIM loss, and  [63]. All these mentioned loss functions are computed with M (B f θ (Y)) − Y in (8). In Table IV, we report the quantitative metrics of the results by our method with different loss functions and their combinations. We can see that the SSIM loss is helpful for obtaining better SSIM and UIQI values. Using the combination of the 1 loss and the SSIM loss could achieve the best performance. 3) Network Structure: As reported in [44], the priors expressed by different network structures could be slightly different, and both the U-net like (or say hourglass) network work and the ResNet are adequate for natural images. Thus, in this part, we test our method with these two typical network structures. In the meantime, three types of the basic convolution block are tested. They are the common 2-D convolution, the 3-D convolution, and the separable 3-D convolution. Table V shows the performance of our method with different network structures. We can see that, when the overall structure is the hourglass, using the 3-D convolution could achieve better performance than using separable 3-D convolution. The ResNet could outperform the hourglass network, and it could obtain the best result when using separable 3-D convolution.

4) Network Parameter Initialization:
In Section II-D, we elaborate a parameter initialization strategy, in which the network parameters are first initialized by fully utilizing highresolution bands. Table VI shows the results of our method with this strategy or directly using the random initialization of network parameters. We can see that random initialization also works, while using our initialization strategy can largely promote performance. This comparison reveals that how to resort to high-resolution bands for super-resolving low-resolution bands could be of great importance in the S2 super-resolution task.

5) Weighting Parameters in Loss Function:
Our loss function is constructed as a combination of the 1 loss and the SSIM loss. The weighting parameters α 1 and α 2 are set to 1 throughout all the experiments except for this part. We conduct an experiment to examine the performance of our method with different values of parameters α 1 and α 2 . First, we fix α 2 = 1 and vary the value of α 1 from 10 −2 to 10 2 . Then, α 1 is fixed as 1 and the value of α 2 varies. The SSIM values of the results are shown in Fig. 8. We can see our method is more sensitive to the value of α 2 , and the best performance is obtained when α 1 = α 2 = 1.
6) Network Input Regularization Parameter: In this article, we take the same network input regularization strategy as [44], i.e., adding a zero-mean Gaussian noise with a fixed standard deviation to the input. This strategy has been shown to generate better image recovery or super-resolution results [44]. In this part, we conduct experiments with different values of the standard deviation σ. Meanwhile, we also consider parameterizing the input noise with the SNR in dB. Table VII shows the performance of our model with different standard deviations . We can see that large σs would bring about inferior results, while small σs help our method generate satisfactory results. Although using the strategy of parameterizing the noise with different SNRs can obtain the best results, we still choose to add the zero-mean Gaussian noise with a fixed standard deviation since this strategy is more stable for different datasets.
7) Running Time: In Table VIII, we report the running time (in seconds) of all the competing methods on the synthetic datasets. As a self-supervised learning method, our method learns the network parameters solely from the observed data, which takes a lot of time for the network to converge, thus being time consuming.

IV. CONCLUSION
In this article, to fulfill the super-resolution of S2 images, we suggest a novel method resorting to the deep image prior. The structure of the selected CNN could well capture the low-level statistics, and network parameters are obtained solely from the observed MSI, being adaptive to different scenes. The degradation of S2 images is taken into account via the loss function, making the network output in accord with the degradation model. Moreover, as the observed S2 images contain four high-resolution bands, a novel network parameter initialization strategy is designed to further utilize fine features within those bands. Then, after our initialization, the network parameters are self-supervisedly learned solely from the observed S2 image. Experiments are conducted on simulated and real S2 data. From the comparison with state-of-the-art methods, we can see the effectiveness of our method. However, as our method needs to infer all the parameters solely from the observed MSI in a self-supervised learning manner, it is quite time consuming. Thus, we will consider accelerating our method in the future.