Sentinel-2 60-m Band Super-Resolution Using Hybrid CNN-GPR Model

Sentinel-2 image super-resolution (SR) has proven advantageous in multiple data analysis pipelines, leading to a more comprehensive assessment of different environment-related metrics. This research aims to provide a method for super-resolving the 60-m bands provided by Sentinel-2 up to 10-m spatial resolution, using Gaussian process regression (GPR). While common GPR methods directly operate on raw data using carefully designed kernels, we propose a convolutional neural network (CNN)-based feature extraction kernel to directly process the input 10-m patches, applied in constructing the elements of the integrated covariance matrices. For each scene, a small number of training patches are sampled to optimize the CNN parameters and to construct the predictive mean function, the latter being further used for predicting super-resolved pixels for new input areas. We prove that our method is a reliable SR mechanism by assessing its performance both quantitatively, using metrics against other methods from literature, and qualitatively, through visual analysis of the results.

deep-learning-based methods [1], [8]. While the latter have proven to generalize well for a variety of environments, they require extensive training using synthetic data to capture the wide variety of spatial and spectral relationships. On the opposite end, previous methods relied on optimizing their internal parameters separately for each scene, fusing the information contained in the high-resolution bands with the radiometric properties of low-resolution ones, along with incorporating sensor-specific degradation-based operators to mimic the construction of low-resolution bands.
In this letter, we propose a Gaussian process regression (GPR) method for super-resolving the 60-m Sentinel-2 bands up to 10-m spatial resolution. The proposed method will incorporate a feature extractor based on a convolutional neural network (CNN) to extract high-resolution features, further used for constructing the covariance matrices through which the predictive mean can be computed for new input locations. To the best of our knowledge, this is the first SR method for Sentinel-2 bands based on GPR. The rest of this letter is organized as follows. Section II provides a theoretical description of GPR, followed by the proposed method. Section III discusses the validation process for our model, along with a comparison to other SR methods. Section IV highlights the concluding remarks and future developments.

A. Gaussian Process Regression (GPR)
Let us consider an unknown function f : X → Y which we are trying to model using GPs. The set of input points and their observed output will be denoted as X = {x i } 1≤i≤N and Y = {y i } 1≤i≤N , respectively, where x i ∈ X and y i ∈ Y. Constructing a distribution over function values f (x i ) requires defining a mean function m(x i ) and a covariance function k(x i , x j ), formally written as f (X) ∼ GP(m(X), k(X, X)), where k(X, X) = {k(x i , x j )} 1≤i, j≤N denotes the covariance matrix between all pairs of elements from X. Setting a prior over function f (usually Gaussian, for regression analysis) allows for sampling function values at different locations in its domain, given the mean and covariance functions [9]. For a set of new input points X * , the inference of its corresponding output set Y * is performed by defining a joint Gaussian distribution on the values of f (X) and f (X * ), followed by conditioning it on the initially known input-output pairs (x i , y i ), as follows: Equation (2) denotes the posterior distribution, while (3) and (4) give the predictive mean and covariance values for the new input points X * . In some scenarios, it is reasonable to assume the existence of underlying i.i.d. noise ϵ ∼ N (0, σ n I ) for observations Y, that is, Y = f (X) + ϵ, which leads to the adjustment k(X, X) → k(X, X) + σ n I in the previous equations. A prior of m(X) = 0 is usually assumed for computational ease. While powerful methods, GPs suffer from a computational bottleneck associated with the inversion of an N ×N matrix (3) and (4), scaling with the number of points N as O(N 3 ). Multiple techniques have been previously developed to find a good tradeoff between the power of representation and time complexity for GPR [10], the majority of which is addressing the characteristics of k(X, X) and its inversion process. One of the most common techniques works by constructing a joint distribution over the function values for a set Z = {z i } 1≤i≤M , z i ∈ X , of M ≪ N inducing points, further used for establishing a family of posterior distributions by conditioning the prior of f on these M function values [11], [12]. These latter methods make use of the approximationk(X, X) ≈ k(X, Z)k(Z, Z) −1 k(X, Z) T to reduce the computational expense of computing the predictive mean and covariance functions, resulting in a complexity of O(N M 2 ).
An important step in constructing representative GP models is the choice of covariance (kernel) function k(x i , x j ), which encodes the similarity between input points x i , x j ∈ X . The most commonly used covariance function is the squared exponential, defined as k( which is an isotropic function as it only depends on |x i − x j |. Most of these functions include a set of hyperparameters θ k (e.g., the squared exponential kernel has θ k = {σ, l}) which needs to be tuned s.t. the posterior distribution fits the training data as good as possible. Titsias [12] provides a way for learning both the hyperparameters θ k and the inducing inputs Z, which are considered to be variational parameters during the maximization of a variational lower bound to the marginal likelihood. In [11] and [13], the Sparse Variational Gaussian Process (SVGP) model provides a way to optimize the variational parameters (hyperparameters + inducing variables) by following the ascent direction given the natural gradients of the evidence lower bound (ELBO) of p( f (X)|Y), written in a form that allows for mini-batch optimization. The SVGP uses variational inference s.t. p( f (X)|Y) is approximated with a variational term q( f (X)), which in conjunction with the usage of inducing variables Z, results in the following maximization objective: which allows for computing the partial derivatives with respect to the kernel hyperparameters θ k and inducing points Z. Max-imizing L implies maximizing the first term-optimizing q(·) s.t. it becomes a good approximation for f (·) on all input locations X-and minimizing the Kullback-Leibler divergence, which serves the same goal for the inducing points Z.

B. Proposed Method
The proposed framework is illustrated in Fig. 1. Our method is based on the previously described SVGP model, optimized for learning an input-output mapping from 10-m k × k patches, extracted from available 10-m bands B2, B3, B4, and B8, to super-resolved 10-m pixels values for 60-m bands B1 and B9. More precisely, given the 10-m patches, the model is trained to produce a residual map, which added over bicubically upsampled 60-m bands should yield valid 10-m predictions. The idea was to retain the radiometric distribution of the original 60-m bands (through bicubic interpolation) while increasing their high-resolution content.
Directly comparing pixel values through common kernel functions would not yield results robust with respect to translation and rotation, thus falling short of being a good proxy for the similarity between different patches. Driven by these shortcomings, we propose to include in the learning process an additional feature extractor based on a CNN, to obtain 1-D representations for 10-m input patches. The intrinsic characteristics of the convolution allow for obtaining similar embeddings for relatively close spatial regions, and also for areas exhibiting local similar patterns, driving these representations to be translation/location-invariant. This further allows for constructing covariance matrices that better encode pair-wise similarities between different patches. Let us denote our neural network function parametrized by θ net as T θ net (·) : R k×k×4 → R d e , acting on concatenated k × k patches from all 10-m bands and computing d e -dimensional embeddings. Given two different inputs x i , x j ∈ R k×k×4 , their corresponding entry in the covariance matrix is now computed as k(T (x i ), T (x j )), where k(·, ·) is chosen to be the commonly used squared exponential. The set of parameters to be optimized is now extended as θ = {θ k , θ net }, which, along with the inducing point Z, will be modified according to the ascent direction of the loss described in (5).
To optimize the previously discussed set of parameters, the model was trained in a reduced-resolution context by generating synthetic input-output pairs, with spatial resolutions of 60, and 10 m, respectively. Given a set of 10-m input bands X 10 ∈ R H ×W ×4 , one can simulate the degradation process by a factor of 6 using the sensors' modulation transfer function (MTF), which encompasses a depth-wise convolution operation between X 10 and Gaussian kernel g σ with variance σ , followed by a depth-wise convolution with a 6 × 6 averaging filter d, applied with stride 6 where X 10 ↓ 6 denotes the simulated 60-m bands, and the variance for the Gaussian kernel is chosen as σ = 3, following the protocol described in [14]. For a k × k × 4 patch extracted from X 10 ↓ 6 , the prediction target was chosen to be the corresponding center pixel from 60-m bands (see Fig. 1 for illustration of the process). Since the model predicts residual values to be added over bicubically magnified low-resolution bands, the target values were constructed by first degrading the 60-m bands (using (6)), followed by applying ×6 bicubic upsampling and subtracting the result from the original 60-m bands.
For each scene, a number of patches have to be sampled for training the SR model, balancing the representational power and the computational complexity of k(X, Z). Given that these N training points should be as representative as possible for the current scene, we propose a sampling process based on perpatch variance, to avoid near-flat areas: 1) compute the variance of all p × p nonoverlapping 60-m patches; 2) normalize each to their sum and construct a discrete probability distribution over their center locations by assigning the normalized variances to their probabilities; and 3) sample N points from the constructed distribution to serve as training data.

III. EXPERIMENTS A. Sentinel-2 Data and Evaluation Metrics
To validate the performance of our model, we used Level-1C products provided by Sentinel-2, each spanning an approximate area of 100 × 100 km 2 . Three areas were selected for performance assessment, further referenced by their geographic location: Bucharest, Romania 1 ; coastline of the Tyrrhenian sea, Italy 2 ; Dilo, Ethiopia. 3 As for the numerical evaluation framework, we adopted Wald's protocol [15] for assessing the performance in reduced-resolution conditions. 1  This coincides with how we create our synthetic training data, taking the 60-m bands as targets and the degraded 10-m bands as inputs. Since the target consists of 60-m spatial resolution data, we further denote this process as 360-60 m SR. Our target represented a residual value, which, added over the bicubically upsampled 60-m bands, yielded the 10-m superresolved response. Before any processing, all bands were divided by their corresponding maximum value. To measure the error between our prediction and the original 60-m bands, we used as evaluation metrics root-mean-square error (RMSE), signal-to-reconstruction error (SRE), and spectral angle mapper (SAM).

B. Performance Comparison and Analysis
For the SR framework presented in Fig. 1, we choose the following architecture for the CNN implementing feature extraction T θ net (·): three 2-D convolutional layers, with 16-32-64 filters of spatial dimension 3 × 3, followed by two fully connected layers with 256-128 units (d e = 128). ReLU activation is used after each layer, except for the last one. The simplicity of the chosen architecture is motivated by two factors: the use of relatively small patches for training-to avoid taking into account redundant spatial details for prediction-and to bypass the possibility of overfitting. Squared exponential is used for measuring the distance between two 128-D embedding vectors. For each of the three Sentinel-2 scenes, we trained a different SR model by sampling N = 8000 input-output pairs, from which we selected M = 200 inducing points by applying k-means on all N sampled points, selecting the k = M centroids as our initial Z values. The models were optimized using 10k iterations of the L-BFGS-B algorithm [16] to adapt all parameters θ. We used patches of size 13 × 13, as this resulted in the best performance. Several methods were used to compare  2. Results on full-resolution evaluation for 60 → 10 m SR. The illustrated super-resolved images cover an approximate area of 6 × 6 km 2 , with the zoomed-in regions covering 1 × 1 km 2 . The first, third, and fourth rows represent results for band B9, while the second is for band B1. The first area is extracted from the Romania tile, the next two from Italy, and the last one from Ethiopia. our model's performance to: bicubic upsampling, SSSS [7], DSen2, and its deeper version VDSen2 [8] and ATPRK [6]. It is important to note that all numerical evaluations were performed by excluding the N sampled training points, ensuring a fair comparison between methods.
Numerical outcomes for reduced-resolution evaluation (360 → 60 m SR) are presented in Table I, for all methods and separately for each Sentinel-2 area. All results are reported in the original range of values (nonnormalized), for each band.
Our method resulted in the best performance in all three areas for band B9, followed by ATPRK with the secondbest results. On band B1, the proposed method gave the best performance on super-resolving Italy and Ethiopia tiles, while achieving the second-best performance for the Romania area, being surpassed by ATPRK. To investigate this, we measured the Pearson correlation between our prediction and ATPRK's on band B1, and the available 10-m bands. This resulted in a mean coefficient of 0.625 for ATPRK and 0.547 for ours, the main difference coming from band B4 where ATPRK resulted in a coefficient of 0.977, while ours in 0.728. Since band B4 is representative of urban areas, along with vegetation crops, and since Romania tile is highly reflected in these characteristics, we concluded that an increased performance for ATPRK was achieved through the ease of controlling its correlation with other 10-m bands. Therefore, ATPRK may be a better fit in the case of areas highly represented by the specifics of an available 10-m band, while falling short in environmentally heterogeneous areas, where linear combinations of these 10-m bands are not sufficient. Note that the results obtained by SSSS for this band are fairly similar to the ones obtained through bicubic interpolation, indicating a deviation from the reflectance distribution of the original band. DSen2 takes third place in almost all evaluations, surpassing its deeper version VDSen2 by a large margin, showcasing better generalization.
In the case of full-scale SR, that is, applying the previous algorithms on the original 10-m bands, the evaluation relies solely on visual inspection for the predicted bands. While any numerical evaluation for SR with no reference can lead to divergent conclusions, several efforts have been conducted to develop such evaluations. One of them is quality no-reference (QNR) [17] which combines spatial and spectral distortion in a single metric, extended in [14] for S2 SR. Table II presents the QNR for full-scale SR, given as percentages. Our method and ATPRK result in the best performances with, however, a very small difference between them. Some visual results sampled from the three scenes are presented in Fig. 2, along with the 10-m RGB representation for detailed comparison. DSen2 results in somewhat blurry details, especially for the last, partially clouded area from the Ethiopian desert. SSSS caused fairly different radiometric distributions for band B1, relative to the original bands, supporting the high RMSE magnitudes for reduced-resolution evaluation from Table I. Our method and ATRPK obtained the best overall visual results, aligning with the observed high-frequency details in the 10-m RGB images and with the results from Table II, thus validating the quality of induced high-resolution information. Finally, we would like to enhance that, even if the QNR can sometimes provide a proxy for the quality of SR results, not relying on a reference image may lead to inconsistent comparisons during the evaluation phase. Thus, greater emphasis should be placed on evaluation protocols that incorporate reference data.

IV. CONCLUSION
In this research, we proposed a Sentinel-2 60-m band SR method using GPR based on a CNN feature extractor.
Our method was optimized on synthetically degraded image patches and was tested in degraded and full-resolution contexts, obtaining top performance in both quantitative and qualitative evaluations. The main drawback of our system is given by the necessity of being optimized separately for each scene, which could otherwise contribute to significant growth of the covariance matrix, leading to impractical solutions. Future advancements cover the development of techniques aimed at increasing the generalization capabilities, for the same model to be used on multiple areas and the use of predicted covariance values in a postprocessing step targeting high-covariance regions.