Hyperspectral Image Superresolution via Subspace-Based Deep Prior Regularization

Hyperspectral imaging is able to provide a finer delivery of various material properties than conventional imaging systems. Yet in reality, an optical system can only generate data with high spatial resolution but low spectral one, or vice versa, at video rates. As a result, an issue that fuses low-resolution hyperspectral and high-resolution multispectral images has gained great attention. However, most fusion approaches depend purely on hand-crafted regularizers or data-driven priors, leading to the issues of tricky parameter selection or poor interpretability. In this work, a subspace-based deep prior regularization is proposed to tackle these problems, which takes both hand-crafted regularizer and data-driven prior into account. Specifically, we leverage the spectral correlation of the images and transfer them from the original space to the subspace domain, within which a modified U-net-based deep prior learning network (SDPL-net) is designed for the fusion issue. Moreover, instead of taking the output of SDPL-net directly as the result, we further feed the output back to the model-based optimization. Under such prior regularization, the recovered high-resolution hyperspectral image holds a high consistency to its inherent structure and hence tends to present enhanced reliability and accuracy. Experimental results on simulated and real data reveal that the proposed method excels other state-of-the-art methods in both quantitative and qualitative metrics.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) comprise various bands acquired by the sensor at different wavelengths, which allow more reliable dissemination of information in real scenes than conventional images with only a few bands. The rich spectrum of HSIs can facilitate the characterization of imaging scenes and greatly improve the performance of numerous computer vision applications, e.g., object detection, classification, tracking, and segmentation [1], [2], [3], [4], [5], [6], [7], [8]. However, owing to the finite incident energy, there exists a crucial tradeoff between spatial and spectral resolution for real-world applications. Specifically, a typical optical system may only generate images with high spatial resolution but a limited amount of spectra [i.e., high-resolution multispectral (HRMS) images], or images with plenty of spectral bands but a low spatial resolution [i.e., low-resolution hyperspectral (LRHS) images] [9]. Therefore, a research topic of HSI superresolution, which fuses these two kinds of images to produce highresolution hyperspectral (HRHS) images has received much attention [10], [11]. In practice, following a physically delicate degradation model, the input LRHS image and HRMS image are considered as spectrally and spatially degraded observations (i.e., linear down-sampled versions) of latent HRHS images, respectively. Commonly, HSI fusion is a highly ill-posed issue due to large scaling factors in both spectra and space. Most traditional approaches consider the incorporation of certain hand-crafted prior to bound the solution space, which holds the potential of achieving acceptable performance without the support form extra training samples [12], [13]. According to the structure of the prior knowledge employed in optimization, available methodologies can be broadly categorized into three types: spectral unmixing-based methods [12], [14], [15], [16], sparse representation-based methods [17], [18], [19], [20], [21], and tensor decomposition-based methods [13], [22], [23], [24], [25], [26], [27]. However, these prior constrained approaches are still inherently flawed in three ways. First, the involved optimization often requires excessive iterations, which further demands prolonged time to recover an HRHS image. Second, a single prior tends to express only one aspect of the visual properties, hence is insufficient to fully reveal the complex structure of HSIs. Third, the blend of multiple hand-crafted priors may pose further hardships in dealing with the concerned optimization issue.
In recent years, inspired by the strong fitting capacity of deep learning (DL) in many computer vision applications, most convolutional neural networks (CNNs) have been introduced to the superresolution field [28], [29], [30], [31], [32], [33]. In comparison to optimization methodologies based on predefined regularizer, the DL approaches demand little assumption on the prior knowledge of general HSIs, putting in their best effort to learn latent information from the training data in an end-to-end fashion. However, these approaches often ignore the blurring and down-sampling operators as well as the spectral response function during degradation process, even these components have a clear physical interpretation that relates LRHS and HRMS to the HRHS image. To relieve this problem, more current end-to-end methods [34], [35], [36] succeed in improving the quality of the recovered images by learning spectral response function and down-sampling operator. Moreover, several deep This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ prior-based methods are lately presented [37], [38], [39], which marry the merits of both degradation and DL approaches.
Albeit attaining encouraging performance, the DL-based superresolution methods still suffer from some drawbacks. The most critical one is the ignorance of the explicit priors, which are known as the inherent properties of most HSIs, e.g., spectral low-rank property [40]. This may lead to a deviation of the network output from the general prior configuration and impact negatively on the recovery accuracy. Another significant challenge facing the current deep prior-based method lies in the overfitting issue. Mostly, due to labor and hardware costs, only a limited amount of paired LRHS and HRMS can be collected. As a result, the mismatch issue of training-testing bias often exists in practical scenarios. Accordingly, the overfitting issue tends to occur in this circumstance. Other drawbacks lie in the specific design of the network framework for two inputs differing in sizes and features, which inevitably results in complex architectures, tremendous parameters, and information distortions.
Against the abovementioned issues, we attempt to perform hyperspectral fusion by using jointly the handcrafted prior as well as the data-driven knowledge, which naturally holds the opportunity of providing better performance and meanwhile generalizing well [10], [41]. Note that the joint consideration also taps the potential of using simpler network architecture for the fusion problem. The specific and primary contributions of this work can be highlighted as follows.
1) Since the bands of HRHS image are highly correlated, the spectral vectors often live in a low-dimensional subspace.
On that basis, we propose a subspace-based deep prior regularization to recover the HRHS images using two factor matrices. The first is produced via a simple singular value decomposition (SVD) of the LRHS image, with the guarantees of leveraging the spectral correlation and retaining the low-rank property. The second matrix can be readily leaned by a relatively concise network due to the narrowed band size. 2) To avoid the hand-crafted regularizer as used in most traditional approaches, we develop a U-net-based model, namely SDPL-net, to learn deep priors in the subspace domain instead of the original one. Note that to highlight the individual features of each band, the RCAB [42] block is further introduced, permitting the low-frequency structures to be bypassed and enabling more concentration on high-frequency information. As a result, the recovered HRHS image finely complies with the intrinsically visual structure and guarantees better reliability, benefitting both from the explicitly low-rank property as well as the implicitly learned prior. 3) Instead of using SDPL-net directly as the outcome producer, we further feed the output, behaving as a priorlearned regularizer, back to the optimization-based model. Such a prior-inspired method enjoys promising generalization capability. When the spatial and spectral information in the training and test images is constrained, our proposed method can automatically extract low-rank and deep priors from latent HRHS images. For more tough circumstances in which data are captured from diverse optical scenarios, even different sensors with variant spatial and spectral information, SDPR can be readily generalized to new test samples and perform well. The rest of this article is organized as follows. The relevant literature is reviewed in Section II. In Section III, the proposed HSI fusion method is described in details. Section IV displays and analyzes experiments on several publicly available datasets. Finally, Section V concludes this article and conceives some future investigations.

II. RELATED WORK
This section provides a brief review of several existing approaches relevant to our work.

A. Spectral Unmixing Based Methods
Leveraging prior to spectral unmixing has been experimentally instantiated to be conducive for HSI fusion, mostly under certain constraints, such as nonnegativity and sum-to-one. The work [14] develops a coupled nonnegative matrix factorization for alternately unmixing LRHS and HRMS images in order to recover the HRHS one. With a similar framework, [15] decouples the optimization issue into two restricted least square problems, followed by a solver that jointly unmixes the two input images. Concerning the endmembers and their abundances, in [16], the latent HRHS image is reconstructed employing the alternating direction method of multipliers [43]. By fully exploiting the specific properties of matrix decomposition, a four-stage fusion framework dubbed as MDF is proposed in [12].

B. Sparse Representation Based Methods
Sparse representation is another promising scheme for the fusion of LRHS and HRMS images, which sparsely encodes the latent HRHS image using a proper spectral dictionary acquired from the inputs. For instance, [17] merges hyperspectral and multispectral images following a variational framework, in which a sparse regularizer is designed by factorizing the scenes over a given dictionary. More recently, a nonnegative sparse coding scheme is proposed [18], which employs not only the pixelwise sparsity but also the nonlocal spatial similarity, yielding an improved performance. Similarly, in [20], a fusion method based on nonlocal low-rank tensor approximation (LRTA) and sparse representation is given, in which the sparsity of abundances is highlighted by a delicately devised constraint. Depending on a newly structured low-rank representation, the approach in [21] represents images as linear combinations of bases in an appropriately learned dictionary, inducing sparse property inferred from certain subspace decomposition of the affinity matrix.

C. Tensor Decomposition-Based Methods
Another crucial technique for HSI fusion relies on tensor decomposition, which has been widely employed to reveal the correlations among different modes. In [22], the fusion issue is formulated as a coupled sparse Tucker decomposition, alternately updating dictionaries of three modes as well as the sparse core tensor. The work in [23] develops a low tensor train rank for the learning of correlations among spatial, spectral, and nonlocal modes of all analogous cubes. A novel subspace-based low-tensor multirank regularization (LTMR) is proposed in [24], which leverages both the spectral correlation and the nonlocal similarity in HRHS images. Through a coupled tensor canonical polyadic decomposition, [25] probes into the intrinsic relationship between HRHS and LRMS images. The approach in [26] develops a novel low-rank tensor ring decomposition for HSI superresolution. By exploiting the idea from the field of tensor completion, [13] develops an LRTA method to directly enforce spatial and spectral low rankness while avoiding computationally complicated steps, such as patch clustering and dictionary learning.

D. End-to-End Learning Methods
In [34], an unsupervised adaptation learning (UAL) method is proposed, which develops a two-stage CNN to leverage characteristics of latent HRHS targets and estimate the unknown down-sampling operator simultaneously. With the aid of attention mechanism and pixel shuffling, in [35], a simple yet effective CNN architecture is presented holding a high reliability of outputting spatially high-quality details. In [36], a new fusion model is formulated that considers both spectral and spatial degradation. Accordingly, a deep CNN dubbed MHF-net is raised to tackle the fusion issue iteratively. Unfortunately, these end-to-end learning methods still have evident drawbacks. For instance, given limited samples, the data-driven approaches are sensitive to variations in spatial and spectral information, suffering severely from the mismatch between training and test sets [35]. Such an overfitting problem happens frequently, since real data are generally acquired under changing environmental settings or even with different sensors.

E. Deep Prior-Based Methods
The work in [37] shows that neural networks can naturally capture the visual prior implicitly. On that basis, acceptable results can be guaranteed in standard inverse problems, such as denoising, superresolution, and inpainting. Following this idea, in [38], a deep hyperspectral image sharpening (DHSIS) method is presented to directly learn deep prior via CNN-based residual learning. In [39], a novel network architecture is developed to leverage both the spectral and spatial characteristics of latent HRHS images, whose outputs serve to regularize the superresolution issue. Albeit with acceptable performance, the explicit priors, such as spectral correlation, possessed by most HSIs, are often ignored by such "black-box" deep models, leaving considerable room for further performance improvements. Our approach is inspired by this line of research. As a whole, we aim to construct a framework that incorporates the explicitly hand-crafted regularizer and the implicitly learned prior in our investigation.

A. Preliminaries
The desired HRHS image is denoted as X ∈ R W ×H×S , where W , H, and S are the width, height, and band number of the target image, respectively. Accordingly, Y ∈ R w×h×S denotes the observed LRHS image with w × h pixels (w <W , h <H) and S spectral bands. Following the physically delicate degeneracy model, Y is a spatially degraded observation of X : where, the matrices X (3) ∈ R S×W H and Y (3) ∈ R S×wh are acquired by unfolding X and Y along the third mode, respectively. S ∈ R W H×wh denotes the down-sampling matrix. The blurring matrix is represented by B ∈ R W H×W H . Practically, B can be diagonalized as where, F denotes the fast Fourier transform (FFT) and Within the same scene, the HRMS image with W × H pixels and s spectral bands (s <S) is represented by Z ∈ R W ×H×s . Z can be modeled as a spectrally degenerate observation of X , that is where, Z (3) ∈ R s×W H is the matrix acquired by unfolding Z along the third mode. R ∈ R s×S is the spectral response matrix. Due to the fact that HSI owns a close correlation among different bands, the globally spectral low-rank property is a dominant prior for most subsequent applications. Particularly, the pixels located at the same spatial position but with different spectral bands often live in a low-dimensional subspace, whose ambient size is much less than that of the original one. Therefore, it is a natural operation to reveal the spectral low-rank property by using the subspace decomposition technology, i.e., Here, D ∈ R S×L denotes the domain switching matrix that records the mapping from the low-rank subspace domain to the original image domain. L is the number of atoms resulted from the subspace representation. The matrix Regarding the subspace, two remarks are made: a) value of L is small, that is L <S, which means that the spectral vectors lie in a lowdimensional subspace. Generally, the reduction of dimension favors a more efficient computation; b) columns of subspace are orthogonal, implying that each spectral vector X (i, j, :) holds the same Frobenius norm as the corresponding coefficient C(i, j, :). In this manner, self-similarity among spectra can be translated into the subspace domain. Thus, the superresolution issue can be rephrased into estimating the switching matrix D and coefficients C from two input images. Since the LRHS image retains most of the spectral information from the HRHS image, the two images are supposed to lie in the same spectral subspace. Therefore, by borrowing the simple SVD decomposition, we can estimate the domain switching matrix from the LRHS image where, U and V are both orthogonal matrices and the diagonal matrix Σ contains singular values aligned in descending order. By holding the L largest singular values and eliminating the remaining ones, the learned subspace matrix D is given as With all the preliminaries prepared, the overall flowchart of our method is given in Fig. 1. Generally, there are three main steps, including a preacquisition of the initial estimation, a learning of subspace-based deep prior, and an integration of the prior into the degradation model. In the sequel, we first provide the integration idea, followed by a sequential presentations of different steps.

B. Integrating Prior Into the Degradation Model
Using the degenerate model in (1), (3), and (4), C can be estimated by introducing appropriate priors and penalizing the following biobjective function: where F representing the matrix Frobenius norm. J 1 (C (3) ) denotes a data fidelity term. J 2 (C (3) ) represents a regularizer that shrinks the solution space and enforces certain prior properties. Moreover, μ > 0 is a hyperparameter that balances J 1 and J 2 . As demonstrated in (7), the subspace-based deep prior structure of latent HRHS images is encoded in J 2 (C (3) ). As known, it is nontrivial to handcraft a sophisticated regularizer. Recently, benefiting from the rocketing development of the variable splitting technique, plug-and-play approaches have been introduced to address diverse HSI inversion issues [44], [45], [46]. In contrast, in this article, we design a regularizer that leverages network output C ∈ R W ×H×L to ensure a close solution of (8) to the learned subspace-based deep prior. Concretely, we let J 2 (C (3) ) be the squared Euclidean distance between C and C in the subspace coefficient domain where, C (3) ∈ R L×W H is the matrix resulted from the unfolding of C along with the third mode. Then (7) can be rewritten as arg min ( 1 0 ) Compared to other possible formulations, cost function (10) allows a straightforward solution owing to the differentiability of the Frobenius norm. Employing a promising deep prior structure for C is the key point to achieve a satisfactory recovery of the latent HRHS image.

C. Preoptimization
Although deep CNN can efficiently learn the prior from existing HSIs, two inputs Y and Z that contain redundant information and differ in sizes are difficult to map to C directly. Not that only the spatial and spectral features are tough to extract individually, but the information distortion in feature merging of different dimensions is also hard to implement. We tackle this issue by a preacquisition of estimated C ∈ R W ×H×L by adequately exploiting Y and Z, followed by an immediate feed into SDPL-net. Similar to (10), the practical implementation simply replaces the third term of J 2 with the Euclidean distance between D C (3) and Y up (3) in the original image domain arg min where, C (3) ∈ R L×W H and Y up (3) ∈ R S×W H are, respectively, the unfolded versions of the expected estimation C and the upsampled LRHR image Y up ∈ R W ×H×S . λ > 0 is the tradeoff parameter. In our implementation, the bicubic interpolation scheme is selected as the upsampling operation. Note that C (3) and C (3) denote different stages of the subspace coefficient C (3) when the algorithm is running, so that different superscript hats are used for ease of identification. That is to say, C (3) and C (3) also satisfy the properties of C (3) , e.g., X (3) = D C (3) , during the optimization stage. To solve (11), we enforce the derivative of (11) w.r.t. C to be zero. As a result, the solution of (11) can be further achieved by the following Sylvester equation: ( 1 2 ) Since the domain switching matrix is acquired by SVD, it fulfills D T D = I L . Thus, we are able to derive the following equation: where, I L is the identity matrix with size L × L. According to the theoretical analysis in [47], the Sylvester equation (12) has a unique solution if and only if the summation of eigenvalues of H 1 and H 2 is nonzero. It is known that both (RD) T RD and I L are positive definite, hence H 1 is positive definite. Moreover, considering that H 2 is a semipositive definite matrix, then the arbitrary summation of eigenvalues of H 1 and H 2 is surely larger than zero, guaranteeing a unique solution of (12). On that basis, we borrow the idea from [48] to analytically and efficiently get the solution of C (3) . The concrete solver is given in Algorithm 1.
In Algorithm 1, convolution blurring B can be factorized as illustrated in (2), in which the diagonal matrix K ∈ C W H×W H is given as where, K i ∈ C wh×wh and d denotes the spatial down-sampling factor. The eigen decomposition of H 1 yields the unitary matrix Q and the diagonal matrix Λ, i.e., with λ l representing the eigenvalue of H 1 and Q holding the eigenvectors of H 1 in its columns. o l denotes the lth row represents a vector whose elements are all 1.

D. Subspace-Based Deep Prior Learning
Instead of using the hand-crafted regularizer, we propose to learn the prior in the subspace coefficient domain of latent HRHS images from hyperspectral datasets. Note that, after preoptimization, we obtain the initial estimation C with the same size as the target C in the subspace coefficient domain. Taking C and C as input and output, respectively, in the sequel, we construct an end-to-end deep network dubbed SDPL-net following a U-Net architecture.
U-Net has been proven to be an effective tool for various tasks, including magnetic resonance spectroscopy imaging [49] and biomedical image segmentation [50]. In contrast to another well-known model Resnet [51], U-Net primarily operates by performing pixelwise transformation on input images, which is applicable to the superresolution problem. Using convolutional layers, U-Net first continuously down samples the input image to a small size, which aids in extracting globally valuable features. Next, the image is scaled up through a reverse operation. This process helps to identify locally vital features so as to refine the image at a finer resolution. Unfortunately, the features in U-Net architecture contain redundant low-frequency information equally across all channels, which inevitably hinders the representability. To solve this problem, we employ RCAB in our network. In particular, RCAB permits the low-frequency structure to be bypassed via the multiple skip connections, enabling CNN to concentrate more on high-frequency information. Meanwhile, RCAB adaptively rescales channelwise features taking interdependencies among channels into account.
As shown in Fig. 2, the overall network follows an encoderdecoder design. To begin, a convolution layer without activation function is initially employed to extract the shallow features. In addition, the modules of encoder and decoder are sequentially appended. Note that the encoder module intends to learn the blurring and down-sampling operators, which is much simpler than the decoder module for the learning of upscaling mapping. In practice, the encoder block is designed with only two convolutional layers and a LeakyReLU activation layer, which is considerably efficient, yet it functions well in our experiment. As for the decoder block, unlike the baseline U-Net, RCAB is utilized to boost the model capacity. Besides, shortcut connections, which consider the hierarchical nature of the features, are employed in the feature extraction process. We try this connection since typical convolutional networks tend to finalize their output with high-level features, neglecting the importance of low-level features [8]. In general, high-level features refer to semantic knowledge, whereas low-level features mostly facilitate information about contours, boundaries, etc. We argue that low-level features are indispensable for HSI superresolution. Therefore, the features from the encoder block are merged into the decoder block in the feature extraction process.
It has been proven that 1 -norm is more applicable than 2norm in low-level visual tasks since it experimentally delivers improved performance [52]. Therefore, the loss function based on 1 -norm is employed where, {( C m ; C m )} M m=1 represents M training pairs for the learning of parameter Θ in SDPL-net F.
The superresolution result derived from our SDPL-net, referred to C in subspace, is again employed in the regularizer J 2 , as shown in the biobjective optimization formulation (10). More specifically, C behaves as a spatial-spectral prior and contributes positively to the finer estimation of C. The solver of (10) is similar to that of (11) and will not be repeated here. Finally, the desired result of the HRHS image X can be obtained by calculating (4).
Overall, the joint usage of low-rank decomposition, deep priors, and degradation knowledge is considered in our model. Recall that the hand-crafted prior-based optimization methods could guarantee an acceptable performance given no training samples [12], [13]. Besides, using a network to recover a subspace variable instead of the original one would compress the model parameters. Both of these two merits favor the relief of the overfitting issue. Moreover, the fast inference of Sylvester equation given in Algorithm 1 would guarantee an efficient solution.

A. Data and Experimental Settings
In this study, experiments are implemented on two simulated datasets to assess the efficacy of our presented approach: Columbia computer vision laboratory (CAVE) [53] and Harvard [54]. Moreover, we also utilize the University of Houston (UH) [55] as a real dataset to evaluate the model performance.
CAVE dataset holds 32 high-quality indoor images recorded by a generalized and assorted pixel camera. These HSIs have a spatial resolution of 512 × 512 and 31 spectral bands with wavelengths ranging from 400 to 700 nanometers. Since the first two spectra of CAVE suffer ambiguity, we remove them to avoid bias. 20 HSIs are selected at random to train the network, and the rest 12 HSIs are considered for the test. The example HSIs of CAVE are given in Fig. 3.
Harvard dataset consists of 77 HSIs covering indoor and outdoor scenarios of various targets, materials, and scales under daylight illumination. Each image has 31 bands with wavelengths ranging from 420 to 720 nanometers and spatial size of 1392 × 1040. For the test, ten images are randomly selected, whose representatives are shown in Fig. 4. UH dataset is provided by the IEEE Geoscience and Remote Sensing Society in 2018, including an HRMS image and an LRHS image with size 83440 × 24040 × 3 and 4172 × 1202 × 48, respectively. For uniformity, the HRMS image is resized to 33376 × 9616 × 3. As a result, the spatial ratio of HRMS image to LRHS image is varied to 8. Moreover, we crop a 1024 × 1024 subimage and a corresponding 128 × 128 subimage from the HRMS and LRHS images, respectively, for testing.
The images from two simulated datasets are employed as the ground truths (GTs). LRHS images are obtained from each simulated dataset by employing an 8 × 8 Gaussian filter with a standard deviation of 2 and then down sampling with a factor of 8. The spectral down-sampling matrix R for the CAVE and Harvard datasets is obtained from the response of the Nikon D700 camera, which produces HRMS image as RGB image. In the nonblind case, we treat the convolutional blur B and the spectral response matrix R as known prior knowledge to guide the generation of superresolution images. Yet in the blind case, we follow most well-known works, such as [12], [24] that borrow HySure [56] to estimate B and R.

B. Quality Measures
Four prevalent metrics are selected in the experiment to assess the quality of superresolution results.
1) Peak signal-to-noise ratio (PSNR): The average spatial similarity between produced and reference images across all bands can be described by the PSNR. The greater PSNR, the better spatial quality. 2) Spectral angle mapper (SAM): SAM measures the angle averaged across the whole spatial domain to determine the spectral quality of the fusion result. The lower degree, the less spectral distortion. 3) Structural similarity index measure (SSIM): SSIM computes the average structure similarity in the spatial domain between created and reference images. The higher SSIM, the better preservation of spatial structure. 4) Root-mean-squared error (RMSE): The discrepancy between produced and reference images can be described by the RMSE. Obviously, the smaller value denotes the better result.

C. Parameter Selection
In our method, three key parameters require to be manually adjusted, including subspace dimension L as well as two tradeoff hyperparameters λ and μ. To discuss the effect of different atoms L on CAVE and Harvard, as shown in Fig. 5(a), we exhibit the mean PSNR curves of the fusion results against varying L. As can be observed, with L varied from 4 to 8, the PSNR values climb rapidly for both CAVE and Harvard. The results remain relatively stable as L increases further. It is well-known that Harvard is less challenging than CAVE. Accordingly, the PSNR curve for the former is substantially smoother than that for CAVE. As a result, the setting of 8 on subspace dimension is sufficient to retain the most spectral information, validating the claim that spectral vectors indeed lie in a low-dimensional subspace.
Regarding to the quality of X ( X (3) = D C (3) ), parameter λ is employed for preoptimization. Similarly, parameter μ is utilized to influence the quality of the final superresolution result X . Therefore, these two parameters need to be properly tuned for a pleasing performance. Unquestionably, both λ and μ shall be greater than 0. In consequence, we traverse these parameters from {1e-5, 1e-4,...,1} to report a better result. As illustrated in Fig. 5(b), when λ consistently increases, the PSNR values on two simulated datasets first remain flat and then decrease. Note that a larger PSNR for X does not assure a better ultimate result, which is more dependent on the prior learning of the subsequent operations. That is, λ can be adjusted within a coarse range. In our implementation, λ = 10 −5 is chosen to ensure an acceptable overall quality. Likewise, in Fig. 5(c), the PSNR values go stably as μ rises. We attribute this to the fact that C is quite close to the true value C in subspace, which leads to a reduced fluctuation of PSNR values. In all our experiments, we simply select the intermediate value, i.e., μ = 0.001, to produce the fusion results. Note that the traversed span is notably loose for practical applications, which demonstrates the property of parameter insensitivity of our method. A finer tuned parameter would surely promote the performance, yet we leave it a relaxed setting for easy model selection. Table I displays the  average PSNR, SAM, SSIM, and RMSE for the 12 test HSIs, with the best values highlighted by bold fonts and the next best ones underlined for clarity. It is demonstrated in the table  that the suggested SDPR approach exceeds other competing methodologies across all evaluation measures. Presumably, our approach can better maintain both spectral and spatial structures. Furthermore, taking hairs (an image in CAVE) as an example for visual comparison of the competing methods, Fig. 6 displays the fused pseudocolor results and their corresponding error maps. The pseudocolor images are formed by bands 29, 19, and 1 of the recovered images, while the error maps are the differences between the recovery and the GT. For ease of observation, a representative portion of each fused HSI is marked and magnified by a factor of 2. This together with the error maps shows that SDPR recovers complex structures more efficiently and has less distortion than all other competitors. Fig. 7(a) further shows the PSNR curves of all competing approaches to compare the quality along different bands. As can be seen from this subfigure, SDPR outperforms the other competitors in almost all spectral bands.

1) Performance Comparison on CAVE:
2) Performance Comparison on Harvard: Table II shows the average performance of all competing approaches on the 10 test HSIs of Harvard. We would like to point out that the DL-based approaches in this experiment are all trained on CAVE and then tested on other datasets without any retraining or fine-tuning steps. Therefore, the results from these methods on Harvard stands as a representative of the generalization ability [35]. As can be seen from the table, SDPR remains the best in most  Fig. 8 illustrates the fused images and the corresponding error maps from all competing approaches. A representative region is also marked and magnified. From this figure, the results achieved by LTMR and SDPR are similar and are very close to the GT. Evidently, they generate the minimum fusion error in both the edges and the smoothed areas. Fig. 7(b) further depicts the PSNR curves against the spectral bands of the whole Harvard dataset. As can be seen, albeit with a similar visual presentation, LTMR lags behind SDPR numerically in  all spectral bands. On the whole, both the quantitative metrics and qualitative visuals demonstrate that most DL-based methods (e.g., UAL and MHF-net) exhibit the weaker generalization than our model. The main reason owes to the overfitting issue since the model parameters fit in closely with the features of CAVE, yet the model perform suboptimally on Harvard dataset.

3) Performance Comparison on UH:
In this case, both convolutional blurring B and spectral response matrix R are unknown. For those methods that need B and R given, we estimate them via HySure. Fig. 9 depicts a portion of the superresolution results, which are RGB images created by assessed spectral response function. The visual inspection clearly demonstrates that the output achieved by SDPR is more realistic and presents much sharper details.

1) Computational Efficiency:
To elucidate the computational efficiency of SDPR, we examine the running times of each competitive methodology on three datasets. A GeForce GTX 3080 16 GB graphics card is used for all DL-based methods. All fusion  Table III shows the average running times of each competing approach. The results demonstrate that our SDPR method is faster than all traditional methods including  LRTA, MDF, LTMR, and most DL-based methods, such as UAL and DHSIS. In other words, SDPR is second only to MHF-net in terms of running efficiency. We attribute the advantage mainly to the enforcement of network on subspace factor, which results in a remarkably reduced dimension of target variable. Besides, the concise design of the network architecture also saves the computational cost. Inevitably, generating initial estimation with preoptimization, integrating prior into the degradation model, and estimating B and R would add some computational effort. Overall, we believe that the speed rank of our SDPR is within an acceptable range due to the improved fusion quality.
2) Effectiveness of Different Steps: As mentioned, our approach can be divided into three steps: preoptimization to generate the initial estimation, using SDPL-net to learn subspacebased deep prior, and integrating prior into the degradation model. Table IV provides the average quantitative results of Y up , X , X ( X (3) = D C (3) ), and X on two simulated datasets, evaluating the effectiveness of these three steps from an objective perspective. The table shows that X exceeds Y up with a huge margin on both datasets, demonstrating that the preoptimization step is qualified to retain both spatial and spectral structures when the two inputs are initially integrated as a whole. The result X generated by SDPL-net yields improved quantification compared to X . This indicates that prior learning plays an effective and helpful role in the superresolution task. Furthermore, the final superresolution result X based on X achieves a further rise, showing that integrating prior into the degradation model has a positive impact on quality improvement. Overall, these three steps are all indispensable and layerwisely benefit the performance. Note that more iterates of the three steps would surely bring more benefit. However, the improvement is marginal yet with considerably more computational cost.

3) Ablation Study of Subspace Learning Scheme:
We further carry out an ablation study of the subspace learning scheme, and the quality of the fusion results are shown in Table V. Taking MHF-net [36] that ranks second in most experiments as an example, we also equip it with subspace learning scheme to validate its general transferability. Specifically, with the same SVD operation, the output of MHF-net is mapped to the subspace domain, which can be further fed into (10) for the final result. From this table, the models equipped with the subspace learning scheme clearly produces improved results in comparison to the baselines. Especially, the gaps of our SDPR with and without the scheme are 6.77, 3.01, 0.07, and 0.62 in metrics PSNR, SAM, SSIM, and RMSE, respectively, on the Harvard dataset.
Furthermore, taking watercolors (an image in CAVE) as an example, Fig. 10 visualizes some feature maps as well as the fused images with and without the equipment of subspace scheme. As can be seen, the more focused and clearer features can be learned from the model equipped with the subspace scheme. For instance, in the first row of Fig. 10(b), more scattered hot areas can be observed around the central regions. In the first row of Fig. 10(c), the features of most trees are in a close tangle with the border areas. All these findings demonstrate the advantage of the subspace learning scheme in facilitating the capture of more semantic information and benefiting the generalization capability.

4) Comparison
Between U-Net and SDPL-Net: By replacing SDPL-net with a standard U-Net, we also conduct an ablation study and report the superresolution results in Table VI. Evidently, compared to U-Net, SDPL-net yields better performance on both datasets. In particular, on the CAVE dataset, the improvement of SDPR-net over U-Net is 1.43, 0.69, and 0.11 in terms of PSNR, SAM, and RMSE, respectively. These improvements demonstrate that the introduction of RCAB indeed benefits the feature learning and enriches the prior knowledge.

V. CONCLUSION
In this article, we present an explicit-implicit prior-based HSI superresolution approach called SDPR. Instead of learning a deep prior in the original image domain, we leverage the spectral correlation to transform the image into a low-rank subspace domain. In addition, rather than using a hand-crafted prior constraint, we utilize explicit-implicit joint knowledge learned by a newly proposed SDPL-net. Moreover, the output of SDPL-net is leveraged to regularize the ill-posed fusion mission. In contrast to most existing DL approaches that always confront the overfitting problem on the training data, our SDPR method can readily allow for the capture of invariant reconstruction rules under response variations. This enables it to be finely generalized to the test HSIs, even tremendous change happens in terms of band number, illumination intensity, and spectral information. Experiments conducted on a variety of synthetic and real-world datasets have confirmed the superiority of our SDPR method over both the traditional model-based and the more prevalent DL-based methodologies.
Full-resolution quality remains an open problem. In fact, there leaves considerable room for a further performance improvement. On the one hand, it makes sense to integrate more traditional explicit prior and deeper implicit knowledge to solve the ill-posed issue. On the other hand, ways to model data fidelity terms by estimating spectral response function and convolution blur matrix will be investigated. These efforts will further favor the adaptability of the presented approach to real-world circumstances. His research interests include remote sensing image process and pattern recognition.
Zhi Liu received the B.S. degree in automatic control and the M.S. degree in system engineering from Xi'an Jiaotong University, Xi'an, China, in 1991 and 1994, respectively. She received the Ph.D. degree in computer science and technology from Zhejiang University, Hangzhou, China, in 2001.
She is currently a Professor with the College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China. Her research interests include 3-D model retrieval, image processing, and intelligent transportation system.
Dr. Liu is a Member of China Computer Federation.