A Review of Spatial Enhancement of Hyperspectral Remote Sensing Imaging Techniques

Remote sensing technology has undeniable importance in various industrial applications, such as mineral exploration, plant detection, defect detection in aerospace and shipbuilding, and optical gas imaging, to name a few. Remote sensing technology has been continuously evolving, offering a range of image modalities that can facilitate the aforementioned applications. One such modality is hyperspectral imaging (HSI). Unlike multispectral images (MSI) and natural images, HSI consist of hundreds of bands. Despite their high spectral resolution, HSI suffer from low spatial resolution in comparison to their MSI counterpart, which hinders the utilization of their full potential. Therefore, spatial enhancement, or super resolution (SR), of HSI is a classical problem that has been gaining rapid attention over the past two decades. The literature is rich with various SR algorithms that enhance the spatial resolution of HSI while preserving their spectral fidelity. This article reviews and discusses the most important algorithms relevant to this area of research between 2002 and 2022, along with the most frequently used datasets, HSI sensors, and quality metrics. Metaanalysis are drawn based on the aforementioned information, which is used as a foundation that summarizes the state of the field in a way that bridges the past and the present, identifies the current gap in it, and recommends possible future directions.


I. INTRODUCTION
F OR several decades, the importance and impact of remote sensing applications have been rapidly increasing. In fact, the first practical remote sensing application dates back to the 1840 s, when balloonists photographed the ground using newly invented cameras [1]. Pigeon photography then came in 1907 by Julius Neubronner as an aerial photography technique, where a lightweight miniature camera was attached to pigeons. The biggest leap in the field of remote sensing was taken when the world's first satellite to orbit the space, Sputnik, was launched in 1957. The field of remote sensing imagery has been evolving ever since, especially after the first image was captured by Landsat in 1972 [1], which further raised the interest in using satellites to monitor the earth's surface. Nowadays, the field of remote sensing is vast and technologically advanced, with hundreds of journal papers and conferences to further exploit the full potential of remote sensing instruments. This increasing interest in remote sensing comes from the fact that it covers a wide range of applications. Some of these applications include geology [2], [3], [4], [5], [6], [7], vegetation [8], [9], [10], [11], land cover land use [12], [13], [14], [15], and oceanography [16], [17], [18], [19]. Each application requires different spatial, spectral, and temporal resolutions depending on its objective. Satellites exist with various resolutions, such as medium-and high-resolution satellites to facilitate the various requirements of different applications. In addition, some satellites are designed for specific tasks, such as weather satellites [20], ocean satellites [21], and earth observation satellites [22], [23], [24]. In order to exploit remote sensing data, meaningful information must be extracted from remote sensing imagery, and this is where the role of image processing techniques becomes important. Remote sensing applications require several tasks that need to be performed with high accuracy in order to guarantee meaningful results. Some of these tasks include object detection [25], [26], [27], [28], classification [29], [30], [31], and semantic segmentation [32], [33]. The more the details can be obtained about an object, the higher the accuracy of the results produced from these tasks. This is where the importance of hyperspectral imagery (HSI) arises.
The goal of HSI is to obtain the spectrum for each pixel in the image of a scene, with the purpose of finding objects, identifying materials, or detecting processes. In HSI, the recorded spectra have fine wavelength resolution and cover a wide range of wavelengths. HSI measure continuous spectral bands, as opposed to multispectral images (MSI), which measure spaced spectral bands ranging from three to a few dozens of bands. In contrast, HSI have hundreds of bands and they capture signals that offer unique signatures to certain objects.
Due to manufacturing tradeoffs (i.e., capturing 3-D signals with a 2-D sensor), it is hard to achieve both at the same time [34]. Hence, spatial and spectral resolutions have inherently inverse relation between them. It is worth mentioning that spatial resolution is also a function of the distance between the sensor and the target. Even a high-quality pushbroom HS sensor in short-wave infrared can only capture 200- Thus, there exists tradeoff between the distance from target and the field of view. However, this article only deals with spatial and spectral resolution that are related to sensor techniques. Assuming that an MS sensor and an HS sensor have the same distance from the target, the resulting MSI will have high spatial resolution but low spectral resolution, a mirror opposite of HSI, as illustrated in Fig. 1. Also, there exists system tradeoff between data volume and signal-to-noise ratio (SNR), which prevents achieving both simultaneously. HSI have low SNR due to reduced illumination by narrow band filters, and they are noisy if a long exposure time is not guaranteed. Practical image processing applications in the context of remote sensing ideally require images having both high spectral and spatial resolution. Therefore, efforts have been exerted in the literature to improve the spatial resolution of HSI while exploiting its high spectral resolution simultaneously. One such example that demonstrates the importance of having high spectral-spatial resolution is seen in a study conducted by Kwan et al. [35]. Kwan et al. [35] enhanced the spatial resolution of the left imager onboard the Curiosity rover using various pansharpening methods. The goal is to have high spatial and high spectral image data cube, which will greatly contribute to the understanding of Mars.
This research study investigates and highlights the most effective techniques and the latest trends in this research field, and explains the intended approaches to enhance HSI spatially, otherwise known as HSI super resolution (HSI-SR). This study also explores the available HSI datasets, the most frequently used HSI sensors, and the most reliable quality metrics. The rest of this article is organized as follows. Section II discusses key concepts and mathematical background relevant to HSI-SR techniques. Section III illustrates metaanalysis based on more than 200 studies conducted between 2002 and 2022. Sections IV and V explore Fusion and single image super resolution (SISR) methods for HSI-SR, respectively. Section VI discusses the challenges surrounding this area of research and recommends the future direction accordingly. Finally, Section VII concludes this article.

II. KEY CONCEPTS AND MATHEMATICAL BACKGROUND
There are several perspectives to categorize HSI-SR methodologies. From our perspective, the most straightforward categorization of high-resolution HSI (HR-HSI) construction is broadly based on the availability of auxiliary data, Fusion and SISR. Fusion methods initially started as a pansharpening problem of MSI, which was then extended to HSI. However, the Fusion method is a much more complicated problem for HSI due to its large spectral information. Thus, method-based (also called optimization-based) approaches appeared [194]. Nowadays, both pansharpening and method-based approaches are considered as traditional approaches, and DCNNs have been the most widely researched approach in this area [38], as will be evident in Section III. Similar to Fusion methods, SISR approaches also started with grayscale and RGB images, and some of these methods were used for HSI-SR, such as interpolation and super resolution mapping (SRM). Both interpolation and SRM approaches are nowadays mostly considered obsolete; they have been almost completely overshadowed by DCNNS, as will be shown in Section III. The proposed taxonomy of HSI-SR methods is given in Table I.
An interesting perspective on HSI-SR categorization is provided in [195]. Kwan et al. [195] suggested categorizing the methods into four groups. Group 1 requires HR panchromatic (PAN) band in addition to the point spread function (PSF). Group 2 requires HR-PAN band only. Group 3 is SISR, which does not require any extra information. Finally, Group 4 is SISR with supplementary information, such as dictionary learning [196], [197], [198], [199], [200]. It is worth mentioning that when any HSI-SR method does not assume prior knowledge or require auxiliary information, it is referred to as blind SR.
The next sections delve into the details of key concepts and mathematical background that are directly related to the spatial enhancement of HSI, followed by a review of some examples from the literature related to Fusion in Section IV and SISR in Section V.

A. Image Resolution
An image resolution refers to the amount of details provided by an image. There are four different ways to describe image resolution: spatial, spectral, temporal, and radiometric. Spatial resolution refers to the smallest feature that can be depicted by a pixel in a satellite image [201]. On the other hand, spectral resolution refers to the satellite sensors' ability to measure certain wavelengths from the electromagnetic spectrum [201]. High spectral resolution means that the satellite is capable of capturing hundreds of bands of the same spatial area, and the wavelength for each band is typically narrow. Temporal resolution refers to the time period of a satellite's ability to capture images of the same scene consecutively. Finally, radiometric resolution refers to the sensor's ability to distinguish between the electromagnetic signals reflected by various objects within the same spectral band [202]. For this study, only spatial and spectral resolutions are of particular importance, as will be evident in the next sections.  I  TAXONOMY OF THE MAIN APPROACHES USED IN THE LITERATURE TO ACHIEVE HSI-SR   TABLE II

B. Datasets of Hyperspectral Images
HSI are constructed using hyperspectral cameras that are capable of capturing hundreds of imaging bands at different wavelengths for the same spatial area [203]. Typically, hyperspectral sensors capture images between 400 and 2500 nm wavelength with regular sampling interval of 4-15 nm. On the other hand, the spatial resolution can be as coarse as 30 m. Low spatial resolution implies that several objects may be captured within the same pixel, which makes them difficult to identify. This is a key concept behind spectral unmixing, which will be discussed in Section IV-B1.
Some examples of the publicly available HSI datasets for studying spatial enhancement of HSI are given in Table II. A sample of Pavia University dataset is shown in Fig. 2. Other datasets include NUS [204], Kawakami [115], University of Houston [205], Moffett Field [206], Paris [207], San  (11,9).
Francisco [208], and Real Hyperspectral dataset, which consists of Samson, Jasper Bridge, Urban, and Cuprite [209], [210]. In 2019, a new HSI dataset called ICONES [211] became publicly available. Due to its recent availability, it has not been used in other research studies thus far, but it is worth mentioning due to the large size and variety of HSI it provides. Some studies on HSI use CAVE [212] and Harvard [213] datasets; however, the images in both datasets are not captured using remote sensing devices. Furthermore, CAVE dataset is listed as a multispectral dataset rather than a hyperspectral one [214].

C. Quality Metrics
The quality of enhanced images requires verification for the purpose of evaluation and benchmarking. Verifying the quality of an image with visual inspection is a subjective process that depends on several factors, including screen size and illumination. Therefore, quantitative evaluation is more reliable, in which the quality of the enhanced image (also called estimated image) is assessed by comparing it to the ground truth (GT) image (also called reference or target image). One example of such metrics is peak signal-to-noise ratio (PSNR), which is expressed as follows [229]: where Y is the GT HSI and Y is the estimated HR-HSI from low-resolution HSI (LR-HSI, also called source HSI). Both images have height M and width N . MAX(Y) refers to the maximum possible value a pixel in the GT HSI can take. For instance, the maximum value for images of type 8-bit unsigned integer is 255. Mean squared error (mse) [229] computes the cumulative error between the GT HSI and the estimated HR-HSI, whereas PSNR computes the maximum possible power of a signal to the power of distortion noise in dB. In ideal cases where both images are identical, PSNR result would be infinite because mse reaches zero [230]. Even though PSNR provides a pixel by pixel comparison, it ignores human visual perception. Structural similarity index measurement (SSIM) [231] allows the inclusion of human visual perception by assessing the errors of three factors: correlation, luminance, and contrast. SSIM is expressed as follows: where μ Y , μ Y , σ Y , σ Y , and σ Y Y represent local means, standard deviation, and cross covariance for HSI Y and Y. SSIM value ranges between 0 and 1, where 0 indicates no similarity and 1 indicates that Y and Y are identical. In the special case when C 1 = C 2 = 0, SSIM is referred to as universal image quality index (UIQI) [232], which is considered the predecessor of SSIM. SSIM is preferred over UIQI due to the fact that the latter can lead to unstable results. Another quantitative metric that measures spatial similarity is cross correlation (CC) [35], [194], [195], [233], which is expressed as follows: (y (i,:,k) − μ y (i,:,k) )( y (i,:,k) − μ y (i,:,k) ) where B refers to the total number of HS bands. CC value ranges between −1 and 1, where 1 is the ideal value if both images are identical.
There are quantitative metrics that are specific to HSI, such as spectral angle mapper (SAM) [234], which measures spectral similarity between the spectra of the GT HSI and the spectra of the enhanced HSI. SAM is expressed as follows: SAM value should be as close to 0 as possible. Another example of a quantitative metric specific to HSI is relative dimensionless global error (ERGAS) [235]. ERGAS is computed as follows: where h and l are the high and low spatial resolutions, respectively, expressed in meters. ERGAS relies on computing the normalized average error of each band in the enhanced image; therefore, low ERGAS values indicate high quality. Other evaluation metrics include degree of distortion [159], [197], [236], Q2 n [237], [238], [239], subpixel CC [240], and spectral angle error [241], [242]. In addition, some researchers assess the quality of their enhanced HSI by observing the performance of standard classification algorithms on the enhanced HSI as opposed to the GT HSI in terms of overall accuracy and Kappa [243], [244], [245], [246], [247].

D. Basic Concepts of Convolutional Neural Networks (CNNs)
Artificial neural networks (ANNs) are a branch of machine learning, which is in turn a branch of artificial intelligence (AI). ANNs consist of an input layer, hidden layers, and an output layer. When an ANN has one hidden layer, it is known as a shallow neural network; otherwise, it is known as a deep neural network. A particular class of ANNs that is designed to perform image processing tasks is called CNNs, which was first introduced in the 1990 s by Yann LeCunn and Bengio [248], [249]. CNNs require high computational resources and processing power, which computers could not achieve at that time. Nowadays, with the rapid development of technology, CNNs started gaining more attention, especially after a CNN successfully won ImageNet challenge of classifying 1.2 million images in 2014 [250]. CNNs are now used for various other image processing tasks, including object detection, semantic segmentation, and SR. A CNN that performs SR tasks typically consists of a combination of two or more of the following: convolutional layer, activation function, and pooling layer. CNNs can include other types of layers, such as batch normalization. For the context of this article, only the layers relevant to SR will be discussed. For the rest of this article, all operations are assumed 2-D unless stated, otherwise.
Convolution is the product of elementwise multiplication between an image and a filter that consists of one or more kernels. When the filter consists of one kernel, the two terms can be  For an LR image X with a single band of size N × N and a kernel K of size M × M , convolution at pixel position (x, y) can be expressed as following: where F (x,y) is the output feature, X (x+i,y+j) is the input that includes the original pixel and the neighboring pixels within the offset range (i, j), K (i,j) is the weight at location (i, j) that corresponds to the input, b is the bias, and f is the activation function. Some of the most commonly used activation functions are sigmoid and rectified linear activation (ReLU), which are seen in Fig. 3. According to the literature, ReLU is the most suitable activation function for CNNs [251]. The result of the convolution operation is a feature map that summarizes key features of the convolved image. In the case where the image has multiple bands, the filter convolves each band individually. Fig. 4 shows how convolution works on multiband images. A variation of the convolution layer called separable convolution achieves the same outcome with less number of computations, making them more efficient than standard convolution layers. For more details about separable convolution, the reader is referred to [252].
Pooling is another key operation in CNNs, which is the process of downsampling an image by selectively discarding features and preserving the important ones. There are two types of commonly used pooling: max pooling and average pooling. For example, for an LR image X with a single band of size 6 × 6 and a kernel K of size 2 × 2 with stride 2, the max  pooling kernel passes through the image to produce a feature map by preserving only the highest values and discarding the lower ones. The result is as illustrated in Fig. 5. Images lose dimensionality after convolution and pooling unless they are padded. The simplest way of padding an image is by adding zeroes at the border [253]. In addition, convolution and pooling can be reversed using transpose convolution and upsample operations, respectively, which are commonly used in a special type of CNNs called autoencoders. Autoencoders learn spatial mappings from one image to another, and can be repurposed to be used for spatial enhancement. Convolution and pooling operations can be extended to 3-D such that the calculations are performed over the entire HSI cube rather than processing each band individually. 3-D convolution spans all three directions: height, width, and bands. Therefore, 3-D convolution is an adequate solution to accommodate spectral context. For an LR-HSI denoted X of size N × N × B and a kernel K of size M × M × C, 3-D convolution at position (x, y, z) can be expressed in the following: Fig. 6 provides a visual illustration of 3-D convolution. Recently, 3-D CNNs have been commonly utilized and showed effectiveness in HSI-SR, as will be seen in Sections IV-C and V-B. All the aforementioned layers can be connected together in different topologies, such as feed forward [254], skip (or residual) connections [255], attention mechanism [256], and recursive neural networks [257], which can enhance the  performance of the network depending on its purpose either in terms of output quality or computation complexity.

III. METAANALYSIS
A web scraping tool was developed using python programming language to retrieve all the relevant research papers related to HSI-SR. The tool was used to retrieve articles from IEEE Xplore Digital Library related to the following keywords: hyperspectral SR, hyperspectral spatial enhancement, hyperspectral reconstruction, hyperspectral Fusion, hyperspectral SISR, and hyperspectral pansharpening. The tool retrieved various information about the research papers, including title, type of publication (e.g. conference, or journal), authors, keywords, DOI, and publication year. The results were later verified with visual inspection and more entries were added manually from other sources, including SPIE remote sensing conference proceedings and MDPI remote sensing journal. A visual summary of all the retrieved results can be seen in Figs. 7 and 8. From the yearly total numbers of papers in both figures, it can be observed that the interest in HSI-SR increased over the years. It is expected that the number of published papers will increase further by the end of 2022. Furthermore, although HSI-SISR studies have been increasing for the past two decades, Fig. 7 shows that in every year, there is a wide gap between the number of Fusion and SISR studies. This can be attributed to the lack of data disadvantage, which is discussed in more detail in Section VI-B. Also, Fig. 8 shows that the interest in DCNNs for HSI-SR has been gradually increasing since 2017, and the number of publications in DCNN  HSI-SR has been exceeding the number of that in traditional methods for the past 3-4 years. Since SISR methods utilize DCNNs much more than traditional approaches, it seems that there is correlation between the rise of DCNN methods and SISR methods.
According to Fig. 9, the five most used datasets are Pavia University, Washington DC, CAVE, Harvard, and Indian Pines. As for the most used HS sensor, Fig. 10 shows that AVIRIS prevails over other sensors by a large margin, followed by HYDICE, Hyperion, and ROSIS. Even though Pavia University is the most used dataset, AVIRIS remains more widely used than ROSIS because the total number of datasets collected using AVIRIS is more than the number of those collected using ROSIS. In addition, several papers mention using AVIRIS for dataset collection without specifying any of the standard datasets mentioned in this article. This also explains why Hyperion sensor seems to be widely used, but not Botswana dataset. As for the most used metric for quality assessment, Fig. 11 shows that SAM is used at least 22% of the time, which is an expected result, since it is the simplest formula that gives indications of spectral distortion. Other metrics with high percentage of usage include PSNR, RMSE, ERGAS, and SSIM. The word cloud seen in Fig. 12 gives a visual indication of the mostly used terminologies in HSI-SR research literature. Consistent with Fig. 7, the vast majority of the terminologies are related to Fusion methods and their extensions, which further asserts the fact that Fusion methods are currently the center of attention in this research area. Finally, to give a summary of the HSI-SR timeline, Fig. 13 shows the evolution of HSI-SR techniques throughout the years by highlighting the most prominent approaches. This timeline further asserts the fact that DCNNS have been of central interest since 2017. It is worth noting that publications in traditional HSI-SISR techniques stopped around 2016, while the field of traditional Fusion methods remains active.

IV. FUSION
Image fusion is the process of combining information from multiple images, such that the final product reveals more information than the individual input images. The pioneer work in Fusion methods dates back to 1999 [258]. Fusion-based methods combine the observed HR-MSI, and LR-HSI of the same scene. According to the literature, using an LR-HSI with the corresponding HR-MSI image to obtain an HR-HSI has shown promising performance. Most approaches use RGB; therefore, HR-MSI and HR-RGB will be used interchangeably. The existing approaches can be roughly divided into two categories. The first one is to design a specific system based on standard RGB cameras. Exploiting time-multiplexed illumination source, multiple color cameras, and a tube of faced reflectors can be used to complete the reconstruction [259], [260], [261]. However, this method relies rigorously on environmental conditions and extra equipment, which makes it impractical and costly. Therefore, HR-MSI and LR-HSI Fusion is the favorable approach, but it is considered as an ill-posed problem due to the amount of lost information. Nonetheless, image Fusion is still possible due to the existence of high correlation between RGB and their corresponding HS radiance. In this study, image Fusion approaches are divided into three categories: pansharpening, method-based, and deep convolutional neural networks (DC-NNs). Method-based approaches are further categorized into matrix factorization (MF) and spectral unmixing, tensor based, and Bayesian based, as given in Table I.

A. Pansharpening
One example of image Fusion is pansharpening, which transforms an LR-HSI to HR-HSI by fusing it with a PAN band extracted from an MSI. Pansharpening methods can be broadly grouped into four categories: component substitution (CS), multiresolution analysis (MRA), variational methods, and hybrid approaches.

1) Component Substitution (CS):
One of the most widely used CS methods for pansharpening is Brovey transform (BT) [87]. BT is based on spectral modeling and was developed to increase the visual contrast in the high and low ends of the data's histogram, such as shadows, water, and high reflectance areas. It uses a method that multiplies each resampled HS pixel by the ratio of the corresponding PAN pixel intensity to the sum of all the multispectral intensities. It assumes that the spectral range spanned by the PAN image is the same as that covered by the HS bands. The basic procedure of BT first multiplies each HS band by the HR-PAN band, and then divides each product by the sum of the MS bands. In the case of RGB, BT can be described using the following equation: where DN is the digital number (pixel value), DN b1,2,3 are LR-RGB bands, and DN fused1 is the resultant HR band. This equation is repeated for each HS band individually. BT is limited to three bands only, and there are constant suitable weights for each band that are different for each satellite.
Gram-Schmidt (GS) pansharpening technique was first invented by Laben and Brower [88]. A synthetic PAN image is acquired by using a GS mode, of which there are three available. In the first mode, the average of HS bands is taken as a synthetic image. In the second one, low-pass filtered version of PAN image is received as a synthetic PAN image. For the last mode, a synthetic PAN image is achieved by using least square regression analysis. By subtracting this synthetic PAN image from the original PAN image, spatial details are obtained. The extracted details are injected into the HS bands, which are upsampled to PAN resolution in order to generate the pansharpened image. The injection gain factor G k of band number k in GS pansharpening is defined as follows: where cov is the covariance, var is the variance, and I is the intensity component of the HSI. Aiazzi et al. [89] attempted to enhance this approach by introducing GS adaptive (GSA). Intensity hue saturation (IHS) [90] method is a standard CS method that was developed based on the assumption that spectral information is mostly contained within the hue and saturation components, leaving the spatial information in the intensity component. The basic approach in IHS method is to replace the intensity component with HR-PAN image, which is histogram matched to the intensity component, to obtain the spatial detail  matrix. These details are injected into each HS band separately in order to obtain the pansharpened image.
PCA technique [91] transforms the HSI to feature space in order to obtain principal components. The first principal component is assumed to contain most of the energy or most of the spatial information. This term is replaced by the PAN image, which is histogram matched to this component. By using inverse PCA transform the pansharpened HSI is obtained. In 2021, Dong et al. [262] suggested a more effective CS technique by achieving the intensity component and injection gain using binary partition tree and image segmentation. Other works in this area include [92] and [93]. Generally, CS algorithms produce remarkable results in terms of spatial resolution; however, they cause spectral distortions, which is their major drawback. Choi et al. [94] attempted to overcome this drawback regardless of the type of satellite sensor by introducing partial replacement adaptive CS. Nonetheless, MRA-based algorithms have better resistance to spectral distortions. These algorithms are discussed in the next section.
2) Multiresolution Analsis (MRA): MRA-based approaches can be generally conveyed in the following: where Y k denotes the kth band of the upsampled (interpolated) HSI, ⊗ denotes elementwise multiplication, P is the PAN image, and P L is the low-pass version of P. Each MRAbased algorithm extracts P L and G k differently. For instance, smoothing filter-based intensity modulation (SFIM) [95] is a pansharpening technique that obtains P L from P using a linear time invariant low-pass filter, such as averaging filter. For G k , high-pass modulation (HPM) is utilized. The result is then added to each HS band separately.
Other techniques utilize multiresolution image decomposition to obtain P L . For instance, wavelet transform (WT) is used for decomposing an image into its high-and low-frequency components, which is a powerful technique for multiple image processing tasks, including denoising and SR. Thus, there is a variation of WT that can be used for pansharpening. In the first step, histogram matching is performed between the PAN image and each HS band. Afterward, WT is applied to the resulting histogram-matched PAN image. The result of this decomposition is categorized into image details extracted through high-pass filtering, and an approximation image extracted through low-pass filtering. Only the approximation part is considered, and the other decompositions are set to zero. Hence, inverse WT yields low-resolution PAN image. This result is then subtracted from the original PAN image to create the detail matrix. Pansharpened bands are finally obtained by adding this matrix to each HS band [96]. This approach has been explored and expanded by several studies in the literature [97], such as decimated WT using additive injection model (Indusion) [98], additive A trous wavelet transform [99], A trous wavelet transform using the Model 3 [100], and additive wavelet luminance proportional [101].
Similar to WT, Laplacian pyramid (LP), which is derived from Gaussian pyramid, also decomposes the image into its high-and low-frequency components. LP was improved into what is known as enhanced LP (ELP), and then generalized LP (GLP) was derived from ELP. GLP is used as a pansharpening technique by extracting the PAN image details from LP, and then injecting the details into an upsampled version of the HSI. This method has been extended to GLP with modulation transfer function matched filter (MTF-GLP) [102], GLP with MTF-matched filter and context-based decision injection scheme (MTF-GLP-CBD) [103], and Gaussian MTF-matched filter with HPM injection model (MTF-GLP-HPM) [99], [104].
MRA methods perform well in terms of robustness and efficiency, but suffer from aliasing effect and spatial details distortion [108], [263].
3) Variational Methods: The first variational model framework known as "P+XS Image Fusion" was introduced in [105]. This approach uses the assumption that the geometry of spectral bands of HSI is related to the topographic map of the corresponding PAN image. The goal is to minimize the energy function, which comprises of three components, by using gradient descent algorithm. This variational framework carries out by minimizing the sum of integrals HSI and its low-pass filtered version and tangent vector multiplied gradient of HS bands, and also the integral of the sum of subtraction the PAN image from alpha values multiplied HS bands.
where Y k represents the kth band of the pansharpened image Y, and θ denotes the normal vector field of the PAN image. The first component, E g , that must be minimized is the spatial fidelity term, which is expressed as where Ω ⊂ R 2 denotes an open, bounded domain with a Lipschitz boundary. The second component, E r , that must be minimized is based on the image relation hypothesis. This hypothesis states that PAN image is the product of weighted sums of each MSI band, where each weight corresponds to the energy of the respective spectral band. Thus, the second component is expressed as follows: where α are the weighting coefficients of each band, and P denotes the PAN image. The final component that must be minimized, E f , is based on the assumption that the LR image is the product of convolution between the HR image and a low-pass filter. Thus, this component is expressed as follows: where S is a Dirac's comb defined by the grid S, K k is the convolution kernel of the low-pass filter, and X k represents the kth band of the LR-HSI. All three components are finally expressed as follows [105]: where γ k , λ, η > 0, γ k allows controlling the relative weight of each band, and λ and η allow controlling weights of each component of the equation.
The most notable methods that extended this framework are nonlocal variant, and nonlocal variant with band-decouple, both proposed by Duran et al. [106], [107], respectively. In 2018, Huang et al. [108] enhanced this framework by proposing a variational pansharpening method for HSI constrained by spectral shape and GS transformation. First, Huang et al. [108] utilized the spectral shape feature of the neighboring pixels with a new weight distribution strategy to reduce spectral distortions caused by the change in spatial resolution. Then, the correlation fidelity term uses the result of GSA to constrain the correlation, thereby preventing the low correlation between the pansharpened image and the reference image. Then, the pansharpening is formulated as the minimization of a new energy function, which produces the final pansharpened image. Huang et al. [108] claimed that this method outperforms GSA, guided filter PCA, MTF, SFIM, intensity modulation, the classic and the banddecoupled variational methods. These methods do not limit the number of bands, but suffer from high computational cost and large spectral distortion. Other works in this area include [109], [110], and [111]. 4) Hybrid Approaches: One of the main challenges for fusing LR-HS and HR-PAN/RGB data is to find an appropriate balance between spectral and spatial preservation. Hybrid approaches use mixture of CS and MRA methods to overcome this challenge. Liao et al. [112], [113] designed a guided filter in the PCA domain (GFPCA). Instead of using CS, which may cause spectral distortions, GFPCA uses a high-resolution PAN/RGB image to guide the filtering process aimed at obtaining SR. In this way, GFPCA does not only preserve the spectral information from the original HSI, but also transfers the spatial structures of the high-resolution PAN/RGB image to the enhanced HSI. GFPCA first uses PCA to decorrelate the bands of the HSI, and to separate the information content from the noise. The first PCA channels contain most of the energy of an HSI, and the remaining PCA channels mainly contain noise. When GF is applied to these noisy channels, it amplifies the noise and causes a high computational cost in processing the data, which is undesirable. Therefore, guided filtering is used to enlarge only the first PCA channels, preserving the structures of the PAN/RGB image, whereas bicubic interpolation is used to upsample the remaining channels.

B. Method-Based Fusion 1) Matrix Factorization and Spectral Unmixing:
Spectral unmixing is the procedure by which the measured spectrum of a mixed pixel is decomposed into a collection of constituent spectra, or endmembers, and a set of corresponding fractions or abundances, that indicate the proportion of each endmember present in the pixel [114]. The basic principle of MF [115] is to associate the Fusion problem with "linear spectral unmixing"; the data can be described by a linear combination of spectral signals, also called reflectance function basis. Each signal uniquely corresponds to a material present in the scene. Spectral unmixing refers to the problem of finding the number of endmembers in an HSI, their spectral signatures, and their perpixel abundances. It is the inverse of spectral mixing described as follows: where r j is the spectral vector expressed by a linear combination of several endmember vectors h, p is the number of endmembers in the image, L is the number of pixels, and w ij is a scalar representing the fractional abundance of endmember vector h j in the pixel r i . H is of size B × p mixing matrix, where B is the number of bands and p L. An example of this approach for HS unmixing is demonstrated in [116], where the authors proposed coupled nonnegative matrix factorization (CNMF) for HS and MS data Fusion, and studied its effect on HSI classification. Their approach unmixed both sources of data to find the signatures and abundance of the endmembers, as described earlier. The relationship between low and high spatial resolution in HS can be described as follows: where D s is the spatial transform matrix and E s is the residual error. D s is determined by image registrations and estimation of PSF. Similarly, the relationship between low and high spectral resolution in MS can be described as follows: where U is the high spectral resolution MSI, Z is the low spectral resolution MSI, D r is the spectral transformation matrix, and E r is the residual error. D r is derived from radiometric calibration to obtain spectral response function (SRF). With references to (18), HS and MS can be expressed as follows: where W and H are abundance and endmember matrices, respectively, and E Y and E U are residual error matrices. NMF spectral unmixing is commonly performed to minimize the squared Frobenius norm of the residual matrix in the linear spectral mixture model expressed as This principle can be applied to estimate the upsampled HSI. Assuming HS and MS images capture the same scene, their endmembers should be the same, and the abundance map of LR-HSI data should match that of MSI. The abundance matrix can be extracted from the MSI, and then used to enhance the spatial resolution of HSI [116], [117], [118], [119]. HR-HSI can then be approximated as CNMF for HSI-SR was extended in [120] to test its effect on target detection. Yokoya and Iwasaki [120] experiment showed that CNMF can restore pure spectra, which contributes to accurate target detection. Despite the effectiveness of CNMF approach, the obtained solution is usually not unique, which can lead to an unsatisfactory outcome. Licciardi et al. [121] attempted to extend the aforementioned methods where endmembers are extracted directly from downsampled HSIs. The derived endmembers are used as an input to an unmixing algorithm applied to the MSI. The obtained abundances are then used to reconstruct the HR-HSI. Another approach that uses spectral unmixing is depicted in [122]. The method is split into two stages: spatial upsampling and spectral substitution. Spatial upsampling is done by estimating an optimal linear combination on exemplar patches for SR reconstruction, followed by evaluation using learned local spectrum dictionary. This approach unmixes HS observation within a pixel using guidance from HR-RGB image. In the spectrum substitution stage, sparse coding is adopted. This stage refines the spectrum obtained in the first stage based on the limited materials assumption within a local region of a scene. Other variations of NMF include sparse nonnegative matrix factorization [123]. HSI can be unmixed using various mathematical functions, learning algorithms, and probabilistic frameworks, such as K-SVD [119], Bayesian sparse representation (BSR) [124], [162], HySure [125], [126], maximum a posteriori (MAP) [127], [128], and GSOM+ [129]. An approach similar to [116] was followed in [118], but the mixing matrix was replaced with a dictionary learned using a nonnegative matrix factorization with sparsity regularization code. Another sparse representation dictionary learning method was used in [117], where two dictionaries were learned from the HSI and MSI, and then dictionary-pair learning method was used to establish correspondence between them. Motivated by the successful applications of sparse representation, Dong et al. [130] proposed a nonnegative structured sparse representation (NSSR) approach for taking consideration of the spatial structure. Dong et al. [130] then conducted optimization procedure with the alternative direction multiplier method (ADMM) technique [264]. NSSR achieved a large margin on HSI recovery performance compared with the other state-of-the-art approaches. A similar approach was devised recently by Xue et al. [131] and Xu et al. [132]. The effectiveness of dictionary approaches strongly depends on how these dictionaries can be obtained. Lanaras et al. [133] proposed HSI-SR by integrating coupled spectral unmixing strategy into HSI-SR and conducted optimization procedure with the proximal alternating linearized minimization method. Other works in this area include [134], [135], [136], [137], [138], [139], [140], [141], [142], [143], [144], [145], and [146]. One shortcoming of this method and MF approaches in general is that they require good initial points of the two decomposed reflectance signatures to provide satisfactory results. Furthermore, most work generally assumes that the number of the pure materials in the observed scene is smaller than the spectral band number, which is not always satisfied in the real application. In addition, MF approaches generally suffer either from spectral distortions or high computation time.
2) Tensor-Based Approaches: A tensor is considered as a generalization form of a matrix. Tensors can be used in the context of HSI-SR by addressing the nonuniqueness of tensor rank. Imposing prior information or regularization are examples of ways to answer this nonuniqueness [78], [147]. HSI are low rank and self-similar [126], [148], [162]. Therefore, applying low-rank regularization on the core of a tensor addresses this nonuniqueness and avoids the necessity of obtaining an exact value of tensor rank. A tensor-based observation model can be expressed as follows: where D 1 and D 2 denote the degradation matrices of the spatial resolution, which can be constructed by downsampling Toeplitz matrix if the PSF is known. D 3 represents the degradation in the spectral resolution, which can be constructed if the SRF of the HS and MS sensors are known. Using this model, tensor decomposition can then be used to estimate Y. For instance, following Tucker decomposition, Y can be decomposed as follows: where T is the decomposed core tensor, and E z is the error term.
Consequently, the Fusion model can be formulated as follows: Prior information must be incorporated in order to regularize (27). For instance, Ma et al. [149], took advantage of the similarities between adjacent bands as well as neighboring pixels, and imposed graph regularization on spatial and spectral matrices separately to minimize the effects of distortion. Another example is demonstrated by Dian and Li [265], where they proposed a low tensor multirank regularization method that exploits high correlation among spectral bands, as well as nonlocal spatial similarities. Ding et al. [150] followed a similar strategy, but instead of Tucker decomposition, they use coupled tensor LL1based decomposition framework to estimate HR-HSI due to its connection to linear mixture models, as suggested by Qian et al. [151]. Li et al. [148] considered HR-HSI as a 3-D tensor, and formulated the Fusion problem by estimating the core tensor and three dictionary modes through coupled sparse tensor factorization approach. They also incorporate a regularizer to model the high spectral-spatial correlations. Other examples include [152], [132], [153], [154], [155], [156], [157], [158], [159], [266], [267], [268], [269], [270], [271], [272], and [273].

3) Bayesian-Based Approaches:
The first known approach that utilizes Bayesian Fusion was devised by Zhang et al. [160] in 2009. The Fusion framework of this approach takes place in the wavelet domain, and is referred to as wavelet MAP. Zhang et al. [160] assumed additive noise imaging model for the HSI, and interpolation is used as a priori to bypass the need to estimate the spatial degradation operator and perform SR in a blind manner. Performing MAP [127] to approximate the enhanced image in the wavelet domain rather than the spatial domain allows for scale-specific and subband-specific estimations. The authors compare their approach to spatial domain estimation, in addition to some of the most commonly used pansharpening techniques. Another blind Bayesian-based approach is HySure devised by Simões et al. [125], [126].
Most optimization-based approaches rely on explicit parameter turning for each different dataset or sensor. To solve this problem, Akhtar et al. [161] attempted to avoid this problem by utilizing nonparameteric BSR over four stages. First, the probability distributions and the proportions of the material reflectance spectra are extracted. Second, a dictionary is estimated and transformed based on the spectral quantization of the HR-MSI. Third, the sparse codes of the HR-MSI are computed by using the proposed Bayesian sparse coding. Finally, the HR-HSI is estimated using the dictionary from the second stage and the sparse coding from the third stage. Another BSR approach was devised in [162], where the authors learn dictionaries from the observed HSI and MSI, and then solve the optimization problem with respect to the target image and the sparse code by using split augmented Lagrangian shrinkage algorithm (SALSA) [163], which is an instance of ADMM. According to Wei et al. [162], "SALSA enables a huge nondiagonalizable quadratic problem to be decomposed into a sequence of convolutions and pixel decoupled problems, which can be solved efficiently." This approach outperforms MAP [127] and wavelet MAP [160].
Fusion approaches assume that the HSI and MSI are perfectly coregistered, which is an impractical assumption. In an attempt to overcome this limitation, Bungert et al. [164] devised a blind Bayesian approach with directional total variation that is robust against imprecise registrations between the HSI and MSI.

C. Deep Learning-Based Fusion
DCNNs have recently shown great success in various image processing and computer vision applications. DCNNs have also been applied to RGB image SR and achieved promising performance [34], [274], [275]. Since the correlation between MSI and HSI is highly nonlinear, DCNNs have high potential to achieve HR-HS with high accuracy if HR-RGB image is used. Some researchers in the literature focused on utilizing DCNNs to obtain HR-HSI from its LR version only [276], an approach known as SISR. In this case, the CNN is known as spatial-CNN. However, their enhancement factor is limited to 8 at maximum compared with using observed RGB or, more generally, MS data. HSI require a much higher enhancement factor (e.g., 32). Further elaboration on this can be found in Section V-B. Other researchers improved the spectral resolution of LR-RGB images using DCNN, in which case it is known as spectral-CNN [34], [277]. This approach ignores the HS attribute that correlates the narrowband and spectra, which leads to unsatisfactory results.
Spatial-DCNN and spectral-DCNN improve the image in one dimension only. Therefore, it is desirable to design a network architecture that performs enhancements in both dimensions in order to generate an HR-HSI. This will be explored in the next sections from supervised and unsupervised learning perspectives.
1) Supervised: Traditional pansharpening approaches, albeit primitive, can be elevated using DCNNs as well. A notable example of such case is deep HSI sharpening model demonstrated in [165], which learns image priors and incorporates them into the Fusion framework. First, the HR-HSI is initialized by solving Sylvester equation. Then, a one-to-one mapping between the initialized HR-HSI and the reference HR-HSI is learned via deep residual CNN. The priors learned from this network are then utilized in the Fusion framework to obtain the final estimated HR-HSI. This approach shows superiority against traditional pansharpening and MF approaches. A spatial and spectral Fusion network (SSF-Net) for HR-HSI reconstruction was proposed in [166]. The results of the network were promising in spite of the simple concatenation of the upsampled LR-HSI and the HR-RGB image. However, the upsampling of the LR-HSI and the simple concatenation cannot effectively integrate the existing spatial structure and spectral property without high computational cost. In addition, precise alignment is needed for the input of LR-HSI and HR-RGB images, and it is extremely difficult to attain due to the large difference of spatial resolution in the LR-HSI and HR-RGB images.
In 2019, inspired by the success of SSF-Net, Han et al. [167] devised multilevel and multiscale SSF-Net (MS-SSFNet), which fuses LR-HSI with HR-RGB. The authors' proposed that DCNN relies on the gradual reduction of the feature sizes of the HR-RGB while increasing the feature sizes of the LR-HSI. Furthermore, DCNNs often suffer from vanishing gradient problem during the training, and the authors alleviate this problem by integrating multilevel cost functions into MS-SSFNet architecture. Other works in this area that also tackle Fusion with spectral-spatial context include [168], [169].
2) Unsupervised: Supervised learning algorithms for image Fusion require a large size of HSI dataset perfectly registered with their MSI counterparts, which is unrealistic. Unsupervised learning offers the possibility to bypass this limitation, as it has the potential to achieve remarkable results with small datasets compared with supervised learning approaches. Qu et al. [291] were the first to attempt this task for HSI-SR using CNN. Their network consists of two encoder-decoders that are coupled by the same decoder in order to preserve spectral information. Sparse Dirichlet distribution naturally covers the physical constraints of HSI and MSI. This allows minimizing the angular difference between HSI and MSI representation, which reduces spectral distortions. The resulting network is referred to as uSDN. As previously mentioned, one of the major challenges that faces this network and image Fusion in general is the assumption that HR-MSI and LR-HSI are accurately registered. The performance of the Fusion typically relies on the registration accuracy. Therefore, Qu et al. [292] attempted to overcome the shortcomings of uSDN by projecting both HR-MSI and LR-HSI into the same statistical space. This representation is assumed to follow Dirichlet representation as well. The authors also exploit mutual information (MI) between both images to capture any nonlinear statistical dependencies between them. Maximizing MI leads to maximizing spatial correlations, which leads to minimizing spectral distortions. The authors test their approach on CAVE and Harvard datasets using ERGAS, PSNR, and SAM evaluation metrics. It can be observed that this approach offers an advantage over conventional Fusion methods as well as uSDN.
In a similar approach, Lei et al. [293] took advantage of image prior and utilized it for unsupervised learning that consists of two-stage SR. The first stage is a Fusion model that is pretrained on synthetic data to generate a general spatial-spectral HSI prior. The second stage is a degeneration model that makes the general HSI prior more specific, which is trained in an unsupervised way. The algorithm performance shows superiority against traditional Fusion algorithms, as well as the results demonstrated in [294] and [168].
Another unsupervised approach is demonstrated in [295]. The authors utilize spectral unmixing and obtain the initial value of high-resolution abundance by interpolation. The exact interpolation method is not mentioned. A deep learning model is trained by utilizing the prior knowledge of hyperspectral and multispectral images to optimize abundances and target HSI. The authors test their algorithm using CAVE and Harvard datasets, but they argue that their method should be tested on remote sensing datasets as well because they are more complicated and may require a stricter model to deal with atmospheric influences. The authors also mention the model's inability to generalize across different HSI datasets.
Li et al. [296] attempted to overcome the limitations of relying on prior knowledge and large datasets. They propose a deep learning architecture that is capable of adaptively learning degradation models. The architecture is mainly divided into three stages. These stages effectively learn spatial and spectral transformations. The authors test their algorithm against uSDN  [133]. (c) CNMF [116]. (d) NSSR [130]. (e) HySure [125]. (f) uSDN [291]. (g) MIAE [298]. and some of the traditional fusion methods, such as CNMF and HySure, and their approach shows superiority when tested on Houston and Chikusei datasets. Inspired by the recent success of unsupervised DCNNs, Liu et al. [297] embedded NMF into their approach and developed a model inspired autoencoder for unsupervised HSI-SR. NMF's task is to preserve the intrinsicality of the estimated HR-HSI, such that the autoencoder takes each individual HSI pixel as an input sample for the encoder side, and outputs spectral and spatial matrices at the decoder side. However, the value of the input pixel is unknown, so the LR-HSI and the HR-MSI are used as inputs in a pixelwise manner that is solved by using gradient descent. The loss function of the autoencoder is formulated based on spectral and spatial degradations. Instead of assuming this degradation as shallow priors, the authors propose an additional blind estimation network to estimate the PSF and SRF. The approach outperforms traditional Fusion approaches in addition to uSDN [291], as seen in Fig. 14.

V. SINGLE IMAGE SR
The mathematical formulation of SISR can be constructed using an observation model, which is expressed as follows: where D is the downsampling operation, G is the blurring filter, and E is the additive noise. HR-HSI can be estimated by minimizing the Forbenius norm of the difference between Y and Y, as follows: The past two decades have witnessed impressive advances in this area of research [307]. This section discusses these advances, starting from the simple ones and building up to more sophisticated approaches. Similar to Fusion methods, SISR approaches that do not assume prior knowledge regarding the degradation kernel are referred to as "blind SISR."

A. Traditional Methods
SISR methods rely mainly on the observed LR-HSI and do not require auxiliary MSI. Some SISR make use of prior assumptions to reconstruct HR-HSI. The earliest HSI-SR method and the pioneer in this field is the work proposed in [80]. Akgun et al. [80] proposed a system for capturing HSI, and based on the proposed system, they design HSI-SR framework as an inverse problem. Assuming that the degradation kernel is known, projection onto convex sets (POCS) [308] can be used to estimate the HR-HSI, such that when the estimated HR-HSI is degraded, it will give a result identical to the observed LR-HSI. The proposed POCS gives more accurate results and the more additional constraints can be added from prior information or assumptions. POCS can also be used to estimate the PSF of LR images, which aids the estimation of the HR counterpart, as seen in [81].
Regularization-based methods, also referred to as reconstruction-based methods, reconstruct HR-HSI from LR-HSI in addition to prior assumptions. For instance, Villa et al. [75] utilized spectral unmixing for SISR-HSI. They extract the end members using vortex component analysis. Then, they use fully constrained least squares (FCLS) algorithm to determine the abundance fraction of the endmembers within each pixel. Afterward, each pixel is divided into subpixels according to the required scale factor. The authors assume that each endmember is spatially close to the same family of endmembers in the surrounding pixels. Based on that, they chose simulated annealing as a mapping function that minimizes the perimeter of the areas that belong to the same endmember. The authors test the effectiveness of their method by comparing the classification map of the enhanced HSI to that of the GT. Another reconstruction-based approach was adapted in [76]. The authors present maximum a posteriori-Markov random fields (MAP-MRF)-based approach. Similar to [75], the first step is to extract the endmembers, and then estimate the abundance maps using FCLS. The reconstruction is performed on the estimated abundance maps using MAP-MRF. The authors consider this approach an improvement to their previous one presented in [77]. Other examples include [244], [309], and [310].
Tensor-based approaches, while predominantly used in Fusion methods, can also be used for SISR. For example, Wang et al. [311] argued that HSI can be modeled as a 3-D tensor to exploit global (spectral) correlations between HSI bands in addition to local (spatial) correlation among HSI patches. These correlations can be modeled by a nonconvex low-rank tensor, which is an optimization problem that can be solved using local linear approximation and ADMM. The authors' approach show superiority against spectral unmixing analysis and various interpolation approaches. Similar examples can be found in [78], [79], [312], and [313].
Some SISR approaches rely on SRM, a concept that was first introduced in [82]. SRM is a technique based on spectral unmixing to estimate the class composition of image pixels. Then, by using algorithms to show how these classes are dispersed spatially within the neighboring area represented by the pixel, the spatial resolution is enhanced. According to Atkinson [314], these approaches can be divided into two categories: optimization and learning based. A dictionary-learning example is presented in [83]. The authors propose a multidictionary-based sparse representation approach, where the proposed feature vector expresses the significant information about spatial dependence. Consequently, multiple distribution dictionaries are learned via sparse representation. The feature vector is then reconstructed by every dictionary. It is also assigned to a class according to reconstruction errors and spectrum distortions. The authors assert that learning-based SRM is robust to noise. Their approach also avoids overfitting problems that can be potentially encountered with neural networks. The recent SRM for HSI-SISR approaches are utilized in conjunction with DCNNs, such as [84], [85], and [86].
SISR approaches that do not require auxiliary MSI and do not impose prior assumptions, especially if the blurring kernel is an unknown function, are referred to as blind SISR. The simplest blind SISR approach in the literature is interpolation. Interpolation is a term that can be used interchangeably with resampling, in the sense that it involves transforming an image from one coordinate system to another. The accuracy of the interpolation depends on the selection of a proper interpolation kernel. Some of the most common interpolation methods include nearest neighbor, bicubic, and bilinear interpolation [71], and others [72], [73], [74]. Even though interpolation methods are widely used in commercial software, they are not favorable because they introduce artifacts and blurriness, and they introduce spectral distortions in HSI. Nonetheless, there are several examples in HSI-SR research where bicubic interpolation is used as a benchmark for performance comparison, or as an initial step in the designed approach [253], [297]. The vast majority of blind SISR approaches are performed using DCNNs, which have been the predominant approaches in HSI SISR from 2017 onward, as will be discussed in the next section.

B. Deep Learning-Based SISR
As discussed in Section II-D, CNNs consist of automatic feature extractors that omit the requirement of having manual hand-crafted features or human intervention. Instead, CNNs are capable of learning one-to-one mapping between an LR image and its corresponding HR version. In the case of natural images (e.g., RGB images), many efforts were exerted to improve their spatial resolution via DCNNs. Some of the most prominent methods include SR convolutional neural network [315], very deep SR [316], [317], super resolution generative adversarial network (SRGAN) [318], enhanced deep residual networks for single image SR [319], UNet [320], and autoencoders [321], [322]. Unlike natural images, HSI can suffer from spectral distortions upon spatial enhancement. Therefore, HR-HSI CNNs must be developed while taking spectral context into consideration.
Some algorithms develop SISR DCNNs while taking inspiration from Fusion methods to minimize spectral distortions. For example, Yuan et al. [36] used transfer learning technique to repurpose SR DCNN that was originally trained for natural images. In addition, they utilize CNMF to capture the spectral relation between spectral bands. The authors compare their results to Fusion methods as well as interpolation methods, and it shows superiority in terms of RMSE, PSNR, SSIM, ERGAS, UIQI, and SAM. Some authors argue that 3-D CNNs capture spectral information better than 2-D CNNs and, thus, they use 3-D convolution as their primary approach for blind HSI-SR. An example of this can be seen in [37], where the authors used a 3-D full CNN (3D-FCNN) to learn both spatial and spectral correlations simultaneously. They further extended their work and improved their algorithm by including one extra convolution layer [38]. This method is sensor specific, as it is a way to avoid the necessity of having a large dataset. Therefore, this method works on images acquired by the same sensor only. For instance, Pavia Center and Pavia University datasets were acquired by ROSIS sensor, so the algorithm needs to be trained for one of them only. A similar approach was adapted in [39] by using 3D-FCNN with residual connections to enhance spectral and spatial characteristics simultaneously. Another approach that utilizes 3D-FCNN was explored in [40]. The authors decomposed LR-HSI into three groups of wavelet coefficients according to their frequency similarity, and hence, the network is referred to as frequency separated 3DCNN. This is done to suppress spectral distortion while maintaining the high-frequency information. The feature cubes are extracted using 3-D convolution and the details are reconstructed by 3-D deconvolution. The final HR-HSI is obtained by inverse wavelet transformation. This method shows superiority against 3D-FCNN and bicubic interpolation in terms of PSNR, SSIM, and SAM. Some researchers use attention mechanism to amplify the important features extracted from 3-D convolution. For instance, Dou et al. [41] devised a 3-D attention-based SRGAN (3DASRGAN) that utilizes SAM as a part of the loss function to guarantee the minimization of spectral distortions. 3DASRGAN prevails over bicubic interpolation, 3D-FCNN, and the original SRGAN. The loss function is an important aspect of CNNs that directly affect spectral fidelity. Other examples of incorporating SAM within the loss function can be seen in grouped deep recursive residual network devised in [42].
All of the aforementioned HSI-SISR approaches suffer from data scarcity. That is, HSI datasets exist as a single scene, which is insufficient to train DCNNs. In SISR approaches, an HSI scene is often divided into patches to train a DCNN. For instance, dividing Botswana dataset into patches of 64 × 64 yields 92 patches of which 70% is typically used for training and the remaining for testing. Even though this is the most common approach, data scarcity is still a challenge that needs to be overcome. In addition, the aforementioned HSI-SISR approaches perform well on homogeneous datasets that are captured by the same sensors, and cannot generalize across different sensors. In 2021, Li et al. [69] addressed the problem of HSI's high dimensionality and scarcity of training samples, which result in undesirable behaviors, such as overfitting. Their work is built based on the concept that there exists high correlation between HSI and their corresponding RGB. Thus, RGB and HSI can be trained jointly, such that RGB-SISR can provide additional supervision. This approach minimizes the amount of HSI dataset required for training, and provides the ability to be applied on heterogeneous datasets. In addition, the authors also devise a novel augmentation algorithm called spectral mixup to increase the amount of training samples. The most recent version of this approach was published by the same authors in 2022 [70]. This method outperforms [42], [43], [44], and [46]. It is worth mentioning that this is the only research work that addresses the problem of data scarcity for HSI-SISR.
The main drawbacks of the aforementioned algorithms are limiting their experiments to scaling factors of ×2 and ×4. In [44] and [45], the scaling factor goes up to ×8 at most, and HSI-SR requires higher scaling factors in order to be put into practical use. Furthermore, SISR techniques have no unsupervised approaches associated with DCNNs, as the only semisupervised approach that tackles data scarcity problem is the one mentioned in [69] and [70].

VI. CHALLENGES AND FUTURE DIRECTIONS
In order to define the future direction of this field of research, a proper analysis of the existing challenges must be done. These challenges will be discussed in terms of each method's drawbacks that must be overcome in order to lead to successful breakthroughs in the field of HSI-SR.

A. Fusion Challenges
The general drawback of all Fusion methods is their dependency on prior knowledge, as the quality of the outcome is conditioned upon the existence of HR-MSI that is accurately coregistered with LR-HSI. With regard to pansharpening approaches, they all suffer from spectral distortions in various degrees. That is, CS techniques suffer from spectral distortions the most, followed by MRA, and then variational methods [333]. Even though MRA and variational methods cause less spectral distortions, they fail to reduce spatial distortions compared to CS. Generally, there is always tradeoff between spectral distortions and spatial distortions. Hybrid approaches achieve balance between spatial and spectral distortions. Method-based approaches are able to overcome the barrier of spectral distortion and achieve remarkable balance between spatial and spectral preservation better than hybrid approaches. However, the main issue with these methods is that they require prior knowledge about the degradation kernel and/or PSF and SRF. The drawbacks of DCNN methods are different for supervised and unsupervised approaches. Supervised approaches require a large amount of data, and the performance often end up being data dependent. That is, the trained network performs well on data captured by a certain sensor, but may not achieve the same performance given data from other sensors. Meanwhile, unsupervised methods do not require a large dataset, but they still impose impractical assumptions and require prior knowledge.

B. SISR Challenges
SISR omit the need for an accurately coregistered HR-MSI, but they have their own set of challenges. For instance, PCOS assumes prior knowledge and it is highly sensitive to noise [334]. In addition, the limitation of regularization methods differs between stochastic and deterministic techniques. The former mostly suffer from high computational cost, and the latter mostly suffer from noise sensitivity and artifacts [335].
As mentioned in Section V, SRM technique is based on spectral unmixing. The effectiveness of this method relies solely on the accuracy of subpixel classification map. If it is inaccurate, the resulting spatial and spectral resolutions will be distorted as well.
Interpolation drawbacks depend on the exact methodology used. Here, bicubic and bilinear interpolation will be discussed, as they are the most used ones for HSI-SR. Since both methods are designed to operate in 2-D, they do not preserve the spectral signature of HSI. Furthermore, bilinear interpolation causes blurriness and artifacts in the outcome. This effect is less noticeable in bicubic interpolation, but it is still present.
As for DCNN-based SISR methods, they suffer from two main shortcomings. The first one is data scarcity. The available datasets at the moment consist of only one HS scene, which is not enough to train a DCNN. The common approach is dividing the scene into patches, which can produce a few dozens of HSI to use for training, validation, and testing. This is still considered a small amount of data and can lead to overfitting during the training procedure. The second shortcoming is similar to that of DCNNs used for Fusion, which is the networks' inability to generalize.

C. Summary and Recommendations
It is worth observing the quantitative results obtained by the recent studies. To narrow it down, the results are studied for the two most used datasets: Pavia University and Washington DC, and three of the most used metrics: PSNR, SAM, and ERGAS. Fig. 16 lists the span of PSNR for Pavia University and Washington DC datasets obtained by the following studies [116], [145], [146], [147], [186], [190]. The PSNR overall ranges between 26 and 43 dB approximately. On average, an acceptable PSNR value that can compete with state-of-the-art results would be around 37 dB. A large span of PSNR shows that the obtained result highly depends on certain conditions, or the dataset itself. Study [146] is an example of such large span.     17 shows comparisons between SAM results for the same studies with the same datasets. The worst overall value is 5.57, while the best is 1.78. An acceptable SAM value would be on average 3.7. Similar to PSNR case, SAM results can also show inconsistencies across different datasets. For instance, study [186] shows vastly different SAM results for Pavia University and Washington DC datasets.
Comparisons between ERGAS values shown in Fig. 18 indicate that most studies show consistency in the results obtained for different datasets, with the exception of study [145] showing large ERGAS value for Washington DC. Such value is  19. Analysis summary of Fusion methods, according to [194] findings. considered as an outlier. Generally, an acceptable ERGAS value is about 2.4 on average.
Since there is no ideal method for enhancing HSI, the best method can be chosen according to the purpose of enhancement and available information. For example, if a large scale factor is required and a corresponding MSI or PAN band is available with precise coregistration, then Fusion is the favorable approach. In that case, if there exists plenty of data, supervised DCNN approaches can be recommended; otherwise, traditional or unsupervised methods would be favorable. In cases where no auxiliary image is available for Fusion, SISR is the only option that can be followed, and it can produce decent outcomes as long as the scale factor does not exceed 4. Whether to use supervised DCNNs or traditional approaches depends on the size of dataset. For small datasets, traditional approaches are more suitable than DCNNs.
Based on the discussion presented in Sections VI-A and VI-B, the overall drawbacks of all the studies discussed in both Sections IV and V are summarized in Table III. Some of the drawbacks are specific to a particular method, whereas other methods share similar drawbacks. According to a detailed quantitative analysis of Fusion methods conducted by Luncan et al. [194], Fusion methods that successfully minimize spectral distortions in the enhanced HSI suffer from high computational complexity. Their findings can be summarized in Fig. 19. By looking at both Table III and Fig. 19, it can be concluded that the most important drawbacks to consider while designing a new algorithm are spectral distortions and computational complexity for Fusion methods, and scaling factor, which needs to exceed 8, in addition to data scarcity for SISR methods.

VII. CONCLUSION
This article presented a thorough literature review of HSI-SR approaches by studying more than 200 papers published within the last two decades. In summary, HSI-SR approaches can be viewed from different perspectives, but the most common one is the broad categorization of Fusion and SISR. Fusion methods mostly use auxiliary MSI and may impose further prior knowledge. On the other hand, SISR do not require a supplementary MSI. Furthermore, algorithms that do not assume prior knowledge are known as blind SR. The common objective of these various approaches is to improve HSI spatially without compromising its spectral resolution. The quality of the outcome is commonly measured using spatial metrics, such as PSNR and SSIM, and spectral ones, such as SAM and ERGAS. Each approach to HSI-SR has its own advantages and disadvantages. In order to achieve a breakthrough in this area of research, at least one of these disadvantages, such as exceeding scale factor of 8 for SISR, need to be overcome. In addition, Fusion approaches seem to suffer from a tradeoff between spectral distortion and computational complexity.
There are several HSI datasets for testing and benchmarking. ICONES dataset is worth investigating, as it is recent, massive, and has not been used in research papers thus far. For both Fusion and SISR directions, DCNNs have taken over the field during the past four years, and it is expected that they will continue to do so since evidence proves their effectiveness over traditional algorithms. Furthermore, the number of studies in the area of Fusion far exceeds those in the area of SISR. This has been an ongoing trend since 2002, which points to a gap in the direction of SISR particularly. Finally, generalization imposes one of the greatest restrictions on HSI-SR, as algorithms typically perform well on a particular dataset or a particular type of sensor, but falls short when it comes to others. This is a crucial challenge to take into consideration in the future HSI-SR studies, especially with respect to data scarcity. Jaime Zabalza (Member, IEEE) received the M.Eng.