Hyperspectral Image Joint Super-Resolution via Local Implicit Spatial-Spectral Function Learning

Hyperspectral image (HSI) super-resolution (SR) in both spatial and spectral dimensions is one of the most attractive research topics in HSI processing. Although recent advances in deep learning (DL) frameworks have greatly improved the performance of spatial-spectral SR reconstruction, existing methods learn discrete representations of HSI, ignoring real-world signals' continuous nature. Recently, Implicit Neural Representation (INR) has been applied to 3D surface reconstruction and image SR for continuous representation and has attracted increasing attention. In this paper, we propose the Local Implicit Spatial-spectral Function (LISSF), which learns a local continuous representation of high spatial resolution hyperspectral images (HR-HSI) from the discrete inputs. The model consists of a deep feature encoder and a spatial-spectral intensity decoder. The encoder converts the low spatial resolution multispectral image (LR-MSI) into deep features and the decoder predicts the intensity values at the given coordinates as output. Since the spatial-spectral coordinates are continuous, LISSF can achieve spatial-spectral SR in arbitrary scales, even extrapolating to higher resolutions not covered by the training data. Extensive experiments on spatial-spectral SR, spatial SR, and spectral SR demonstrate that LISSF can achieve superior performance in comparison with state-of-the-art methods. Moreover, ablation studies are performed on the effects of individual components of LISSF.


I. INTRODUCTION
H PERSPECTRAL images (HSI) contain reflectance or transmittance information of objects in hundreds of spectral bands over a continuous wavelength range.Compared to commonly used RGB images, HSIs show more intrinsic properties of object materials.Therefore, hyperspectral imaging is an indispensable scientific tool in many fields such as remote sensing [1], [2], [3], medical imaging [4], [5], [6], and industrial inspection [7], [8], [9].In these applications, both high spatial resolution and high spectral resolution are required.However, there is a trade-off between spatial and spectral resolution due to limitation of imaging sensor technology and time constraints.For example, multispectral imaging (MSI) systems on remote sensing satellites such as Geo Eye-1, MODIS, Landsat series, and GF series have lower spatial resolution than the panchromatic or RGB imaging systems, and lower spectral resolution than HSI systems.Neither the spectral resolution nor the spatial resolution of these MSI systems can satisfy the requirements of emerging remote sensing applications.In recent years, to take full advantage of the available multispectral data, deep learning (DL) frameworks have been introduced to enhance the resolution in spatial and spectral dimensions.
There are mainly three approaches for obtaining high spatialspectral resolution data using DL-based SR methods: 1) spatial SR, 2) spectral SR, and 3) spatial-spectral SR.The spatial SR approach only enhances the spatial resolution of MSI.A typical paradigm is training a deep convolutional neural network (CNN) network to extract deep features from low spatial resolution inputs and reconstruct high spatial resolution counterparts.The spectral SR approach only enhances the spectral resolution of MSIs.Sparse representations and dictionary learning are commonly used conventional techniques for spectral SR, and CNN-based models have become popular in recent years.A special case of spectral SR is to reconstruct HSIs from RGB images, which has become an important task in many computer vision challenges such as NTIRE [10], [11].Unlike the above approaches, the spatial-spectral SR approach extends both spatial resolution and spectral resolution of the input.Therefore, it can make better use of the available multispectral data and adapt to more situations.At the same time, it is a more challenging task due to its highly ill-posed nature.
Although several DL-based spatial-spectral SR methods have been proposed, these methods still suffer from a number of issues that hinder their performance.On the one hand, most existing methods of these approaches treats the hyperspectral image as discrete voxels in the 3D spatial-spectral space, ignoring the continuous nature of signals.On the other hand, existing methods are trained and inferred at a fixed SR scale which is inconvenient in practical use.
To address these issues and inspired by the recent progress in INR, we propose a framework, termed Local Implicit Spatialspectral Function (LISSF), that learns the local continuous spatial-spectral representation of HR-HSI from discrete input.LISSF consists of two parts, an encoder and a decoder.The encoder is a transformer-based U-shape network to extract deep features from LR-MSI input.The decoder takes an MLP as the core and uses the deep features to estimate the intensity values at given continuous spatial-spectral coordinates on HR-HSI.As a spatial-spectral SR model, LISSF can improve the resolution of LR-MSI in both spatial and spectral dimensions.Furthermore, it can scale the LR-MSI input to arbitrary size regardless of whether the current scaling ratio is covered by the training process, which greatly improves the practicality.
To evaluate the performance of LISSF, detailed experiments on CAVE and ARAD HS datasets are carried out.Joint spatialspectral SR experiments validate the state-of-the-art performance of LISSF.The well performance of spatial SR and spectral SR also demonstrates the good generalization ability and flexibility of LISSF.Moreover, ablation studies are carried out to validate the effectiveness of individual components of LISSF.
The main contributions of this article can be summarized as follows.
1) A novel framework termed LISSF is proposed for jointly spatial-spectral SR.  [12] which is capable of arbitrary spatial-spectral SR is used for comparison.

A. Spatial Super-Resolution
HSIs are data in 3D form, containing two spatial dimensions and one spectral dimension.Spatial SR, i.e. increasing the resolution of HSIs in the spatial dimensions, has the same roots as the single-image super-resolution (SISR) task in computer vision.Numerous SISR methods can be directly used to perform HSI spatial SR, such as SRCNN [13], VDSR [14], SRGAN [15] and EDSR [16].However, these methods treat images of different bands as independent, ignoring their correlation.Motivated by these approaches, networks dedicated to HSI spatial SR have recently been developed.For the first time, Yuan et al. [17] proposed a CNN-based method for HSI spatial SR.They regarded the problem as a transfer learning task and transferred a pre-trained SISR model to perform the HSI spatial SR.Later, Mei et al. [18] proposed a 3D-FCNN model to directly increase the spatial resolution of HSI.Li et al. [19] designed a generative adversarial network (GAN) framework for HSI spatial SR to reconstruct more texture details and proposed a band attention mechanism to explore the correlation of spectral bands.Li et al. [20] designed a novel mixed with both 2D and 3D convolution to jointly exploit the information from different bands.Jiang et al. [21] proposed to use spatial-spectral blocks (SSB) to exploit the spatial and spectral prior.Liu et al. [22] employed a new spectral attention mechanism for group convolutions to rescale grouped features with holistic spectral information.Li et al. [23] alternately employed 2D and 3D units to solve the problem of structural redundancy by sharing spatial information during the reconstruction.

B. Spectral Super-Resolution
The spectral SR denotes to enhance the resolution of hyperspectral images in the spectral dimension.Most previous researches of spectral SR are based on the sparse representation.Han et al. [24] proposed a spectral library-based dictionary learning method to achieve HSI spectral SR, which estimates the band matching matrix, spectral dictionary, and sparse coefficients simultaneously.Yi et al. [25] designed a framework involving spectral improvement strategies and spatial preservation strategies for HSI spectral SR.In recent years, many CNN-based methods have been proposed and achieved excellent spectral SR performance.Gewali et al. [26] proposed to reconstruct HSIs from MSIs using an end-to-end fully convolutional residual neural network architecture.Arun et al. [27] integrated sparse representation into a CNN-based encoder-decoder architecture to improve the fidelity of spectral SR reconstruction.Zheng et al. [28] proposed a spatial-spectral residual attention network (SSRAN) that simultaneously explores the spatial and spectral information of MSIs to reconstruct HSIs.In particular, reconstruction of HSIs from RGB images can be regarded as a special case of spectral SR.It has become a hot topic in the field of computer vision [11], attracting the attention of many researchers.Shi et al. [29] proposed two advanced CNNs for RGB spectral SR, one using residual blocks and the other using dense blocks with a novel fusion scheme.Li et al. [30] proposed an adaptive weighted attention network (AWAN) for RGB spectral SR which integrats adaptive weighted channel attention (AWCA) module and patch-level second-order non-local (PSNL) module.Cai et al. [31] designed a Transformer-based method, Multi-stage Spectral-wise Transformer (MST++), for efficient spectral reconstruction.The model achieves state-of-the-art performance while consuming much less computation and memory.

C. Spatial-Spectral Super-Resolution
Although spatial SR and spectral SR have been widely explored in recent years, few researches consider joint spatialspectral SR.For the first time, Mei et al. [32] proposed a spatial-spectral joint SR (SSJSR) model that learns an end-toend mapping from a LR-MSI and to the HR-HSI with a full a 3-D CNN.Ma et al. [33] proposed a CNN-based model named unfolding spatiospectral super-resolution network (US3RN).US3RN solves both spatial SR and spectral SR problems via the alternative direction multiplier method (ADMM) technique.Ma et al. [34] presented a deep spatial-spectral feature interaction network (SSFIN) that using a spatial-spectral feature interaction block (SSFIB) to make the spatial SR task and the spectral SR task benefit each other.Our work also belongs to spatial-spectral SR methods.Moreover, it can achieve arbitrary SR in both spatial and spectral domains.

D. Implicit Neural Representation
INR is an emerging technology that can generate continuous, memory-efficient implicit representations for objects like shapes [35], [36], scenes [37], [38], [39] or images [40].Objects are represented as multi-layer perceptrons (MLPs) that map coordinates to signal values.Genova et al. [35] chose an implicit surface representation based on a combination of local shape elements to allow for widely varying geometry and topology of shapes.Peng et al. [39] combined a convolutional encoder with an implicit occupancy decoder to incorporate inductive biases for structured reasoning in 3D space.Recently, many studies have focused on sharing the function space of implicit representations for different objects, rather than learning an independent INR for each object [41].Sitzmann et al. [41] proposed a meta-learning-based method for sharing the function space.Chen et al. [42] proposed the Local Implicit Image Function (LIIF) to generate continuous representations for images and achieve arbitrary spatial scaling of images.Xu et al. [43] and Zhang et al. [44] proposed to represent hyperspectral images with INR and perform spectral reconstruction from RGB images.However, both studies focus on learning independent INRs and achieving SR in only spectral domain.Our work proposes to learn the local image function in a shared spatial-spectral space and can achieve arbitrary scaling in both spatial and spectral domains.

III. METHODOLOGY
To simplify the presentation, we use the abbreviations listed in Table I.In this section, we introduce the proposed LISSF model as shown in Fig. 2. The model takes a LR-MSI I as input and produce a HR-HSI O as output.The model is mainly divided into two parts: encoder and decoder.The encoder converts the input I into a 3D deep feature F d through a deep neural network.Then, the decoder maps each continuous spatial-spectral coordinate

A. Spatial-Spectral Feature Representation
Let I ∈ R h×w×d denotes the input LR-MSI, where h, w and d represent the height, width and spectral bands respectively.The encoder transforms I into a 3D deep feature F d ∈ R h×w×d×C , where C represents the channel number.It can be formulated as where f encoder is the map function of encoder.
In LISSF, we use a transformer-based network as the encoder, as shown in Fig. 2. To cope with different number of input spectral bands and to exploit the feature representation ability of 3D data, we use 3D convolution layers as basic components of the encoder.It first applies a convolution with kernel size of 3 to extract shallow features from the input, denoted as where f conv3 is the map function of 3D convolution with kernel size of 3. Afterwards, these shallow features are transformed into deep features through a 3-level U-shaped structure.In each level, multiple transformer blocks are stacked to effectively extract features.Starting from the original input, the encoder hierarchically reduces spatial size, while keeping spectral number and expanding channel size.The detailed process can be expressed as where f trans is the map function of transformer block, f down denotes the map function of downsampler, N 1 , N 2 and N 3 are the number of transformer blocks in each level.
C are the deep features in different levels.Then, the encoder hierarchically expands spatial size, while keeping spectral number and reducing channel size, formulated as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where f up denotes the map function of upsampler, f conv1 is the map function of 3D convolution with kernel size of 1, C and F 5 ∈ R h×w×d×C are the deep features after processing.Afterwards, multiple transformer blocks are stacked to refine the features, denoted as At last, a skip connection from the shallow features is used to generate the final deep features, formulated as 1) Transformer block: In typical transformer-based models, the transformer block is composed of a Multi-head Self-Attention (MSA) block, a Feed-Forward Network (FFN) block and the corresponding layer normalization blocks.To improve the performance of transformer block and inspired by the structures proposed in Restormer [45], we propose the 3D multidconv head transposed attention (MDTA3D) and the 3D gateddconv feed-forward network (GDFN3D).The architecture of the transformer block is illustrated in Fig. 3. Suppose X as the input, the map function of the transformer block can be denoted as where f MDTA3D and f GDFN3D are the map functions of MDTA3D and GDFN3D respectively.2) Channel-Wise Multi-Head Transposed Attention: Suppose X ∈ R Ĥ× Ŵ× D× Ĉ as the input of MDTA3D, X is first projected into query (Q ∈ R ĥ× Ĥ Ŵ D×ĉ ), key (K ∈ R ĥ×ĉ× Ĥ Ŵ D ) and value (V ∈ R ĥ× Ĥ Ŵ D×ĉ ), where ĥ is the head number and ĉ is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the channel number in each head.It is achieved by applying 1×1 convolutions to expand the channel dimension, and then using 3×3 group channel-wise convolutions to encode spatial-spectral context, formulated as where f 1 chunk , f 2 chunk , f 3 chunk denote to split the feature into three chunks, f dconv3 is the map function of 3D deep-wise convolution with kernel size of 3, f Q reshape , f K reshape , f V reshape are reshape functions corresponding to Q, K, V .Then the map function of MDTA3D can be expressed as where f softmax is the softmax activation function.By using channel-wise multi-head attention mechanism and channel-wise group convolution, the memory usage and computation can be greatly reduced.
3) Channel-Wise Gated Feed-Forward Network: Suppose X ∈ R Ĥ× Ŵ× D× Ĉ as the input of GDFN3D, it is expanded in channel dimension and spit into two parallel paths to achieve gating mechanism.Then the gated feature is reduced to the original size in channel dimension.The map function of GDFN3D can be formulated as where f gating denotes the map function of gating mechanism, f GELU is the GELU (gaussian error linear units) activation function.

B. Continuous Spatial-Spectral Reconstruction
Let O ∈ R H×W×D denotes the output HR-HSI, where H, W and D represent the height, width and spectral bands respectively.It is clear that H ≥ h, W ≥ w and D ≥ d for the spatial-spectral SR task.In typical spatial SR and spectral SR models, the upsampling module is crucial for mapping the deep features from low-resolution space to high-resolution space.However, the scaling ratio of the upsampling module is fixed.To achieve arbitrary scaling in the spatial and spectral domains, we map each spatial-spectral continuous coordinate s ∈ S to its corresponding hyperspectral pixel value O(s) with the deep feature F , formulated as where f decoder is the map function of the decoder.Inspired by former research on Local Implicit Image Function (LIIF) [42], the reconstruction process of LISSF can be separated as the following four parts.

1)
Local feature extraction: More specifically, to map the continuous coordinate s ∈ S to HR-HSI value O(s), the feature vector t needs to be extracted first.As the deep feature F is represented in the discrete space, t can be obtained by indexing F at the nearest (Euclidean distance) discrete coordinate s * , formulated as Then ( 18) can be rewritten to where f MLP is the mapping function of the MLP, f cat refers to concatenation of vectors.In (20), t and s − s * make up the 1D input and can be transformed to O(s) with the MLP.Using the residual value s − s * instead of s prevents the MLP from relying on the absolute value of s, allowing it to learn the local continuous representation.
2) Feature enhancement: Although ( 20) is enough to train the decoder, neighboring information is ignored with only one coordinate.To achieve a better representation of the local information, we apply the feature enhancement scheme.The deep features in the 3×3×3 neighboring area are concatenated together to generate the final feature vector, formulated as where l ∈ {−1, 0, 1}, m ∈ {−1, 0, 1} and n ∈ {−1, 0, 1} denote the variations of discrete spatial-spectral coordinates.After the feature enhancement, t is replaced by t in subsequent processes, so (20) can be rewritten to

3) Local ensemble:
There is still an issue in (22) that hinders the continuous prediction of pixel values.Since the pixel value is predicted by querying the nearest feature vector with the decoder, when s moves across the boundary of adjacent discrete coordinates, the nearest discrete coordinate s * and its corresponding feature vector t change, and the decoder's prediction changes accordingly.As illustrated in Fig. 4(a), the sudden switch happens when s crossing the red interfaces.As long as the decoder map function f decoder is not perfect, discontinuous prediction can appear at these interfaces when the sudden switches Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
occur.Therefore, we refine (22) to where s * i (i ∈ {000, 001, 010, 011, 100, 101, 110, 111}) is one of the 8 nearest discrete coordinates of s, ti is the corresponding enhanced feature vector of s * i , v i is the volume of the subspace enclosed by s and s * i , v = i v i is the total volume of 8 subspaces.(23) denotes to combine 8 neighboring predictions of the decoder to get the final prediction.With the normalized volumes of 8 subspaces as weights, the final prediction can switch smoothly around the interfaces, as shown in Fig. 4(b).
4) Grid decoding: Since LISSF is used for arbitrary spatialspectral SR, the decoder should exhibit different properties with different scaling ratio.For example, the model should tend to reconstruct more textures with lower ratio and reconstruct more low-frequency features with higher ratio.Therefore, taking scale-dependent information as additional input features will improve the reconstruction performance of the decoder.We finally extend (23) as where g = [g h , g w , g d ] specifies the grid size of the input data cube in spatial and spectral dimensions.

A. Experimental Setups 1) Datasets:
In the experiments, we use 2 hyperspectral datasets, CAVE [46] and ARAD HS (NTIRE 2020) [11].CAVE is used in training and testing, while ARAD HS is used for testing only.
The CAVE [46] dataset consists of 32 hyperspectral images which covering varieties of materials and objects, such as skin, fruits, drinks, feathers, paintings, etc.The CAVE is captured by a tunable filter and a cooled CCD camera called Apogee Alta U260 under controlled indoor illumination conditions.All images are saved in 16-bit format to preserve high dynamic information.Each image has 31 spectral bands with a wavelength range of 400-700 nm with 10 nm interval and a spatial resolution of 512 × 512 pixels.24 HSIs in the CAVE are used for traing and the other 8 are used for testing.
The ARAD HS dataset is built for the NTIRE 2020 challenge on spectral reconstruction from RGB images.It contains two parts: Track 1 "Clean" and Track2 "Real World", each of which contains 450 training images and 10 validation images.The ARAD HS dataset is collected with a Specim IQ mobile hyperspectral camera.Each image has 482 × 512 pixels and 31 bands from 400 nm to 700 nm with a 10 nm step.We use 10 validation image in the "Clean" Track for testing.

2) Implementation details:
As the dataset contains only HR-HSI, we obtain the corresponding LR-MSI input by downsampling the HR-HSI.We use bicubic interpolation for spatial downsampling and linear interpolation for spectral downsampling.Due to the difference between spatial and spectral pixel definitions, we align corners when interpolating the spectral dimension, but not when interpolating the spatial dimensions.Therefore, the definition of magnification in the spectral dimension is different from that in the spatial dimensions.For example, ×2 represents 16 channels to 31 channels, and ×3 represents 11 channels to 31 channels.Since LISSF is cable of arbitrary SR in spatial and spectral dimensions, the scaling factors are not fixed during training.The spatial scaling factor is ranging from 1 to 4 and the spectral scaling factor is ranging from 1 to 5.
During the training phase, the input patch is of 48×48×6 pixels.The ground truth patch is randomly cropped from the original HR-HSI and its size is determined by the scale factors.A total of 20 patches are randomly selected for training in one image.Image flip and rotation are randomly used for data augmentation.The proposed model and other methods for comparison are all trained for 200 epochs.We use the AdamW optimizer to train LISSF with β 1 = 0.9, β 2 = 0.999 and weight decay of 2×10 −4 .The learning rate is initialed as 3×10 −4 and halved every 20 epochs.For a fair comparison, the channel number of all methods are set as 64.The proposed model and other methods are all implemented using the PyTorch framework and trained on an NVIDIA GTX3090 GPU.
3) Evaluation metrics: To quantitatively evaluate the performance of the proposed method, we use three widely used metrics, including Peak Signal to Noise Ratio (PSNR), structural similarity (SSIM), and spectral angle mapping (SAM).PSNR is the ratio between the maximum possible power of an image and the power of distortion noise that affects the quality of its reconstruction.It is suitable for evaluating the overall reconstruction performance of different methods.SSIM is a perception-based model that considers image degradation as perceptual change in structural information.It is suitable to evaluate the spatial reconstruction performance of different methods.SAM determines the similarity between the estimated spectra and the reference one by calculating the angle between them.It is suitable to evaluate the spectral reconstruction performance of different methods.

4) State-of-the-art methods:
To evaluate the performance of the proposed method under different conditions, we use three state-of-the-art spatial-spectral SR methods, SSJSR [32], US3RN [33] and SSFIN [34] for comparison.We also make up four spatial-spectral SR methods by combining state-of-the-art HSI spatial SR methods (MCNet [20] and ERCSR [23]) and HSI spectral SR methods (AWAN [30] and MST++ [31]).The interpolation method that upsampling with bicubic interpolation in spatial dimensions and linear interpolation in spectral dimension is used as a baseline.Besides, we modify the MetaSR [12] method into a 3D form, making it possible to achieve spatialspectral SR and compare with LISSF.The modified MetaSR3D model apply the same encoder as LISSF.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Spatial-Spectral SR Results on CAVE Dataset 1) Setup:
To evaluate the spatial-spectral SR performance of LISSF, experiments are carried out on the CAVE dataset under two scaling factor settings.The first is to increase the spatial resolution by 2 times and the spectral resolution by 2 times.The second is to increase the spatial resolution by 4 times and the spectral resolution by 5 times.It should be noted that the LISSF and MetaSR3D model are trained only once while other models are retrained for different scales.
2) Results: For quantitative comparison, the average PSNR, SSIM and SAM metrics of all methods on the CAVE dataset are shown in Table II where bold indicates the best results and underline indicates the second best results.The interpolation method can be use as a baseline and perform well under low scaling ratios.Compared with the four combined methods, the three previously proposed state-of-the-art methods, SSJSR [32], US3RN [33] and SSFIN [34] have no obvious advantages.This may be because the four combined models have significantly larger network size than the three independent models.As can be seen from Table II, AWAN [30] & ERCSR [23] method achieves the best performance and LISSF achieves the second best of all methods.As mentioned earlier, LISSF is trained only once, while the rest of the algorithms are trained individually for each scale.For qualitative comparison, Fig. 5 provides examples of visual reconstruction for all methods under two scaling factor settings.The corresponding metrics are listed below the pictures and regions of interest (ROI) are magnified.It can be easily figured out that the results of LISSF contain abundant details and little reconstruction error.We also plot the spectral intensity of interested points in Fig. 6 where the reconstructed spectrum of LISSF is very close to the ground truth.All these experiment results demonstrate the effectiveness of the proposed LISSF method.In addition, the quantitative metrics of the interpolation method, MetaSR3D and LISSF under another two scaling factor settings are provided in Table III.In this experiment, MetaSR3D and LISSF are the same as those in the experiments above.This experiment confirms that LISSF can achieve arbitrary scaling of spatial and spectral dimensions with only one training.LISSF can achieve stable and excellent spatial-spectral SR even with scaling factors outside the range of the training process (spatial factors larger than 4 and spectral factors larger than 5).This brings great convenience to the practical application of the model.

C. Spatial-Spectral SR Results on ARAD HS Dataset 1) Setup:
In order to evaluate the generalization ability of LISSF, we perform spatial-spectral SR on the ARAD HS dataset which is not included in the training dataset.All comparison methods are exactly the same as the models in the former section and not retrained.Evaluations are also carried out under two scaling factor settings.
2) Results: The average PSNR, SSIM and SAM metrics of all methods on the ARAD HS dataset are shown in Table IV.As shown in Table IV, LISSF achieves all 6 best quantitative Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.results which is the best in all methods and MetaSR3D achieves all 6 s best quantitative results.Visual reconstruction examples of all methods under two scaling factor settings are provided in Fig. 7 where corresponding metrics are listed below the pictures and ROIs are magnified.Fig. 7 shows that LISSF achieves the richest details and the least reconstruction error.The spectral intensity of interested points are plotted in Fig. 8 where LISSF achieves the closest results to the ground truth spectrum.Interestingly, LISSF does not achieve the best performance on CAVE dataset, but it does on ARAD HS dataset.This shows that LISSF learns the deep nature of spatial-spectral SR during training, but other methods have a certain degree of overfitting on CAVE dataset.The well performance of MetaSR3D also indirectly demonstrates the effectiveness of the encoder design, because the MetaSR3D shares the same encoder structure with LISSF.It can be concluded that the proposed LISSF method have excellent generalization ability and can achieve state-of-the-art performance in spatial-spectral SR task.

D. Spatial SR Results on CAVE and ARAD HS Dataset
1) Setup: Spatial SR can be regarded as a special case of spatial-spectral SR with spectral scale of 1.We carry out a spatial SR experiments to quantitatively evaluate the performance of LISSF and other methods.Both CAVE and ARAD HS datasets are used and the spatial scaling factor is 4. The LISSF and MetaSR3D in this experiments are still the same with the models used in the experiments above, while other models are retrained for this task.
2) Results: The average PSNR, SSIM and SAM metrics of all methods are shown in Table V. SSJSR [32] achieves the worst spatial SR result among all methods, even worse than the interpolation method, the baseline of the experiments.All other DL-based methods achieve spatial SR performance no worse than the interpolation method.LISSF achieves the best performance on spatial SR of CAVE dataset, which demonstrates the effectiveness of LISSF.Besides, LISSF achieves the best Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. RGB Spectral SR Results on CAVE and ARAD HS Dataset
1) Setup: Spectral SR from RGB images can be regarded as a special case of spatial-spectral SR with spatial scale of 1 and fixed input spectral bands of 3. We conduct RGB Spectral SR experiments to quantitatively evaluate the performance of LISSF and other methods.Both CAVE and ARAD HS datasets are used and the output spectral band number is 31.As RGB images are not provided in ARAD HS, we extract the 28-13-5 bands as R-G-B in CAVE and ARAD HS datasets, which is a widely used approach in previous studies.But this creates a problem, the number of input spectral bands is different from the number we use to train LISSF and MetaSR3D in the experiments above (as few as 6 bands).Therefore, we retrained LISSF and MetaSR3D in this experiment for a fair comparison.Other methods are also retrained for this task.During training, the actual channel-wise order used is 5-13-28 to maintain the order from low wavelength to high wavelength.
2) Results: The average PSNR, SSIM and SAM metrics of all methods are shown in Table VI.As the input band number is much less than the output and three channels of RGB images are not evenly spaced, the interpolation method perform worst.Among all DL-based methods, LISSF achieves the best performance on both CAVE and ARAD HS datasets.This experiment further verifies the effectiveness and generalization ability of the LISSF method.

F. Ablation Study
In addition, we conduct an ablation study to validate the effectiveness of each proposed component.
1) Effect of encoder structure: In LISSF, the encoder is used to extract deep features from the input LR-MSI which has a significant impact on the final performance of the model.To evaluate the effectiveness of the proposed transformer-based encoder, we train LISSF with other models including VDSR [14], UNet [47], CARN [48] and RRDB [49].All models are modified to 3D form to adapt to LISSF.The spatial-spectral SR metrics of all methods with spatial ratio of 3 and spectral ratio of 3 on CAVE dataset and ARAD HS dataset are provided in Table VII.Among all encoder choices, LISSF with the proposed transformer-based encoder achieves the best reconstruction results, which validates the effectiveness of the encoder design.2) Effect of transformer block structure: As the basic components of the encoder, transformer blocks have a significant impact on the performance of LISSF.In this section, we compare the performance of different transformer block variants.The MDTA3D block is replaced by MTA3D and CNN3D (3D convolutions).In MTA3D, the 3D deep-wise convolutions are replaced by standard 3D convolutions.The GDFN3D block is replaced by FN3D and CNN3D.In FN3D, the gating mechanism is removing and the 3D deep-wise convolutions are replaced by standard  MDTA3D and GDFN3D causes obvious drop of performance which demonstrates the effectiveness of the transformer block architecture.
3) Effect of decoder design: Besides the encoder, the other part that affects performance the most is the decoder, where we apply a lot of unique designs.In this section, we compare the performance of LISSF to its variants without feature unfolding, local ensemble and cell decoding.As MetaSR3D shares the same encoder structure with LISSF, it is also used as a baseline to evaluate the performance of different variants of LISSF.The spatial-spectral SR metrics of all methods under two scaling factor settings are provided in Table IX.It is clear that, comparing with the complete LISSF, LISSF-g (without grid decoding), LISSF-l (without local ensemble) and LISFF-f (without feature enhancement) all have significant performance degradations.This shows that all these components play important roles for the MLP to effectively decode deep features and perform continuous spatial-spectral reconstruction.Furthermore, we can find that not only LISSF, but LISSF-g, LISSF-l and LISFF-f also perform better than MetaSR3D.This shows that the excellent performance of LISSF is indeed due to the joint action of the encoder and decoder.

V. CONCLUSION
In this article, we propose the LISSF model which can achieve arbitrary super-resolution in both spatial and spectral dimensions.Different from spatial-spectral SR methods that learns fixed mapping from LR-MSI to HR-HSI, LISSF learns the local continuous representation of LR-MSI from discrete input independent of a specific scale.To achieve this goal, we design a transformer-based encoder and a MLP-based decoder.First, the encoder is used to transform the input LR-MSI into deep features containing both local and global information in the spatial-spectral domain.Then, the HR-HSI to be reconstructed is decomposed into individual coordinates for processing, and a feature vector is generated for each coordinate.At last, the decoder is applied to project the feature vectors to intensity values at specific spatial-spectral coordinates.Detailed comparisons and ablation studies were carried out to validate the effectiveness of LISSF.Experiments shows that LISSF can achieve better spatial-spectral SR results with arbitrary scales than state-of-the-art methods retrained for a specific scale, which has significant convenience in practical applications.

Hyperspectral
Image Joint Super-Resolution via Local Implicit Spatial-Spectral Function Learning Yanan Zhang , Jizhou Zhang, and Sijia Han Abstract-Hyperspectral image (HSI) super-resolution (SR) in both spatial and spectral dimensions is one of the most attractive research topics in HSI processing.Although recent advances in deep learning (DL) frameworks have greatly improved the performance of spatial-spectral SR reconstruction, existing methods learn discrete representations of HSI, ignoring real-world signals' continuous nature.Recently, Implicit Neural Representation (INR) has been applied to 3D surface reconstruction and image SR for continuous representation and has attracted increasing attention.In this paper, we propose the Local Implicit Spatial-spectral Function (LISSF), which learns a local continuous representation of high spatial resolution hyperspectral images (HR-HSI) from the discrete inputs.The model consists of a deep feature encoder and a spatialspectral intensity decoder.The encoder converts the low spatial resolution multispectral image (LR-MSI) into deep features and the decoder predicts the intensity values at the given coordinates as output.Since the spatial-spectral coordinates are continuous, LISSF can achieve spatial-spectral SR in arbitrary scales, even extrapolating to higher resolutions not covered by the training data.Extensive experiments on spatial-spectral SR, spatial SR, and spectral SR demonstrate that LISSF can achieve superior performance in comparison with state-of-the-art methods.Moreover, ablation studies are performed on the effects of individual components of LISSF.Index Terms-Hyperspectral image (HSI), spatial-spectral super-resolution, implicit neural representations (INR), local implicit spatial-spectral function (LISSF).

Fig. 1 .
Fig. 1.Schematic diagram of Local Implicit Spatial-spectral Function (LISSF).LISSF learns the local continuous representation from discrete input and achieves arbitrary super-resolution in both spatial and spectral dimensions.

Fig. 2 .
Fig. 2. Diagram of LISSF for spatial-spectral SR.First, the encoder, a transformer-based U-shape network, transforms the LR-MSI input into deep features.Then, for a specific continuous spatial-spectral coordinate, the decoder extracts a local feature and enhances it.At last,an MLP is used to estimate the corresponding intensity value.Besides, the local ensemble strategy is used to ensure a smooth reconstruction. .

Fig. 4 .
Fig. 4. LISSF with local ensemble.(a) 8 nearest discrete coordinates s * i of s and their interfaces.(c) Normalized volumes as local ensemble weights.

Fig. 5 .
Fig. 5. Qualitative spatial-spectral SR example results of CAVE dataset (the composite images of the HSI with bands 28-13-5 as R-G-B) and the corresponding reconstruction error.The first row shows the reconstruction results of the "balloons_ms" sample with ×2 spatial scale and ×2 spectral scale (16 bands to 31 bands).The third row shows the reconstruction results of the "flowers_ms" sample with ×4 spatial scale and ×5 spectral scale (7 bands to 31 bands).The second and forth rows show the normalized reconstruction error corresponding to the first and third rows.

Fig. 7 .
Fig. 7. Qualitative spatial-spectral SR example results of ARAD HS dataset (the composite images of the HSI with bands 28-13-5 as R-G-B) and the corresponding reconstruction error.The first row shows the reconstruction results of the "ARAD_HS_0453" sample with ×2 spatial scale and ×2 spectral scale (16 bands to 31 bands).The third row shows the reconstruction results of the "ARAD_HS_0456" sample with ×4 spatial scale and ×5 spectral scale (7 bands to 31 bands).The second and forth rows show the normalized reconstruction error corresponding to the first and third rows.

TABLE I ABBREVIATIONS
AND NOTATIONS s ∈ S to the hyperspectral pixel value O(s) using a MLP.Finally, all hyperspectral pixel values are synthesized and reshaped to generate O.

TABLE III QUANTITATIVE
SPATIAL-SPECTRAL SR RESULTS OF METASR 3D AND LISSF ON CAVE DATASET WITH ARBITRARY SCALING FACTORS

TABLE IV QUANTITATIVE
SPATIAL-SPECTRAL SR RESULTS OF ALL METHODS ON ARAD HS DATASET TABLE V QUANTITATIVE SPATIAL SR RESULTS ON CAVE AND ARAD HS DATASET performance on spatial SR of ARAD HS dataset, which proves that LISSF has good generalization ability.

TABLE VI QUANTITATIVE
RGB SPECTRAL SR RESULTS ON CAVE AND ARAD HS DATASET3D convolutions.The spatial-spectral SR metrics of all variants with spatial ratio of 3 and spectral ratio of 3 on CAVE dataset and ARAD HS dataset are provided in TableVIII.Replacing Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VII QUANTITATIVE
SPATIAL-SPECTRAL SR RESULTS OF METASR3D AND LISSF ON THE CAVE DATASET WITH ARBITRARY SCALING FACTORS TABLE VIII QUANTITATIVE SPATIAL-SPECTRAL SR RESULTS OF METASR3D AND LISSF ON THE CAVE DATASET WITH ARBITRARY SCALING FACTORSTABLE IX QUANTITATIVE SPATIAL-SPECTRAL SR RESULTS OF METASR3D AND LISSF ON THE CAVE DATASET WITH ARBITRARY SCALING FACTORS