Tree-Structured Neural Network for Hyperspectral Pansharpening

Hyperspectral (HS) pansharpening refers to fusing low spatial resolution HS (LRHS) images with the corresponding panchromatic (PAN) images to create high spatial resolution HS (HRHS) images. Most of the existing HS pansharpening methods overlook the spatial and spectral imbalance of the ground objects of different types in the observed scenes. To address the dilemma, in this article we develop a novel tree-structured neural network (Tree-SNet) to form an adaptive spatial-spectral processing for HS pansharpening. The Tree-SNet method maps a convolutional neural network (CNN) onto a hierarchical tree structure, where routing nodes automatically tune the data distributed to tree paths, which is adaptive to the local characteristics of the data, while spatial enhancement (SpatE) and spectral enhancement (SpecE) modules are dynamically performed in the tree paths to further strengthen the adaptive processing. The proposed Tree-SNet is evaluated on several datasets, and the experimental results verify its superiority.


Tree-Structured Neural Network for Hyperspectral Pansharpening
Lin He , Member, IEEE, Hanghui Ye , Dahan Xi , Jun Li , Fellow, IEEE, Antonio Plaza , Fellow, IEEE, and Mei Zhang Abstract-Hyperspectral (HS) pansharpening refers to fusing low spatial resolution HS (LRHS) images with the corresponding panchromatic (PAN) images to create high spatial resolution HS (HRHS) images.Most of the existing HS pansharpening methods overlook the spatial and spectral imbalance of the ground objects of different types in the observed scenes.To address the dilemma, in this article we develop a novel tree-structured neural network (Tree-SNet) to form an adaptive spatial-spectral processing for HS pansharpening.The Tree-SNet method maps a convolutional neural network (CNN) onto a hierarchical tree structure, where routing nodes automatically tune the data distributed to tree paths, which is adaptive to the local characteristics of the data, while spatial enhancement (SpatE) and spectral enhancement (SpecE) modules are dynamically performed in the tree paths to further strengthen the adaptive processing.The proposed Tree-SNet is evaluated on several datasets, and the experimental results verify its superiority.

I. INTRODUCTION
S ATELLITE remote sensing seeks to provide accurate obser- vations of the Earth's surface with sufficient information.However, owing to the inherent physical constraints of satellite sensors, the spatial resolutions and the spectral resolutions of the connected data have an effect on each other.For example, a panchromatic (PAN) image is usually composed of only one band with high spatial resolutions, while a hyperspectral (HS) image comprises hundreds of bands but often with a low spatial resolution.Considering the fact that the HS data with full resolution are required for many practical applications [1], [2], [3], [4], [5], [6], [7], we have to resort to the HS pansharpening technique, which has drawn much attention in recent years.It is able to reconstruct high-resolution HS (HRHS) images by fusing the matching pairs of low-resolution HS (LRHS) images and PAN images.
The HS pansharpening technique is originally extended from traditional multispectral (MS) pansharpening methods [2], [2], [3], [8], [10], [11], [12], some of which include Gram-Schmidt (GS) [8], intensity-hue-saturation (IHS) [13], principal component analysis (PCA) [14], smoothing filter-based intensity modulation (SFIM) [15], wavelet transform [16], modulation transfer function generalized Laplacian pyramid (MTF-GLP) [17] and its high-pass variant [18], can be approximately categorized into component substitution (CS) and multiresolution analysis (MRA).Some of the other pansharpening approaches can be attributed to the variational optimization (VO)-based processing, which regards pansharpening as an ill-posed inverse problem by variation [9] [19], [20], [21].Representative VO-based methods include matrix factorization [2], [22] and Bayesian estimation [2], [23], [24].Matrix factorization methods usually decompose low-resolution MS/HS images and high-resolution PANs to obtain basis matrices and coefficient matrices, respectively, and then form high-resolution MS/HS images by the multiplication of the basis matrices and the coefficient matrices.In contrast, Bayesian estimation methods treat MS/HS pansharpening as an optimization problem in the Bayesian framework, which is often regularized by a prior.Since convolutional neural networks (CNNs) posed significant potential for enhancing spatial resolutions of data, CNN-based methods were proposed for the pansharpening task.CNN-based pansharpening is pioneered by pansharpening neural network (PNN) [25], which aims at MS pansharpening and is then followed by some researchers [26], [27], [28], [29].An HS image is usually characterized by obviously higher spectral resolution and contains much more spectral bands than an MS image.Therefore, CNNs need to be designed more elaborately to deal with HS pansharpening.Zheng et al. [32] designed a CNN structure for HS pansharpening that can make full use of the relationship between spatial information and spectral information through a residual block.Xie et al. [33] discovered deep priors to predict the HRHS images, which are used as a minimum constraint during the HRHS image restoration.He et al. [34], [35] built a spectral prediction structure to enhance the spectral fidelity [34] and proposed an HS pansharpening framework that can produce reliable spatial detail through a physically interpretable skip connection.
Existing HS pansharpening CNNs have achieved great progress [25], [26], [30], [31], [32], [33], [34], [35], [36], [42], which usually accomplish better quality indicators in pansharpening processing than traditional methods.However, the spatial and the spectral attributes of an HS image and their coupling may be spatially different in the real-world situation, while most of the HS pansharpening methods did not explicitly consider such a spatial varying characteristic of HS data, which may cause distortion to the pansharpened HRHS image, especially when large-scale networks are absent.Intuitively, scenes with more spatial information such as residential areas and scenes with more spectral information like vegetations should require variable processing of recovering spatial information or spectral information.Inspired by this, we propose a novel HS pansharpening CNN to incorporate CNN into a tree structure, where spectral-spatial processing is automatically adjusted through different pansharpening paths.Major innovative contributions of our work can be identified as follows.
1) A tree-structured neural network (Tree-SNet).Tree-SNet maps a CNN onto a hierarchical tree structure.Routing nodes automatically tune input data to different paths corresponding to spatially local characteristic of the data, fulfilling an adaptive spatial adjustment, while in tree paths the data go through additional dynamic processings to reinforce the adaptiveness.2) A spatial enhancement (SpatE) module and a spectral enhancement (SpecE) module.SpatE and SpecE modules are designed in the paths of the tree to further enhance spatial components and spectral components of the data, respectively, with the involvement of dynamic convolutions along the spatial dimension and the spectral dimension.3) A patch integration rule connected to the data routing.After spatial and spectral processings on tree paths, we design a self-adjustable integration rule to fuse all the processed patches from the paths.The rule is associated with the data routing on the tree and collaboratively accomplishes the adaptive spatial and spectral processing on the tree.The rest of this article is organized as follows.Section II proposes our pansharpening neural network, i.e., Tree-SNet.Section III shows our experimental results in detail.Finally, Section IV concludes this article.

II. METHODOLOGY
To handle the spatially varying characteristic of HS data in the duration of pansharpening, a feasible way is to design a pansharpening procedure capable of adaptively or automatically adjusting the spectral-spatial processing.More specifically, in this section, we seek to exploit an innovative Tree-SNet to deal with the problem.

A. Overall Design
HS pansharpening CNNs are usually connected to a learning process to reconstruct HRHS images by fusing LRHS images and PAN images, which consists of a training stage and a subsequent inferring stage.The two stages can be formulated as The overall design of our pansharpening CNN, i.e., Tree-SNet, is graphically presented in Fig. 1.As shown in the figure, we utilize a hierarchical tree to guide the construction of Tree-SNet.The input PAN and preinterpolated LRHS are first stacked along the spectral dimension and then cropped to form patch samples for the training stage or the inferring stage.In the tree, routing nodes are employed to dynamically tune patch data, i.e., distribute the data to the paths of different spectral and spatial processings with adaptive weights.Thus, adaptive tuning of the pansharpening is achieved.In the practical fulfillment of the training stage, patch samples of training data are directly used as samples of adaptive moment estimation (Adam) optimizer, which trains the pansharpening parameters θ opt with respect to a loss; and once θ opt is determined, the pansharpening model Φ(•) with fixed parameters θ opt would be used to infer the pansharpening of the test or the incoming data.

Routing nodes [N j
i ] of the Tree-SNet tree serve as routers to determine how to distribute input data into paths of the tree.More specifically, routing nodes would measure spatial components and spectral components of the observed scene, and accordingly tune the settlement of the succeeding spatial-spectral processing with enhancement on tree paths.In such a way, the spatial details and spectral fidelities of the scene can be enhanced while keeping them balanced on the whole.More specifically, Fig. 2 graphically shows the architecture of a routing node.The routing node comprises a main path and two skip paths.The main path is a convolutional path, which is composed of consecutive convolutional layers, one of which is a two-stride layer, and followed by a global average pooling, several full connected (FC) layers and a sigmoid activation.The final FC layer is connected to a sigmoid activation to produce a ϕ i value in (0, 1).The output of the main path, i.e., ϕ i , is used as a factor to tune the skip paths.Once the convolution parameters of the main path are determined, the factor ϕ i would adaptively follow the characteristic of the input data of the routing node.The mathematical representation associated with the main path is as follows: where represent the parameters of these layers, F and S stand for the Relu function and the Sigmoid function, respectively, and ϕ i is the output of the routing node.This process for the main path of the ith node can be formulated as for brevity, where θ N presents the parameters of all layers in this process.
As regards the two skip paths, one skip path directly guides the input patch data I 0 to multiply the factor ϕ i ; while the other skip path guides the input data I 0 to multiply the factor (1 − ϕ i ).The outputs of two skip paths, i.e., ϕ i • I 0 and (1 − ϕ i ) • I 0 , will be fed into the subsequent spatial enhancement path and the spectral enhancement path, respectively, which implies adaptively tuning the spatial and the spectral processings of the local patch.Then, we have where F in1 and F in2 are outputs of two skip paths, meanwhile, denote the inputs of the spatial enhancement path and the spectral enhancement path, respectively.

C. Spatial Enhancement and Spectral Enhancement Paths
After the adaptive data distribution by routing nodes, we perform further enhancing of the spatial and the spectral information.As shown in Fig. 1, in the left downward path of a node (i.e., a spatial enhancement path), the spatial information is restored by a spatial enhancement (SpatE) module, while the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.spectral information is restored by two spectral enhancement (SpecE) modules in the right path (i.e., a spectral enhancement path).The SpatE and the SpecE modules are constructed as follows.
1) SpatE Module: The module aims to enhance the spatial components of observed scenes via automatic spatial adjustment.We resort to spatially dynamic convolution to achieve such a spatial adjustment.More specifically, the proposed SpatE module learns a mapping function ) that transforms an input feature map F in1 into the weights W s and biases bs of kernels associated with every pixel.The learned parameters are then applied to adaptively modulate the upcoming input feature maps along the spatial dimension to strengthen spatial details.The related mathematical representation is of the form where θ S denotes the parameters of M 1 .Fig. 3(a) and (b) shows the structure of the SpatE module and the implementation of the corresponding spatially dynamic convolution.In the module, several consecutive convolutional layers with ReLU activations are used to perform the mapping M 1 , which outputs kernel parameters and then they are reshaped as weights and biases to fulfill spatial convolution.Spatial enhancement is conducted by applying these kernels and biases to the input feature maps along the spatial dimension in a channelwise and pixelwise manner, as shown in Fig. 3(b).2) SpecE Module: In contrast to the SpatE module, the SpecE module is introduced to improve spectral fidelities of observed scenes.This module also learns a mapping function M 2 : F in2 → (W , b) that transforms the input feature maps F in2 into kernel parameters.But different from SpatE, SpecE uses these kernel parameters to form spectral kernels of the spectral convolution operation to strengthen spectral fidelities.The connected formulation is as follows: where θ λ presents the parameters of M 2 .Fig. 3(a) and (c) displays the structure of the SpecE module and the implementation of its associated spectral convolution.As can be seen, the SpecE module is built on a similar structure to the SpatE module, except the dynamic convolution is performed with 1 × 1 × k spectral kernels.In the spatial enhancement and the spectral enhancement paths, SpatE and SpecE modules use spatially and spectrally dynamic convolutions to lift spatial and spectral information, which to some degree are also spatially self-adjustable and thus collaborate with the data distribution by routing nodes and the connected reconstruction to achieve the spatial adaptiveness of our Tree-SNet.

D. Final Reconstruction
After the data distribution via routing nodes and the associated spatial and spectral enhancements on paths, a process of final reconstruction needs to be fulfilled to achieve the desired pansharpened HS image.The final reconstruction is composed of three sequential phases: Path-based spectral reconstruction (PSR), patch integration (PI), and stitching (as shown in Fig. 1).
In PSR, the output of each path is processed with two 1 × 1 convolutional layers (as shown on the bottom right part of Fig. 1) to restore the spectral information of each patch to yield pansharpened patches, which can be formulated as where L denotes two convolutional layers that reconstruct the spectral information and θ L represents the parameters of the above layers.L stands for the final output features of the patch.PI seeks to integrate a set of patches from all the paths to form a final patch by a fusion rule, which corresponds to the data distribution by routing nodes.The associated mathematical representation is as follows: where ϕ j i and N out stand for, respectively, the weight of spatial enhancement and the final HRHS patch by weighting the final output features.
Stitching aims at reconstructing the pansharpened HRHS image by combining all final patches according to their corresponding positions, which can be formulated as where ST denotes the process of stitching by average the sum of all final patches according to the matching positions.In brief, the final reconstruction can be formulated as where FR represents all the three consecutive operations of path-based spectral reconstruction, patch integration, and stitching, which collaboratively fulfill reconstructing the desired HRHS image.Especially, in the patch integration, we design a combining rule in a subtle way (as shown in Fig. 1) to integrate outputs of all paths, which corresponds to the routing of distributed data on the tree.

E. Summary
Our Tree-SNet maps a CNN on a tree structure, where routing nodes are responsible for data distribution corresponding to the spatially local characteristic of the HS image.Meanwhile, dynamic spatial enhancement and dynamic spectral enhancement are built into the paths of the tree to further strengthen the adaptiveness of the spatial-spectral processing, and then the connected reconstruction is designed to form the final pansharpening data.Algorithm II-E outlines the practical implementation of our Tree-SNet.

III. EXPERIMENTS
In this section, experimental results are presented to verify the effectiveness of the proposed Tree-SNet.

A. Experiment Setup
We tested our proposed Tree-SNet on four datasets.Table II shows the rough properties of those datasets.
1) The Pavia University dataset, collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the University of Pavia, Italy, which contains an HS image with the spatial size of 610 × 340 and 102 spectral bands covering a spectral range from 0.4-0.9μm.The bottom left part with the size 200 × 200, the right part with the size 410 × 340, and the rest parts are selected for test, train and validate, respectively.Totally, 7680 training samples and 768 validation samples with sizes of 20 × 20 are generated by the use of data augmentation (including rotation and flipping).
2) The Houston University dataset, acquired by the Compact Airborne Spectrographic Imager-1500 (CASI-1500) sensor over 3) The Merced dataset, gathered by the Hyperion spectrometer and the Advanced Land Imager (ALI) sensor over agricultural and vegetational areas of Merced, California, which contains an HS image with the spatial size of 2100 × 180 and 134 spectral bands ranging from 0.4-2.5 μm and an accompanied PAN image with the spatial size of 6300 × 540.Parts of the HS data with sizes of 150 × 150, 600 × 150 and 400 × 150, while the connected PAN data are selected for test, train, and validate, respectively.Totally, 9600 training samples and 960 validation samples with sizes of 20 × 20 are produced.
4) The Changping dataset, also acquired by the Hyperion spectrometer and the ALI sensor, but over an area of Changping, Beijing, which comprises an HS image with the spatial size of 2900 × 180 and 137 spectral bands ranging from 0.4-2.5 μm and an accompanied PAN image with the spatial size of 8700 × 540.Parts of the HS data with the sizes of 180 × 180, 1000 × 180 and 200 × 180, and the connected PAN data, are selected for test, training, and validate, respectively.Totally, 9600 training samples and 960 validation samples with sizes of 20 × 20 are generated for our experiments.
In our experiments, two kinds of datasets are used: real datasets (including the Merced dataset and the Changping dataset) and simulated datasets (including the PaviaU dataset and the Houston dataset).A real dataset means it comprises both the observed LRHS data and the connected PAN data, whereas a simulated dataset means it contains only the observed LRHS data (i.e., no PAN data).Therefore, for simulated datasets, we have to use the observed LRHS data to artificially build the PAN data, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II PROPERTIES OF THE DATASETS
whose spatial information is the same as the observed LRHS data.Then, the observed LRHS images need to be degraded to a lower spatial resolution to collaborate the relatively high resolution PAN to perform reduced-resolution experiments.For real datasets, both the observed PAN data and observed LRHS data are available, where the PAN images are of the same spatial resolutions of the target pansharpened images.Then, the observed PAN images can naturally be used to pansharpen the observed LRHS images to the spatial resolution of the PANs, i.e., the full-resolution experiment.In addition, the observed PAN data and the observed LRHS data in real datasets can be degraded to a lower spatial resolution by involving blurring and downsampling, and then also be used to conduct a reduced-resolution experiment, which is similar to that for the simulated datasets.Therefore, in our experiments, simulated datasets are usually only used to perform reduced-resolution experiments, while real datasets can be used to fulfill not only reduced-resolution experiments but also full-resolution experiments.
In reduced-resolution experiments, the observed LRHS images are used as the ground-truth images, which can be then utilized for calculating evaluation metrics CC, SAM, RMSE, ERGAS, and Q2n.However, in full-resolution experiments, the ground-truth images are absent.Therefore, evaluation metrics CC, SAM, RMSE, ERGAS, and Q2n cannot be obtained.In this situation, the evaluation metric QNR, which does not require the involvement of the ground-truth images, is utilized for evaluating the results of full-resolution experiments.
In our experiments, we utilize Wald's protocol [37] to arrange the simulated data of Pavia University and Houston University, where LRHS images are generated by blurring and downsampling and PAN images are generated by averaging the visible spectral bands [38], and we also employ Wald's protocol in the Merced and Changping data to generate the associated LRHS images.

B. Parameter Setting
We conducted all experiments using an NVIDIA GeForce RTX 2060 SUPER GPU.The convolutional layer employed the RELU activation function and rectified linear unit.The loss function used in the experiment was l1 loss between the pansharpened data and the ground-truth.The training phase consisted of 100 000 iterations with a batch size of 64.The initial learning rate was set at 0.0001, decaying exponentially every 2000 iterations, and we utilized the Adam optimizer to minimize the loss function.Additionally, the tree structure of our Tree-SNet was set to be three layers empirically, considering the relatively low computational burden.

C. Simulated Image Experiments
These experiments are conducted on the Pavia University and Houston University datasets.PAN and LRHS images were generated using the Wald's protocol, as previously mentioned.
1) Pavia University Dataset: The spatial size of the degraded HS image is 122 × 68 pixels and the simulated PAN image is 610 × 340 pixels, while their ratio is 5. Table III shows the quantitative results of all these methods on the Pavia University dataset.As seen from the table, our proposed Tree-SNet achieves the best quality metrics among all these methods, with great improvements on all quantitative assessment results (i.e., CC, SAM, RMSE, ERGAS, and Q2 n ).
To identify artifacts and distortions in the pansharpened results.Fig. 4(a)-(o) shows the pseudocolor images achieved by different methods.Compared with other methods, the pansharpened image of our proposed method exhibits the best visual performance, with no noticeable artifacts and distortions in the pansharpened image, as shown in the enlarged area.The image of GFPCA shows the worst in the spatial preservation with much blurring.The image of SFIM has a large color difference compared with the ground-truth image, which means it performs the worst in the spectral restoration.The error image of our method shows smaller error compared with other error images in the high-frequency area, which means the pansharpening image of our method is more similar to the ground-truth image, with good spatial and spectral restoration.
To further demonstrate the effectiveness of the tree structure mechanism in our method, we graphically illustrate the feature maps produced by only spatial enhancement path and by only spectral enhancement path in Fig. 6.Each of the figures is a pseudocolor image composited by the three visible-light bands.As shown in the figure, the feature maps from different enhancement paths indicate different spatial and spectral characteristics.The feature map from the spatial enhancement path contains a large amount of details, e.g., sharp edges and clear structure contours, but yields bad color restoration compared to the ground-truth [as shown in Fig. 4(a)] which means obviously spectral distortion.In contrast, the feature map from the spectral enhancement path performs well in keeping spectral-fidelity; however, it is blurred, which contributes to the loss of spatial details.Therefore, it is verified that the spatial enhancement path tends to extract and improve spatial information, while the spectral enhancement path tends to spectral information; thus, the validity of our tree-structure method is further indicated.
2) Houston University Dataset: The spatial size of the degraded HS image is 68 × 265 pixels, and the simulated PAN image is 340 × 1325 pixels while their ratio is 5.The comparison of quality metrics for different methods with the Houston University dataset are shown in Table IV, from which our Tree-SNet performs the best with these scenes for all mentioned metrics.GFPCA and MTF-GLP obtain much worse performance than other methods, while BayesSparse and SFIM achieve better performance among other traditional methods.To identify the visual performance of different methods, Fig. 7(a)-(o) shows the pansharpening results of different methods.From these images, the result generated by our method exhibits the best performance with no blurring or color difference in the enlarged area.GFPCA performs worst in the spatial restoration while SFIM performs worst in the spectral restoration compared with other methods.In general, traditional methods exhibit worse performance in both spatial and spectral restoration compared with CNN methods.
In addition, Fig. 8(a)-(o) shows the error images of different methods.The error images show the difference between pansharpening images and ground-truth images.From Fig. 7, the pansharpening result of our proposed method obtains the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.best visual performance, which means it is the most similar to the ground-truth image among these mentioned methods.Generally, the results of CNN methods achieve much smaller error than traditional methods.

D. Real Image Experiments
The Merced dataset and the Changping dataset are utilized to perform evaluation in the real image experiments.Two kinds of experiments are carried out on the two datasets: the reducedresolution experiment and the full-resolution experiment.
1) Merced Dataset: In the reduced-resolution experiment, the HS and PAN images are downsampled simultaneously and then fused to yield HRHS images.Similar to the simulated-data experiment, the original HS images serve as reference images to evaluate the performance of different methods.The dimension of the test PAN image is 450 × 450 pixels, and the size of the experimental HS image is 150 × 150 pixels.Fig. 9(a)-(o) shows the visual results of all compared pansharpening methods for simulated Merced dataset, and Fig. 10(a)-(o) shows the error images compared to the ground-truth image.As observed from these images, our Tree-SNet obtains the best performance in both spatial and spectral restorations.
In the full-resolution experiment, the HS and PAN images are used as the inputs to generate a real HRHS image.The dimension of the test PAN image is 300 × 300 pixels, and the size of the experimental HS image is 100 × 100 pixels.
Table V gives the quantitative evaluation results for both the reduce-resolution experiments and the full-resolution experiments.CC, SAM, rmse, ERGAS, and Q2 n are for the reduce-resolution experiments; while the quality with no reference (QNR) protocol [46] is used to evaluate the full-resolution pansharpening images.As shown in this table, our proposed Tree-SNet achieves the best performance with respect to the four quality metrics in the reduced-resolution experiment, while HSpeNet1 yields better result under the SAM metric and the QNR metric.HSpeNet1 is specifically designed for recovering spectral information, and the Merced dataset may tend to comprise relatively less spatial information but relatively more spectral information.Thus, HSpeNet1 performs better     under SAM, which is a spectral-preserving metric.As regards the QNR metric, its calculation involves the considerations of both spectral distortion and spatial distortion.As the Merced scene may contain relatively less spatial information but relatively more spectral information, the QNR metric on that dataset is especially vulnerable to spectral distortion.Therefore, HSpeNet1 tends to achieve high QNR performance on the Merced dataset.However, considering the experimental results of all the quantitative quality metrics illustrated in Table V, our Tree-SNet still achieves overall excellent performance.Fig. 11(a)-(n) graphically illustrates the pansharpening results of different methods for the real Merced dataset.The result of our Tree-SNet achieves better visual performance compared to the other methods, with less blurring and spectral distortion.
2) Changping dataset: In the reduced-resolution experiment, the size of the test PAN image is 180 × 180, and the size of the test HS image is 60 × 60.Fig. 12(a)-(o) shows the pseudocolor images of all compared pansharpening methods for reduced-resolution experiment s, and Fig. 13(a)-(o) shows the error images.Our method exhibits the best performance in spatial restorations.
In the full-resolution experiment, the original HS and the connected PAN images are used to yield a real HRHS image.The size of the test PAN image is 360 × 360, and the size of the connected HS image is 120 × 120.
Table VI gives the quantitative evaluation results for both the reduce-resolution experiments and the full-resolution experiments.As shown in this table, our proposed Tree-SNet achieves the best performance with respect to the five quality metrics (i.e., CC, rmse, ERGAS, Q2 n , and QNR).Similar to the experimental results on the Merced dataset, HSpeNet1 produces better result under SAM.This may be also due to that Changping dataset contains relatively less spatial information but relatively more spectral information.Fig. 14(a)-(n) gives the visual results of different methods for full-resolution Changping dataset.The result of our Tree-SNet exhibits the best visual performance, with least blurring and spatial distortion.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Ablation Study
In order to further examine the contributions of specific structures in the proposed Tree-SNet, we conduct ablation studies on the Pavia University dataset, the Houston University dataset, the Merced dataset and the Changping dataset.We mainly test the effectiveness of our Tree-SNet strategy comprising both spatial enhancement paths and spectral enhancement paths.In Tables III, IV, V, and VI, the result of removing spectral enhancement paths and the result of removing spatial enhancement paths are illustrated, where SpaE refers to the situation of removing spectral enhancement paths, while SpecE denotes the situation of removing spatial enhancement paths.As observed from the tables, when the spectral enhancement paths or the spatial enhancement paths are removed from our Tree-SNet, its performance drops significantly.
With the removal of the spatial enhancement path or the spectral enhancement path, the routing nodes in the tree structure lose the function of dynamic tuning, i.e., the routing nodes are unable to distribute the data to the paths of different spectral and spatial processings with adaptive weights.Thus, the adaptively tuning of the pansharpening is unavailable.In this situation, individual spatial enhancement or individual spectral enhancement may occur, which leads to spectral/spatial distortion in   the pansharpening result.Therefore, this experiment indicates the validity of the structure of Tree-SNet which is composed of both spatial enhancement paths and spectral enhancement paths.

IV. CONCLUSION
In this article, we develop an innovative Tree-SNet method to deal with the nonstationary spatial-spectral coupling in HS pansharpening.Our design introduces a hierarchical tree structure to guide the pansharpening CNN to fulfill the spatial-spectral adaptiveness.In the tree structure-guided neural network, routing nodes are employed to automatically tune the data delivered to tree paths according to the local data characteristic; whereas SpatE modules and SpecE modules are dynamically run in tree paths to further reinforce the adaptive spatial-spectral processing.Our method achieves remarkable performance in tackling the HS pansharpening task.

Fig. 2 .
Fig. 2. Structure of a routing node in the proposed Tree-SNet.

Algorithm 1 :
Tree-SNet.the University of Houston, America, which contains an HS image with the spatial size of 340 × 1325 and 144 spectral bands ranging from 0.4-1.0μm.The top left part with the size 200 × 200, the top right part with the size 410 × 340, and the rest part are selected for test, train and validate, respectively.Totally, 7680 training samples and 768 validation samples with sizes of 20 × 20 are produced.

Fig. 5 (
Fig. 5(a)-(o) shows the error images of different methods.The error image of our method shows smaller error compared with other error images in the high-frequency area, which means the pansharpening image of our method is more similar to the ground-truth image, with good spatial and spectral restoration.To further demonstrate the effectiveness of the tree structure mechanism in our method, we graphically illustrate the feature maps produced by only spatial enhancement path and by only spectral enhancement path in Fig.6.Each of the figures is a pseudocolor image composited by the three visible-light bands.As shown in the figure, the feature maps from different enhancement paths indicate different spatial and spectral characteristics.The feature map from the spatial enhancement path contains a large amount of details, e.g., sharp edges and clear structure contours, but yields bad color restoration compared to the ground-truth [as shown in Fig.4(a)] which means obviously spectral distortion.In contrast, the feature map from the spectral enhancement path performs well in keeping spectral-fidelity; however, it is blurred, which contributes to the loss of spatial details.Therefore, it is verified that the spatial enhancement path tends to extract and improve spatial information, while the spectral enhancement path tends to spectral information; thus, the validity of our tree-structure method is further indicated.2)Houston University Dataset: The spatial size of the degraded HS image is 68 × 265 pixels, and the simulated PAN image is 340 × 1325 pixels while their ratio is 5.The comparison of quality metrics for different methods with the Houston University dataset are shown in TableIV, from which our Tree-SNet performs the best with these scenes for all mentioned metrics.GFPCA and MTF-GLP obtain much worse performance than other methods, while BayesSparse and SFIM achieve better performance among other traditional methods.

Fig. 6 .
Fig. 6.Feature maps for the Pavia University dataset produced by (a) spatial enhancement path and by (b) spectral enhancement path.

TABLE IV QUALITY
METRICS OF DIFFERENT HYPERSPECTRAL PANSHARPENING METHODS ON HOUSTON UNIVERSITY DATASET

TABLE V QUALITY
METRICS OF DIFFERENT HYPERSPECTRAL PANSHARPENING METHODS ON MERCED DATASET TABLE VI QUALITY METRICS OF DIFFERENT HYPERSPECTRAL PANSHARPENING METHODS ON CHANGPING DATASET