Multiscale Fusion Network Based on Global Weighting for Hyperspectral Feature Selection

Feature selection (FS) is an important way to achieve high-precision and efficient classification of hyperspectral remote sensing images. However, most existing FS methods use a fixed scale to extract features and the relationship between spatial and spectral dimensions is ignored. In fact, this correlation is useful for classification. In this article, a multiscale feature fusion network based on global weighting (MSFGW) is proposed in which a global weighting mechanism is explored to catch spatial–spectral information at multiple scales. First, the multiscale feature extraction module composed of group convolution and dilated convolution is utilized to extract the multiscale features. With the increase of the dilation rate, the module takes the spatial differences at varying scales. Second, a 3-D weighting mechanism is used to combine the spatial and spectral correlated information for reducing the interference of homologous and heterologous and boosting the feature discrimination ability. Then, multiscale weighted features are fused to integrate the internal information of all bands at different scales. Finally, the band reconstruction network is used to select representative bands according to their entropy. The experimental results with the state-of-the-art FS algorithms on four widely hyperspectral datasets demonstrate that the features selected by MSFGW have obvious advantages in classification with only a few training samples.


I. INTRODUCTION
H YPERSPECTRAL imaging is one of the most important remote sensing detection methods because it achieves the effective integration of target spectral acquisition and spatial imaging. The full name of hyperspectral remote sensing is "hyperspectral resolution remote sensing," which more intuitively reflects the ability of hyperspectral remote sensing to characterize spectral dimension details. Therefore, hyperspectral remote sensing is not only an important means of Earth observation but also an indispensable component of the spatial information network. In addition, it has also played an active role in many tasks, such as urban mapping [1], geological exploration [2], and military surveillance [3]. Manuscript  However, hyperspectral remote sensing images have a large number of bands, and there is a great correlation between the bands. This makes the analysis and processing based on hyperspectral images (HSIs) involve a lot of computation and a heavy computational burden. In practical classification applications, after the number of feature dimensions increases to a certain threshold, the performance of classification will deteriorate if the number of features continues to increase, which is called the "Hughes phenomenon" [4], [5]. Accordingly, feature extraction or band selection for dimension reduction is a good choice to overcome the above problems [6].
Feature extraction [7] achieves the purpose of dimensionality reduction by performing different forms of function mapping on the original features. Compared with feature extraction techniques, the features obtained by feature selection (FS) [8], [9] are subsets of the original set of bands, so the physical meaning of the original bands is preserved. Therefore, this article mainly discusses the related issues of band selection.
According to the different FS methods, the band selection is divided into rank-based, cluster-based, search-based, etc. [10]. The rank-based FS method usually quantifies and sorts all the bands according to a certain evaluation criterion and selects the high-priority bands according to the sorting index threshold. For example, maximum variance principal component analysis [11], sparse representation (SpaBS) [12], [13], and geometry-based BS (OPBS) [14]. The selection result is mainly influenced by the ranking criteria. The search-based FS method regards FS as the multiobjective optimization problem, which is essentially an optimization problem of a criterion function, e.g., multiobjective evolutionary algorithm [15], quantum search algorithm [16], and particle swarm optimization [17], [18]. The search process is usually time-consuming. The cluster-based FS method divides the spectrum into multiple clusters according to the task requirements from the perspective of considering the similarity between the bands. Their typical ones are sparse nonnegative matrix factorization clustering (SNMF) [19], affinity propagation clustering [20], [21], and (K-means) clustering [22], [23]. Nevertheless, the existing FS methods have problems, such as large computation, high similarity, and easily falling into local optimal solutions.
Recently, convolutional neural networks (CNNs) [24], [25] in deep learning have received extensive attention by transforming initial "low-level" feature representations into abstract "high-level" representations through multilayer convolutional network structures. This characteristic is suitable for solving This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the complex problem of band FS in HSI data. Zhan et al. [26] proposed the algorithm to apply CNN to band selection first, which models the relationship between bands by a simple combination of convolution and pooling. BSNet-Conv [27] first introduced the attention mechanism into band selection, which uses the attention mechanism to consider the global information, and simulates the global nonlinear correlation between spectral bands instead of estimating each band independently. However, the method is weak in capturing long-range contextual information in both spatial and spectral directions. On this basis, Roy et al. [28] utilize a dual attention mechanism (DAM) to capture long-range nonlinear contextual information in spectral and spatial directions and achieves information weighting in both channel and spectral dimensions. Objects at different spatial scales have their specific spectral features, but most algorithms do not address the spatial scales of various objects. They use a uniform spatial scale to measure the feature information of all objects, which may affect the choice of representative bands.
Inspired by the above research, in this article, a multiscale fusion network based on global weighting (MSFGW) is proposed to solve the mentioned problem. Based on the assumption that all bands can be completely reconstructed from a subset of bands, the appropriate band subset is selected according to the contribution to the band reconstruction. First, MSFGW applies a multibranch module consisting of group convolutions and dilated convolutions to extract features with representation ability, fully considering the spatial structure and spatial correlation of the target object. Then, to enhance the impact of useful features and reduce the information dispersion, the 3-D convolution is used to enhance the information interaction between channel and spatial. Finally, the features at different scales are fused by integrating the information flow from different branches, revealing the inner connections of all bands. The proposed model achieves state-of-the-art on several challenging datasets, demonstrating the effectiveness and superiority of the method. The main contributions of this article are as follows.
1) The multibranch convolution module is employed to extract various spectral features on different spatial scales. A series of spectral features are extracted under the effect of dilation rate and grouped convolution. As the dilation rate increases, the obtained feature data cubes acquire the spectral feature information of the features at an incremental scale.
2) The information selection mechanism of human eye vision is simulated by using 3-D attention to realize spatial attention and channel attention together. Weights are applied among channel, spatial width, and spatial height to realize the combination of spatial and spectral information and reduce information loss. The rest of this article is organized as follows. We first define the notations and review the basic concepts of group convolution and dilated convolution in Section II. Second, we introduce the proposed MSFGW for hyperspectral band selection in Section III. Next, in Section IV, we explain the experiments on four hyperspectral datasets and compare them with many existing FS methods. Finally, Section V concludes this article.

A. Definition and Notations
For convenience, in this article, the 3-D HIS cube is represented as IࢠR W×H×C , where W and H are the length and width of the band image, respectively, and C represents the number of bands. So, I can be regarded as the set containing C band images B = {B 1 , B 2 , . . . B C }. The target of the band selection is to select a subset D from the set B that meets the task requirements, where D consists of b bands and satisfies D ⊆ B, DࢠR W×H×b . In addition, only considering point pixels will ignore the information related to the spatial arrangement of pixels in the scene. Therefore, in the data processing stage, a 3-D neighborhood block PࢠR S×S×C is extracted from the original image I. Taking the spatial position x(i, j) as the center, where i = 1, 2, …, W and j = 1, 2, …, H, its ground-truth label is determined by this pixel. For convenience, the input and output of the neural network are represented by tensors. For example, the input of the convolutional layer is represented as XࢠR N×M×C , where N×M is the spatial size of the input feature map, and C is the number of channels.

B. Group Convolution
HSIs have the typical characteristics of a large amount of data, so the convolution processing based on HSIs is computationally intensive. Group convolution [29] can effectively alleviate this problem without affecting the results.
The inspiration for group convolution comes from Inception [30] and AlexNet [29] that separates the convolution of channel dimension and spatial dimension. The feature mappings obtained by different convolutional paths are less coupled with each other and the features of interest are different, so better results can be obtained. As the convolution can be split into multiple paths, the model can be trained on multiple GPUs in parallel. Moreover, the model parameters will decrease as the number of groups increases, so it has the characteristics of efficient training.
As shown in Fig. 1, Fig. 1(a) represents the standard convolution operation. Suppose the size of the input feature is H 1 ×W 1 ×C 1 , where the size of the convolution kernel is h 1 ×w 1 ×C 1 , and the number is C 2 . The final output is H 2 ×W 2 ×C 2 . Then, the parameter quantity of the convolutional layer is h 1 ×w 1 ×C 1 ×C 2 . Fig. 1(b) represents the group convolution operation. Assuming that the input feature map is divided into two groups, the input feature size of each group is H 1 ×W 1 ×(C 1 /2). The size of convolution kernels is h 1 ×w 1 ×(C 1 /2), and the number is (C 2 /2). The output feature map size of each group is H 2 ×W 2 ×(C 2 /2). The parameters of the two groups of convolutions are h 1 ×w 1 ×(C 1 /2)×(C 2 /2) ×2 = (h 1 ×w 1 ×C 1 ×C 2 )/2. From the above example, it can be concluded that the parameter amount of the group convolution is 1/g of the regular convolution, where g is the number of groups (the number of groups in Fig. 1(b) is 2).

C. Dilated Convolution
In the deep convolution network, downsampling (such as pooling) is performed to increase the receptive field and reduce  the computation frequently. Although the receptive field can be increased in this way, the spatial resolution will be reduced, which will directly affect the subsequent application of HSIs. To expand the receptive field without losing resolution, dilated convolution [31], [32] is utilized to retain more feature map information. Assuming that a variable α is used as the expansion coefficient to measure the dilation convolution, the relationship between the dilated convolution kernel size and the original convolution is K = k + (k − 1)(α − 1), where k is the kernel size of the original convolution, and α is the dilation rate. The dilation rate can represent the degree of convolution kernel expansion. As illustrated, when α = 1, 2, and 3, the receptive field of the convolution kernel is shown in Fig. 2.
As shown in Fig. 2, although the size of the three convolutional kernels is the same, that is 3×3, the receptive field observed by the model is different. In Fig. 2(a), when α = 1, the size of the convolution kernel at this time is 3×3, which is the same as the general convolution. In Fig. 2(b), when α = 2, the size of the convolution kernel of the dilation convolution is 5, here the receptive field is 7×7. Similarly, when α = 3, the convolution kernel size will change to 7, and the receptive field can grow to 11×11. From this, dilated convolution can obtain the larger receptive field without increasing the cost of parameter operation.

III. PROPOSED NETWORK
This section mainly introduces the backbone structure and various components of the proposed band selection network, including the multiscale feature extraction part, the 3-D feature weighting, the feature fusion part, and the band reconstruction network. The main idea is to extract the target features in the multiscale spatial first. Second, different branches are weighted in both channel and spatial dimensions based on the consideration of the importance of cross-dimensional interactions between the different bands. Then, to obtain the complete information of all bands, the output features of each branch are fused in a summation manner. Finally, the fused features are applied to the band reconstruction, and the subset of bands that contribute most to the band reconstruction is selected as the final result.

A. Multiscale Feature Extraction
Most of the existing algorithms based on CNN model use the combination of convolution and pooling operations to learn features, and the feature scale extracted by convolution is single. However, in reality, the size and shape of objects in images are different, so the features extracted from a uniform size are not enough to meet the needs of complex situations. This requires different sizes of receptive fields to obtain contextual information. For images containing different objects or images with different resolutions, learning object features from different scales can more compactly understand the spatial structure of objects.
Inspired by the work of the Inception model [30] and others [33], [34], this article attempts to take target characteristics from different scales. The Inception model first attempts multibranch convolution with different kernel sizes, which extends the convolution operation between layers of the neural network, resulting in different sizes of perceptual fields. Similar to the Inception structure, this section designs the multibranch module composed of dilated convolution to extract features from different receptive fields. The module decomposes the feature extraction of each image patch into three different parts (branch A, branch B, and branch C) in a manner of increasing the dilation rate to describe the spatial characteristics of different scales.
Specifically, the multibranch feature extraction module is shown in Fig. 3. The multibranch module consists of three branches, and the characteristics of different scale are extracted from the input image cube in an increasing manner. For the input image cube X ∈ R S×S×C , apply different operations separately: where F , F , and F are composed of grouped convolution and dilated convolution. The kernel size of the three branches is 3 × 3. However, the dilated rate of branch A is 1, the dilated rate of branch B is 2, and the branch C is 3. From the content of Section II, we can see that the dilated rate has a direct relationship with the receptive field obtained by convolution. The larger the dilated rate, the larger the size of the receptive field obtained. In this article, different dilated rates are used to achieve the effect of multiscale feature extraction by multibranch modules. With the action of the dilated rate, the convolution results obtained for each branch are the same as using convolution kernel sizes of 3×3, 5×5, and 7×7. The finally extracted features contain context information of multiple receptive fields without significantly increasing the number of parameters. In addition, to reduce the amount of calculation, group convolution is adopted. After inputting the image cube into different branches, the same grouping is performed in the channel dimension. Each group of images is convoluted separately and then spliced into a complete image. The outputs of the final multiscale feature extraction module are U 1 , U 2 , and U 3 , which contain the feature information on different scales for the next step. However, since the number of image bands in the UP dataset is 103, it cannot be grouped. So, in the experiment of this dataset, the operation of grouped convolution is not applied. The experiments are carried out in the way of ordinary convolution.

B. Feature Weighting and Fusion
The rich features extracted by the multibranch module describe the characteristics of the target at different scales, but the features are not all useful for the task inevitably. Therefore, it is important to measure the importance of features to filter out useful features. In prioritizing feature importance, correlation operations are usually performed from two dimensions, channel and spatial [35], [36]. But they often ignore the global interaction between spectral and spatial information of bands.
In contrast to previous weighting mechanisms, we adopt a 3-D convolution method, taking into account the interaction information of spatial and channel. The main difference between the 2-D convolution and 3-D convolution is the spatial dimension of the filter sliding. In 3-D convolution, 3-D filters can move in all three directions (the height, width, and channel) of all three directions. At each position, the multiplication and addition of elements will provide a value. Because the filter slides through a 3-D space, the output is a 3-D data. The advantage of 3-D convolution is to describe the object relationship in the 3-D space. There is also the advantage of reducing information dispersion while capturing important features in three dimensions. The size of the convolution kernel is set to (1,3,3), the stride is set to (1, 1, 1), and the padding is set to (1, 1, 0).
In three branches, the output of the multiscale module U 1 , U 2 , and U 3 is subjected to a 3-D convolution operation where Θ t b denotes the trainable parameters involved in the conv3d. V t represents the output of the third branch, t<3. The spatial and channel 3-D feature weights obtained by the weighting module can be used to measure the characterization capability of the features. In addition, considering the importance of the features from a global perspective, spatial and channel information fusion can be used to maximize the retention of the 3-D information of the features. To create an interaction between the original input and the weights, the output of the multiscale feature extraction module is multiplied by the weight matrix to improve important features and suppress unwanted features The information contained in Y 1 , Y 2 , and Y 3 from the three branches are different, so the single output cannot maximize the effect of contextual information. To make the final output contain features of different scales, the cross-channel connection is adopted for information fusion next. For the weighted features from each branch, the information is integrated by a summation function to achieve the complementarity of spectral features on different spatial scales The final output Y achieves the fusion of multiscale features, which is beneficial to uncover the real structure of all spectral bands.

C. Reconstruction Network
The evaluation of the selected bands is based on the assumption that the spectral bands can be reconstructed sparsely with a small number of informative bands. For the selected bands set, if the band reconstruction is performed better, it must contain more useful information. Therefore, to demonstrate the representativeness of the selected bands, a reconstruction network is applied to achieve the reconstruction from the weighted fused images to the original spectral bands [27]. The band with the highest entropy ranking in the final reconstruction band set is the selected band. For convenience, the band reconstruction network is defined as a function Φ with the multiscale weighted output Y as inputX where Θ c denotes the trainable parameters involved in the reconstruction network. The reconstruction network is a completely symmetric coding and decoding structure strategy in which the encoder mainly analyzes the object information, and then the decoder corresponds the parsed information into the final image form. The encoder is mainly composed of the convolution layer, pooling layer, and batch normalization layer, but the difference is that the decoder adopts deconvolution. The encoder classifies and analyzes the low-level local pixel values of the image to obtain higher level semantic information. The decoder upsampling the feature image containing high-order information and then convolves the upsampled image to restore the geometric shape of the object. To illustrate the effectiveness of the network, it is measured by mean-square error where x ∈ X,x ∈X,X ∈ R S×S×C , is the reconstructed output for the given input X ∈ R S×S×C , and S tra is the number of training samples. For quantitative analysis of selected subsets of bands, the entropy and mean spectral divergence (MSD) of the reconstructed bands were calculated. According to Shannon's entropy theorem, entropy is related to the image information contained in the band. The larger the MSD value, the less redundancy between the selected bands [37], [38] where h is the gray level of histogram bins in a band consisting of S × S pixels and p(h) is the probability that h occurs where D SKL is the symmetrical Kullback-Leibler divergence, which measures the dissimilarity between C i and C j . Specifically, D SKL is defined as follows: And D KL (C i C j ) is calculated from gray-level histogram bins.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
To verify the effectiveness of the proposed method, we validate our experimental results on four widely used datasets. In addition to qualitative analysis, three popular quantitative analysis standards, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa), are also used as experimental evaluation indicators. To illustrate the advancement of the proposed algorithm, it is compared with the state-of-the-art band selection methods, such as LvaHAI [39], EGCSR_BS [40], IBRA-GSS [41], and NGNMF-E2DSSA [42]. Furthermore, comparative experiments of all bands are added to intuitively analyze the performance. According to the relevant contents of the article presented in [43], we uniformly extract 3-D patches of size 9 × 9 × C for training, where band C of IP, UP, and SA datasets is set to 200, 103, and 204. FDSSC [44] is used as the classifier to verify the classification performance of all band selection algorithms.
The entire framework is implemented in PyTorch with CUDA 10.1. For all datasets, various FS methods are independently computed ten times, with a learning rate of 0.0001, num epochs are 200, and batch size is 32.

A. Hyperspectral Datasets
In this section, we use four well-known HSI datasets (ie., Indian Pines, University of Pavia, Salinas Scene and WHU-Hi-LongKou) to demonstrate the classification performance of the proposed method.
Indian pine is the earliest test data for HSI classification. In 1992, the airborne visible infrared imaging spectrometer (AVIRIS) imaged an Indiana pine tree in Indiana, USA. The ground object is imaged for 220 consecutive wavebands. However, since the 104-108, 150-163, and 220 bands cannot be reflected by water, the remaining 200 bands after the removal of these 20 bands are generally taken as research objects. The data size is 145×145, including 16 classes of ground objects, as shown in Fig. 4 and Table I. The spatial resolution of the image formed by the spectral imager is about 20 m, so it is easy to generate mixed pixels, which makes classification difficult The data of Pavia University are a part of the hyperspectral data of Pavia, Italy, imaged by the German airborne reflective optical spectral imager in 2003. The spectral imager continuously imaged 115 wavebands and the spatial resolution of the resulting image was 1.3 m. Among them, 12 bands are deleted due to the  influence of noise, so the image formed by the remaining 103 spectral bands is generally used, and the data size is 610×340. The information on nine classes of the main ground objects is shown in Fig. 5 and Table II. Salinas data were taken by AVIRIS imaging spectrometer, which imaged Salinas Valley in California, USA. Unlike the Indian pine dataset, its spatial resolution is 3.7 m. The image initially has 224 bands. Generally, 204 bands remain after removing 108-112, 154-167, and the 224th band that cannot be reflected by water. The size of the image is 512×217, which is divided into 16 classes, as shown in Fig. 6 and Table III. The WHU-Hi-LongKou dataset was collected in July 2018 in Longkou Town, Hubei Province, China. This dataset is equipped with an 8 mm focal length head-wall nano-ultrahigh specification imaging sensor on the DJI Matrice 600 Pro (DJI M600 Pro)     The image size is 550 × 400 pixels with 270 bands, and the spatial resolution of the UAV hyperspectral imagery is about 0.463 m. The dataset is shown in Fig. 7 and Table IV.

B. Results on Indian Pines Dataset
To prove the effectiveness of the proposed algorithm, we conducted two experiments using various FS methods. First, hyperspectral classification is performed using band subsets of different sizes, ranging from 5 to 30. Second, training samples of different sizes are used for classification, ranging from 1% to 25%. To ensure the reliability of the experimental results, the training and test sets are randomly selected and each method was run ten times independently. Fig. 8(a)-(c) shows the average comparison results of OA(%), AA(%), and kappa, respectively. It can be seen that MSFGW achieves the best OA(%), AA(%), and Kappa. When greater than 15, the classification accuracy of LvaHAI [39] is higher than that of BSNet-Conv [27]. The results show that the graph learning algorithm has advantages in mining HSI band clustering structure over using spectral information alone. From the trend of the results, compared with other methods, the algorithm based on deep learning is powerful. In addition, the experimental results can verify the aforementioned Hughes phenomenon that the classification accuracy does not always increase with the increase of the number of bands. For example, both BSNet-Conv [27] and NGNMF-E2DSSA [42] have a clear downward trend when the number of bands exceeds 15. This phenomenon occurs earlier in SpaBS [12], and when the band is greater than 10, there is a downward trend. However, MSFGW occurs the phenomenon later than the other algorithms, which means that the algorithm in this article is more robust and efficient. In the interval of 20-25, the classification accuracy is the best, which is consistent with the results of virtual dimension (VD) analysis [45], [46] evaluated using the false alarm probability pF = 10 −5 .

1) Classification Performance With Different Numbers of Selected Bands:
2) Classification Performance With Different Proportions of Training Samples: Fig. 8(d)-(f) shows the classification performance under different percentages of training samples. In the experiment, we fixed the band subset to 25 and changed the training size from 1% to 25% in 5% intervals. The results showed that MSFGW significantly outperformed other FS methods in terms of OA(%), AA(%), and Kappa. The classification accuracy keeps increasing as the number of training samples increases.
In Table V, the detailed classification performance of different methods is presented by selecting the best 25 bands and using 1% of the training samples. In the end, MSFGW achieves the highest OA (87.36%), AA (88.32%), and Kappa (0.85). Among the 16 classes, 8 classes win, and the individual classes exceeded the suboptimal results by 10% or 20%. The suboptimal algorithm wins three classes, but most of them were similar to the results of MSFGW, and the advantage are not obvious. According to the analysis, the poor results obtained in class 4 are due to the Corn class being similar to the Corn-min-till class and the Grass-pasture-mowed class, thus making it easy to produce incorrect classification results. The overall analysis of MSFGW has no obvious disadvantages and has a strong ability to classify various targets with a small number of training samples. Fig. 9 shows the classification results of different FS methods. The shape of various objects is well preserved, and the internal smoothness is higher. This proves the powerful capability of the features extracted by MSFGW and the effectiveness of HSI classification.

C. Results on Pavia University Data
To prove the applicability of the model, we did the same experiment on the Pavia University dataset in this experiment.

1) Classification Performance With Different Numbers of Selected Bands:
We show the obtained OA(%), AA(%), and Kappa in Fig. 10(a)-(c), respectively. As can be seen, the proposed FS network significantly outperforms other FS methods. In particular, after the number of selected bands is greater than 10, the classification accuracy and average accuracy results are even superior to all bands, and the best can obtain about 2% improvements. The fact that the accuracy increases first and then decreases with the increase of the number of bands confirms the Hughes phenomenon exactly. MSFGW produces a larger improvement in results compared with BSNet-Conv [27], which also indicates that the combination of spatial and  spectral information better characterizes the internal structure of hyperspectral responsibility. The results of LvaHAI [39] are similar to BSNet-Conv [27], but the optimal results appear earlier as the number of choices increases. The other algorithms have poorer classification results with large fluctuations on this dataset, so the algorithms are not as robust. From the curves of changes, it can be derived that MSFGW achieves the best classification performance when the band subset is around 15. Similarly, according to the VD analysis [45], [46], by setting the false alarm probability pF = 10 −5 , the optimal subset size for the University of Pavia dataset is 13. It is consistent with the experimental results in Fig. 10.
2) Classification Performance With Different Proportions of Training Samples: Fig. 10(d)-(f) shows the classification performance using different numbers of training samples. Compared with other FS methods, MSFGW achieves the best classification performance. Specifically, MSFGW and all bands show comparable performance. Under the same training samples, such as 10% and 15%, the results are similar to all bands overall. In Table VI, we compare the detailed classification performance by setting the band subset size to 15 and the training size to 1%. Finally, under the premise of limited samples, our network achieves OA(%) of 90.23%, AA(%) of 91.36%, and Kappa of 0.86 Meanwhile, OA(%) for all bands is 89.02%, AA(%) is 90.2%, and Kappa is 0.88. Especially, our algorithm wins four classes and the suboptimal algorithm wins one class in the classification of nine classes. It can be seen from the classification results that MSFGW has obvious advantages, and the classification results of the nine classes are far superior to other algorithms. At the same time, although the accuracy of other classes is not the best, the gap is relatively small, which proves the effectiveness of the algorithm. Fig. 11 shows the classification results of different FS methods.

D. Results on Salinas Dataset
Similarly, we also verified the algorithm on the Salinas dataset. Similar to the Indian Pines and Pavia University dataset, it uses different band subsets and training samples of different sizes for experiments.

1) Classification Performance With Different Numbers of Selected Band:
We show the obtained OA(%), AA(%), and kappa  in Fig. 12(a)-(c), respectively. From the results, the proposed FS network can achieve better classification performance than SpaBS [12], EGCSR_BS_Clustering [40], IBRA-GSS [41], and other networks. Moreover, the performance of the MSFGW is not affected by the number of band subsets. The results of BSNet-Conv [27] and LvaHAI [39] continue to outperform other algorithms except for MSFGW, so deep learning algorithms have obvious advantages in datasets of different resolutions. The result of SNMF [19] is the worst, which has a large gap with various algorithms, indicating that it cannot adapt to such complex situations as hyperspectral. It also proves the powerful capability of the proposed MSFGW. When the band set size is greater than 20, the classification accuracy of most FS methods does not increase anymore, which corresponds to the estimation of VD analysis [45], [46].
2) Classification Performance With Different Proportions of Training Samples: Fig. 12(d)-(f) shows the classification performance under different percentages of training samples. In this experiment, we fixed the band set to 20 and changed the training size from 1% to 25% in 5% interval. From the results, even if the training sample is 1%, by comparing EGCSR_BS_Clustering [40], IBRA-GSS [41], and BSNet-Conv [27], our MSFGW obtains at least 2% improvements. With the increase of training samples, the classification accuracy is also growing, and the gap between MSFGW and the results of all bands gradually decreases. In Table VII, we show the detailed classification performance of different methods by selecting the best 20 bands and using 1% of the training samples. The Salinas dataset contains 16 classes, MSFGW wins eight classes and the suboptimal algorithm wins three classes. In particular, class 6 achieves 100% of the classification results, although the training samples are limited. For classes 7, 10, 13, etc., the classification results are close to the best ones, although they are not optimal. The final classification accuracy of MSFGW reached the best OA (90.63%), AA (94.86%), and kappa (0.92). Fig. 13 shows the classification results of different FS methods.

E. Results on WHU_Hi_LongKou Dataset
To prove the effectiveness of the proposed algorithm, in addition to the above classic datasets, the newer dataset WHU_Hi_LongKou is also tested. To be fair, the same verification method as the above experiment is used, that is, the classification performance of different bands and the classification performance of different proportions of training samples. Fig. 14(a)-(c) shows the comparison results of OA(%), AA(%), and kappa, respectively. It can be seen that for difficult experimental data, our MSFGW still achieves the best OA(%), AA(%), and Kappa. When the number of band  subsets is greater than 15, MSFGW achieves better results than all bands, proving that with the increase of band number, the classification accuracy does not always increase, which is consistent with the content described in the first section. During the whole experiment, even when using the minimum number of bands of 5, MSFGW achieved the best results except for all bands, and it was close to the result of all bands. When the number of bands is around 20, the classification accuracy of most algorithms is optimal. Among them, EGCSR_BS_Clustering [40] and IBRA-GSS [41] perform better than other algorithms and can achieve results comparable to all bands. Overall, deep learning has more advantages than traditional algorithms in band selection. When the number of bands continues to increase, the accuracy tends to decrease that just verifies the necessity of band selection.

1) Classification Performance With Different Numbers of Selected Band:
2) Classification Performance With Different Proportions of Training Samples: Fig. 14(d)-(f) shows the classification performance under different percentages of training samples.
In the experiments, we fixed the subset of bands to 20 and changed the training size from 1% to 25% in 5% interval. The     results demonstrate the strong capability of the bands selected by MSFGW for hyperspectral classification. When the number of training samples is greater than 5%, the results of MSFGW in terms of OA(%), AA(%), and Kappa surpass all bands, achieving the best classification accuracy. As the number of training samples increases, the experimental accuracy keeps increasing. In Table VIII, the detailed classification performance of different methods is obtained by selecting the best 20 bands and using 1% of the training samples. MSFGW achieves the best results of OA (97.75%), AA (94.49%), and Kappa (0.96). Compared with all bands OA (97.27%), AA (92.97%), and Kappa (0.96), this signifies that the complete band set contains many noisy bands, which will damage the classification performance. Out of nine classes, MSFGW wins four classes and the suboptimal algorithm wins two classes. Compared with the MSFGW results, the advantage of the suboptimal algorithm is not prominent, and the two classes it wins are similar to the results of MSFGW. The results of other classes are close to MSFGW. It is proved that the band selected by MSFGW also has strong classification ability for the WHU_Hi_LongKou dataset with similar spectra between classes. Fig. 15 shows the classification results of different methods. The proposed MSFGW can generate more uniform and smoother classification maps while preserving edges. From the details in the white boxes, it is clear that MSFGW has excellent performance on irregular shapes and small ground objects.

F. Ablation Experiment
In this section, the effectiveness of 3-D attention is analyzed by comparison with the attention mechanisms of other structures. Specifically, we adopt PAM [35], CAM [47], DAM [36], and no attention mechanism (NAM) in contrast to weight multibranch extracted features. Compare the classification results with the proposed algorithm. For fairness, 1% of the training samples are used for all datasets, while the number of selected bands varies among datasets. The fix number of bands for the Salinas dataset and WHU_Hi_LongKou dataset is 20, the fixed number of bands for the Pavia University dataset is 15, and for the Indian Pines dataset is 25. In both cases, the down arrow (↓) indicates that MSFGW performed significantly better, and the up arrow (↑) indicates that the comparison method performed significantly better than the proposed MSFGW. Table IX presents the classification results of the algorithm using different mechanisms as the weighting mechanism. Compared with the method without adding the attention mechanism, the classification results of other algorithms are improved to a certain extent, which proves the effectiveness of the attention mechanism in FS. The results of the DAM [36] can achieve better classification performance than PAM [35] or CAM [47], indicating that it has a stronger advantage than channel information in the case of sufficient spatial information and channel information. With 3-D attention, FS is performed in both spatial and spectral dimensions to reduce information loss, so it is better than the attention mechanism of parallel structure. It can be seen from the table that the proposed algorithm achieves the best OA (%) results on all four datasets. Table X presents the classification results of each branch as a feature extraction module. It can be seen from the table that extracting features using a single-size convolution is not sufficient to characterize the complex case of HSIs. Compared with the multibranch fusion structure of the proposed algorithm, the classification results for single-branch structures are worse.
In addition, it also shows that different scales have various effects on images of different resolutions, so the multibranch structure of the proposed algorithm is more adaptable to different datasets and more robust.

G. Discussion
Aiming at the problem of information redundancy and curse of dimensionality in HSIs, this article proposes an MSFGW for FS of HSIs. From the experimental results, it can be seen that the multiscale feature extraction module composed of multibranch convolution proposed by MSFGW can effectively utilize the advantages of different scales and the complementarities between different scales. For example, in the Indian Pines dataset, compared with the other existing models, MSFGW achieved the best results for the class 9 of a few samples and the class 2 of more samples. In addition, compared with single-channel attention [27], MSFGW can effectively extract and integrate spectral and spatial information in HSIs using 3-D convolution. In the four datasets, in terms of OA, the proposed MSFGW in this article increased by 2%, 4%, 2.5%, and 1.4%, respectively. This shows that the features extracted by MSFGW are more representative and have better classification performance for HSIs.
In the comparative experiments on the classification of the four datasets, the method of this article has achieved outstanding results for the categories of urban buildings, plants, and roads. However, due to the influence of various factors, there are still some categories, such as the fourth category in Table I and the first category in Table III, which are lacking in performance and have a gap with the optimal results. Such as Brocoli_green_weeds_1 and Brocoli_green_weeds_2 in the Salinas dataset, as well as Corn and Corn-min-till in the Indian Pines dataset, etc., are difficult to distinguish in terms of semantic features. The difficulty of distinguishing similar categories in semantic description also interferes with the feature extraction of deep CNN models. Therefore, it also has a certain impact on the classification performance of the overall model. At the same time, due to the multiscale feature network itself, the network model has efficient multiscale feature perception capabilities. In this complex situation, our method significantly improves the classification ability of datasets with different resolutions and irregular shapes compared with other methods. Hyperspectral datasets usually contain a large number of homogeneous regions, the use of spatial features has a greater effect, and spectral features can characterize the spectral characteristics of ground objects. Therefore, using a model that jointly extracts spatial and spectral features can yield more robust hyperspectral visual features. In the next step of research, more attention will be paid to improve the classification performance of the model.

V. CONCLUSION
This article presents a new FS method MSFGW for hyperspectral remote sensing images to explore the spatial scale problem of various ground objects. Considering the interaction between channels and spectral information in different bands from a global perspective, the proposed MSFGW adopts the multiscale dilated convolution module to extract the spectral features of objects at different spatial scales. Then, the complementary information from multiscale spectral features combines into a consistent map. Finally, the bands are selected by the contribution to the band reconstruction task. The experimental results on four public hyperspectral datasets validate the better performance of the proposed MSFGW method than the other state-of-the-art comparison methods, and indicate the effectiveness of the proposed MSFGW in FS for HSIs.