Nonlocal Band Attention Network for Hyperspectral Image Band Selection

Band selection (BS) is a foundational problem for the analysis of high-dimensional hyperspectral image (HSI) cubes. Recent developments in the visual attention mechanism allow for specifically modeling the complex relationship among different components. Inspired by this, this article proposes a novel band selection network, termed as nonlocal band attention network (NBAN), based on using a nonlocal band attention reconstruction network to adaptively calculate band weights. The framework consists of a band attention module, which aims to extract the long-range attention and reweight the original spectral bands, and a reconstruction network which is used to restore the reweighted data, resulting in a flexible architecture. The resulting BS network is able to capture the nonlinear and the long-range dependencies between spectral bands, making it more effective and robust to select the informative bands automatically. Finally, we compare the result of NBAN with six popular existing band selection methods on three hyperspectral datasets, the result showing that the long-range relationship is helpful for band selection processing. Besides, the classification performance shows that the advantage of NBAN is particularly obvious when the size of the selected band subset is small. Extensive experiments strongly evidence that the proposed NBAN method outperforms many current models on three popular HSI images consistently.

bands [7], thus leading to huge data redundancy. The high dimensional and redundant HSIs data will result in huge expenditure and extravagant computing resources. On the other hand, it often suffers from the so-called curse of dimensionality [8], [9], which will impair the classification ability of classifiers.
Feature extraction and band selection (BS) are the two most common methods to transform the high-dimensional HSI data to a lower one [10]. Feature extraction methods are widely used in HSIs data processing [11]- [13]. The core idea of these methods is to find a mapping from high-dimensional space to low-dimensional space. However, feature extraction changes the original feature space and causes the loss of the physical characteristics of HSI data [10]. The basic idea of BS is to select the most representative band from the original data. Compared with feature extraction [14], BS preserves the main physical attributes of the data to a great extent and protects the information of the original data as much as possible [15].
BS methods can be classed into supervised and unsupervised methods. Since no prior knowledge is needed and its better robustness, unsupervised BS methods have attracted a great deal of attention. Over the past decade, many unsupervised BS methods have been proposed [16]. Some of BS methods view Band selection as a combinational optimization problem and use a heuristic searching method to optimize it, such as multiobjective optimization-based band selection (MOBS) [17]- [19]. Some of them are the cluster-based methods which cluster the spectral bands and select the target bands, such as subspace clustering (ISSC) [7], [20]. These methods consider the similarity between spectral bands and achieved good results in recent [7], [20]. Other BS methods are based on band-ranking which assign a rank for each spectral band by assessing their score, e.g., maximum-variance principal component analysis (MVPCA) [21], sparse representation (SpaBS) [22], [23], and geometry-based band selection (OPBS) [24].
Many existing BS methods commonly view every single spectral band as an independent feature. However, there is a nonlinear relationship exists between each band [7], [25]. Cai et al. [25] proposed an end to end framework (BS-net), which uses a convolution layer and an attention module to reconstruct the original data and to find the connection of bands. However, due to the limitation of the convolution kernel BS-Net can not explore the nonlinear relationship between bands over a long distance.
Recently, deep neural network (DNN) [26], [27] has attracted increasing attention in HSI processing. Due to its ability to find the nonlinear relationship between the features, DNN has been This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ widely applied in HSI classification and feature extraction [25]. With the development of the DNN, convolution neural network (CNN) [28] a variant of DNN, has been proven powerful to extract spatial relationships between images and it has become one of the most popular models for HSI processing [29]. CNN is widely used in many neural network models and architectures. For example, the auto-encoders of CNN version are always used to deal with image reconstruction. In addition, attention mechanism [30] have attracted increasing attention for the image classification problem, due to its ability to make the whole framework focus on salient information. Many channel weighting methods are equipped with attention modules, for example, Residual attention network [31] and spatial transformer networks [32]. Due to the ability of the different version for DNN, it also can be used in extract the correlation between the spectral, i.e., [25]. BS-net consider the nonlinear correlation between the spectral bands, that is why BS-net performances better than other exists BS method. Our proposed framework continues the idea of BS-net and uses DNN to find the relationship between bands, which we discussed in detail in Section III.
In this article, we develop a band selection network framework that considers the global relationship of all bands called nonlocal band attention network (NBAN). Specifically, we assume that there is a long-range relationship that can help an informative band subset to restore the complete spectral band set. Instead of evaluating the connection between adjacent bands, our framework extracts the long-range relationship by an attention score matrix that is generated with an attention module. Finally, NBAN is end-to-end trainable which makes it can be viewed as a unified framework and combined with many popular networks.
To sum up, the main contributions of this work are as follows. 1) By assuming a long-range relationship exists between the spectral bands, we propose a novel method for HSI band selection called NBAN. Our proposed method measures the significance of each band by calculating the restore contribution of the target band to other bands and an attention score matrix is used to extract the long-range relationship between the spectral bands. Finally, the attention score matrix is applied to hyperspectral band selection directly, which attempts to provide a new idea for unsupervised band selection. 2) We introduce nonlocal attention into the module of BS-Net [25] by considering long-range relationship which means that we have a receptive field in the process of band selection. The long-range relationship makes NBAN have global metrics rather than only consider a small range of band relationships. That enables NBAN to achieve a better performance when select a small size of band subset. 3) We show that the proposed method can better shield the noise bands and achieves a good result on three HSI datasets. The final experiment results show that our proposed method achieves the best performance not only on the classification performance but also on the correlation between the selected band subset. At the end of the experiment, we analyze why our framework can better avoid selecting those noise bands and achieve better classification performance than other BS methods by combining with the characteristics of information entropy of three datasets. The rest of the article is structured as follows. In Section II, we first describe motivation and review the related work. Second, we define the notations and show the details of our proposed method in Section III. Next, we design experiments to compare with existing BS methods and discuss their results in Section IV. Finally, we conclude with a summary and final remarks in Section V.

A. Attention Mechanism
The inspiration for attention mechanisms mainly comes from human beings. The core idea of attention mechanism is to make the modules ignore extraneous information and focus on key information. Attention mechanism is widely applied in natural language processing [33], [34] and image processing [29], [32], [35], [36]. In this article, we mainly focus on its application in image processing. Attention module can be considered as a function f which measures the significance of the features and formulates an attention map. The attention map can be taken as a reference to reweight the raw data. In image processing, the task of attention module can be defined as follows: Here, a is a score vector of features a ∈ R b that generated by attention module, Z denotes a feature map Z ∈ R mn×b , and H is the resulted feature map H ∈ R mn×b . Attention module f is widely achieved by a neural network, that makes a can extract the nonlinear relationship from the original feature map.
The network focuses the key information from the whole training process, and generate an attention map. By combining the attention map a and original feature map Z, H will focus on the key information and give less attention to the extraneous information.
Due to the different objects of concern, attention modules can be classed into spatial attention, channel attention [35], and joint attention [37], [38]. The spatial attention is utilized to learn the relationship between the spatial pixels. In practice, convolution kernel is widely used in spatial attention modules, due to its powerful ability to extract the information between adjacent pixels. Meanwhile, the convolution operation is also applied in the channel attention mechanism. For example, Hu et al. [35] proposed a simple network branch that uses an average pooling layer and convolution layer to squeeze the spatial information and get channel attention. On the other hand, due to the limitation of kernel size, most of the spatial attention modules of using convolution kernels cannot consider the long-range relationship between elements. Although the focus of spatial attention mechanism and channel attention mechanism is different, the shortcomings of convolution kernels leads to the limited ability to extract contextual relations from spatial attention and channel attention. To solve this problem, Wang et al. [39] proposed a network that calculates the similarity between pixels and learn the long-range relationship from the data. They calculated the similarity between all the pixels and obtain a more comprehensive relationship between all pixels.
To sum up, attention mechanism has great potential in feature selection. In this article, we not only use the traditional channel attention but also use some concepts of spatial to find the longrange relationship between bands. In the following section, we will show the attention modules of our proposed framework and discuss how it works.

B. Auto-Encoder
As a structure of DNN, auto-encoder is widely applied in neural language processing [40], [41], and image processing [42], [43]. With the development of CNN, an auto-encoder of convolutional version can better extract the information from the images than the original one. In this article, we mainly focus on the auto-encoder of convolutional version. In practice, We define an auto-encoder as a function f , which takes a tensor X as input and outputs a resulted tensor Y. Then the auto-encoder can be defined as follows: where Θ denotes the trainable parameters in the auto-encoder. The training process of auto-encoder can be defined as two stages: feedforward and backward. In the feedforward process, the auto-encoder transforms the input tensor X into a latent space by its encoder layer. In order to extract the information, the encode layer always performs convolution operation in image processing. Then the decode layer tries to restore the data and produces a certain output Y. The encode layer and decode layer are composed of multiple convolution kernels of different sizes and after convolution operation, there is an elementwise function between convolution kernels. The second stage is called backward. After the stage of feedforward, the auto-encoder needs to update the parameters by using the method of gradient descent. A cost function is used to calculate the cost between the original tensor X and the result tensor Y. Then a method such as mean square error (MSE) is utilized to minimize the cost. Finally, cost function can be defined as where Θ denotes the parameters of the auto-encoder and Θ is updated by Here, η is learning rate and ∂ denotes the partial derivative operation.

C. Motivation
The purpose of BS is to select some representative bands to improve computational efficiency. This article purpose based on the assumption, i.e., select the band set with global characteristics will perform better than those band only consider the local relationships. However, most of BS methods divide the whole band set into several categories or just evaluate each band as an independent feature [21]- [24]. These methods limits the expression ability of the selected band set, and make the result fall into a trivial solution. A method to solve this problem is to enlarge the receptive field of the network, such as extract the long-range relationship of the whole band set. By assuming a band can be jointly represented by the others bands, the data of original band set can be written as XC = X, where C is a score matrix to reveal the significance of each band to other bands. Moreover, the score matrix C can be used to select the most informative bands as an important reference. As a deep learning method, BS-Net takes convolutional neural networks as band attention module and reconstruction network which makes it more advantageous to other BS methods. However, it is also face some shortcomings such as the following. 1) The expression ability of the selected band subset is limited, especially when the size of the subset is small. 2) The score matrix C cannot extract enough information from the whole band set due to the limitations of the convolution kernel size. Hence, this article attempts to establish a new nonlocal evaluation framework to select the more global bands by extracting the long-range relationship between the spectral bands.

III. PROPOSED NETWORK
We denote an HSI dataset consisting of b spectral bands and n × m pixels as U ∈ R n×m×b . For convenience, we regard U Our goal is to find a function ψ : Ω = ψ(B) which can produce a subset contains the most representative bands. In this section we lay out an end-to-end trainable framework for BS, then describe how it works. To begin with, we summarize the structure of the model and then the details of each module are shown in the following sections.

A. Architecture of NBAN
The core idea of NBAN is to rank the significance of the bands in the process of sparse band reconstruction with a nonlocal way. We try to restore the whole band set by only using a few informative bands. In the process of reconstruction, those bands that can represent the vast majority of bands should achieve more attention. To this end, in order to select the most influential bands we proposed a framework consists of a band attention module and a reconstruction network.
The schema of NBAN is shown in Fig. 1. Aiming to rank the significance of the bands, we first consider the long-rage relationship between the bands and design a band attention module. The input data is first extracted the correlation by the attention module, and generate an attention score matrix. The band attention module is a branch network that contains the characteristics of spatial attention and channel attention.
In this module, we use a matrix C called the attention score matrix to collect the long-range relationship between bands. Then C will help to reweight the original data. The original data are reweighted by matrix operation with reference to C. The details of the attention module are shown in part B. Next, a reconstruction network is to restore the original spectral bands from the reweighted bands. The reweighted data are restored by the reconstruction module. The details of the reconstruction net are shown in part C. In the process of the reconstruction, the attention module adjusts the weight of bands to reconstruction and measure the significance of each band. After training, the final attention score matrix can be utilized to select the representative bands.

B. Nonlocal Attention Module
A nonlocal attention module is an embedded unit which can reweight the original data to a new feature map O. To make the selected spectral bands more global, a reweight operation is to calculate the correlation between each spectral band in a nonlocal way. This allows us to measure the relationship with a bigger receptive field. In Fig. 2, we show the details of the nonlocal channel attention module. A feature map X ∈ I which consists of d × d pixels and b spectral bands is given as an input. The reweighted dataset O can be defined as where X ∈ R d 2 ×b , and ⊗ denotes reweight operation. C is an attention score matrix that used to extract the relationships between the spectral bands. 1) Attention Score Matrix: Comparing with the attention module in BS-net, we employ an attention score matrix to record the relationship we extract from the spectral bands so that the framework can learn more information from the original feature map.
We obtain the attention score matrix by attending to all pixels in each band and taking their weighted average in embedding space, this follows the design of [39], [30] and the nonlocal operation can be written as (6). We simplify the embedded Gaussian nonlocal module [39] and use the improved version on band attention. The similarity between each spectral band is calculated in a embedded Gaussian way and the attention score matrix measure the significance of each spectral band to others. Specifically, two (1 × 1) kernels with a stride of (1 × 1) are to learn the correlation between the spectral bands. Then to standardize the data, we follow it with a sigmoid function. The similarity function f (x i , x j ) can be defined as (7) where 1 H(X) denotes a normalization function, and i is the index of the target position, and j is the index of enumerates all other positions. f denotes the similarity between two pixels.
In order to reweight the band set, we view each band as a combination of all bands and calculate the restore weights of bands to each other. Specifically, the greater the similarity between the bands, the greater the reconstruction weight. To ensure the standardization of the generated data, we set the sum of bands weight to 1 by calculating with a softmax function along with the column of the attention score matrix, which means that we can regard each column of the matrix as the reconstruction cost of the corresponding band and each line can represent the reconstruction weight of the corresponding band to other bands. The last attention score matrix can be written as Here, σ(X i ) = W σ X and φ(X) = W φ X. W σ and W φ are the learning parameter of the convolution layer. The attention score matrix represents the relationship between pixels and the values in the matrix are all positive.
2) Band Reweighting: Next we describe the reweight operation. To reweight the data, we take ⊗ as a reweight operator and use the attention score matrix as a reference. Each element of the reweighted data is a combination of the elements of the same position in other bands, and use C ij to denote the constituent weight, where C ij refers to the element of the ith row and jth column in the attention score matrix. Then we can write the outputs element O ij as (9). Finally, the reweighted data O can be calculated by (10) Here, O ij is the element of ith row and jth column on the reconstructed dataset.

C. Reconstruction Net
Following the attention operation, we employ a reconstruction net (RN) to restore the reweighted spectral bands. The RN can be defined as a function f which takes the reweighted data O as input data and outputs a restored datasetX aŝ Here, Θ c is the trainable parameters involved in RN. The MSE is used as the cost function to help recover the data. We define the cost function L as follows: Fig. 2. Overall network structure of NBAN. The framework consists of a nonlocal attention module and a reconstruction network. ⊗ is an operator, which denotes the operation of matrix multiplication. O is the data that reweighted by the attention score matrix and has the same shape as X. The differences between X and X are calculated to feedback the framework and update the parameters.
where X denotes the original feature map and S is the number of training samples. Equation (12) can be optimized by using a gradient descent method, such as stochastic gradient descent (SGD) and adaptive moment estimation (Adam). The details of RN can be seen in Fig. 1. First, the reweighted data O processed by a (1 × 1) convolution kernel with a strids of (1 × 1). In order to restore the data, we consider the vanishing gradient and simplify the auto-encoder by only using one convolutional encoder(Conv1) and one deconvolutional decoder(Deconv1) to up-samples feature maps. We employ the reconstruction error between the prediction results and the original data to feedback adjustment and form the final attention score matrix C, then C is a reference to select the bands.

D. Informative Band Subset Selection
In this step, our goal is to measure the significance of each spectral bands. In order to select the informative band subset, we evaluate the importance of each band by a vector ω = [ω 1 , ω 2 , . . ., ω i ], where ω i refers to the importance of the ith band. As we mentioned in part A, we view each line in C as the reconstruction weight of the corresponding band to other bands. In other words, the greater the reconstruction weight, the more important the band is to other bands. With this solution, we assess ω i as where ω i denotes the evaluation weight vector of all bands, i is the line number and j is the column number of the attention score matrix. And then we sort ω and select the informative band set. The pseudocodes of NBAN are shown in Algorithm 1.

IV. EXPERIMENT AND DISCUSSION
In this section, we explore the use of NBAN and discuss how it works on real three datasets. In part A, we begin with introduce three datasets, training details and evaluation criteria. Then, we test NBAN with a classifier, analysis the convergence of the framework, and compare the performance with six popular BS methods in part B. Finally, we investigate the reasons why NBAN performs better from the selected band subsets in part C.

A. Dataset and Training Details
To evaluate the influence of NBAN, we employ Indian Pines, Pavia University, and Salinas as testbed for exploring the per- For better evaluating the performance of the selected band subsets, support vector machine (SVM) is utilized [44]- [46] as the classifier. We randomly select 5% labeled samples from three datasets as the training set and set the optimal window size as 7 × 7 for each dataset. We train the network for 80 epochs on Pavia University, and 100 epochs on Indian Pines and Salinas. The kernel size of Conv1 and Deconv1 are 3 × 3 × 128. The optimum learning rate used for three datasets is 0.00001. Overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) are calculated by NBAN for 20 independent runs.
To analyze the selected band set, information entropy and mean spectral angle (MSA) [15], [47], [48] are calculated as an where P (i) denotes the grey level of histogram bins of B i . The larger entropy is, the greater the amount of information exists.
MSA is an average unit of measurement to indicate the degree of data matching for a band set. The MSA for band subset B can be written as where α(i, j) denotes the spectral angle between the ith band and jth band. α(i, j) can be calculated as The larger MSA is, the less redundancy is contained between the band subset. The methods are evaluated with Python 3.5 running on an Intel Xeon E5-2620 2.10 GHz CPU with 32 GB RAM [25]. We implement all methods with TensorFlow-GPU 1.6.1 and accelerate them on an NVIDIA RTX-2080TI GPU with 11 GB graphic memory and the hyperparameters of the contrast methods are listed in Table I.

B. Experiment Results
In this part, we design the experiments to prove the effectiveness of NBAN. We employ ISSC, SpaBS, MVPCA, MOBS, OPB, and BS-Net-Conv as our comparative methods. After that, the performance of using all bands is also compared with our method as an important reference.
1) Analysis of Convergence of NBAN: In this part, we discuss the convergence of NBAN on different HSIs, with visualize the loss curves and the classification accuracy. For Indian Pines, the curves of loss and classification are shown in Fig. 3(a). It can be seen that when we train NBAN on Indian Pines the reconstruction errors decrease and the accuracy of SVM increases at the same time. The loss values of NBAN close to 0.002 after 20 iterations, and there has been a huge improvement in accuracy at the same time. Finally, the value of accuracy stabilizes around 73% after 40 interactions when we train our method on Indian Pines. There is nearly a 13% improvement in accuracy, which means that our method has a good effect on the band selection. The use of long-range attention mechanism will make loss converge faster than other BS methods, it is another advantage of our method. Similar to Indian Pines, the loss curves of Pavia University and Salinas are shown in Figs. 4(a) and 5(a). Fig. 3(b) represents the selection process of NBAN on Indian Pines. For convenience, we scale the bands' weight into range [0,1] and find that the importance of the band changed with the training iteration. Almost all the spectral bands' weights are same at first, but with the increase of the training iterations, we can observe that some of them become prominent compared with other bands which means that we can view this phenomenon as a process of band selection. Furthermore, to explore the relationship between the correlation matrix and the selected bands, we further visualize the correlation matrix of the trained network on Indian Pines. As we can see in Fig. 3(c), the informative bands and the trivial bands are distinguished by the lines in the matrix. The horizontal lines in the graph of the attention score matrix mean our method enhanced the weight of some specific bands successfully instead of randomly increasing the weight of the matrix. In Fig. 4(b) and Fig. 5(b) and (c), we can see the selection process and the final attention score matrix of Pavia University and Salinas. For Pavia University, the selected bands mainly concentrated before the 80th band. However, the selected bands are evenly distributed on Salinas. The specific band distribution will be discussed in detail in part C.
2) Performance Comparison: To show the classification performance of our method, we compare the classification results of different BS methods under different sizes of band subset. For Indian Pines, we can see from Fig. 8 that NBAN achieves the best OA when the band subset size is over 5, followed by BS-Net-Conv, MOBS, and others BS methods. It is observed that two deep learning methods NBAN and BS-Net-Conv perform better than other BS methods in most cases and the OA of NBAN increases larger than 70% when the subset size is only 15. Then we find a counter-intuitive phenomenon from these curves. The classification performance is not always increased by selecting more bands in some BS methods. We find that the OA curve of SpaBS shows a downward trend when the subset size larger than 17 and the classification performance of BS-Net-Conv start decrease when the subset size larger than 23, it is the so-called Hughes phenomenon [9], [8]. Furthermore, we can observe that the OA curve of NBAN rises continuously throughout the whole curve which means that NBAN can select more informative bands on Indian Pines. For Pavia University, it can be seen from Fig. 8 that NBAN achieves better OA than other methods when the subsets are smaller than 15. When the subset size larger than 20, NBAN, MOBS, BS-Net-Conv, and ISSC achieve close OA. Although MOBS achieves a better performance when the size of the subsets larger than 17, our proposed method is still comparable to it. Because of the Huges phenomenon, the classification accuracy of NBAN increases first and then decreases with the selected bands. Then for Salinas, we can see from Fig. 8(c) that NBAN achieves the best OA when the size of the subset larger than 20. The OA of most BS methods no longer increases unless NBAN when the subset size larger than 21, which means that our method can choose more informative bands. Moreover, when the subset size larger than 19, two deep     learning methods achieve better classification performance than the performance of using all bands.
In order to observe the performance of each BS method, we show the details of performance for Indian Pines in Table II and  the results of Pavia University and Salinas in Table III. For Indian Pines, it is observed that NBAN achieves the best OA(73.34%), AA(75.21%), and Kappa(0.718). NBAN achieves the best score in 10 classes and the method of using all bands wins in No. 8,No.11,and No.14 class. BS-Net-Conv gets better performance in No.2, No.10, and No.12 class. Then for Pavia University, we can see that the method of using all bands achieves the best result, NBAN is worse than the method of using all bands but better than other methods when the subset size is set to 13. For Salinas, NBAN achieves the best performance on this dataset when the subset size is set to 21. MOBS and BS-net-Conv achieve very close results, in this subset size. The performances for the three dataset show that NBAN performs better than other BS methods, followed by BS-Net-Conv. BS-Net-Conv takes into account the nonlinear relationship between the spectral bands and achieves good results. Compared with BS-net, NBAN calculates the long-range relationship on this basis so that it gets the best performances.
In the process of comparison, we find that two deep learning methods perform more stable than other BS methods and the Hughes phenomenon always appears later. For Indian Pines and Salinas, NBAN and BS-Net-Conv achieve better OA than the OA of using all bands which means that the data redundancy between all bands affects the accuracy of the classification and prove BS methods is beneficial to data processing. Comparing with NBAN and BS-Net-Conv, we notice from the curves that NBAN is more advantageous when the subset size is smaller than 11 and two methods achieve close OA when the subset size larger than 13. This occurs because NBAN considers the long-range relationship between the whole band set. However, the size of convolution kernel limits the receptive field range of BS-Net-Conv. Therefore when we only choose a few bands from the subset of BS-Net-Conv, the bands with the highest score only can represent the bands within a limited area. Compared with convolution operation in BS-Net-Conv, nonlocal attention extracts more relationships from the global band set. When the subset size is small, the bands in the subset which contains longrange relationship can better represent the whole band set.

C. Analysis of the Selected Band Subset
To verify the selected band subset by NBAN is more informative, we visualize the selected band subset and the informative entropy. The subsets of selected bands for three dataset are shown in Table V. For the sake of fairness, we avoid the Hughes phenomenon and size of the band subsets for three datasets are 15, 15, and 20.
1) Indian Pines: The distribution of selected bands for Indian Pines is shown in Fig. 6. To observe the characteristics of the  bands, we visualized the information entropy of each band. For Indian Pines, we can see that there are some bands with low entropy in the band set from Fig. 6. These bands are called noise bands which lead to a negative impact on data processing and BS methods should avoid selecting these bands. As shown in Fig. 6, the band distribution selected by our proposed method is relatively uniform. The band subset selected by NBAN avoids those bands with low information entropy such as 0-3103-112 and 217-220. As we know that there are huge differences between the noise bands and the normal bands, NBAN has a long-range receptive field when selecting the band subset so that our method can better avoid the noise bands. In Table IV, we find OPBS and MVPCA achieve better MSA than other BS methods. However, our method does not perform very well. This happens because the band subsets of OPBS and MVPCA exist some noise bands. The noise band may cause an increase in MSA [15] because of its difference from other bands. We also find that the band distributions of OPBS and MVPCA are concentrated. However, the information gap between adjacent bands is always small which means that noise bands have a great influence on MSA. Compared with other BS methods, the classification performance of OPBS and MVPCA is poor due to selecting the noise bands.
2) Pavia University: Fig. 7 show the distributions of selected bands for Pavia university. The entropy curve of Pavia University is smoother than the curve of Indian Pines and shows an upward trend. According to the information entropy, we can divide the band set into three parts. The first part is consists of the bands before 20th with low entropy. The second one has the largest number of bands, distributes between bands 20th-80th. The last part distributes after the 80th which has the highest entropy and there is a drop near the 70th band. We can see that the selected bands of NBAN mainly distribute in the middle position and there are two bands distribute after the 80th bands. That happened because we consider the long-range relationship so that NBAN can choose a subset of bands that match the overall band as much as possible. So the selected bands of NBAN mainly distribute between 20th and 60th. Meanwhile, NBAN also avoids selecting those bands with very low entropy. Specifically, as shown in Fig. 8(b) when the subset is small NBAN has an advantage because the selected bands are more representative of most bands. However, when the size of subset getting bigger the bands with higher entropy may have more advantages, but NBAN is still comparable because there are some bands also distribute in the part of high information entropy. To further analyze the correlation between the selected bands, we show their MSA for Pavia University in Table IV. MVPCA achieves the best MSA in this part, but NBAN is still comparable. Although the MSA of MVPCA is high, it ignores the bands with higher information entropy in the process of selection. So the classification performance of MVPCA is worse than other BS methods. To sum up, the selected subset of NBAN achieves good results on classification performance and correlation performance when the subset size is small.
3) Salinas: We show the distributions of selected bands for Salinas in Fig. 9. From the entropy curve we observe that there are some sharply decreasing regions, i.e., 105-107, 146-147, and 200-203. Different with Pavia University, the value of entropy for Salinas is stable at about 4.5. As we can see from the bands distribution, NBAN avoid the sharply decreasing regions and distribute in the conventional bands. Meanwhile, NBAN also ignores the bands in 0-25 with the low entropy. The MSA of each BS methods are given in Table IV. ISSC achieves the best result and NBAN is in the second place. However, as shown in Fig. 8, ISSC achieve worse classification performance than most other BS methods. That happens because ISSC chooses too many continuous bands and cause information redundancy. In addition, ISSC also choose too many noise bands which lead to a negative   impact. The distribution of MOBS is similar to that of NBAN, so it also achieves a good performance on classification. However, the bands in the position of 0-20 with low entropy make the classification performance of MOBS worse than NBAN. There are some noise bands in the selected band subset of BS-NET-Conv lead to a negative impact on classification performance. In a nutshell, NBAN can select the representative bands with low correlation and avoids the noise bands. 4) Discussion: From observing the distribution of selected band subset we find that the value of entropy has an effect on classification performance. However, the classification performance of selecting all the bands with high information entropy without considering the whole band classification will not achieve the best result. In addition, noise band has a negative effect on the classification performance, but it will reduce the correlation of the band subset. Since the similarity between the noise bands and the normal bands is always low and our method measures the significance of each band by considering the reconstruction contribution for the whole band set, NBAN can better avoid selecting those noise bands than other BS methods. Meanwhile, both from the classification performance and the distribution of the selected spectral bands we can observe that BS-Net-Conv and NBAN achieve better result than other BS methods. This phenomenon proves that it is important for BS to capture nonlinear relationships. For BS-Net-Conv, the use of CNN and fully connected neural network enables the framework to find a more comprehensive relationship between the spectral bands [25]. However, the antinoise ability and the interpretability of BS-Net-Conv is also limited. Compared with BS-Net, the attention module of NBAN uses matrix operations so that we can more easily interpret the effects of NBAN. Meanwhile, by calculating the global relationship, the selection result of NBAN can better avoid the interference of noise.

V. CONCLUSION
In this article, we propose a framework called NBAN with a no-local attention module to consider the long-range relationship from the whole dataset. The main idea of the framework is to restore the HSI data by using the correlation between the whole band set so that we can extract the long-range relationship and increase the receptive field of the network. The framework consists of two modules, nonlocal attention module and reconstruction network, making the whole network is end-to-end trainable. The attention module of NBAN is also a lightweight block makes our framework can be plugged into many network architectures. We conduct extensive experiments on three real datasets and prove our method is significantly better than many compared BS methods on classification performance. NBAN makes sure the selected bands are representative for the whole band set, so our method has more advantages when the subset size is small. Specifically, the use of attention score matrix makes the process of the band selection more explanatory. Meanwhile, we also summarize the relationship between some band noise and the degree of the correlation between the spectral bands.
Besides, in the process of the experiment we summarize the effects of noise band and information entropy on band correlation and classification performance. Then we find NBAN has a powerful ability to avoid noise bands due to its nonocal attention module. However, there may be some information that we ignore in the attention score matrix. In the future work, we will pay more attention to improving the interpretability of the framework and reducing the complexity of the model. The above-mentioned will be our future works.