Global-Local Consistency Constrained Deep Embedded Clustering for Hyperspectral Band Selection

Hyperspectral band selection plays a key role for overcoming the curse of dimensionality in the classification of hyperspectral remote sensing images (HSIs). Recently, clustering-based band selection methods have demonstrated great potential to select informative and representative bands for hyperspectral classification tasks. However, most clustering-based methods perform clustering directly on the original high-dimensional data, which reduces their performance. To address this problem, a novel band selection method called global-local consistency constrained deep embedded clustering (GLC-DEC) is proposed in this paper. In GLC-DEC, to simultaneously learn the low-dimensional embedded representation and cluster assignments of all bands in an HSI, the stacked autoencoder is integrated with the K-means method. In addition, to reduce the adverse impact of a limited number of training samples available in HSIs, local and global consistency constraints are imposed on the embedded representation so that discriminatively consistent representation of all bands is learned. Specifically, local graph regularization and global graph regularization are introduced into the GLC-DEC model, by which the strong correlation between neighboring bands and the manifold structure of all bands are fully exploited. Based on the clustering results provided by GLC-DEC, a group of representative bands are selected by using the minimum noise method. Experimental results on two real datasets demonstrate that the proposed GLC-DEC outperformed several state-of-the-art methods.


I. INTRODUCTION
Hyperspectral remote sensing images (HSIs) usually consist of hundreds of narrow and continuous bands, and thus can provide rich spectral and spatial information of ground objects.Currently, HSIs are applied in a wide range of applications such as environmental protection [1], [2], [3], anomaly detection [4], [5], [6], land cover analysis [7], [8], image segmentation [9], and hyperspectral classification [10], [11], [12].For these applications, hyperspectral classification is a vital task used to identify different land The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo .covers that have distinct spectral differences.However, the high-dimensionality of an HSI cube causes the Hughes' phenomenon [13], which may lead to performance degradation in hyperspectral classification applications.An effective method to alleviate this problem is band selection, with the aim of selecting a representative subset of bands for hyperspectral classification.In contrast to feature extraction, which may lose critical information of HSIs, band selection is more physically meaningful as it effectively preserves the original spectral information.
Band selection methods have been classified as supervised [14], [15], [16], semisupervised [17], [18] and unsupervised methods [19], [20], [21] according to the proportion of labeled and unlabeled samples in the training set.Compared with supervised and semisupervised methods, which need a certain number of labeled samples, unsupervised methods present excellent prospects for application as they no longer need any labelled samples [22].In recent years, a large number of unsupervised band selection methods have been proposed by researchers.These methods can be classified as three categories: searching-based, ranking-based, and clustering-based methods.Searching-based methods choose the target band subset by optimizing certain specific criteria.Ranking-based methods generally employ defined metrics to evaluate the significance of all bands, and then select a group of bands with high ranking.To obtain discriminative bands while reducing redundant information, clusteringbased methods perform band selection by grouping highly similar bands into a set of clusters as well as selecting one representative band in each cluster.
Currently, clustering-based band selection methods have demonstrated great potential to solve the band selection problem.However, most clustering-based methods perform clustering directly on the original high-dimensional data, which reduces their performance because the distance measure becomes meaningless in high-dimensional space.This oriented researchers to find the interpretable (dis-)similarity between objects in embedded space [23].Lately, deep embedded clustering (DEC) has attracted much attention as it can learn the embedded representation and cluster assignments simultaneously.According to the network structure being used, DEC can be divided into autoencoder-based, generative-adversarial-network-based, and variational-autoencoder-based methods [24].Among these methods, autoencoder-based methods are the most common because they are easy to implement and have the ability to avoid trivial solutions.Yang et al. [25] proposed a deep clustering network (DCN), where a stacked autoencoder (SAE) is integrated with the K-means method, allowing the embedding representation to have a clustering-friendly structure.Although existing DEC methods have exhibited significant performance improvements, they are rarely used in band selection tasks and require further study.In addition, it remains challenging to obtain clustering results that are applicable to band selection by designing effective objective functions for DEC based on the prior information of HSIs.
To select a more representative band subset for hyperspectral classification by improving DCN, we propose a novel band selection method, namely global-local consistency constrained deep embedded clustering (GLC-DEC).As there are only a few hundred bands in a typical real HSI, the embedded representation learned by DCN may have significant difference for similar bands in the original data, which reduces the effectiveness of clustering for band selection.To maintain better consistency between the original data and the embedded representation, we introduce the local and global consistency constraints into the DCN model.These constraints are designed based on the assumption that there exist a strong correlation between neighboring bands as well as a manifold structure between all bands of an HSI.Consequently, the clustering results obtained by the proposed GLC-DEC are expected to help select a more representative subset of bands.Specifically, in GLC-DEC, the prior correlation between neighboring bands is exploited by imposing local consistency constraints on the embedded representation via local graph regularization.In addition, to make full use of the manifold structure between all bands, global graph regularization is further introduced into GLC-DEC to impose global consistency constraints on the embedded representation, which contributes to achieving a globally consistent representation of all bands.Then, the minimum noise method is employed to select representative bands from the clustering results.In the experiments, our method is compared with seven representative methods on two real datasets.The experimental results validate the effectiveness of the proposed method.The main contributions of this paper are listed as follows.
• A deep embedded clustering-based band selection method is proposed.Our method can simultaneously learn the embedded representation and clustering assignments, which enable the embedded representation to have a clustering-friendly structure.To the best of our knowledge, this is the first time that deep embedded clustering is applied to band selection.
• We introduce the global and local consistency constraints into the deep embedded clustering.These constraints are designed to make full use of the manifold structure of HSIs and the prior information that there exists strong correlation between neighboring bands.Consequently, the connectivity from the original space to the representation space can be preserved, which is conducive to obtaining better embedded representation for clustering.
• We evaluated the proposed GLC-DEC method on two real hyperspectral datasets, and the performance of GLC-DEC is compared with seven representative band selection methods.The proposed method has been demonstrated to be effective by the results of our experiments.

II. RELATED WORK
The unsupervised band selection methods can be classified into three categories: clustering-based, ranking-based, and searching-based methods.Next, some representative band selection methods will be briefly reviewed.

A. CLUSTERING-BASED METHODS
To select representative bands, the clustering-based methods usually first conduct clustering on the target HSI, then select one representative band on each cluster.Typical methods include hypergraph spectral clustering [26], multiscale superpixel-level group-clustering [27], and the adaptive subspace partition strategy (ASPS) [28].However, these methods perform clustering directly on the original high-dimensional data, which reduces their performance because the distance measure becomes meaningless in highdimensional space.Recently, several studies have worked on finding the embedded representation of HSIs to obtain effective cluster assignments for band selection.Sun et al.
[29] adopted an L2-norm-based regularizer to generate sparse embedded representation of high dimensional hyperspectral bands under the linear assumption, by which the subsequent cluster task can be performed on the embedded representation.Wang et al. [30] proposed a region-aware latent features fusion based clustering method.This method employs superpixel segmentation to obtain multiple spatial regions and creates the Laplacian matrix of each region.
Then, the low-dimensional feature representation of each spatial region is obtained based on the corresponding Laplacian matrix that reflects the band-wise similarity of the corresponding region.Next, the shared feature representation of an HSI can be generated by integrating all latent feature representations corresponding to each region.To fully utilize the spatial information, as well as jointly learn and fuse the latent feature representations in a unified framework, Wang et al. [31] employed a hierarchical strategy to learn the low-dimensional discriminative feature representation of each spatial region.Then, the information entropy is used as a metric to select the representative bands based on the unified feature representation.Recently, deep learning techniques have been applied to band selection applications to capture the inherent nonlinear structure of high-dimensional data [32].For instance, [33] applied deep convolutional autoencoder to transform HSIs into embedded space in a nonlinear way.Their work regards the representation learning and clustering task as an independent phase, which may lead to unfavorable representation for the subsequent clustering task.

B. RANKING-BASED METHODS
This kind of method generally selects a group of representative bands with high ranking.To rank all bands, several metrics such as variance, entropy, and information divergence are used to assess the importance of all bands.For instance, Chang et al. [34] employed principal component analysis to evaluate the prioritization of bands.However, this method neglected the redundancy between the bands.To address this problem, the divergence-based band-decorrelation scheme [35] was designed by discarding bands with differences below a preset threshold.In addition, [36] designed a band selection network, which assumes that all bands can be sparsely reconstructed by some informative bands.Then, the final bands is selected by ranking the learned sparse weights of all bands.Sun et al. [37] proposed a concrete end-to-end autoencoder-based unsupervised band selection framework (CAE-UBS) by introducing concrete random variables into the autoencoder.The representative bands are obtained by the information entropy criterion.Since ranking-based methods mainly concern the performance of individual bands and ignore the relationship between different bands, the band subsets given by these methods usually have a high level of redundancy.

C. SEARCHING-BASED METHODS
The searching-based methods choose the target band subset by optimizing certain specific criteria.Wang et al. [38] proposed a method that regards band selection as a column subset selection problem.It selects the column subset with the largest volume as the final band subset.Geng et al. [39] proposed a volume-gradient-based band selection method, in which the computational complexity can be reduced by avoiding explicitly calculating the volume.In [40], Zhu et al. employed a structure-aware metric to evaluate the significance of all bands, and then selected representative bands by using dominant set extraction.Liu et al. [41] proposed a band grouping-based sparse self-representation band selection method.To be specific, they first divided all bands into multiple non-overlapping band subsets.Next, the representative bands is determined by selecting a subset that has the smallest reconstruction error.In addition, Wan et al. [42] evaluated the redundancy between bands by using the fitness function based on information theory.Although the searching-based methods have certain global search capabilities, most searching-based methods usually have high computational complexity since they deal with a nonlinear optimization problem.

III. PROPOSED METHOD
Fig. 1 illustrates the framework of the proposed GLC-DEC method.The model of GLC-DEC mainly consists of the model of DCN, the local graph regularization, and the global graph regularization.In DCN, the model of SAE is integrated with that of K-means so that the embedded representation and cluster assignments of all bands in an HSI is simultaneously learned.In addition, to maintain better consistency between the original data and the embedded representation, we introduce the local and global consistency constraints into the model of DCN.Specifically, the prior correlation between neighboring bands is exploited by imposing local consistency constraints on the embedded representation using local graph regularization.Meanwhile, to fully utilize the manifold structure between all bands, global graph regularization is further introduced into DCN to impose the global consistency constraints on the embedded representation, which contributes to achieving globally consistent representation of all bands.According to the clustering results available from GLC-DEC, the minimum noise method is employed to select representative bands.Let M ⊂ R n be a d-dimensional manifold embedded in R n , where the original data are hypothesized to be located in.
Given that the encoder of SAE is expressed as f : M → R d , and the decoder is represented by g : R d → M, the objective function of GLC-DEC can be defined as where R(f , g) + λC(f ) represents the loss of DCN; F(f ) and L(f ) represent the local graph regularization and the global graph regularization, respectively, used to impose consistency constraints on the space of function f; λ, λ l and λ g denote the regularization parameters.Next, we will introduce the model of GLC-DEC in detail.

A. DCN
The loss function of DCN [25] is composed of two parts, namely the SAE loss and the clustering loss.Among them, the SAE loss is utilized to acquire feature representations, while the clustering loss encourages these features to possess more discriminative properties for clustering.Specifically, the network structure of SAE includes (S +1) fully connected layers, where S denotes a positive even number.The first S 2 + 1 layers of SAE function as the encoder used to generate the embedded representation of the original data, while the last S 2 + 1 layers form the decoder that reconstructs the original data based on the output of the encoder.Given the original HSI data X ∈ R M ×N , with M and N denoting the number of pixels and bands, respectively, the input of SAE is expressed as H (0) = X, where h (0) i = x i denotes the i-th band.The output of each layer in SAE is represented as H (s) , s = 0, 1, . . ., S, with where (s) (•) indicates the activation function; W (s) denotes the parameters between layers (s−1) and s; b (s) represents the bias parameters.To generate an embedded representation of the original HSI data, the encoder maps X into the potential space, and the obtained embedded representation matrix is expressed as where B denotes the dimension of the embedded representation h i .Based on the output of the encoder, the output of the decoder provides the reconstructed data, which is formulated as To better reconstruct the original data, the loss function of the SAE is generally defined as where • F represents the Frobenius norm of matrices.
Although SAE can learn the embedded representation of the original data, the generated embedded representation is not necessarily conducive to clustering.To jointly learn the cluster assignments and embedded representation, DCN integrates the loss function of K-means into SAE.More formally, the objective function of DCN can be given by [25] where Z = [z 1 , z 2 , . . ., z c , . . ., z C ] ∈ R B×C represents the matrix formed by C centroids, with z c denoting the centroid of the c-th cluster; and A ∈ R C×N denotes the clustering assignments matrix in which each column has only one nonzero element.Note that to jointly optimize the objective function of SAE and K-means according to Equation ( 6), the embedded representation generated by the encoder is regarded as the input of K-means in each iteration of the optimization process of DCN.

B. CONSTRAINTS FOR CONSISTENT REPRESENTATION
Although the DCN model in Equation ( 6) can learn the embedded representation with a clustering-friendly structure, it neglects to exploit the prior information of HSIs.This may limit the effectiveness of clustering results for band selection.To address this problem, we impose two constraints on the embedded representation by designing effective regularization according to the assumption that there exist strong correlation between neighboring bands as well as manifold structure among all bands.Specifically, to promote the locally consistent representation between neighboring bands, we design an local graph regularization for DCN.Furthermore, aiming at learning the representation with global consistency based on the manifold structure of all bands, we introduce a global graph regularization into the model of DCN.

1) GLOBAL CONSISTENCY CONSTRAINT
To preserve the global consistency of representation based on the manifold structure of all bands, we design a global graph regularization for DCN.This regularization is proposed on the basis of the manifold assumption that the low-dimensional embedded representations of two data bands are similar if they are close to each other in the original geometric space.
According to [43], the manifold structure of HSIs can be effectively captured by a K -nearest neighbor graph, in which each vertex corresponds to a band.Accordingly, given an HSI matrix X = {x 1 , x 2 , . . ., x N }, we construct a K -nearest neighbor graph , where V 1 = {x 1 , x 2 , . . ., x N } denotes the vertex set of G 1 , while E 1 represents the edge set of G 1 .Then, each edge (x i , x j ) in E 1 is constructed as follows: for each band x i , we find a band x j from x i 's K -nearest neighbors based on the Euclidean distance in the original space.To measure the similarity between x i and its neighbor x j , we calculated a weight for the edge (x i , x j ) by using the heat kernel weighting strategy where O ∈ R N ×N represents the weight matrix [44].Based on the matrix O, the global graph regularization is expressed as where Tr(•) denotes the trace of matrices, and L = D − O indicates the Laplace matrix, with D denoting the diagonal matrix whose diagonal entry is expressed by With the introduced global graph regularization, the embedded representation learned by the encoder is expected to preserve better global consistency between each band and its neighbors.

2) LOCAL CONSISTENCY CONSTRAINT
In this study, to maintain the local consistency between the embedded representation of adjacent bands, we introduced the local graph regularization into DCN.This regularization is designed based on the assumption that neighbouring bands are strongly correlated with each other.Accordingly, given an HSI matrix X = {x 1 , x 2 , . . ., x N }, we construct a bipartite graph In the set V 2 , u i denotes the average vector of x i−1 and x i+1 .To measure the similarity between x i and u i , a weight ω i is calculated for each edge (x i , u i ), where i = 1, 2, . . ., N .Fig. 2 shows an illustration of the bipartite graph G 2 .To facilitate the calculation of the weights of G 2 , we constructed the matrix Based on X, the weight ω i is defined by the heat kernel weighting strategy Accordingly, the local graph regularization can be expressed as with where h

C. OVERALL MODEL
The overall objective function of GLC-DEC is expressed as where λ l and λ g denote the regularization parameters.The advantage of GLC-DEC expressed in Equation ( 14) lies in that by introducing local and global consistency constraints, the strong correlation between neighboring bands and the manifold structure of all bands can be effectively exploited.Consequently, the connectivity between the original space and the representation space can be increased.This is conducive to obtaining better embedded representations for clustering.Based on the above analysis, the algorithm steps of GLC-DEC is listed in Algorithm 1. Obtain the reconstruction data H (S) according to Equation (4); Update W and b via Equation ( 14); 16: end while 17: Select the representative band subset Υ using minimum noise method; 18: return Υ .

D. BAND SELECTION
Many conventional clustering-based methods perform band selection using criteria based on information divergence or band distance from the centroid to each band in a cluster.However, these methods do not consider noise interference, which limits the performance of hyperspectral classification when the selected bands contain significant noise.To address this problem, the minimum noise method [28] is applied to select representative bands based on the obtained clustering results.Specifically, each band in a cluster is split into η blocks, each of which has R×R pixels.Next, the noise level of each band is evaluated by calculating multiple local variances based on the blocks of the corresponding band.The local variance LV of each block is calculated by 129714 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where LM indicates the average of all pixels in the current block and T i indicates the value of the i-th pixel in this block.
Next, based on the difference between the maximum and minimum variance of η blocks, k bins with equal width are divided by where max LV and min LV denote the maximum and minimum variance, respectively, while β denotes the partition granularity.Then, according to the value of the local variance, each block is allocated to one of these bins.The bin that contains the most blocks is an indicator of the approximate level of noise in the current band.Finally, according to the noise estimate in each band, a subset of bands is obtained by selecting a band with the minimum noise in each cluster.

A. DATASETS
Two real HSI datasets listed in Table 1 were used to evaluate our method.Fig. 3 shows the pseudocolor images corresponding to these two datasets.The Indian Pines dataset was collected by an airborne visible/infrared imaging spectrometer in 1992 at a test site in northwest Indiana, USA.This dataset contained 145 × 145 pixels, with each pixel comprising 220 spectral bands.Before the experiments, bands corresponded to the water absorption regions, including bands with numbers [104-108], [150-163], and 220, were deleted, and the remaining 200 bands were used.This scene contained 16 categories of objects.Table 2 shows the number of training and testing samples in each class of the Indian Pines dataset in our experiments.
The Pavia Centre dataset was acquired by the ROSIS sensor over Pavia in northern Italy.This dataset contained 102 spectral bands with an image size of 1096 × 715 pixels.The spatial resolution of this scene reaches 1.3 m.In our experiments, we selected a subscene located within [250-394, 400-544] of the Pavia Centre scene to evaluate our proposed method.This subscene contained 145 × 145 pixels and seven   classes.In our experiments, the number of training and testing samples in each class of the Pavia Centre dataset are shown in Table 3.

B. COMPARISON METHODS
Our proposed GLC-DEC is compared with seven benchmark methods, and their characteristics are summarized below: 1. E-FDPC [45]: E-FDPC belongs to the clustering-based band selection method.It can select representative bands by introducing the weighted local density and distance into FDPC, as well as adjust the representative bands via an exponential-based learning rule.
2. ASPS [28]: ASPS also belongs to the clustering-based band selection method.It first divided the 3D HSI cube into multiple subcubes by conducting clustering on an HSI.After that, ASPS introduces local variance to evaluate the noise of bands.Lastly, the bands subset is generated by selecting a band with the lowest noise from each cluster.
3. DSC [33]: DSC is a clustering-based method with two phases.It first conducted clustering on the latent representation learned by the convolutional autoencoder.Then, the representative bands are obtained by selecting a band that is close to the cluster center in each cluster.
4. HLFC [31]: HLFC belongs to the clustering-based band selection method.It first employed superpixel segmentation to obtain multiple spatial regions.Then, Laplacian matrices that can learn latent features of all regions using the hierarchical strategy are generated based on the graphs that reflect the band-wise similarity of each region.The unified feature representation of an HSI can be generated by integrating all latent features corresponding to each region.Finally, the optimal band subset is formed by selecting a band that has maximum information entropy in each cluster given by k-means.
5. BS-Net [36]: BS-Net is a ranking-based method, which regards band selection as a band reconstruction task with the assumption that each band is sparsely reconstructed by some informative bands.Based on the learned sparse weights of bands, the final bands are selected by ranking the average band weight of each band.
6. CAE-UBS [37]: CAE-UBS is a ranking-based method, which proposed a concrete autoencoder by introducing concrete random variables into the autoencoder.Based on the weights of the concrete autoencoder, CAE-UBS can generate candidates for band selection.Then, the representative bands is chosen by the information entropy criterion.
7. MVPCA [34]: MVPCA belongs to the ranking-based method.It first estimates the priority of all bands via the definition of a variance-based band power ratio, and then regards the bands that have the high priority as representative bands.

C. EXPERIMENTAL SETUP
To compare the performance of all the methods, we used the classification accuracy given by support vector machine (SVM) and K-nearest-neighbor (KNN) classifiers as the performance metric.In our experiments, we select the appropriate proportion of training samples by cross-validation.According to the result of cross-validation, as shown in Tables 2-3, we randomly choose 20% of samples from each class as the training data, while keeping the remaining 80% samples as the testing data.To demonstrate the overall accuracy of all methods, we tested them on the two datasets while using different amounts of bands (5-50 with an interval of 5).Note that we evaluated the performance of all methods using three accuracy measures: overall accuracy (OA), average overall accuracy (AOA), and kappa coefficient (Kappa).
We implemented GLC-DEC method using the PyTorch platform.The encoder part of SAE has three hidden layers which include 700, 100 and 20 neurons, respectively.The decoding network has a mirror structure.The learning rate was 1.0 × 10 −4 and the maximum iterations T was 500.Three regularization parameters were set to λ = 0.05, λ g = 10, 000, and λ l = 100, respectively.The network and parameter settings were the same for both the Indian Pines and Pavia Centre datasets.According to reference [28], the relevant parameters in the minimum noise method were set to η = 841, R = 5, β = 3, and k = 5 on the two datasets.

D. EXPERIMENTAL RESULTS
The AOA and Kappa values calculated based on the results given by SVM and KNN on Indian Pines and Pavia Centre datasets are compared in Fig. 4 and Fig. 5, respectively.It is obvious that GLC-DEC exhibits the best performance on both datasets.With the SVM classifier, HLFC achieved the second-best performance.Meanwhile, ASPS provided the second-best performance with the KNN classifier.In addition, Fig. 6 (a) and (b) show the OA values of SVM and KNN, respectively, when all methods were tested on the Indian Pines dataset.Concerning the Pavia Centre dataset, Fig. 7 (a) and (b) present the OA curves corresponding to SVM and KNN.Note that the OA curves in Fig. 6 and Fig. 7 show the performance of all the methods when they yielded 5-50 selected bands with an interval of 5.
To further evaluate the performance of GLC-DEC, the classification accuracy for each class on the two datasets using 20 selected bands is listed in Tables 4-5.The optimal values are indicated in red bold font, while the secondbest values are indicated in blue italic font.According to Tables 4-5, it can be seen that some band selection methods are slightly unstable on different datasets.For instance, BS-Net performs better on the Indian Pines dataset than on the Pavia Centre dataset, the same trend is also observed for the DSC method.Moreover, GLC-DEC provides the best or second-best performance in most cases, which demonstrates that our method has superior performance on both datasets.In addition, detailed analysis of the OA values for all methods on the Indian Pines and Pavia Centre datasets is given as follows.
On the Indian Pines dataset, as illustrated in Fig. 6 (a) and (b), our proposed GLC-DEC method outperformed all the alternatives when using the SVM and KNN classifier.Especially, when selecting 5 and 15 bands, our method exhibited a significant advantage.According to Fig. 6 (a), we can see that the OA value of GLC-DEC improves by 6% over the second-best method when selecting 5 bands.As shown in Fig. 6 (b), with the KNN classifier, GLC-DEC outperforms the second-best method by 2% when 15 bands were selected.
As to the Pavia Centre dataset, we can see from Fig. 7 (a) that our proposed GLC-DEC method outperforms other methods with the SVM classifier except for the case that 10 and 15 bands are selected.Although HLFC performs similarly to GLC-DEC when selecting 10 and 15 bands, GEC-DEC performs slightly better at the other bands.As shown in Fig. 7     performance, while GLC-DEC has higher OA values on all bands.
In addition, Fig. 8 and Fig. 9 compare the ground truth classification information to the classification maps based on GLC-DEC when 20 bands were selected.These figures demonstrate the superior classification of the GLC-DEC method for both datasets.It is worth noting that only 20 selected bands were used, which account for 10% and 20% of the original Indian Pines and Pavia Centre datasets, respectively.Thus, our proposed method can reduce a lot of data redundancy while maintaining satisfactory classification results for HSIs.

V. DISCUSSION
According to the performance comparison as illustrated in Fig. 6 and Fig. 7, it can be observed that the amount of selected bands has a significant impact on the classification performance.Specifically, we can see form Fig. 6 and Fig. 7 that the OA values rise slowly or even decreases when the selected bands reached a certain number instead of keeping   significantly rising with the increasing of selected bands.Previous studies have also shown that the performance classification is not exactly proportional to the amount of bands [46].This can be interpreted as more bands leading to more redundancy.Therefore, selecting a reasonable number of bands is necessary to achieve optimum classification performance.The advantage of our method lies in that the performance of GLC-DEC has superior performance on different classifiers, and is robust to the number of selected bands.As shown in Fig. 6 and Fig. 7, our method is optimal with SVM and KNN classifiers when the band is selected in the range 5-50 with an interval of 5. Compared with ranking-based methods which neglect to consider the similarity between the selected bands, our method chooses one band from each cluster, which can avoid high similarity among the selected bands.Compared with other clustering-based methods (i.e., E-FDPC, ASPS, DSC, and HLFC), our method has better performance because GLC-DEC can not only jointly learn embedded representation and clustering assignments of all bands, but also consider the similarity between neighboring bands and manifold structure of HSIs.
Our study also has some limitations.On the one hand, the conventional K-means clustering method is used in the GLC-DEC method, while the performance sensitivity of our method to different clustering methods (e.g., spectral clustering and Gaussian mixture models) deserves further investigation.On the other hand, although the minimum noise method considers the noise interference to band selection, the affect of noise is not considered in our proposed model.In future studies, we will explore the deep embedded clustering based on other clustering methods and noise modeling to obtain more effective clustering results for band selection.

VI. CONCLUSION
In this paper, we proposed a novel hyperspectral band selection method based on the deep embedded clustering.Compared with conventional clustering-based band selection methods, our proposed method can achieve more effective clustering results for band selection by simultaneously learning the embedded representation and cluster assignments.Specifically, by introducing consistency-related constraints into the DCN model, the strong correlation between neighboring bands and the manifold structure of all bands of HSIs are fully exploited.Consequently, the obtained clustering results achieved by our proposed method are more applicable to the task of band selection.The experimental results on two real datasets demonstrated that our method outperformed seven state-of-the-art methods.

FIGURE 1 .
FIGURE 1. Illustration of the framework of GLC-DEC method.The model of GLC-DEC mainly consists of the model of DCN, the local graph regularization, and the global graph regularization.In DCN, the model of SAE is integrated with that of K-means so that the embedded representation and cluster assignments of all bands in an HSI is simultaneously learned.In addition, to maintain better consistency between the original data and the embedded representation, we introduce the local and global consistency constraints into the model of DCN.Specifically, the prior correlation between neighboring bands is exploited by imposing local consistency constraints on the embedded representation using local graph regularization.Meanwhile, to make full use of the manifold structure between all bands, global graph regularization is further introduced into DCN to impose the global consistency constraints on the embedded representation, which contributes to achieving globally consistent representation of all bands.According to the clustering results available from GLC-DEC, the minimum noise method is employed to select representative bands.

FIGURE 2 .
FIGURE 2. Illustration of the bipartite graph G 2 .

S 2 i 2 i−1 and h S 2
represents the embedded representation of band x i , and r i denotes the average vector of h S i+1 ; P = D − Ō indicates the Laplace matrix, in which Ō ∈ R 2N ×2N represents the weight matrix with Ōi,i+N = ω i , i = 1, 2, . . ., N , and D ∈ R 2N ×2N denotes a diagonal matrix with Dnn = m Ōnj , m = 1, 2, . . ., 2N .Compared with global graph regularization, which is used to enforce the global consistency of embedded representation based on the manifold structure, the local graph regularization can impose the similarity constraint between the embedded representation of each band and the average of the embedded representation of its neighboring bands.

Algorithm 1
GLC-DEC Algorithm for Band Selection 1: Input: HSI data X ∈ R M ×N , the number of selected bands C, maximum iterations T, and regularization coefficients λ, λ l , and λ g .2: Output: The band subset Υ .3: Compute the Laplace matrix L of global graph regularization by Equation (7) and Equation (9); 4: Calculate the Laplace matrix P of local graph regularization by Equation (11); 5: Random initialize W and b according to the given network configure; 6: Pretrain SAE and obtain initial clustering centroids by running K-means in the embedded space; 7: while model is not convergent or maximum iterations T is not met do clustering assignments matrix A and cluster centroids matrix Z; 10: Compute the K-means clustering loss; 11:Compute the local graph regularization loss by Equation (

FIGURE 4 .
FIGURE 4. Average Overall accuracy (AOA) on two datasets for seven band selection methods.

FIGURE 5 .
FIGURE 5. Kappa coefficient (Kappa) on two datasets for seven band selection methods.

FIGURE 6 .
FIGURE 6. Overall accuracy (OA) for the support vector machine (SVM) and K-nearest-neighbor (KNN) classifiers on the Indian Pines dataset for different numbers of bands.

FIGURE 7 .
FIGURE 7. Overall accuracy (OA) for the support vector machine (SVM) and K-nearest-neighbor (KNN) classifiers on the Pavia Centre dataset for different numbers of bands.

FIGURE 8 .
FIGURE 8. Ground truth map and classification maps generated by GLC-DEC using two classifiers on the Indian Pines dataset.(a) Ground truth.(b) GLC-DEC by SVM.(c) GLC-DEC by KNN.

FIGURE 9 .
FIGURE 9. Ground truth map and classification maps generated by GLC-DEC using two classifiers on the Pavia Centre dataset.(a) Ground truth.(b) GLC-DEC by SVM.(c) GLC-DEC by KNN.

TABLE 1 .
Detail information about two real HSI datasets.

TABLE 2 .
The number of training and testing samples in each class of Indian Pines dataset.

TABLE 3 .
The number of training and testing samples in each class of Pavia Centre dataset.

TABLE 4 .
The comparison of classification accuracy of all methods for each class of the Indian Pines dataset, where GLC-DEC is our method and larger values indicate better performance.

TABLE 5 .
The comparison of classification accuracy of all methods for each class of the Pavia Centre dataset, where GLC-DEC is our method and larger indicate better performance.