Self-Supervised Deep Subspace Clustering for Hyperspectral Images With Adaptive Self-Expressive Coefficient Matrix Initialization

Deep subspace clustering network has shown its effectiveness in hyperspectral image (HSI) clustering. However, there are two major challenges that need to be addressed: 1) lack of effective supervision for feature learning; and 2) negative effect caused by the high redundancy of the global dictionary atoms. In this article, we propose an end-to-end trainable network for HSI clustering. Specifically, to ensure the extracted features are well-suited to subsequent subspace clustering, the cluster assignments with high confidence are employed as pseudo-labels to supervise the feature learning process. Then, an adaptive self-expressive coefficient matrix initialization strategy is designed to reduce the dictionary redundancy, where the spectral similarity between each target sample and its neighbors is modeled via the ${k}$-nearest neighbor graph to guide the initialization. Experimental results on three public HSI datasets demonstrate the effectiveness of the proposed method. In particular, our method outperforms several state-of-the-art HSI clustering methods, and achieves overall accuracy of 100% on both SalinasA and Pavia University datasets.

task due to the high dimensionality and complex spectral-spatial structures of HSI data [9].
Recently, subspace clustering (SC) [10] has been successfully applied to HSI clustering due to its capability to handle high-dimensional data and its effectiveness of capturing complex structures of HSI data [11]- [19]. These methods can be grouped into two categories, i.e., SC in original space and SC in feature space. The former ones construct affinity matrix from raw samples [11]- [16], whereas the latter ones construct affinity matrix from the features of samples [17]- [19]. Due to the inherent nonlinear structures of HSIs, SC in deep feature space can well capture the nonlinear characteristics of sample distribution [18], [19]. However, there exist two main problems that need to be tackled for these deep subspace clustering (DSC) methods. First, since affinity matrix learning and spectral clustering are performed independently in these methods, their feature learning lacks effective supervision. Therefore, the deep features extracted by the encoder cannot always suit for the subsequent SC [20]. Second, since these methods employ the global self-expressive dictionary to represent the features of samples, the high dictionary redundancy hinders the further improvement of the clustering performance [14].
To address the aforementioned issues, we propose a Selfsupervised Deep Subspace Clustering method with Adaptive self-expressive coefficient matrix Initialization (SDSC-AI) for HSI clustering. Specifically, to learn discriminative features for the SC, we propose an end-to-end trainable network to combine affinity matrix learning and spectral clustering. In our network, fully connected layers are introduced on top of the encoder to serve as a classifier, which use the cluster assignments produced by spectral clustering as pseudo-labels to supervise the feature learning process. In this way, affinity matrix learning and spectral clustering are alternately performed and the whole model is trained in an end-to-end manner. Moreover, to obtain highly confident pseudo-labels, the samples closer to their cluster centers in spectral clustering are selected to train the encoder, and their cluster assignments are considered as highly confident.
To reduce the high redundancy of the global dictionary atoms, the correlated atoms need to be selected to express the target features of samples, while the uncorrelated atoms should be suppressed. However, existing DSC-based methods [18], [19] initialize the elements of the self-expressive coefficient matrix with the same non-zero values, which tends to induce all atoms to express the target features of samples. Therefore, the initialization This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ approach of self-expressive coefficient matrix in these methods cannot address the issue of high dictionary redundancy. Based on the fact that similar HSI samples are more likely lying in the same subspace [12], we construct k-nearest neighbor (KNN) graph to model the spectral similarity between each sample and its neighbors. The nonzero element in the binary adjacent matrix of KNN graph indicates that the corresponding two samples are similar. Therefore, these two samples and their corresponding features are likely lying in the same subspace. Moreover, since the weights of neural networks are generally initialized to small random values [21], [22], the nonzero elements in the binary adjacent matrix are updated by random values generated from a uniform distribution. Finally, the updated adjacent matrix is used to initialize the self-expressive coefficient matrix. In this way, the correlated atoms can be induced to express the target features, while the uncorrelated ones can be suppressed.
The main contributions of this article are summarized as follows.
1) We propose an end-to-end trainable network to combine the affinity matrix learning and spectral clustering. The cluster assignments with high confidence are used as pseudo-labels to supervise the feature learning process.
To the best of our knowledge, this is the first attempt to introduce self-supervised learning for HSI clustering. 2) We proposed a spectral similarity based adaptive selfexpressive coefficient matrix initialization strategy to reduce the high redundancy of global self-expressive dictionary atoms. 3) Experimental results on three benchmark HSI datasets demonstrate the superiority of our method as compared to several state-of-the-art clustering methods. The rest of this article is organized as follows. Some related works are briefly reviewed in Section II. The proposed method is described in Section III. Section IV presents the experimental setup and results in detail. Section V concludes this article.

II. RELATED WORKS
In this section, we briefly review major works on HSI clustering, SC, and self-supervised learning.

A. HSI Clustering
HSI clustering methods have drawn much attention since they do not require any labeled samples during training phase. Generally, the existing HSI clustering algorithms can be divided into following categories [23]: 1) centroid-based methods; 2) density-based methods; 3) biological-based methods; 4) graph-based methods; and 5) deep learning-based methods.
Centroid-based methods such as k-means [24], fuzzy c-means (FCM) [25], and fuzzy c-means with spatial constraint (FCM_S) [26] iteratively update the cluster centers until the cluster centers remain unchanged. These methods are computationally efficient and easy to implement. However, they are sensitive to the initialization state [27]. Density-based methods such as clustering by fast search and find of density peaks [28] and its improved version [29] calculate the local density of each sample, and then select the samples both having high local density and large distance from samples with higher densities as cluster centers. The biological-based methods such as automatic fuzzy clustering method based on adaptive multiobjective differential evolution [30] employs the biological model to achieve HSI clustering, which transforms the clustering problem into a multiobject optimization problem. The graph-based methods such as spectral clustering [31], fast spectral clustering with anchor graph [16], and sparse subspace clustering (SSC) [11] construct graph to represent the similarity of each pair of samples, and then obtain the clustering results by applying spectral analysis to the similarity graph. The deep learning-based methods such as learning the deep embedding based on the set-to-set and sample-to-sample distances (LSSD) [23] embed the raw samples into low-dimensional feature space and group the deep representations to generate final clusters.

B. Subspace Clustering
Based on the fact that data in a high-dimensional space can be better represented as subspaces [32], SC-based methods obtain great research interests due to their capability to handle highdimensional data and their effectiveness of capturing complex structures of HSI data [11]- [19]. These methods generally divide the task of clustering into two subproblems. The first one is to construct affinity matrix, and the second one is to apply spectral clustering on the affinity matrix. According to whether the affinity matrix is built in original space or not, these methods can be grouped into two categories, i.e., SC in original space and SC in feature space.
1) Subspace Clustering in Original Space: This type of methods [11]- [13] build affinity matrix from raw HSI samples based on the assumption that a sample in a union of subspaces can be expressed as a linear combination of other samples in the same subspace (i.e., self-expressiveness property of the data [11]). To construct informative affinity matrix, different regularization terms [10] are introduced to regularize the selfexpressive coefficient matrix, e.g., the sparse affinity matrix induced by 1 -norm [11], the low-rank affinity matrix induced by nuclear norm [33], [34]. In addition to these typical subspace learning methods, spectral-spatial sparse SC (S 4 C) [12] and l 2 -norm regularized SSC (l 2 -SSC) [13] have been proposed for HSI clustering to better exploit both the spectral and spatial information.
2) Subspace Clustering in Feature Space: This type of methods [17]- [19], [35] maps the raw samples into feature space to better capture the nonlinear characteristics of sample distribution, and then construct the affinity matrix in the feature space. For instance, kernel SC [17] is proposed to implicitly map the HSI samples from original space to a kernelized space. However, this method empirically selects the optimal kernel and thus suffers from the degradation of different nonlinear kernels. Recently, DSC network [18] is proposed to nonlinearly map the raw samples to a latent space using deep convolutional autoencoders. Besides, DSC uses a novel self-expressive layer to Our network consists of five modules: a) the feature extraction module is based on deep convolutional auto-encoders, where the encoder is used to extract features, and the decoder is used to reconstruct the raw input samples; b) the self-expressiveness module is used to learn the self-expressive coefficient matrix; c) the self-expressive coefficient matrix initialization module is used to provide a good initialized self-expressive coefficient matrix for the self-expressiveness module; d) self-supervised learning-based classification module classifies the features with the pseudo-labels generated from the spectral clustering module; and e) the spectral clustering module generates clustering results and provides pseudo-labels to supervise the feature learning (best viewed in color).
achieve self-expressiveness property of the features of samples. To preserve the cluster structure in data space, distributionpreserving subspace clustering [35] introduces a distribution consistency loss to guide the learning of distribution-preserving latent representation. Inspired by the success of the DSC method, Laplacian regularized deep subspace clustering [19] introduces Laplacian regularization to retain the manifold structure of HSI data.

C. Self-Supervised Learning
Self-supervised learning aims at learning general features without using any human-annotated labels [36]. To this end, self-supervised learning generally predefines a pretext task to learn feature representations of the unlabeled data using pseudolabels that are automatically generated based on the attributes of unlabeled data. After pretext task training, the deep representations contain rich semantic information and then are transferred to downstream tasks by fine-tuning.
Since the pretext task plays a key role in self-supervised learning, several effective pretext tasks [36], [37] are designed to yield pseudo-labels from the unlabeled data to guide selfsupervised learning. In [38], the prediction of the relative spatial position between the central image patch and its 8-neighbor image patches are used as a pretext task. In [39], pretext task is designed as the recovery of positions of spatially shuffled image patches. Prediction of geometrical image transformation such as rotations is also used as a pretext task [40]. Besides, mutual information (MI) maximization is a popular kind of pretext task in self-supervised learning [41], [42]. Recently, instance discrimination [37], [43], [44] is leveraged as a pretext task and achieves promising performance in downstream tasks.
Clustering is a natural pretext task since data are grouped according to their attributes and can be automatically assigned with clustering labels. DeepCluster [45] is a typical clusteringbased self-supervised learning method whose training process includes two alternate steps, i.e., 1) train the encoder using cluster assignments as pseudo-labels, and 2) cluster the image features using k-means algorithm. In this article, we follow the clustering-based method to yield pseudo-labels.

III. PROPOSED METHODOLOGY
In this section, an SDSC-AI method is proposed for HSI clustering. As illustrated in Fig. 1, our method consists of feature extraction module, self-expressiveness module, adaptive self-expressive coefficient matrix initialization module, spectral clustering module, and self-supervised learning-based classification module. In this section, we first introduce each module of our network, and then introduce the implementation details.

A. Feature Extraction Module
The foundational component of our method is the feature extraction module, which nonlinearly maps the raw HSI samples into a latent space. To exploit the spatial information, the deep convolutional auto-encoders are used as the backbone network. Given HSI samples X = [x 1 , x 2 , . . . , x N ] ∈ R m×N drawn from a union of n subspaces n j=1 {S j }, where m, N , and n denote the spectral dimension, number of samples, and number of subspaces, respectively. Let Z = [z 1 , z 2 , . . . , z N ] ∈ R p×N denotes the deep features of the input samples, i.e., the output of the encoder, where p is the dimension of the deep features. Then, Z is fed into the decoder to reconstruct the input samples X. To ensure that the input samples X can be constructed by the the auto-encoders, the loss function is defined as where · F denotes the Frobenius norm, andX denotes the samples reconstructed by the auto-encoders.

B. Self-Expressiveness Module
The self-expressiveness module is used to learn the self-expressive coefficient matrix, which introduces a fully connected (FC) layer (i.e., self-expressive layer [18]) without activation function and bias between the encoder and the decoder. The weights of the self-expressive layer can be considered as the self-expressive coefficient matrix. The loss of the self-expressiveness module is defined as where C ∈ R N ×N denotes the parameters of the self-expressive layer. C is a regularization term to ensure that C have a blockdiagonal structure. 1 2 Z − ZC 2 F is the self-expressiveness term. The constraint diag(C) = 0 denotes that the values of the diagonal elements of C are zeros, which is used to eliminate trivial solutions.

C. Self-Expressive Coefficient Matrix Initialization Module
This module is used to provide a good initialized selfexpressive coefficient matrix C for self-expressive layer to address the issue of high dictionary redundancy. For this purpose, the correlated atoms need to be selected to represent the target features of samples, whereas the unrelated atoms should be well suppressed simultaneously. However, in recent DSC methods [19], [20], all elements of the self-expressive coefficient matrix are initialized with the same nonzero value, which tends to induce all the atoms in the global self-expressive dictionary to express each target feature. Consequently, the initialization approach of self-expressive coefficient matrix cannot address the problem of high redundancy of self-expressive dictionary atoms, which can degrade the clustering performance. This relationship can be clearly observed from the self-expressiveness property of the features: where z i and z j are the ith and the jth columns of Z that denote the atom of self-expressive dictionary and the target feature, respectively. e j denotes the noise. Since all the elements of C are initialized as C i,j = 0, z j can be linearly expressed by all the atoms Based on the fact that similar HSI samples are more likely lying in the same subspace [12], the KNN graph is used to model the spectral similarity between each sample and its neighbors. Let G = (V, E) be an undirected graph, where V and E denote the set of nodes and edges, respectively. The adjacent matrix A of G is defined as where N k (x i ) denotes the set of neighbors of the sample x i . If A i,j = 0, sample i and sample j are similar and thus likely lying in the same subspace. Correspondingly, the features of these two samples are also likely lying in the same subspace. However, it is unreasonable to directly use the adjacent matrix to initialize self-expressive coefficient matrix since the self-expressive coefficients are generally smaller than 1. Moreover, the weights of neural networks are generally initialized to small random values [21], [22]. As a result, we update the adjacent matrix as follows: where y is a small random value sampled from a uniform distribution U [a, b]. In this way, matrix A not only retains the structure of adjacent matrix, but also meets the requirement that coefficients are smaller than 1. Consequently, A can be used to induce the most correlated atoms to represent the target features. The flowchart of the initialization strategy is illustrated in Fig. 2.

D. Spectral Clustering Module
Spectral clustering module is used to generate clustering results. The parameters of the self-expressive layer (i.e., C) are employed to construct an affinity matrix that is formulated as follows: where W ∈ R N ×N is the affinity matrix with element W i,j denoting the similarity between the ith and the jth samples. Then, the clustering results are produced by applying spectral clustering [46] to the affinity matrix.

E. Self-Supervised Learning-Based Classification Module
Since affinity matrix learning and spectral clustering DSCbased methods [18], [19] are independent, the features extracted from the convolutional encoder cannot be well-adopted to the subsequent spectral clustering due to the lack of effective supervision. To handle this problem, we use the cluster assignments generated from the spectral clustering as pseudo-labels to supervise the feature learning. 1) Self-Supervised Feature Learning: Inspired by [20], two FC layers are introduced on the top of the encoder as p × n 1 × n 2 × n, where n 1 and n 2 are the numbers of neurons in the first and the second FC layers, respectively (see Fig. 1). The FC layers are served as a classifier that is trained with pseudo-labels and back-propagates to the encoder. The output of the FC layers is a multiple classification with a softmax function where R is the output of the encoder, W and b are the weights and biases of the FC layers, P (·) denotes the probability that the input belongs to the ith category.
1} n×N denote the output of the spectral clustering with each column q i denoting the cluster assignments of the ith sample, and the output of the spectral clustering is fed into the FC layers and used as self-supervision information. We combine the cross-entropy loss and center loss to train the convolutional encoder. The loss function is defined as where w denotes the weights of the FC layers, y i is the output of the FC layers at sample i calculated by (7), and c y i is the corresponding cluster center of y i in the deep feature space. The first term of (8) denotes the cross-entropy loss that makes the features of different clusters separable, and the second term of (8) denotes center loss that minimizes the distances between the deep features and their cluster centers [47]. Highly confident pseudo-labels are useful for discriminative feature extraction and beneficial to the clustering [48]. However, some of the pseudo-labels produced by spectral clustering are incorrect and may misguide the feature learning. To handle this issue, it is necessary to introduce the confidence of samples to obtain highly confident pseudo-labels.
2) Selection of Samples With High Confidence: Inspired by [5], [49], we iteratively select highly confident pseudo-labels from each cluster to supervise the feature learning. Given the clustering results, the points in each cluster that are closer to their cluster center are assigned with high confidence. Let U ∈ R N ×n be the matrix containing the first n eigenvectors of graph Laplacian induced by the affinity matrix W as columns, and U = {u j | j = 1, . . . , n} denotes the points consisting of the rows of the matrix U. Then, k-means algorithm is applied to the points of U to obtain the clusters A 1 , . . . , A n and the corresponding cluster centers θ 1 , . . . , θ n . The class-wise confidence is defined as where · 2 2 denotes the Euclidean distance between point j and its cluster center. ρ ∈ (0, 1] is a parameter that controls the Calculate the class-wise conf idence i by (9). 4: Calculate the distances between the θ i and A i .
Sample j is selected and S ← S ∪ {j}. 7: end if 8: end for Output: Indexes of the selected samples S. amount of selected samples. For each cluster A i , the points with distances smaller than the conf idence i have high confidence and their cluster assignments are considered to be highly confident. Correspondingly, the samples with high confidence are selected in the current clustering to train the encoder. Algorithm 1 illustrates the process of the sample selection in the current clustering. Note that, the samples selected in the current clustering are merged with the ones selected in the previous clustering, and no longer used as the candidates in the next clustering. Empirically, once the increment of selected samples is less than 0.5% of the number of total samples N , or the number of selected samples reaches 70% of N , the selection is forced to cease.

F. Implementation Details
The loss function of the proposed SDSC-AI method is defined as where L 1 = C , L 2 = 1 2 Z − ZC 2 F . λ 1 , λ 2 , λ 3 are the weights to balance the contributions of different terms. Since the size of the datasets for HSI clustering is generally limited (e.g., in the order of thousands of samples), it is hard to directly train a network from scratch using these datasets. Therefore, the proposed network are trained following a two-stage pipeline: 1) pretrain the deep convolutional auto-encoders; and 2) train the whole network by alternately performing affinity matrix learning and spectral clustering.
1: Pre-train the deep convolutional autoencoders. 2: Calculate and update the adjacent matrix according to (4) and (5). 3: Initialize the self-expressive coefficient matrix C. 4: Initialize the FC layers. 5: Train self-expressive layer and construct W according to (6). 6: Perform spectral clustering to obtain U and Q. 7: while t ≤ T max do 8: Select samples by Algorithm 1. 9: if size(S t ) ≤ 0.005N or size(S) ≤ 0.7N do 10: S t ← S ∪ S t and S ← S t . 11: end if 12: Fix Q and update the remaining parts for T 0 epochs according to (10).

1) Pretrain Stage: Pretrain aims to obtain good initialization
weights and reduce the reconstruction difficulty in the later fine-tune stage. In the pretrain stage, we only use the reconstruction loss L DAE to update the autoencoders. 2) Fine-Tune Stage: In this stage, we train our network end-to-end with all the losses defined in (10). The main step of the fine-tune stage are described as follows. First, the self-expressive coefficient matrix is initialized, and the self-expressive layer is trained to learn the self-expressive coefficient matrix. Second, the affinity matrix is constructed and spectral clustering is performed to get the cluster assignments. Third, the samples with high confidence and their corresponding cluster assignments are selected. Then, the spectral clustering module is fixed, and the remaining modules of the network are updated for T 0 epoches. Finally, the spectral clustering is performed to update cluster assignments. The affinity matrix learning and spectral clustering are iteratively performed. The training process of our network is illustrated in Algorithm 2.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
In this section, we extensively evaluate the performance of the proposed method on three widely used HSI datasets. The proposed method is implemented in TensorFlow on a PC with a 3.0 GHz Intel i7 CPU and 64-GB memory. The network is trained on an Nvidia GeForce RTX 2080 Ti GPU.

1) Datasets and Preprocessing:
Three publicly available HSI datasets (i.e., Indian Pines, Pavia University, and Salinas) are employed to validate the effectiveness of the proposed method. To build affinity matrix for spectral clustering, most SC methods need to solve large-scale optimization problems, and calculate pairwise similarities among all the samples at one time. Therefore, these methods require O(N 2 ) memory to store the affinity matrix and may suffer from an "out-of-memory" error in the training phase. Following [7], [12], [13], [17], [19], [23], [50], a subset of these datasets is used in our method for computational efficiency. Particularly, the subset taken from Salinas dataset is also known as Salinas-A dataset. r The subset of the Pavia University dataset collected by the reflective optics spectrographic image system (ROSIS) sensor contains 103 spectral reflectance bands with a size of 200×100 pixels. Eight classes are considered in this dataset, including asphalt, meadows, tree, metal sheet, bare soil, bitumen, bricks, and shadows.
r The Salinas-A dataset captured by AVRIS contains 224 bands with a size of 86×83 pixels. Six different classes of crops are considered in this dataset, including Brocoli_gw1, Corn_sgw, Lettuce_r4, Lettuce_r5, Lettuce_r6, and Let-tuce_r7. We briefly denote the three datasets as IPS, PUS, and SA, respectively. Table I reports the ground truth of the three datasets. For all datasets, principal component analysis is performed before the training process to reduce computational cost, and we fix the number of reduced spectral bands to 4. Furthermore, we use 15 × 15, 9 × 9, 9 × 9 as the size of the spatial window to obtain image patches for each dataset, respectively. The influence of different window size is analyzed in Section IV-F.
2) Compared Methods: We compare the proposed method with several existing HSI clustering methods, including FCM [25], SSC [11], LRSC [34], SSCS [12] and S 4 C [12], DLSS [51], TV [14], RMMF [9], LSSD [23], DSC [18], and S 2 CSC [20]. Since the results of some compared methods are difficult to reproduce on different datasets, we compare the proposed method with the corresponding state-of-the-art methods on different datasets (i.e., SSCS and S 4 C methods on the IPS dataset; TV and LSSD methods on the PUS dataset; and DLSS Both visual results and quantitative metrics [i.e., the overall accuracy (OA), the average accuracy (AA), and Kappa coefficient (Kappa)] are used for performance evaluation. The OA is defined as follows: where g i is the ground-truth label, y i is the cluster assignment of sample x i generated by clustering algorithm, and map is a mapping function that ranges over all possible one-to-one mappings between cluster assignments and ground-truth labels, respectively. The optimal mapping function can be computed by Hungarian algorithm [52]. The implementation of mapping function can be referred to the publicly available code of the DSC method [18]. 1 In addition, running time is reported to evaluate the efficiency of different methods.

3) Networks Architecture and Parameter Setting:
Since the number of samples in each dataset for HSI clustering is limited (i.e., 4391, 6445, and 5348 for the IPS dataset, PUS dataset, and SA dataset, respectively), the proposed network is expected to have less parameters to avoid overfitting. Therefore, the channel numbers of the deep convolutional autoencoders are set to 32-32-16-16-32-32 (see Table II). The kernel size of all the convolutions in the autoencoders is set to 3 × 3. The rectified linear unit (ReLU) is used as activation function. Batch normalization is used after the convolutional layers except for the last layer of the encoder and the last layer of the decoder. The ADAM optimizer is used with the learning rate being set to 2 × 10 −4 . We set the maximum number of training epochs T max = 200. We update the encoder, the decoder, the self-expressive layer, and the FC layers for T 0 = 20 epochs and then update the spectral clustering once. For FC layers, we set n 1 = 1024 and n 2 = n. We use all the samples in each dataset to generate a batch during training phase.
We use 1 norm to define the sparse regularization term C p in all experiments. The values of trade-off parameters λ 1 , λ 2 , and λ 3 are set to 0.001, 100, and 2000, respectively, and ρ is set to 0.3, 0.1, and 0.1 for the IPS, PUS, and SA datasets, respectively. The number of nearest neighbors k is set to 120, 40, and 120 for 1 https://github.com/panji1990/Deep-subspace-clustering-networks.

B. Results and Analyses
1) Qualitative Results: The cluster maps generated by different clustering methods are visualized in Figs. 3-5. Compared with the other methods, cluster maps produced by our method are very close to the ground-truth map, which clearly validates the effectiveness of our method. Taking the IPS dataset as an example, most methods cannot separate the four classes and generate many misclassifications due to the similar spectral signatures of the land-cover classes. In contrast, our method can better distinguish the four kinds of land-covers. Particularly, the "Grass" and "Soybean_n_t" classes are completely separated. For both PUS and SA datasets, there are no misclassifications in the cluster maps generated by our method.
2) Quantitative Results: The quantitative results achieved by different methods are summarized in Tables III-V, respectively, where the best results of each row are highlighted in italic. The following conclusions can be drawn from these results.
r The proposed method achieves the best clustering in terms of OA, AA, and Kappa on all three datasets. It should be noticed that OAs, AAs, and Kappas achieved by our method are 100% on both PUS and SA datasets, which outperforms all the compared methods by a notable margin.     [20], the proposed method significantly improves the clustering performance on all three datasets. Specifically, the OA, AA, and Kappa on the IPS dataset are improved by our method by 20.34%, 11.31%, and 27.15%, respectively. That is because, the proposed method can precisely represent each target feature with the correlated atoms, and select highly confident pseudo labels to facilitate the self-supervised learning to extract discriminative features.
r The proposed method is robust to class unbalance. Note that, the class distribution is unbalanced on all three datasets, e.g., the numbers of the "tree" and "bricks" classes are 63 and 94 on the PUS dataset, respectively, which makes the clustering more challenging. Most methods cannot perform well on the two land-cover classes. For example, the DSC method [20] misclassifies the "Bricks" class into the "Asphalt" and "Bare soil" classes. In contrast, our method accurately recognizes these three classes.
r Compared with the clustering methods that only use the spectral information (FCM [25], SSC [11], LRSC [34], TV [14], DLSS [51]), the spectral-spatial clustering methods (SSCS [12], S 4 C [12], LSSD [23], DSC [18], S 2 CSC [20], SDSC-AI) can achieve better performance in terms of OA, AA, and Kappa. Taking the IPS dataset as an example, the SC-based methods such as SSC and LRSC achieve relatively low accuracies especially for the "Corn_no_till" class and the "Soybeans_n_t" class. This is because these methods only focus on the spectral information while the spectral signatures of the four land-cover classes are very similar and difficult to distinguish. In contrast, the SSCS, S 4 C, DSC, S 2 CSC, and our method significantly improve the clustering accuracy by incorporating the spatial neighborhood information.

C. t-SNE Feature Visualization
To investigate whether the features extracted by the proposed method are discriminative and benefcial to the SC, we use the t-distributed stochastic neighbor embedding (t-SNE) [53] approach to visualize the raw samples, the features produced by the DSC method [18], and the features produced by the SDSC-AI method on all datasets for comparison. First, as shown in Fig. 6(a), (d), and (g), the raw samples are mixed on the three datasets (especially on the IPS dataset) due to their similar spectral signatures. Therefore, it is difficult to separate the landcover classes in original space. Second, although the interclass boundaries are apparent in the raw samples distribution, the distances between intraclass samples and their cluster centers are large (e.g., the "Corn_sgw" class, "Lettuce_r4" class, and "Let-tuce_r5" class on the SA dataset). Third, although the features extracted by the DSC method can be separated on the SA dataset, they cannot be separated on both the IPS and PUS datasets [see Fig. 6(b), (e), and (h)] due to the large spectral variability on these two datasets. In contrast, the features extracted by our method are well separated on all three datasets [see Fig. 6(c), (f), and (i)] by introducing self-supervised learning. Particularly, the features are completely separated on both the PUS and SA datasets. It can be also observed that the features learned by our method are both interclass dispense and intraclass compact since the center loss penalizes the distances between the deep features and their cluster centers [47].

D. Affinity Matrix Visualization
Generally, affinity matrix represents the similarity between each pair of samples. The affinity matrix is constructed by the self-expressive coefficient matrix. If a group of samples lies in the same subspace, the corresponding self-expressive coefficients are nonzero, otherwise, they will be zero. Hence, an ideal affinity matrix is sparse and block-diagonal with each block signifying a land-cover class. To further demonstrate the effectiveness of the proposed method, we visualize the affinity matrices learned by both the DSC method [18] and the proposed method on all three datasets. As shown in Fig. 7, for all datasets, we can clearly observe that the affinity matrices obtained from the proposed method are superior to the ones obtained from the DSC method due to their sparsity and apparent block-diagonal structure. It clearly demonstrates that the proposed method can accurately represent each feature with the correlated atoms in the same subspace, and the deep features learned from the convolutional encoder benefit the affinity matrix learning. Note that, "-AI" represents methods trained with adaptive self-expressive coefficient initialization, and "SDSC" represents methods trained with self-supervised learning using selected samples.

E. Ablation Study
To evaluate the impact of adaptive self-expressive coefficient matrix initialization and sample selection in self-supervised learning, we perform ablation study on all three datasets. The S 2 CSC method [20] is used as a baseline method. As shown in Table VI, the adaptive self-expressive coefficient matrix initialization can significantly improve the clustering accuracies on all three datasets. Specifically, 18.36%, 1.44%, and 0.75% improvements in OA metric can be achieved on the IPS, PUS, and SA datasets by using adaptive self-expressive coefficient matrix initialization. Compared with original S 2 CSC method, the sample selection in self-supervised learning can also improve the clustering accuracies on the three datasets. Specifically, 0.5%, 1.44%, and 0.47% improvements in OA metric can be achieved on the IPS, PUS, and SA datasets. Note that, the clustering OA achieved by using sample selection method on the IPS dataset is relatively low. This is because the pseudolabels obtained by spectral clustering are of low confidence due to the low-clustering OA achieved at the beginning of the training process. Therefore, it is difficult to select highly confident pseudo-labels. With the help of adaptive self-expressive coefficient matrix initialization, the clustering accuracies are significantly improved and the pseudo-labels are enhanced to be highly confident.

F. Parameter Sensitivity Analyses
In this section, we conduct experiments to investigate the influence of the important parameters on the clustering performance.
1) Impact of Spatial Window Size: Since spatial information of the center pixel is crucial to the spectral-spatial feature extraction [54], the size of spatial window can influence the clustering performance. The impact of the size of spatial window on the clustering results is presented in Fig. 8. It can be observed that  the highest OAs are achieved when the window size is 15 and 9 on the IPS and PUS datasets, respectively. When the window size is larger than 9, the OA achieved by our method is 100% on the SA dataset. Generally, the image patch can exploit more spatial information with an increased window size. However, large window size will increase the computational burden and introduce noise [55]. Therefore, we set the window size to 15,9,9 for all the experiments on the three datasets, respectively.
2) Impact of ρ: Parameter ρ controls the number of selected samples in the self-supervised feature learning. Fig. 9 illustrates the impact of ρ on the clustering performance in which we set ρ = {0.1,0.3,0.5,0.7,0.9,1.0}. It can be clearly observed that the proposed method is insensitive to parameter ρ since the clustering results are stable with respect to different values of ρ. The proposed method achieves the highest OAs when ρ = 0.3,0.5,0.7, ρ = 0.1, ρ = 0.1 on the IPS, PUS, and SA datasets, respectively. Note that the proposed method uses all the cluster assignments to supervise the training of feature extraction in the case of ρ = 1.0, and the clustering performance degrades by different degrees on the three datasets. This is because, by setting ρ = 1.0, some low-confident pseudo-labels are adopted to supervise the network training. Hence, the selection of highly confident cluster assignments as pseudo-labels is important to the self-supervised learning.
3) Impact of Distribution Interval: We randomly selected several distribution intervals to investigate the influence of the distribution interval [a, b] on the clustering performance. As shown in Table VII, for all datasets, the proposed method achieves the highest OAs when the distribution interval falls into the range of [0.001, 0.008]. Moreover, when the interval covers the value of zero (e.g., [−0.04, 0.001]), the clustering performance degrades since the structure of the initialized selfexpressive coefficient matrix is changed. Finally, it can be seen that the clustering OAs significantly degrade when a ≥ 0.01. Therefore, we set the distribution interval to [0.001, 0.008] for all the experiments on the three datasets.  VIII  RUNNING TIME OF THE DIFFERENT CLUSTERING METHODS ON THE IPS,

4) Impact of k:
Since the number of nearest neighbors k plays an important role in constructing the KNN graph and controls the numbers of the correlated atoms of each target feature, we set k = {10, 40, 80 120 160, 200 240} as in [14] and conduct k-sensitivity experiments on the three datasets. The influence of the number of nearest neighbors k on the clustering results is shown in Fig. 10. It can be seen that the optimal value of k varies for different datasets. Moreover, we can further observe that the cluster performance tends to be saturated with an increasing k. However, since a large k will increase the computational burden and dictionary redundancy, we set k = 120, k = 40, k = 120 for all the experiments on the IPS, PUS, and SA datasets, respectively.

G. Convergence Analysis
To show the convergence of our network, we conduct convergence experiment on the IPS dataset. The maximum number of training iterations is set to 200. The clustering OA is computed every 20 iterations. As shown in Fig. 11, the loss values are fluctuant at the early training phase. Then, they decrease rapidly and tend to be stable. Meanwhile, the clustering OA increases gradually, and then tends to be saturated when the number of iteration is larger than 120. Therefore, the network can well converge within 200 iterations on the IPS dataset. In all experiments, we report the cluster results of the last iteration.

H. Running Time
We investigate the running time of different HSI clustering methods. As reported in Table VIII, conventional SCbased methods (SSC [11], LRSC [34], TV [14], SSC-S, and S 4 C [12]) take more time than other clustering methods since they iteratively compute the representation coefficient matrices. Compared with deep learning-based methods (LSSD [23] and DSC [18]), our method takes more running time since it takes most computational time on the iterative self-supervised learning process and initialization. Moreover, compared with S 2 CSC [20], our method takes more time in sample selection and initialization. Although FCM [25], DLSS [51], and RMMF [9] methods are very efficient, they achieve lower accuracies than our method. To sum up, our method achieves a good tradeoff between clustering performance and computational efficiency.

V. CONCLUSION
In this article, we propose an end-to-end trainable network named SDSC-AI for HSI clustering. Specifically, we introduce self-supervised learning for feature extraction to make sure that the learned features are well-adapted to subsequent SC. Moreover, we design a spectral similarity based adaptive selfexpressive coefficient matrix initialization strategy to enhance the clustering performance. The experimental results demonstrate the superiority of the proposed method as compared to several state-of-the-art HSI clustering methods. To build the affinity matrix for spectral clustering, the proposed method needs to integrate all samples in one batch to train the network, which makes it difficult to scale for large HSI data. The scalability problem will be studied in our future work.