MS$^{2}$A-Net: Multiscale Spectral–Spatial Association Network for Hyperspectral Image Clustering

Remote sensing hyperspectral cameras acquire high spectral-resolution data that reveal valuable composition information on the targets (e.g., for Earth observation and environmental applications). The intrinsic high dimensionality and the lack of sufficient numbers of labeled/training samples prevent efficient processing of hyperspectral images (HSIs). HSI clustering can alleviate these limitations. In this study, we propose a multiscale spectral–spatial association network (MS<inline-formula><tex-math notation="LaTeX">$^{2}$</tex-math></inline-formula>A-Net) to cluster HSIs. The backbone of MS<inline-formula><tex-math notation="LaTeX">$^{2}$</tex-math></inline-formula>A-Net is an autoencoder architecture that allows the network to capture the nonlinear relation between data points in an unsupervised manner. The network applies a multistream approach. One stream extracts spectral information by deploying a spectral association unit. The other stream derives multiscale contextual and spatial information by employing dilated (atrous) convolutional kernels. The obtained feature representation generated by MS<inline-formula><tex-math notation="LaTeX">$^{2}$</tex-math></inline-formula>A-Net is fed into a standard k-means clustering algorithm to produce the final clustering result. Extensive experiments on four HSIs for different types of applications (i.e., geological-, rural-, and urban-mapping) demonstrate the superior performance of MS<inline-formula><tex-math notation="LaTeX">$^{2}$</tex-math></inline-formula>A-Net over the state-of-the-art shallow/deep learning-based clustering approaches in terms of clustering accuracy.

covers the visible and near-infrared (VNIR) to the shortwave infrared (SWIR) range of the spectrum (0. 38-2.50 μm) to enable users to observe and monitor materials and organisms of interest [6]. Visual interpretation and traditional approaches for processing HSIs require large amounts of man-power, time, and expenses [7]. The ever-growing demand for utilizing HSIs, encourages researchers to design and develop fast, yet robust analytical approaches. Recently, there has been a tremendous progress in the development of supervised and unsupervised machine/deep learning approaches to analyze HSIs. Such approaches have been successfully deployed to accomplish various HSI analysis tasks (e.g., feature extraction, classification, clustering). Despite the satisfactory performances obtained by supervised approaches, they require a considerable amount of labeled/training samples, which is not always easy to obtain. Thus, in recent years, unsupervised approaches have been receiving more attention [8].
One of the main tasks of unsupervised learning is clustering data points with similar characteristics into separate groups. For HSIs, the main objective is to group pixels that share similar spectral/spatial characteristics into distinct clusters. Overall, HSI clustering approaches can be categorized into two general groups (i.e., conventional shallow learning and deep learning). Conventional shallow learning (CSL) clustering approaches constitute the largest group [8], [9], [10], while deep learning (DL) clustering techniques have been developed more recently [11], [12]. The most widely used CSL-based clustering approach is the k-means clustering algorithm that iteratively clusters the data points by alternately assigning data points to the nearest cluster centroids and updating the cluster centroids, until convergence [13]. Density-based clustering techniques identify clusters by calculating the local densities of the feature space, assuming that each dense area represents a cluster [14]. In the last decade, sparse subspace clustering (SSC)-based approaches received significant attention [9], [10], [15], [16], [17]. These methods cluster data points, based on the self-expressiveness property, which indicates that each data point can be written as a linear combination of other data points from the same subspace [9]. SSC-based approaches initially generate a sparse representation of the data and then compute a similarity matrix on which spectral clustering is applied [18]. Despite the great success of CSL-based clustering approaches, their performance tend to deteriorate when it comes to processing complex datasets (e.g., HSIs), as they merely assume a linear relation between data points [8], [19]. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Due to the recent advances in computational technologies (e.g., GPUs) and inheriting concepts from the human neural network, DL-based clustering approaches have been developed, that attain superior performances compared to CSL-based approaches [8], [20]. Autoencoder (AE)-based networks are the pioneers of DL-based clustering approaches. A simple AE architecture consists of an encoder which extracts latent features from the original dataset, and a decoder that reconstructs the original dataset from the extracted latent features. Clustering is then performed on the latent features. The deep clustering network (DCN) is a representative DL-based clustering approach, which aims to learn k-means friendly features by minimizing both clustering and reconstruction losses simultaneously [11]. To further improve the performance of AE-based networks, convolutional AE (CAE)-based networks have been proposed [21], which reconstruct the original dataset by exploiting the spatial information. Clustering deep neural networks (CDNNs) merely use the clustering loss to train their network parameters, making it hard to extract informative and abstract features, and making them sensitive to network initialization [20]. Finally, variational AE (VAE)-based clustering approaches are generative models that enforce the extracted features to follow a predefined distribution [8], [12].
These developments in DL-based clustering found their way into the geoscience and remote sensing community. In [22], a deep clustering was proposed, which utilizes an intraclass distance constraint in its clustering loss, and employs a reconstruction loss as well, leading to cluster friendly latent features. In [23], a graph regularized residual subspace clustering network (GR-RSCNet) was proposed, which captures subspace information by learning the nonlinear relation between data points in an HSI. In [24], a spectral-spatial subspace clustering (DS 3 C-Net) approach was proposed to analyze HSIs. DS 3 C-Net is a multiscale approach, feeding patch blocks with different sizes with the same center pixel into parallel autoencoder networks. For the optimization, next to the reconstruction loss and a self-expressiveness loss at each individual stream, a collaborative self-expressiveness loss was employed to capture the subspace structure between various scales. Authors in [25] proposed to deploy 3-D convolutional kernels to capture the spatial information of HSIs and produce the latent features. Similar to DCN, the network parameters are optimized in accordance with both clustering and reconstruction losses. In [26], an HSI clustering network was designed to learn features by computing the set-to-set and sample-to-sample distances (LSSDs), from which the latent features are derived using different extraction approaches (i.e., pairs extraction, joint spectral-spatial feature extraction). Finally, density-based spectral clustering is applied on the learned features to produce the final clustering result.
Most of the aforementioned DL-based clustering approaches merely use the spectral information (e.g., AE, VAE). When spatial information is incorporated (e.g, CAE, DS 3 C-Net), a single fixed convolutional operation has been employed [27] and less attention is paid to the spectral information [28]. To alleviate these shortcomings, and effectively exploit spectral as well as spatial information in the clustering process, the following approaches are suggested: r Inspired by recent studies, we propose to employ dilated (atrous) convolutional operations. These operations effectively extract spatial information with a wider field of view, and require fewer number of learnable parameters [29].
r To capture subtle spectral information in the training process, we propose to include a spectral association unit into the current backbone of the networks. The spectral association unit is inspired by self-attention mechanisms and squeeze-and-excitation networks that allow the user to capture spectral information effectively [27], [28], [30]. Based on these two propositions, we present a multiscale spectral-spatial association network (MS 2 A-Net) that combines a designed spectral-association stream to extract informative spectral features, and a multiscale spatial stream to capture the spatial information between data points.
The main contributions of this study can be summarized as follows: 1) To cluster HSIs in a more effective and accurate manner, spatial information is extracted with a wider field of view and fewer learnable parameters by using dilated convolutional operations. 2) Spectral information is optimally preserved in the reconstruction process, by deploying a spectral-association stream.
3) The network is optimized by fusing the spectral and spatial features and employing a loss function that contains a reconstruction loss term and a spectral mean constraint on the latent features. To the best of our knowledge, this study is the first attempt to utilize the concepts of self-attention mechanisms and dilated convolutional operations for the purpose of HSI clustering. Experimental results on four HSIs for different types of applications (i.e., geological-, rural-, and urban-mapping) demonstrate the superior performance of MS2A-Net over several state-of-the-art shallow/deep learning-based clustering approaches in terms of clustering accuracy.
The remaining of this article is organized as follows. In Section II, we describe the proposed approach in detail. Section III is devoted to the description of the data. The presentation and discussion of the experimental results are elaborated in Section IV. Conclusions are provided in Section VI.
Motivated by the architecture of autoencoder-based networks and dilated convolutions [27], [29], we propose a multiscale spectral-spatial association network (MS 2 A-Net) for HSI clustering. MS 2 A-Net has a simple yet effective architecture, as shown in Fig. 1. MS 2 A-Net aims to extract spatial and contextual information from an HSI at various scales, while preserving the spectral information. In the following section, we describe the two main streams deployed in MS 2 A-Net.

A. Notation
Throughout the article, bold upper case characters (X ∈ R a×b×c ) denote tensors of rank 3, and X i ∈ R a×b is a matrix, denoting layer i (i = 1, . . . , c) of that tensor. X ∈ R h×w×D and R ∈ R h×w×D express an HSI and its reconstructed image, respectively, with spatial dimensions h (height) and w (width), and the number of spectral bands D. X s ∈ R h×w represents the sth spectral band of X. A vector is denoted with a lower case character x, and its components as x i .

B. Multiscale Spatial Stream
A normal 2D-convolutional layer on an HSI can be formulated as: Here, for each value of i, all spectral bands of X are convolved with the same filter W i ∈ R r×r , with a predefined kernel size r, after which all bands are summed up (sum and convolution can be swapped) and a bias b i is added. Then, a batch normalization (BN) is applied to guarantee a fast and stable training process. Finally, a rectified linear (ReLU) function is applied as the nonlinear mapping function σ to obtain the ith extracted feature map M i ∈ R h×w , with i = 1, 2, . . ., d 1 . The number of filters (d 1 ), is predefined by the user (in this study, The receptive field of the kernels is bound to close-range neighbors. In order to capture spatial information with a wider receptive field, the kernel size of the convolutional layers can be increased, and the extracted feature maps from multiple kernel sizes can be stacked together. However, this strategy requires high computational power and results in a significant increase in the number of learnable parameters. As a remedy to this issue, we will deploy dilated convolutional layers. Dilated convolutional layers with different dilation rates enlarge the receptive field and capture multiscale spatial information, while keeping the number of learnable parameters under control [27], [29]. The idea is to apply several streams of 2-D convolutional layers, with different dilation rates l: (i = 1, . . . , d 1 ), with W l i ∈ R (lr−l+1)×(lr−l+1) represents the weights corresponding to the ith convolutional filter with dilation rate l (l = 1 leads to Eq. (1)). As shown in Fig. 2, we employ three streams (a), (b), and (c), with dilation rates l = 1, 2, and 4, respectively. In this study, we fix r to 5. Since we aim to use the multiscale extracted features both in the reconstruction and clustering processes, we concatenate M l i , with i = 1, . . . , 12 and l = 1, 2, 4 to shape M ∈ R h×w×36 . M is subject to a final 2-D convolutional layer, consisting of filters with r = 1 and stride= 2 (see Fig. 1), generating d 2 output features. To restore the original spatial dimensions of the original HSI, upsampling with a scale factor of 2 is performed before the batch normalization and the employment of the nonlinear activation function (σ). The final extracted multiscale spatial feature map is denoted by SPF ∈ R h×w×d 2 .

C. Spectral-Association Stream
Apart from exploiting the spatial information using the multiscale spatial stream, it is important to preserve and effectively incorporate the spectral information from the original HSI during the reconstruction process [29]. For this, MS 2 A-Net employs a spectral association unit (henceforth we will call it spectral-association stream), which contains two phases (see Fig. 3) [31]. The initial step in the spectral-association stream is to extract the spatial information at the local level. To extract spatial information correlated with each pixel, we deploy a normal close-range 2-D convolutional layer with d 3 filters with r = 3 on X: To make the process faster and produce a spectral-association matrix, we deploy a softmax function (soft) to rescale each output feature of σ between 0 and 1. Subsequently, a spectral-association matrix SA = reshape(soft(X T )) × reshape(F) ∈ R D×d 3 is produced, where reshape (.) unfolds a tensor of rank 3 into a matrix with h.w rows, and × denotes the matrix multiplication operation. SA describes the contribution of each spectral band to the extracted spatial features in F, derived in Eq. (3). Originally, the spectral-association matrix was proposed in [31] to reconstruct the original HSI. However, in this study, we propose to use SA to extract spectral features as follows: where SAF ∈ R h×w×d 3 , and reshape −1 (.) folds a matrix into a tensor of rank 3.

D. Reconstruction Process
To train the network in an unsupervised manner, the original HSI needs to be reconstructed. For this, the extracted spatial (SPF) and spectral (SAF) feature maps are concatenated to To equally contribute spectral and spatial information in the reconstruction process, we set d 2 = d 3 = 3 in this study. However, d 2 and d 3 can be varied, depending on the application at hand. Thereafter, we feed EF into the decoder section, which consists of a normal close-range 2-D convolutional layer, with D filters of 3 × 3 kernel size, 2D batch normalization, and nonlinear mapping function σ, and that finally generates the reconstructed HSI R from EF.

E. Optimization of MS 2 A-Net
In order to train MS 2 A-Net in an unsupervised manner and stabilize its performance, we designed the following loss function: where L represents the loss function, minimized with respect to all weights and biases of the entire network. X ∈ R h×w and M ∈ R h×w denote computed averages over the spectral dimension from X and M , respectively. In addition, ||.|| F represents the F robenius-norm. In Eq. (5), the first term denotes the mean squared error (MSE) between the original (X) and reconstructed (R) images. The second term defines the spectral mean constraint, which denotes the MSE between the averaged feature values from M and the spectrally averaged image X. λ is a tradeoff parameter to control the impact of the spectral mean constraint. Since MS 2 A-Net ultimately aims to cluster the multiscale spatial features (M ), the spectral mean constraint on the generated M is included in the designed loss function, to assure that the generated latent features have a direct impact on the training process, in such a way that they are enforced to preserve the mean spectral information of the original image.
As the final step of MS 2 A-Net, the clustering is applied. To be more specific, we apply k-means clustering on the generated multiscale spatial features (M ) to cluster X. The optimization of the MS 2 A-Net assures a more effective exploitation of spatial as well as spectral information for a better description of the relations between pixels in an HSI, leading to an improved clustering map.

III. HYPERSPECTRAL DATA DESCRIPTION
We evaluate the performance of our proposed algorithm on four real HSIs, covering three different application domains (i.e, rural, urban, and geological sites).

A. Trento Dataset
This dataset is acquired by the AISA Eagle sensor over a rural area in the south of the city of Trento, Italy. The HSI is composed of 166 × 600 pixels with a spatial resolution of 1 m, and 63 spectral bands ranging between 0.40 and 0.98 μm. The acquired HSI along with its corresponding ground truth dataset are presented in Fig. 4. The Trento dataset contains six classes: 1) Apple trees, 2) Buildings, 3) Ground, 4) Wood, 5) Vineyard, and 6) Roads.

B. Houston 2013 Dataset
This dataset is acquired over the University of Houston campus and the neighboring urban area by the Compact Airborne Spectrographic Imager (CASI) on June 23, 2012. In this work, we utilized a subset of this scene, composed of 300 × 300 pixels (indices on spatial dimensions range within [40:340,500:800]), with a spatial resolution of 2.5 m and 144 spectral bands ranging between 0.38-1.05 μm. The HSI along with its corresponding ground truth dataset are presented in Fig. 5. More details on the Houston 2013 dataset can be found in [32]. The Houston 2013 subset contains six classes: 1) healthy grass, 2) soil, 3) residential area, 4) road, 5) parking lot, and 6) tennis court.

C. Geological Finland Dataset
The geological Finland dataset was captured over an outcrop of the Archean Siilinjärvi glimmerite-carbonatite complex in  Finland [33], by a hyperspectral frame-based camera (0.6 Mp Rikola Hyperspectral Imager), mounted on a hexacopter (Aibotix Aibot X6v2). The HSI is composed of 300 × 900 pixels and contains 50 spectral bands covering the range between 0.50 and 0.90 μm. The geological Finland dataset contains five classes: 1) Clay, 2) Glimmerite, 3) Dark-rocks (which is a mixture of soil and Glimmerite), 4) Dust, and 5) Water. The RGB image of the scene and its corresponding reference map are shown in Fig. 6. More elaborated and detailed information on the geological Finland dataset can be found in [5].

D. Geological Spain Dataset
The Rio Tinto area is located 70 km north of Huelva in the Iberian Pyrite Belt, in Spain. The area has a rich mining history dating back to the Bronze Age, while currently, significant resources remain and mining operations still take place. Panoramic outcrop scans were acquired by an AISA-FENIX camera, mounted on a tripod [34]. The captured HSI is composed of 300 × 1416 pixels and 190 spectral bands, covering the range between 0.38 and 2.50 μm. The geological Spain dataset consists of eight classes: 1) Chlorite rich schist, 2) Sericite rich schist, 3) Chlorite + Sericite rich schist, 4) Chert, 5) Massive Sulphide, 6) Purple Shale, 7) Phyllite, and 8) Saprolite. The RGB image along with the corresponding ground truth of the scene is displayed in Fig. 7.

E. Evaluation Metrics
We evaluate the clustering performance of the studied approaches using three widely used classification metrics: overall accuracy (OA), average accuracy (AA), and Kappa. Remark that the proposed clustering approach is entirely unsupervised, i.e., only unlabeled data are used for optimizing the network and the clustering. However, for a quantitative evaluation of the clustering results, labeled data are applied for validation. Y = [y 1 , y 2 , . . ., y N ] represents the true class labels. C = [c 1 , c 2 , . . ., c N ] denotes the obtained cluster labels, where c i = {1, . . ., k}, with k the number of clusters. To evaluate the clustering performance, a matching function c i = bestM ap(y i , c i ) between the predicted cluster labels and true class labels is constructed by the Hungarian algorithm [35]. Subsequently, OA is computed as In addition, we report two commonly applied unsupervised evaluation metrics, namely, the normalized mutual information (NMI) and the adjusted rand index (ARI). NMI is based on the common/mutual information between two clusters and is defined as ij n ij log n i n ij n i+ n +j i n i+ log n i+ n j n +j log n +j n (6) where n ij = |c i ∩ y j |, n i+ and n +j are defined as N j=1 n ij and N i=1 n ij , respectively. In order to compare different approaches, the mutual information is normalized between 0 and 1 [36].
ARI computes the similarity (or dissimilarity) between two clusters and is a adopted from the original rand index [37]. It is defined as The value of ARI is smaller than 1 and can be negative, in which case two clusters are less similar than what can be expected from a random result.

A. Implementation Details
We implemented MS 2 A-Net in Python, version 3.8 using the Pytorch library on a workstation with an i9-7900X CPU, 128 GB RAM, NVIDIA GeForce RTX 2080 Ti 11 GB GPU. We adopted the Adam optimizer with default parameters for both streams (i.e., spectral-association and spatial multiscale). The parameters of the Adam optimizer are set as: β 1 = 0.9, β 2 = 0.999, = 10 −8 , and weight decay = 0. The implementation of MS 2 A-Net is available online at: https://github.com/Kasra2020/MS2A-Net.

B. Comparison to State-of-the-art Clustering Approaches
In this section, we provide a quantitative and qualitative assessment of the obtained clustering results from the studied clustering approaches on different datasets. All the experiments are reported and analyzed based on five runs, with different random initialization of the learnable parameters. Please note that for each dataset, the entire ground truth dataset was utilized for validation during the performance evaluation.
We compare clustering performance of our proposed MS 2 A-Net to ten other representative CSL/DL-based clustering approaches: 1) Traditional CLS-based clustering approaches: k-means [13] and spectral clustering on the sparse coefficients (SC-SC) [18]. 2) Advanced sparse subspace-based clustering approaches that have proven to be effective for analyzing complex datasets: hierarchical sparse subspace clustering (HESSC) [38], scalable exemplar-based subspace clustering (ESC) [10], and elastic net subspace clustering (EnSC) [39]. 3) DL-based clustering approaches: AE [40], CAE [40], VAE [12], DCN [11] and deep multiresolution clustering network (DMC-Net). In AE, CAE, VAE, DCN, DMC-Net, the clustering results are generated by employing k-means on the latent features. To have a fair comparison, the number of latent features for all aforementioned DL-based approaches is set to 36. Table I reports the quantitative assessment of the studied clustering approaches applied on the Trento dataset. Among the CSL-based approaches, k-means and HESSC obtained the highest OAs (57.95 % and 56.76 %, respectively), while among the DL-based clustering approaches, the approaches that incorporate spatial and contextual information (i.e., CAE, DMC-Net, and MS 2 A-Net) attained higher OAs. Among the approaches that merely use spectral information (i.e., AE, VAE, and DCN), DCN is superior and obtained comparable results as CAE. The inferior performance of VAE (OA = 50.12 %) indicates that better hyperparameter tuning for such an approach is required. Overall, DMC-Net and MS 2 A-Net yielded the highest OAs (82.35% and 84.16%, respectively). This demonstrates a substantial performance improvement by deploying a multiscale spatial stream. However, integrating spectral information by employing a spectralassociation stream improved the clustering performance even further. With respect to individual class accuracies, most approaches failed to capture the "Ground" class accurately, except for EnSC (97.29 %). One can argue that the cause of this problem is insufficient test samples for the "Ground" class.

1) Quantitative Results on Trento Dataset:
2) Quantitative Results on Houston 2013 Dataset: We quantitatively assessed the clustering performance of studied approaches on Houston 2013 dataset (see Table II). Among the CSL-based approaches, ESC obtained the highest OA (67.18%), revealing that the selected exemplars are sufficiently representative to describe the entire dataset. SC-SC obtained the poorest performance (OA = 43.38%), meaning that the representative dictionary utilized in SC-SC, is not well-built, and further tuning of its parameters is required. MS 2 A-Net attained the highest OA (73.06%) among all studied clustering approaches. In the Houston 2013 dataset, all DL-based clustering approaches distinguish the "Healthy grass" class perfectly (100%). Except for VAE (class accuracy = 87.09%), the performance of the DL-based approaches is poor for the "Road" class. In addition, the poor performance of some DL-based approaches could be due to the 3) Quantitative Results on Geological Finland Dataset: The quantitative assessment of the different clustering approaches applied on the geological Finland dataset is reported in Table III. HESSC is capable of clustering the geological Finland scene (OA = 70.05 %) more accurately than other CSL-based approaches. Despite the good performance of CAE in other datasets, in the geological Finland dataset, it performed slightly weaker (OA = 70.76%) than AE (OA = 68.37%). Similarly as in the Houston 2013 dataset, this may be caused by the low number of available test samples. MS 2 A-Net attained the highest OA (80.48%) among all studied clustering approaches. DL-based approaches distinguished the "Dark-rocks" class better than the majority of the CSL-based approaches.

4) Quantitative Results on Geological Spain Dataset:
The performance of the studied clustering approaches applied on the geological Spain dataset is reported in Table IV. Interestingly, the availability of a high number of test samples for various classes reveals valuable information. We can observe that some CSL-based approaches (i.e., ESC, SC-SC) are not applicable on this dataset. In the case of SC-SC, deriving sparse representation for a large-scale dataset is too computationally expensive. In the case of ESC, one can reduce the number of exemplars, but  this results in poor performance when the number of selected exemplars is not sufficient to represent the entire dataset. In addition, overall, CSL-based approaches have inferior performance compared to DL-based approaches. Furthermore, incorporating spatial information (i.e., CAE, DMC-Net, MS 2 A-Net) ameliorates the clustering performance of DL-based approaches, compared to when merely spectral information is deployed. 5) Processing Time: All tables report required processing times on the four datasets. k-means is the fastest clustering approaches on all studied datasets, since it merely needs to compute the Euclidean distance between data points and centroids. AE is the fastest DL-based approach. However, comparing all other CSL-and DL-based approaches, MS 2 A-Net is faster than any other approach. Among all approaches, EnSC and DCN are the most time consuming CSL-and DL-based clustering approaches, respectively. 6) Qualitative Assessment on Trento, Houston 2013, Geological Finland, and Geological Spain Datasets: Figs. 8-11 display clustering maps generated by the most representative CSL-and DL-based clustering approaches, which can handle complex and large scale datasets, on Trento, Houston 2013, geological Finland and geological Spain, respectively. There is a general trend, in which the spectral-based approaches (e.g., k-means, AE) tend to produce "noisy" maps in comparison  to the ones (e.g., CAE, MS 2 A-Net), which incorporate spatial information. Although CAE, DMC-Net, and MS 2 A-Net generate smooth clustering maps, it can be observed that DMC-Net and MS 2 A-Net provide more detailed clustering maps compared to CAE. This observation reveals the essence of using both spectral and spatial information in the clustering process. Further investigation of Fig. 8(g) and (h) demonstrates that MS 2 A-Net more efficiently utilizes both spectral and spatial information compared to DMC-Net. For instance, several "Apple trees" pixels have been misclustered as "Wood" by DMC-Net, while this effect is reduced when MS 2 A-Net is employed. The same trend is observed in Fig. 11(f) and (g), where the "Chlorite rich schist" class is not well clustered by DMC-Net, whereas MS 2 A-Net can distinguish this class.

V. DISCUSSION: ABLATION STUDY AND HYPERPARAMETER EVALUATION
In this section, we evaluate the impact of different hyperparameters on the performance of MS 2 A-Net. In order to tune the MS 2 A-Net and identify its corresponding optimal hyperparameters, we utilize the Trento dataset that has a richer and more

A. Influence of the Multiscale Spatial Stream on the Number of Parameters
In MS 2 A-Net, dilated convolutions are utilized to cover a larger receptive field while requiring fewer learnable parameters. As reported in Table V, we computed the number of learnable parameters for two scenarios. In the first scenario, dilated convolutions are deployed as described in Section II. In the second scenario, we increase the kernel size of normal convolutions, to cover the same receptive field as their corresponding dilated convolutions at different scales. The results reveal that the deployment of dilated convolutions reduces the required number of learnable parameters (i.e., weights and biases) by approximately a factor of 5. Consequently, the computational effort of MS 2 A-Net is reduced proportionally compared to its version with normal convolutions. Furthermore, in the studied datasets (i.e., rural, geological, and urban applications), pixels which lie within a close-range neighborhood from each other tend to be drawn from the same class. Therefore, to primarily capture the local neighboring information, we limited the receptive fields by selecting the dilated rates ({1, 2, 4}). However, depending on the applications at hand, other choices for these hyperparameters can be made.

B. Impact of the Spectral Mean Constraint
In the proposed approach, the ultimate goal is to cluster the extracted multiscale features (M ). To include M implicitly in the training process, we defined the spectral mean constraint on the generated M , to assure that the generated latent features have a direct impact on the training process, in such a way that they are enforced to preserve the mean spectral information of the original image. We evaluated the impact of the spectral mean constraint in Eq. (5) for the following values of λ: {0, 0.0001, 0.001, 0.01, 0.1}. In Fig. 12, the results are displayed in terms of OA(%). From the obtained results, one can conclude that λ = 0.1 leads to the highest OA and the lowest variation.

C. Influence of the Number of Latent Features on the MS 2 A-Net Performance
In Fig. 13, the performance of MS 2 A-Net is validated in terms of OA(%) by utilizing different numbers of extracted latent   features (d = 6,12,18,24,30,36,42,48,54,60). One can observe that the best performance is obtained using 36 features.
Thus, we propose to use 36 as the number of extracted features for all datasets.

D. Impact of the Multiscale Spatial and Spectral-Association Streams
We evaluated the effectiveness of the streams deployed in MS 2 A-Net by using various alternative scenarios: 1) Alternative 1 (A1): In this alternative, only the spectralassociation stream is deployed in the MS 2 A-Net architecture, hereby completely ignoring the effect of the multiscale spatial stream in the training process. 2) Alternative 2 (A2): In this alternative, merely the multiscale spatial stream is deployed in the MS 2 A-Net training process. 3) Alternative 3 (A3): This alternative is the proposed approach that deploys both spectral-association and multiscale spatial streams. As shown in Fig. 14, poor results are obtained by scenario A1, compared to the other scenarios. The spectral-association stream mainly uses spectral information, and the lack of sufficient spatial information strongly deteriorates the clustering performance. With scenario A2, the clustering performance improves by the beneficial influence of spatial information. However, one can observe that optimal clustering performance is obtained in scenario A3, when both multiscale spatial and spectral-association streams are deployed.

VI. CONCLUSION
HSI clustering is a challenging task, which can provide valuable insight into datasets. Unlike CSL-based clustering approaches, DL-based approaches can capture nonlinear intrinsic relationships between data points complex datasets. Furthermore, most CSL-and DL-based HSI clustering approaches merely utilize spectral information, while neighboring pixels likely share the same characteristics. Hence, in this article, we proposed a multiscale spectral-spatial association network, which effectively exploits spectral and spatial information for HSI clustering. MS 2 A-Net contains two main streams (i.e., a spectral-association and a multiscale spatial stream). The spectral-association stream aims to efficiently extract spectral information, whereas the multiscale spatial stream deploys dilated convolutions to capture spatial information at various scales. We have demonstrated that MS 2 A-Net outperforms the state-of-the-art CSL-and DL-based clustering approaches with competitive processing times on four real hyperspectral datasets.
In the future, we will work on different optimization strategies to further improve the clustering performance. In addition, we intend to design even lighter networks with a reduced number of learnable parameters, and consequently processing time.