SSF-Net: A Spatial–Spectral Features Integrated Autoencoder Network for Hyperspectral Unmixing

In recent years, deep learning has received tremendous attention in the field of hyperspectral unmixing (HU) due to its powerful learning capabilities. Particularly, the unsupervised unmixing method based on an autoencoder (AE) has become a research hotspot. Most of the current AE unmixing networks mainly focus on information about pixels and their neighborhoods in images. However, they make insufficient use of information about spatial heterogeneity and spectral differences of endmembers in hyperspectral image (HSI) data. To this end, an AE HU network with the name of SSF-Net is proposed for fusing the spatial–spectral features. The network first extracts pseudoendmember information from the HSI using a regional vertex component analysis algorithm. Then, a dual-branch feature fusion module incorporating a spatial–spectral attention mechanism is constructed to make full use of the information in the HSI data, thereby improving the network's unmixing performance. It is worth stating that SSF-Net can fuse spatial–spectral information and utilize different attention maps to obtain more significant spectral difference information and more discriminative spatial difference information about the scene. The experimental results on synthetic and real datasets demonstrate that the proposed SSF-Net outperforms state-of-the-art unmixing algorithms.

SSF-Net: A Spatial-Spectral Features Integrated Autoencoder Network for Hyperspectral Unmixing Bin Wang , Huizheng Yao , Dongmei Song , Jie Zhang, and Han Gao , Member, IEEE Abstract-In recent years, deep learning has received tremendous attention in the field of hyperspectral unmixing (HU) due to its powerful learning capabilities.Particularly, the unsupervised unmixing method based on an autoencoder (AE) has become a research hotspot.Most of the current AE unmixing networks mainly focus on information about pixels and their neighborhoods in images.However, they make insufficient use of information about spatial heterogeneity and spectral differences of endmembers in hyperspectral image (HSI) data.To this end, an AE HU network with the name of SSF-Net is proposed for fusing the spatial-spectral features.The network first extracts pseudoendmember information from the HSI using a regional vertex component analysis algorithm.Then, a dual-branch feature fusion module incorporating a spatial-spectral attention mechanism is constructed to make full use of the information in the HSI data, thereby improving the network's unmixing performance.It is worth stating that SSF-Net can fuse spatial-spectral information and utilize different attention maps to obtain more significant spectral difference information and more discriminative spatial difference information about the scene.The experimental results on synthetic and real datasets demonstrate that the proposed SSF-Net outperforms state-of-theart unmixing algorithms.

I. INTRODUCTION
H YPERSPECTRAL image (HSI) can capture detailed spectral information of ground objects in hundreds of continuous bands from visible light to short-wave infrared and even wider spectral intervals while obtaining spatial distribution information of ground objects.Because of its rich spectral information, it has received a great deal of attention, especially in the fields of military investigation, target tracking, target identification, environmental monitoring, etc. [1], [2], [3], [4], [5].However, due to the limitation of spatial resolution and the complex diversity of the natural land surfaces, the phenomenon of mixed pixels [6] is common in HSI.The mixed pixels are composed of a variety of pure material spectra, and their existence will have an enormous impact on the accuracy of hyperspectral remote sensing applications [7].To better solve this problem, the hyperspectral unmixing (HU) technique is often used to decompose the mixed pixels into a series of different pure material spectra (endmembers) and the coverage ratio (abundances) of the endmembers [8].Currently, HU has been widely used in mineral detection [9], [10] and agricultural detection [11], [12].
HU models can simply be classified into two categories: linear mixing model (LMM) [1] and nonlinear mixing model (NLMM) [13].In this regard, the LMM is based on the assumption that the electromagnetic wave energy received by the sensor does not undergo secondary scattering during transmission, i.e., the spectrum of a mixed pixel is a linear combination of multiple pure spectra (endmembers) of the ground object according to certain proportions (abundances).Moreover, considering the physical mechanism in HU, the abundance needs to satisfy the nonnegative constraint (ANC) and abundance sum-to-one constraint (ASC) [14].NLMM is often used to describe the intricate interactions between scattered light from multiple materials within a scene [1].Although the NLMM is more in line with the actual transmission of electromagnetic waves, it requires consideration of numerous complex factors in implementation.Given the explicit physical mechanism and relatively straightforward solving process of the LMM and the relatively simple solution process, the simulation of mixed spectra can be efficiently achieved.Therefore, this study focuses on the LMM-based HU.
Traditional unmixing methods can be mainly categorized into geometric-based, statistical-based, and sparse regression-based unmixing methods.Among the geometry-based unmixing methods, the typical representatives include N-finder (N-FINDR) [15] and vertex component analysis (VCA) [16].N-FINDR employs the projection of pixels into the feature space to form a simplex, where the endmembers are efficiently selected by identifying the pixel that constitutes the maximum volume simplex.The VCA does this by iteratively projecting the pixels into a direction orthogonal to the subspace formed by the already identified endmembers, where the new endmember corresponds to the extreme of the projection.Due to the complexity and variety of natural surfaces, it is a formidable challenge to identify the pure pixels in remote sensing images.Furthermore, geometry-based unmixing methods tend to fall into local optima in HSI with highly mixed ground objects.In contrast, statistical-based unmixing methods are able to obtain the global optima, such as the unmixing methods with a Bayesian framework [17] and nonnegative matrix decomposition (NMF) [18].Bayesian methods can effectively incorporate a priori information into the unmixing process, thus improving the accuracy of unmixing [19], [20].Due to the advantages of learning part-based representations, NMF has become a prominent research focus in the field of HU.Notably, NMF can simultaneously capture the endmembers and abundances of HSI after completing the unmixing task [21].Currently, many NMF-based HU methods have been proposed, mainly focusing on improving the unmixing accuracy through the incorporation of regularization constraints or the integration of spectral and spatial information [22], [23].Besides, there is also unmixing by nonnegative tensor factorization [24], [25] to minimize the information loss in the unmixing process.Furthermore, there are model-inspired network-based unmixing approaches [26], [27], [28] that make the unmixing process more physically interpretable by combining it with a physical model.Finally, the sparse regression-based unmixing method [29] effectively estimates the endmembers and their corresponding abundances in HSI using a priori knowledge of a known spectral library.Although the sparse regression-based unmixing methods can mitigate the adverse effects of inaccurate endmember extraction, they cannot be widely applied to HSI images acquired in complex environments due to the poor mobility of the spectral library.
Deep learning has attracted much attention in the field of HU owing to its powerful feature representation ability.In particular, the autoencoder (AE) and its variants have been applied to HU with excellent unmixing results.Up to now, the selfencoder-based unmixing networks can be simply classified into two categories: pixel-level unmixing networks and spatial-level unmixing networks.And the typical representative of pixel-level unmixing networks includes EndNet [30], uDAS [31], DAEN [32], MUNet [33], CyCU-Net [34], and EGU-Net [35].Among them, EndNet forms an unmixing network through two AEs and uses spectral angular distance (SAD) and K-L divergence as loss functions.uDAS reduces the effect of noise on unmixing by introducing denoising constraints into the AE and enhances the abundance estimation by introducing the so-called l 21 -norm into the decoder [31].To enhance the robustness of the model, DAEN first processes the outliers and noise in the data by using a stacked autoencoder (SAE) and then feeds the processed data into a variational AE to obtain more accurate unmixing results.MUNet constructs a multimodal unmixing network by additionally introducing LiDAR data, which improves the unmixing performance by integrating the elevation differences of the LiDAR data into the HSI.CyCU-Net uses a cyclic consistency network structure with two cycle-connected AEs to reduce the information loss during image processing, thus enhancing the unmixing ability of the network.EGU-Net reduces the influence of spectral variability (SV) on the unmixing results by introducing pseudoendmembers and utilizes the unmixing information derived from pseudoendmembers to guide the unmixing process to improve network performance.With the rise of CNNs in the field of computer vision, many spatial-level unmixing networks have emerged as well.Typically, CNNAEU [36] introduces CNNs into AE-based unmixing networks, omitting the use of any pooling or upsampling operations to maximize the retention of spatial information from HSI. SSCC-Net [37] utilizes both spectral and spatial information to train the spatial AE networks and spectral AE networks, respectively, in an end-to-end manner.Particularly, DeepTrans [38] was the first to use a transformer in combination with the convolutional AE for HU.Although the above methods perform remarkable performance in HU tasks, they still have some limitations.The pixel-level unmixing networks mainly utilize the spectral information of pixels for unmixing without considering the spectral differences between pseudoendmembers.The spatiallevel unmixing networks enhance the receptive field by introducing convolutional operations to capture spatial contextual information but neglect the spatial differences between different ground objects.Although these joint spectral-spatial networks utilize the contextual information of the spectral and spatial features in HSI, they still do not take into account the effects of spectral differences between the pseudoendmembers as well as the spatial heterogeneities of distributed features.Therefore, how to make full use of the spatial discrepancy in HSI and the spectral difference between pseudoendmembers to maximize the accuracy of HU has become an urgent challenge to overcome.
To this end, this article proposes a novel spatial-spectral features integrated AE HU network, called SSF-Net, which integrates spectral attention mechanism and spatial attention mechanism within the AE unmixing framework to effectively extract the spatial discrepancy in HSI and the spectral difference between pseudoendmembers in an unsupervised manner.Compared with the current unmixing methods that only use spectral information or neighboring pixel information, SSF-Net can better extract the feature difference information between different ground objects, thus significantly improving the network unmixing accuracy.Specifically, the main contributions of this study can be summarized as follows.
1) SSF-Net improves the unmixing performance by exploiting the spatial difference information in HSI and the spectral difference information between pseudoendmembers.
To the best of our knowledge, this study represents the first utilization of DL to investigate the unmixing task of multifeature fusion.2) A two-branch feature fusion module incorporating a spatial-spectral attention mechanism is built into SSF-Net to address the problem of underutilization of spatialspectral information.This module extracts the relevant spatial-spectral information by integrating spatial and channel attention, thus improving the accuracy of the unmixing.3) An unsupervised learning approach is adopted, which makes the training process no longer dependent on labeled data.The network model is able to learn and extract features from the data itself autonomously, thus improving the generalization ability of the model.The rest of this article is organized as follows.Section II briefly describes the principles of AE-based unmixing.Section III describes the network structure of SSF-Net in detail.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Section IV gives the results of SSF-Net on several datasets.Finally, Section V concludes this article.

II. AE-BASED UNMIXING MODEL
This section mainly devotes to the introduction of an unmixing method based on LMM, which can usually be expressed as follows: where Y ∈ R l×n is the HSI expressed in the form of a twodimensional (2-D) matrix, l is the number of bands, and n is the number of pixels.Besides, E ∈ R l×p is an endmember matrix representing the p endmembers in the HSI, A ∈ R p×n is the abundance matrix corresponding to the p endmembers, and N ∈ R l×n refers to the added noise matrix.
Considering that abundance represents the proportion of different feature types in each mixed pixel, the abundance vector a j also needs to satisfy the constraints of ANC and ASC This study primarily focuses on HU based on the AE to simultaneously obtain the endmember matrix E and abundance matrix A in an unsupervised manner with the powerful learning and characterization capabilities of deep neural networks.As illustrated in Fig. 1, a comprehensive AE typically consists of an encoder and a decoder.
Encoder: The encoder is typically a multilayer network structure that converts the input original hyperspectral data {y i } n i=1 ∈ R l into a hidden layer data denoted by h i , as formulated in the following equation: where f ( • ) represents the activation function of the encoder, W (e) represents the weight of the eth layer encoder, and b (e) represents the bias of the eth layer encoder.Decoder: The decoder converts the hidden layer data h i back to the original input data, denoted by {ŷ i } n i=1 ∈ R l , which can be formulated as follows: where W (d) represents the weight of the dth layer decoder.
Conventionally, the metric of mean square error (MSE) standard formulation ( 5) is always employed to quantify the reconstruction error of AE.However, when dealing with hyperspectral data, additional error metrics, such as the SAD, as depicted in ( 6), need to be taken into consideration for evaluating the reconstruction accuracy Considering the inherent advantages of AE, such as a simple training process, flexible stacking of layers, and unsupervised learning paradigm, this study adopts the employment of AE for the HU.By leveraging the encoder to transform the input HSI data into abstract features and then convert them into abundance maps according to the ANC and ASC constraints, such operations ensure that the decoder is fully compliant with the LMM principle to reconstruct the abundance map back to HSI data.At this point, the decoder weights can be interpreted as the desired endmembers.The workflow of AE-based HU is illustrated in Fig. 2, wherein the network enables the simultaneous estimation of both abundances and endmembers.It is noteworthy that the entire AE unmixing process is conducted in an unsupervised manner, which effectively mitigates the issue of insufficient labeled samples in HSI data [39].

III. PROPOSED METHOD
In this study, a spatial-spectral features integrated AE network, abbreviated as SSF-Net, is proposed for the HU, with its overall structure, as shown in Fig. 3.The network consists of two parts: a spatial-spectral feature fusion encoder and a decoder.The former component fuses the deep-level features extracted from both HSI data and pseudoendmembers to fully exploit the spatial and spectral characteristics inherent in the original data.The latter component employs a commonly used decoder architecture to reconstruct the HSI by leveraging the extracted abundances.It is noteworthy that within the encoder, the in-depth integration of spectral features and spatial features is achieved by employing a dedicated module known as the spatial-spectral fusion module (SSFM).Through the SSFM, the fused high-level features encompass the intrinsic attributes from pseudoendmember spectra as well as the spatial global information from the HSI, which synergistically contributes to the enhancement of the unmixing accuracy of the network.To endow the network with abundant spectral features, the regional VCA endmember extraction method is employed to acquire the pseudoendmember spectra from HSI.Specifically, the HSI data are first partitioned into several subpatches with a certain overlap rate, and the pseudoendmembers of each subpatches are extracted using the VCA algorithm.Subsequently, the K-means clustering algorithm is utilized to eliminate the duplicate pseudoendmembers, and all remaining pseudoendmembers are then aggregated into K clusters [35].And the pseudoendmembers are then obtained by computing the centers of each cluster.Notably, the number of subpatches and K values can be determined by referring to the literature [40].In this study, the K value is deliberatively set to about 20% of all hyperspectral pixels according to several trial experimental results.It should be noted that the pseudoendmembers obtained by the above process can reduce the influence of SV on the unmixing results because they contain rich spectral information of features, perturbation information, and a certain amount of noise.The following sections devote to a detailed description of the SSF-Net framework.

A. Spatial-Spectral Feature Fusion Encoder
To fully exploit the information contained in the HSI data, the spatial-spectral fusion encoder is ingeniously designed as a dual-branch structure.The encoder consists of a spectral branch and a spatial branch, which encode the input data from different views to capture the high-level spectral features and spatial features in the HSI.In the spectral branch, the spectral features in the pseudoendmembers are extracted by using three consecutive feature extraction blocks.Within each feature extraction block, a 1 × 1 convolution is employed to compress the spectral information of the pseudoendmembers, and an activation function (such as Sigmoid, ReLU, or LeakyReLU) is introduced after the convolution.Since the ReLU activation function may lead to the problem of neuron invalidation [41], the LeakyReLU activation function is employed in this network.Moreover, the batch normalization strategy is introduced to alleviate problems, such as gradient vanishing and gradient exploding, as well as to improve the overall computational speed of the network.To mitigate overfitting in the network, the dropout is added at the end of the blocks.The high-level spectral features, denoted as F spe , are obtained from the pseudoendmembers through the three feature extraction blocks in the spectral branch.Moreover, considering the strong correlation between pixels and their surrounding scenes in HSI [37], [42], [43], the 3 × 3 convolutions are incorporated in the feature extraction blocks of the spatial branch to effectively capture the spatial information of neighboring pixels.Thus, the high-level spatial features obtained from the HSI data through these feature extraction blocks are denoted as F spa .
To effectively combine the spectral features from the pseudoendmembers with the spatial features from the HSI, SSFM is constructed in this study.This module aims to enhance the network model's ability of utilizing and integrating spectral and spatial information to improve the unmixing accuracy.As shown in Fig. 4, the overall structure of SSFM contains the channel attention mechanism and the spatial attention mechanism, which are described in detail as follows.
1) Channel Attention Mechanism: During the process of HU, the pseudoendmembers are remarkably high resemblance to the pure endmembers.Therefore, in the absence of the pure endmembers, the spectral differences between different pseudoendmembers can be used as a substitute for the spectral differences between the pure endmembers.In view of this, a channel attention mechanism is introduced to enhance or suppress the channel features that are responsive to the differences between different Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.pseudoendmembers to improve the accuracy of abundance estimation during HU.The workflow of the channel attention module is shown in Fig. 5(a).First, the input spectral feature F spe ∈ R l×H×W is processed using the global average pooling and global maximum pooling operations to obtain the average pooling feature F c avg ∈ R l×1×1 and the maximum pooling feature F c max ∈ R l×1×1 , respectively.Then, to capture the association information between F c avg and F c max , they are separately put into a shared multilayer perceptron (MLP) and the outputs are summed.Finally, the channel attention map is output M C (F spe ) by virtue of the sigmoid function.
To reduce the number of parameters, the hidden layer size in the MLP is set to l/r, where r is the compression ratio.Through multiple sets of experiments, it is found that when r is 16, the weight operator corresponding to the entire channel attention module can significantly improve the representation ability of the network model for abundance.
The channel attention module can be expressed as follows: where δ represents the sigmoid function, and ⊕ refers to the operation of elementwise addition.
2) Spatial Attention Mechanism: The complex distribution of real-world environments and the susceptibility of the HSI imaging process to interference from external factors cause the spectral profiles of pixels in the HSI to be affected by SV [44], [45], resulting in significant differences in the contributions to the unmixing of HIS pixels from different regions.The SV can often lead to deviations between the spectral curves of some pixels in the HSI and the ideal spectral curves, thereby affecting the accuracy of HU.To this end, the spatial attention mechanism is introduced to enhance or suppress the importance of pixels in different regions during the HU process to improve the unmixing ability of the network.The workflow of the spatial attention mechanism is shown in Fig. 5(b).
First, the input spatial features F spa ∈ R l×H×W are subjected to the operations of global average pooling and global max pooling, which results in the average-pooled feature F s avg ∈ R 1×H×W and the max-pooled feature F s max ∈ R 1×H×W , respectively.Then, these two feature maps are merged together by concatenation operation along the channel dimension, yielding the fused feature F s A−M ∈ R 2×H×W , which serves for calculating the spatial weights.Subsequently, the fused feature F s A−M is further processed by a 2-D convolution with a kernel size of 7 × 7, yielding the spatial weight operator.Finally, this spatial weight operator is converted to a spatial attention map M S (F spa ) using the sigmoid activation function.
The spatial attention module can be formulated as follows: where f 7×7 denotes the 2-D convolution with a kernel size of 7 × 7. Concat represents the concatenation operation of vertically stacking the feature maps along the channel dimension, and δ refers to the sigmoid function.
3) Fusion of Spectral-Spatial Features: SSFM facilitates the comprehensive mining and utilization of spectral and spatial information by synchronizing the fusion of features using the channel attention mechanism and spatial attention mechanism, thus significantly improving the unmixing accuracy.The former mechanism evaluates the importance of channels based on the spectral features to improve the abundance estimation accuracy, while the latter mechanism focuses on elevating the significance of different pixels in space to obtain better endmember extraction results.The fusion process of SSFM can be formulated as follows: where ⊗ and ⊕ denote the operations of element-by-element multiplication and addition, respectively; and F fused refers to the fused features.
Furthermore, the abundance maps are obtained by employing a 3 × 3 convolutional operation upon the F fused , where the number of abundance maps is consistent with the number of endmembers.In the end, a softmax function is used immediately after the convolution layer to ensure that the abundance results satisfy the ANC and ASC constraints.

B. Decoder
The decoder reconstructs the input pixels by integrating the estimated abundances and the corresponding endmembers.Such a process can be expressed as follows: In the equation above, {ŷ i } n i=1 ∈ R l is the reconstructed pixels, and Ê ∈ R l×p represents the estimated endmember matrix.{â i } n i=1 ∈ R p denotes the generated abundance vector.
Notably, to reduce the training time, the method of VCA is employed to initialize the weights W (d) of the decoder.

C. Objective Function
To achieve the best possible training results, the loss function of the SSF-Net model is designed to consist of SAD and MSE.The SAD exhibits spectral scale invariance as it evaluates the similarity between two spectral curves by calculating the angle between the target spectrum and the reference spectrum.And a smaller angle between the two spectral curves indicates more similarity.The calculation formula of SAD is given as follows: where y i and ŷi denote the input and reconstructed pixel data, respectively.n denotes the total number of pixels.
Although SAD can improve the accuracy of endmember extraction, it is prone to larger errors in abundance estimation as it only considers the scale invariance of endmembers.To this end, the MSE is also introduced into the objective function to ensure that the network can obtain more accurate abundance To strive for better unmixing results, the overall loss function of the network in this study is defined as a weighted combination of SAD error and MSE error, as shown in the following equation: Here, α and β represent the hyperparameters of the loss function.

IV. EXPERIMENTS
In this section, a comparatively experimental analysis is carried out with the state-of-the-art HSI unmixing methods to demonstrate the superiority of the SSF-Net network.The algorithms selected for comparison include three classical methods: fully constrained least-squares unmixing (FCLSU) [14], multilayer nonnegative matrix factorization (MLNMF) [46], spatial group sparsity regularized nonnegative matrix factorization (SGSNMF) [47], SNMF-Net [48], and four deep-learningbased methods: uDAS [31], DAEU [49], CNNAEU [36], and CyCU-Net [34].These methods are widely recognized and highly represented in the field of HU.To ensure fairness in the experiments, the VCA [16] is first adopted for generating the initial endmember for all the comparison algorithms.

A. Data Description 1) Synthetic Dataset:
The synthetic dataset is composed of five randomly selected spectral curves from the ASTER spectral library, as curated by Jin et al. [50].This dataset consists of 60 × 60 pixels, with a total of 200 spectral bands spanning from 0.4 to 14 μm.The abundance maps follow a Dirichlet distribution.To simulate the endmember variability in real HSI data, this dataset is made by using asphalt as the background color and the remaining four endmembers are randomly scattered in the corners.Moreover, to enhance the realism of the synthetic hyperspectral data, Gaussian noise with different signal-to-noise ratios (SNRs) was introduced to the synthetic dataset.The data contain a total of five endmembers: limestone, conifer, basalt, concrete, and asphalt.Fig. 6(a) shows the RGB true color image corresponding to this data area.
2) Samson Dataset: The Samson dataset is acquired using the SAMSON sensor.The image consists of 952 × 952 pixels with a total of 156 spectral bands ranging from 0.401 to 0.889 μm.Considering that the size of the original image is too large, an area of 95 × 95 pixels is cropped out starting from the position of (252,332) pixels in the original image as the experimental data.Specifically, the cropped data contain three endmembers: Soil, Tree, and Water.Fig. 6(b) presents the corresponding RGB true color image of these data.
3) Jasper Dataset: The Jasper data were collected by the airborne visible infrared imaging spectrometer sensor.The image is 521 × 614 pixels and contains 224 bands with a spectral range from 0.38 to 2.50 μm.Since the original image was too large, only a cropped subimage containing 100 × 100 pixels is used in this experiment, with its first pixel starting from the position of (252,332) pixels in the original image.After removing some of the bands affected by high water vapor concentration and atmospheric effects, only 198 channels are retained in these data, which contain four endmembers: Road, Soil, Water, and Tree.Fig. 6(c) shows the RGB true color image corresponding to this data area.
4) Urban Dataset: The Urban data were obtained from the hyperspectral digital image collection experimental sensor for the urban area of Copoas, Texas, USA.The image consists of 307 × 307 pixels and contains 210 bands with a spectral range from 0.4 to 2.5 μm.Only 162 bands are retained after removing bands affected by high water vapor concentrations and atmospheric effects.In these HIS data, there are five endmembers: Asphalt, Grass, Tree, Roof, and Dirt.Fig. 6(d) presents the corresponding RGB true color image of these data.

B. Experimental Settings 1) Hyperparameter Settings:
The implementation of the SSF-Net model in this study is based on the PyTorch framework with an i9-9900K CPU and an NVIDIA 2080 8GB GPU as the hardware platform.During the training process, the Adam optimizer is used to update the network parameters, where the learning rate is set to 1 × 10 −3 .To further improve the network Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.accuracy, the learning rate decay strategy is adopted, that is, the learning rate is decayed once every 40 epochs of training, and the maximum number of iterations is set to 800.

TABLE I QUANTITATIVE RESULTS FOR THE SYNTHETIC DATASET
2) Evaluation Metrics: To evaluate the network unmixing performance, the following two metrics are introduced into the experiments: root-mean-square error (RMSE) and SAD, which are defined as follows: where a i and âi represent the true abundance vector and generated abundance vector, respectively.m i and mi denote the true endmember and extracted endmember, respectively.

C. Experimental Result and Analysis 1) Synthetic Dataset:
The quantitative results of the RMSE as well as the SAD for each endmember on the synthetic dataset by the different algorithms are presented in Table I.Meanwhile, Figs. 7 and 8 show the abundance maps and corresponding endmember results extracted by different algorithms on the synthetic dataset.In the synthetic dataset, SGSNMF achieves better unmixing results than FCLSU and MLNMF.SGSNMF uses spatial information to divide the HSI into spatial groups and incorporates a sparsity constraint of spatial group into nonnegative matrix factorization, which results in a superior decomposition structure.Although SNMF is constructed based on a nonnegative matrix model with L-p sparse constraints, its failure to take the spatial information into account results in unmixing performance in synthetic data without significant advantages.The uDAS algorithm achieves promising results in   abundance estimation by incorporating denoising constraints and attaching certain physical constraints.The DAEU, on the other hand, only focuses on spectral information and ignores spatial information.Although CNNAEU and CyCU-Net consider spatial information, they ignore the inter-SV, which leads to their unsatisfactory performance on synthetic datasets.The experimental results of these algorithms demonstrate the importance of taking both spectral and spatial information into consideration to obtain accurate HU results.In contrast, the proposed SSF-Net achieves superior results on synthetic data, which proves its superiority in the unmixing task.
To verify the robustness of the proposed SSF-Net network, the varying SNR values from 20 to 40 dB are added to the synthetic dataset.Correspondingly, the quantitative experimental results are presented in Table II.In general, the unmixing accuracy of these algorithms tends to decrease as the noise increases.The classical algorithm SGSNMF exerts good performance on the synthetic dataset.Benefiting from denoising processing, the uDAS network exhibits minimal variation in unmixing accuracy under different noise conditions.In contrast, the models of SNMF, CNNAEU, and CyCU-Net perform poorly in high-noise situations.Particularly important, the proposed SSF-Net has obtained remarkable accuracy in both abundance estimation and endmember extraction under varying noise levels, which fully demonstrates its effectiveness and robustness.
2) Samson Dataset: The results of the RMSE and SAD quantification for each endmember on the Samson dataset by the different algorithms are presented in Table III.Figs. 9 and 10 show the abundance maps and corresponding endmember results extracted by different algorithms on the Samson dataset.The Samson dataset is characterized by a relatively uniform spatial distribution of different materials, making it widely regarded as a relatively simple unmixing dataset.In general, the algorithms have all achieved relatively excellent results.However, despite the promising results obtained by the algorithm of SGSNMF on synthetic datasets, its performance on the Samson dataset is subpar.The reason for this discrepancy can be attributed to the distribution complexity of ground objects and the non-Gaussian nature of noise distribution in real scenes.Moreover, the unmixing accuracy of all classical algorithms is lower than that of deep learning models.Specifically, the proposed SSF-Net algorithm achieves the best results in terms of RMSE for each land cover category, and it also outperforms all other methods in terms of the overall mean endmember accuracy of SAD.
3) Jasper Dataset: The results of the RMSE and SAD quantification for each endmember on the dataset of Jasper by the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.11 and 12 show the abundance maps and the corresponding endmember results extracted by different algorithms on this dataset.The distribution of ground objects in the dataset of Jasper exhibits a higher level of complexity compared with the Samson dataset.As shown in Fig. 11, the unmixing algorithms, such as FCLSU, MLNMF, SGSNMF, SNMF, and uDAS, have deficiencies in accurately extracting all the roads, which is manifested in the fact that some of the roads are misclassified as water bodies, which affects the final unmixing accuracy.In contrast, deep learning algorithms are better able to identify roads and, thus, generate more accurate abundance maps.As can be seen in Table IV, the proposed SSF-Net network achieves excellent results in the Jasper dataset.The quantification results show that all accuracy evaluation metrics, except for SAD_Tree, achieve optimal accuracy levels, indicating that SSF-Net has excellent performance in the unmixing task.
4) Urban Dataset: The results of the RMSE and SAD quantification for each endmember on the Urban dataset by the different algorithms are presented in Table V. Figs. 13 and 14 show the abundance maps and corresponding endmember results extracted by different algorithms on the Urban dataset.Among four experimental datasets, the Urban dataset is the most heavily mixed dataset in terms of the mixing degree of ground objects.Visually, the abundance maps generated by the proposed SSF-Net network exhibit the highest degree of similarity to the real abundance maps.From the quantitative results, SSF-Net achieves the best accuracy in terms of the Mean_SAD

D. Computational Cost
Table VI presents the run time by all comparison methods on different datasets.It can be seen that the classical methods, FCLS, MLNMF, and SGSNMF, have a relatively simple computational process and, thus, have a low time overhead.In contrast, deep learning models usually run significantly longer than classical methods due to their complex network structure and inclusion of a large number of parameters.Among these deep learning models, SNMF, uDAS, and DAEU belong to the pixel-level unmixing networks.Since they are unmixed pixel-by-pixel, their processing time increases with the number of pixels.However, CNNAEU and CyCU-Net are spatial-level unmixing networks, which capture spatial context information by introducing a receptive field.However, it is also the inclusion of the receptive field that causes their time overhead to significantly exceed that of pixel-level unmixing networks.Overall, the time expenditure of the proposed method in this study is between the pixel-level Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and spatial-level unmixing network models as mentioned above.
The network model proposed in this study also constructs an SSFM module in the encoder, which contains spatial attention and spectral attention mechanisms, which helps to take into account the spatial contextual information in the HSI data and the spectral disparity information between the pseudoendmembers.Despite the increase in model runtime overhead caused by the introduction of the attention mechanism, the proposed method is able to significantly improve the unmixing accuracy.

E. Ablation Analysis
To verify the effectiveness of each module in the SSF-Net network, ablation experiments are conducted in this section for the spectral attention module and the spatial attention module on the Urban dataset.As can be seen from Table VII, when the SSF-Net network removes both the spectral attention module in spectral branch and the spatial attention module in spatial branch, the performance of the network becomes the worst, which also indicates to a certain extent that the potential information contained in the HSI data is not fully exploited.
In the SSF-Net network, the addition of the spectral attention module can enhance the network's ability of abundance estimation.The spatial attention module, on the other hand, is able to improve the endmember extraction accuracy of the network.It is worth emphasizing that the spectral attention module plays a crucial role in enhancing the abundance estimation capability of the network, mainly by focusing on the distinct disparities between different spectral features.And that, the spatial attention module exhibits higher sensitivity to the difference between different spatial regions, which significantly enhances the endmember extraction ability of the network.The results of the ablation experiments show that the joint use of both spectral attention module and spatial attention module in SSF-Net can be more effective in mining high-dimensional features from HSI, thus obtaining better unmixing results.

F. Discussion
By conducting a quantitative analysis of the experimental results on the four datasets, it is found that SSF-Net behaves prominently superior unmixing performance to the other comparative methods.Meanwhile, the complexity of the spatial distribution of features in real images far exceeds that of synthetic datasets, resulting in the poor performance of some NMF-based unmixing methods in real datasets.Besides, SNMF-Net is of high physical interpretability as it is built by unrolling L-p sparsity constrained NMF model, so it achieves higher accuracy than other NMF methods in the real datasets.However, because it only performs unmixing at the pixel level without introducing spatial information, it does not achieve particularly good unmixing accuracy.Similarly, the pixel-level-based unmixing method also includes DAEU, which pays more attention to endmember extraction, and a good endmember extraction result will further enhance the abundance estimation result, thus it also achieves relatively excellent unmixing accuracy in these comparative experiments.Although CNNAEU introduces spatial information, its loss function only considers SAD, which makes its unmixing results not as good as other DL unmixing networks.The accuracy of CyCU-Net is poor because its receptive field does not cover the complete image, resulting in its lack of extensive information and remote dependencies.Most importantly, the proposed method in this study, however, achieves the optimal accuracy in real datasets and the unmixing results can be perceived as the closest to the ground truth in terms of visualization.This further confirms the excellence of the proposed network model in the unmixing task.

V. CONCLUSION
In this study, a convolutional AE HU network called SSF-Net is proposed ingeniously integrating both spatial and spectral features.The architecture of this network is conceived in such a manner that it initiates its operation by employing a regional VCA algorithm to extract the pseudoendmembers from HSI data.Then, the network utilizes the spatial attention module along with the spectral attention module to learn the spatial difference information contained within the HSI data and spectral difference information amongst the pseudoendmembers, respectively, in such a way that the network makes the best use of the information inherent in the HSI data, resulting in more reasonable and superior unmixing results.Experiments confirm the effectiveness of the SSF-Net network proposed in this article on synthetic and real hyperspectral datasets with higher unmixing accuracy compared with other state-of-the-art HU methods.The proposed SSF-Net network is built based on LMM.However, considering the complexity of the hyperspectral imaging process, the NLMM is more suitable for elaborating its imaging principles.Thus, our future research aims to develop more general and powerful NLMM-based unmixing networks that integrate the spatial-spectral features.Meanwhile, the introduction of LiDAR data to aid in HU has been shown to be feasible.Thus, designing a network architecture that fuses the spatial-spectral features of HSI with those extracted from the LiDAR point cloud can help to address the unmixing issue more effectively.

Fig. 1 .
Fig. 1.Schematic diagram of an AE.The abstract features are first obtained by encoding the input data, and then the abstract features are decoded to reconstruct the input data.

Fig. 7 .
Fig. 7. Abundance maps of five materials from the synthetic data obtained by different algorithms.

Fig. 8 .
Fig. 8.Comparison of endmembers between SSF-Net (blue curves) and the corresponding GT (red curves) on the synthetic dataset.

Fig. 9 .
Fig. 9. Abundance maps of three materials from the Samson data obtained by different algorithms.

Fig. 10 .
Fig. 10.Comparison of endmembers between SSF-Net (blue curves) and the corresponding GT (red curves) on the Samson dataset.

Fig. 11 .
Fig. 11.Abundance maps of four materials from the Jasper ridge data obtained by different algorithms.

Fig. 12 .
Fig. 12.Comparison of endmembers between SSF-Net (blue curves) and the corresponding GT (red curves) on the Jasper dataset.

Fig. 13 .
Fig. 13.Abundance maps of five materials from the urban data obtained by different algorithms.

Fig. 14 .
Fig. 14.Comparison of endmembers between SSF-Net (blue curves) and the corresponding GT (red curves) on the urban dataset.

TABLE II QUANTITATIVE
RESULTS OF MEAN_RMSE AND MEAN_SAD FOR THE SYNTHETIC DATASET UNDER DIFFERENT NOISES

TABLE III QUANTITATIVE
RESULTS FOR THE SAMSON DATASET

TABLE IV QUANTITATIVE
RESULTS FOR THE JASPER DATASET different algorithms are presented in Table IV.Figs.

TABLE VI COMPUTATIONAL
COST OF ALL COMPARISON METHODS ON DIFFERENT DATASETS IN TERMS OF SECONDS (S)

TABLE VII ABLATION
ANALYSIS OF SSF-NET ON URBAN DATASET COMBINED WITH DIFFERENT NETWORK MODULES