Enhanced Spectral–Spatial Residual Attention Network for Hyperspectral Image Classification

Deep learning has achieved good performance in hyperspectral image classification (HSIC). Many methods based on deep learning use deep and complex network structures to extract rich spectral and spatial features of hyperspectral images (HSIs) with high accuracy. During the process, how to accurately extract the features and information from pixel blocks in HSIs is important. All of the spectral features are treated equally in classification, and the input of the network often contains much useless pixel information, leading to a low classification result. To solve this problem, an enhanced spectral-spatial residual attention network (ESSRAN) is proposed for HSIC in this article. In the proposed network, the spectral-spatial attention network (SSAN), residual network (ResNet) and long-short term memory (LSTM) are combined to extract more discriminative spectral and spatial features. More specifically, SSAN is first applied to extract image features by using the spectral attention module to emphasize useful bands and suppress useless bands. The spatial attention module is used to emphasize pixels that have same category with the central pixel. Then, these obtained features are fed into an improved ResNet, which adopts LSTM to learn representative high-level semantic features of the spectral sequences, since the use of ResNet can prevent gradient disappearance and explosion. The proposed ESSRAN model is implemented on three commonly used HSI datasets and compared to some state-of-the-art methods. The results confirm that ESSRAN effectively improves accuracy.


I. INTRODUCTION
H yperspectral images (HSIs) contain abundant of narrow and contiguous spectral bands ranging from visible to near-infrared and even thermal infrared, holding plentiful physical properties. The 3-D data block of HSIs also contains extensive detailed spatial distribution information. Both spectral signatures and spatial information can be used to accurately characterize and identify the types of objects of interest, resulting in great potential for land cover identification [1], [2], [3], [4]. Hyperspectral image classification (HSIC), aiming to identify the category of each hyperspectral pixel, has been applied to many applications, such as geological exploration [5], [6], urbanization analysis [7], precision agriculture [8], environmental monitoring [9], change detection [10], [11] and target detection [12].
In early studies of HSIC, machine learning-based methods, such as support vector machines (SVMs) [13], random forests [14], decision trees [15], neural networks (NNs) [16], and logistic regression [17] were dominant. However, these methods simply extract shallow features based on the spectral information of the HSI, using one single pixel and all of its bands as input. Thus, these linear and nonlinear classifiers do not adapt well to the high dimensionality of the spectrum, limiting their application [18]. Feature extraction (FE) methods are well adapted to the high-dimensionality of the spectrum by mapping the raw HSIs to a low-dimensional space. Some of the more advanced FE methods are geodesic-based sparse manifold hypergraph [19] and multistructure unified discriminative embedding [20], etc. These methods classify HSIs by converting them into low-dimensional structures and extracting the sparse relationships and discriminative features from different structures. When deep learning was introduced into HSIC, it achieved remarkable performance. Typical deep learning-based classification methods include deep belief networks [21], sparse autoencoders [22], recurrent neural networks (RNNs) [23], convolutional neural networks (CNNs) [24], and so on. Different from traditional machine learning algorithms, these deep learning-based methods can automatically extract high-level semantic information from HSIs with no handcrafted FE. Among them, CNN can simultaneously extract high-level spectral and spatial features by convolution, showing better classification performance. This spectral-spatial classification method has gradually developed to solve the complex spatial distribution problem in HSIs and obtain higher classification accuracy [25]. The 1-D CNN model is designed to use the pixel vector along the radiometric dimension as a training sample to extract deep features, which is conceptually called the spectral-based classification approach. A 2-D CNN, which is called a spatial-based classification approach, learns spatial information by a convolution operation on the spatial dimension. 3-D CNN combines the advantages of 1-D CNN and 2-D CNN and can extract diagnostic spectral and spatial information from 3-D hypercubes with spectral and spatial continuity, which is also called the spectral-spatial classification method [26].
The 3-D CNN takes a cube containing the target pixel and several adjacent pixels as input. There are pixels in this cube labeled differently from the central pixel. Such bands and pixels contained in this hyperspectral cube obviously have bad effects This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ on the CNN classification [27]. Therefore, it is critical to harvest the information bands and pixels that are beneficial to HSIC in the end-to-end classification process. Such information that facilitates classification should be focused on, while bands with redundant information and pixels with different labels from the target pixels should be suppressed. To solve this problem, the spectral-spatial attention (SSA) mechanism is used in HSIC to learn dependent spectral and spatial features. SSA is composed of a spectral attention (SpeA) module and a spatial attention (SpaA) module. It assigns high weights to useful bands and pixels for feature enhancement of the original image [28]. The attention mechanism comes from the study of human vision, where people selectively focus on useful information of interest and ignore other visible information. Thus, this mechanism increases the sensitivity to features that contain the most valuable information. It was first applied to machine translation [29] and later was also widely used in natural language processing [30], image recognition [31], [32] and speech recognition [33]. Mei first introduced the SSA mechanism into hyperspectral classification to capture high spectral correlation between adjacent spectra and learn spatial dependence in the spatial domain [34]. Later, the attention mechanism was improved or combined with other network structures to improve the classification accuracy of HSI [35]. Pan et al. [36] designed a joint network with a spectral attention bidirectional RNN branch and a spatial attention CNN branch to extract spectral and spatial features for HSIC. Zhu et al. [37] embedded a SSA module into a residual block to avoid overfitting and accelerate the training speed. Lu et al. [38] used a multiscale spatial-spectral residual network to stack the extracted deep multiscale features and input them into the 3-D attention module to improve the classification accuracy. Some researchers combined SSA with graph convolutional networks (GCNs) [39] to adaptively extract spatial and spectral features from neighboring nodes through a graph attention mechanism [40], [41]. It is evident that it is highly feasible and advantageous to use SSA to extract spectral and spatial dependent features and then input them into a deep network model for classification. The depth of the network is critical to the performance of most models. When the number of network layers is increased, the network extracts more complex features, so theoretically better results could be achieved. However, many experiments have shown that as the depth of the network increases, the CNN model exhibits degradation problems, leading to poor results. Thus, He et al. [42] proposed the residual network (ResNet) on the classification task of ImageNet large scale visual recognition challenge (ILSVRC) 2015. The main contribution of ResNet is the discovery of "degradation" and the invention of a "shortcut connection" aimed at the degeneracy phenomenon, which greatly eliminates the problem of training difficulty in deep NNs. Subsequently, ResNet has been added to deep network models by many scholars to classify HSIs in combination with CNNs. Jiang et al. [43] collaborated on the 3-D separable ResNet with cross-sensor transfer learning to reduce training parameters and achieve better classification performance. Meng et al. [44] proposed a multipath ResNet that employed multiple residual functions in the residual blocks to make the network wider. Li et al. [45] proposed a depthwise separable ResNet, which can separate both spectral and spatial information and also greatly reduce the network size. The residual network will continue to be used in HSIC due to its powerful feature transformation capability.
Although SSA and ResNet have powerful FE and generalization capabilities, the potential relationships between adjacent bands are ignored, resulting in important spectral features to be undetected. ResNet inputs the SSA transformed 3-D feature maps as a whole into the model for training, and connects the upper-level nodes with the lower-level nodes with weights; thus, it ignores the relationship between the nodes of the same layer. Long-short term memory (LSTM), a deep learning algorithm mainly used to handle sequence data, can solve this problem. The aim of the LSTM is to give a typically strong relationship between the given sample and the previous one, where activation at each step depend on the previous step in the hidden layer [46]. The simplest way to classify HSIs using LSTM is to use each band of the pixel spectrum as input data at the corresponding time, serialize the spectral vector band by band, and then extract potential information. Zhou et al. [47] input row vectors of image blocks centered on target pixels into the LSTM model for hyperspectral classification. Liu et al. [48] proposed a bidirectional-convolutional LSTM network to automatically learn the spectral and spatial features from HSI. Tang et al. [49] combined the GCNs with bidirectional LSTM to extract both short and long spatial relationships for HSIC. It can be seen that the addition of LSTM to the network model of HSIC is helpful to improve the classification accuracy.
Based on the above analysis, we propose an enhanced spectral-spatial residual attention network (ESSRAN) algorithm for HSIC. Moreover, small training samples are selected to test the network, which fully demonstrates the advantages of the proposed method, and the pixel cluster (PC) approach is used to solve the problem of insufficient number of training samples for some categories. This network combines the advantages of SSA, ResNet and LSTM, improving the capabilities of spectral and spatial feature learning and the accuracy of classification. The main contributions of this article are as follows.
1) For the problem that hyperspectral cubes often contain redundant pixels and bands, the SSA module is applied to extract discriminative and robust spectral and spatial features. In the spectral dimension, it generates a spectral weight vector emphasizing useful bands to improve the performance of classification. In the spatial dimension, it adaptively emphasizes the spatial information of pixels with the same label as the central pixel by generating a spatial weight matrix that represents the significance of neighborhood pixels. 2) To extract potential relationships between adjacent bands, LSTM is added to the ResNet module to obtain the interdependence of long-range nonlinear channels. Specifically, convolution and LSTM operations are used in ResNet to extract the required spectral and spatial information. The feature map after convolution is produced as spectral sequence data, which is then fed into the LSTM to obtain the relationship between the bands. 3) To adapt small samples of HSIC, we used the PC method to expand the training samples. This method regroups the training samples in order to obtain new pixel blocks. These pixel blocks are superimposed on the spectral dimension so that the new data block is of the same size as the original one. This method effectively improves the classification accuracy for classes with small number of samples. 4) We experimentally demonstrate the effectiveness of the proposed deep network modules and illustrate that the proposed ESSRAN outperforms eight compared methods on three HSI datasets. The rest of this article is organized as follows. Section II introduces the proposed method. Section III evaluates the effectiveness of the proposed method on real hyperspectral datasets, and Section IV draws the conclusion.

II. PROPOSED METHOD
In this section, the framework of the proposed method for HSIC is first described in detail. Second, each basic model in the network is introduced in turn, including SSA, ResNet, and LSTM. Finally, a pixel-cluster-based training sample increasing method is presented in detail.

A. Overview of the Proposed Model
Let X hsi ∈ R H×W ×B represent the original HSI data, where H, W, and B represent the height and width of spatial dimensions and the number of spectral bands, respectively. Suppose that the dataset X hsi contains N labeled pixels X = {x 1 , x 2 , . . . , x N } ∈ R 1×1×B , and their corresponding set of one-hot label vectors Y = {y 1 , y 2 , . . . , y N } ∈ R 1×1×K , where K is the number of classes. The regions of size S × S centered at pixel x can be defined as a spectral-spatial vector Z = {z 1 , z 2 , . . . , z N } ∈ R S×S×B . In this article, each patch cube z i in Z is used as input to the proposed model to classify its corresponding center pixel x i in HSI [50].
After the notation of HSI data, all available labeled data are randomly divided into training and test datasets denoted by Z train and Z test , respectively, and corresponding label sets are denoted by Y train and Y test , respectively. Then, Z train is used to optimize the hyperparameters of the proposed model and obtain the besttrained model through cross-validation. Finally, the best-trained model is used to obtain three evaluation metrics of performance by Z test and classify all pixels to form a classification map. Fig. 1 shows the framework of the proposed ESSRAN network. First, the principal component analysis (PCA) algorithm is used to perform feature transformation on the original HSI, and then a pixel-centric 3-D patch is extracted as the input of the proposed network [51]. Second, SSA is used to extract spectral and spatial features. The spectral attention module assigns a greater weight to the key channels and smaller weight to the less important channels. The spatial attention module similarly uses the weight matrix to enhance the information of pixels with the same label as the central pixel and weaken those different labels. Third, the improved ResNet, which adopts LSTM, that has a strong ability to capture contextual information in the spectral sequence, is used to extract more representative and discriminative semantic features. Finally, a fully connected layer with a softmax function is used for classification.

B. Spectral-Spatial Attention Network
The SSA network (SSAN) extracts deep spectral and spatial features from patch cube z by enhancing useful information and suppressing the effects of interfering information. This is actually an adaptive attention mechanism, which extracts the weight vector w from the patch cube z itself. The weight vector is a significant spectral and spatial feature. Final SSAN output is represented by f SSAN . The detail is formulated as follows: where σ(.) represents the activation function, b SSAN represents the bias, and ⊗ represents the matrix multiplication. SSAN is composed of two modules: the spectral attention module and the spatial attention module, as represented in Fig. 2. The spectral attention module is utilized to extract spectral features from the patch cube z. The spatial attention module is utilized to capture spatial features from the output of the spectral attention module [52].

1) Spectral Attention Module:
The SpeA mechanism emphasizes the spectral band, which helps in the extraction of features and the final classification. The SpeA module is abstracted into three procedures: feature aggregation, feature transformation, and feature enhancement [53].
Feature aggregation calculates the average value of the patch cube z in the spatial dimension as the weight of the corresponding spectral dimension. Specifically, the input feature maps F spec_in ∈ R S×S×B are fed into an average pool layer and a new feature map F spec1 ∈ R 1×1×B is obtained ( Feature transformation learns nonlinear channelwise inner relationships by a multilayer perceptron (MLP) module. The MLP module has two linear fully connected layers FC, a ReLU activation function σ ReLU , and a sigmoid activation function σ sigmoid . The bottleneck ratio r of MLP is set to 2 to reduce the computational cost and prevent overfitting. The feature transformation operation function in MLP is expressed as where F spec2 ∈ R 1×1×B represents the output spectral attention map. Feature enhancement multiplies the converted spectral features F spec2 with the original input F spec_in to obtain the feature map with enhanced spectral information F spec_out ∈ R S×S×B 2) Spatial Attention Module: The SpaA mechanism enhances spatial information from the neighborhood pixels with the same class label as the center pixel while it suppresses the information from those with different labels. Similar to SpeA, SpaA also has three procedures [54].
Feature aggregation extracts the average and maximum values of each pixel spectrum from the input feature maps F spa_in ∈ R S×S×B and obtains new feature maps F spa1 ∈ R S×S×1 and F spa2 ∈ R S×S×1 , respectively, where max represents the maximum operation. Feature transformation connects the above two feature maps horizontally as the input of a new convolutional layer followed by a sigmoid activation function, obtaining the output attention where ࢩ is the convolution operation and k is the convolution kernel.
Finally, feature enhancement combines the attention map F spa3 and input map F spa_in and obtains the output map F spa_out contains the spatial features of all the positions and highlights the information of important spatial locations.

C. Modified Residual Network
ResNet is proposed to solve the problem that the accuracy of CNN decreases substantially with increasing network depth. CNN is an NN that extracts nonlinear spectral and spatial features through convolution, pooling and activation functions. The convolution layer uses convolutional operations to extract deep features in spectral and spatial dimensions; the pooling layer can reduce the complexity of the network and improve the computational speed, including average pooling and maximum pooling; and the activation function can improve the ability of CNN to deal with nonlinear problems, such as the sigmoid function and the ReLU function [55]. Therefore, it is difficult to achieve a constant transformation between the nonlinear feature map extracted by the deep CNN and the desired label. ResNet connects the original feature map x with the optimized feature map F (x), seeking a balance between linear and nonlinear transformations where H(x) is the desired underlying feature map [42].
To propagate information backward and forward in the network, deep ResNet is formed by stacking multiple BasicBlocks together. One BasicBlock contains two convolution layers, two batch normal layers and two activation functions, while the modified ResNet only retains half of these operations. The feature map after convolution and batch normalization operations is input to the LSTM module to obtain the contextual relationships between adjacent spectra x 1 refers to the feature map extracted by the convolution operation, batch_norm is batch normalization, and conv is the convolution. The channel relationship characteristic obtained by LSTM is multiplied by x 1 to obtain the final feature map. Considering the entire network proposed in this article, spectral and spatial features are extracted using only one residual operation, reducing the redundancy of the original structure. This not only effectively utilizes the advantages of ResNet, but also reduces the computation time and increases the discriminative power of the model.

D. Long-Short Term Memory
LSTM overcomes the problem of gradient explosion or vanishing of RNNs when dealing with long sequence data. LSTM has a chain-like structure, including a forget gate, input gate and output gate. The forget gate decides whether to consider the previous cell state; the input gate decides what new information is stored in the cell state; and the output gate regulates the amount of data passed to the next layer. The cell state carries information from the first timestep to the last timestep, i.e., the footprint of all inputs. Gates have one sigmoid activation function, where 0 indicates forget. The structure of LSTM is shown in Fig. 3. It can be observed that the LSTM cell constantly updates the hidden value and cell value with the help of the three control gates, which are used to discard, retain, or amplify signals to achieve information control and transformation. The calculation process in one LSTM cell at time t is where f t , i t , and o t represent the forget gate, input gate, and output gate respectively.c t represents cell value. x t , h t , and c t represent the input, hidden, and cell states at time step t, respectively. b f , b i , b c , and b o are bias terms. The weight matrix subscripts have conventional meanings. For instance, w xo is the input-output gate matrix and w hi is the hidden-input gate matrix. σ(.) is the activation function, and is a dot product operator, meaning pixelwise multiplication [56], [57]. In HSIC, the spectral vector is serialized band-by-band, and each spectral band is used as input data for the LSTM model at the corresponding time, extracting relationship information between the bands.

E. Theory of Pixel Cluster
In HSIC, problems such as high imbalance between the number of samples of categories and few known labels for some of the features are very common [58]. The small number of training samples of HSI limits the learning ability of deep learning-based models, which makes it difficult to extract the typical features and affects the classification accuracy. Therefore, the PC algorithm is proposed to solve this problem. Pixel clustering is a process of increasing the number of samples using the principle of permutation. This method selects multiple pixel blocks to be combined after disrupting the training samples, and then forms a new data block. A superposition operation is performed on the selected multiple data blocks in the spectral dimension so that the new data block is of the same size as the original block.
Suppose there is a class that has n training samples. One PC is composed of p pixels, which are randomly selected from the training samples. The number of training samples after data augmentation is It is obvious that n is larger than n when p ࣔ 1, solving the shortage of the training set. In addition, the deep learning model can learn more diverse spatial information from the expanded training samples [59]. For categories with a large number of samples, sample expansion using the PC principle would lead to data redundancy and reduced accuracy. Therefore, categories with sample sizes below-average are selected for the pixel clustering operation to improve the classification performance of the network. The effects of using PCs will be explained in detail in the experimental section.

III. EXPERIMENTAL RESULTS
In this section, we first introduce three experimental datasets and three factors that obviously influence the performance of the proposed model. After that, the results are compared with some state-of-the-art deep learning methods, fully proving the advantages of the proposed algorithm. Finally, the effects of SSA, LSTM, and PCs on the model are discussed separately.

A. Datasets
Three common HSI datasets, i.e., Indian Pines (IP), Pavia University (PU), and Salinas (SA), are considered in our experiments, as given in Table I. The numbers and names of each category, the number of training samples, and the total number of category samples for each of the three datasets are given in Table II. The false color image, ground truth map, and color code are depicted in Fig. 4. 1) Indian Pines: This dataset was captured by an airborne visible infrared imaging spectrometer (AVIRIS) sensor

B. Experimental Settings
We evaluated the performance of the proposed network model on a server with an NVIDIA GeForce RTX 3090 GPU with 24 GB RAM. The code implementation of all methods is based on Python 3.6 with the library of PyTorch 1.7. Several evaluation indicators, including class-specific accuracy, overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa), are used to evaluate the proposed method exactly. Approximately  TABLE II  NUMBER OF TRAINING AND TOTAL SAMPLES OF THE THREE DATASETS   TABLE III  ORIGINAL AND INCREASED NUMBER OF TRAINING SAMPLES 5% of the samples are randomly selected as the training set for IP, and 1% of the samples are randomly selected as the training set for PU and SA. For categories with fewer samples, at least five samples are randomly selected for the training set, while other samples are used as the test set. Each experiment is optimized for 100 epochs for the training samples. Each experiment is repeated five times to eliminate bias from randomly selected training samples and the AA and the standard deviation of each evaluation criterion are reported. In addition, the batch size is set to 32 [60].
To solve the problem having inadequate number of labeled hyperspectral datasets, experiments are performed using the PC method to add new samples. The number of added training samples are given in Table III. In the table, train A denotes the original training sample, and train B denotes the extended training sample. The number of samples is added only for categories with training samples smaller than the mean; otherwise, the original training samples are used for training. As seen from the table, the total number of training samples increased 3 to 4 times when compared to the originals.
1) Support Vector Machine: A classical machine learning algorithm using kernel functions. The implementation is based on libsvm. 2) Long-Short Term Memory: A method for extracting spectral features by converting spectral values into sequence data.

3) 3-D Convolutional Neural Network: A method for di-
rectly extracting spectral and spatial information using 3-D convolutional operations. This method includes a 3-D convolutional layer and a fully connected layer. 4) HybridSN: This method is a hybrid spectral CNN that combines 3-D CNN extracting spectral and spatial features with 2-D CNN extracting spatial abstract features. 5) DHCNet: This method introduces the deformable convolutional sampling locations based on 2-D CNN, whose size and shape can be adaptively adjusted according to the complex spatial contexts of HSI.

6) Graph Convolutional Network: This method classifies
HSIs by encoding them into graphs and using superpixels instead of pixels as nodes to simulate various spatial structures of land cover on the graphs. 7) RSSAN: This method first uses SSAN to extract spectral and spatial information, and then embeds the attention mechanism into ResNet to accelerate model training and extract features for classification.

8) A2S2K-ResNet:
This method improves ResNet based on RSSAN, which extracts spectral and spatial features using selective 3-D convolution kernels and improved 3-D residual blocks, and adopts an efficient feature recalibration mechanism to improve classification performance.

C. Parameter Setting
In this part, three pivotal factors that influence the training progress and classification performance of the proposed model are analyzed. These factors are the learning rate, spatial size, and training size, which are called hyperparameters.
1) Learning Rate: The learning rate controls the rate of gradient descent and affects the convergence in training progress. A grid search approach is used to find the best learning rate of the proposed model on each dataset. Here, we consider the learning rate sets {0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}. The results of ESSRAN with different learning rates in the three datasets are shown in Fig. 5. Based on the above results, the highest accuracy is achieved for the IP dataset when the learning rate is 0.005. For the PU and SA datasets, the highest precision learning rate is 0.01. 2) Spatial Size: The spatial size determines how much spatial information is used for FE around the target pixel. Thus, a large set of spatial input sizes {3, 5, 7, 9, 11, 13, 15} is used to evaluate the influence on the performance of the ESSRAN. As shown in Fig. 6, the accuracy of the PU dataset reaches its highest value when the space size is 9 ×9 and then decreases as the spatial size increases. For the IP dataset, the accuracy increases smoothly until the spatial size is approximately 11 × 11. For the SA dataset, the larger the spatial size is, the higher the accuracy. It follows that a data cube with a small spatial size cannot be extracted with sufficient spatial information, while a large spatial size affects the classification accuracy due to the presence of other categories at the edges. Consequently, we choose a spatial size of 9 ×9 for the later classification experiments. 3) Training Size: The number of training samples plays a decisive role in supervised HSIC. Therefore, we analyzed the effect of different training sample sizes on the OA. 1%, 3%, 5%, 10%, 15%, and 20% of labeled pixels are selected as the training set to train the ESSRAN. As shown in Fig. 7, the OA increases as the training size increases for all three HSI datasets and all algorithms. Compared with the other eight methods, the proposed method performs the best on most of the training sizes. It is more obvious on the IP dataset that the accuracy obtained by ESSARN is significantly higher than other methods when the training samples are small.

D. Classification Results
The experimental results of the IP dataset are shown in Fig. 8. To clearly show the difference, we place a local enlarged patch in the corner of each result map, and the same for the PU and SA datasets. The proposed ESSRAN method obtains the best classification results visually, with nearly no misclassification. Both A2S2K-ResNet and GCN show impressive results, but GCN shows some consecutive misclassifications at the edges. Among the remaining methods, the CNN-based 3-D CNN, HybridSN, DHCNet, and RSSAN give better classification results than SVM and LSTM. Table IV gives the average OAs, AAs, and kappas (and their standard deviations based on five runs) of the IP dataset. It can be clearly seen that the ESSRAN has the highest OA, AA, and kappa among the nine methods. The average OA of the ESSRAN is 97.69%, AA is 97.19%, and kappa is 97.37%. Three metrics of ESSRAN also have the smallest standard deviation among all methods. The standard deviation of OA is only 0.13%, indicating that the method has the highest stability. In addition, the ESSRAN method achieves the highest classification accuracy in 11 of the 16 classes due to the extraction of more discriminative spatial and spectral features. The class-specific samples in the IP dataset are highly imbalanced. Four feature types (C1, C7, C9, and C16 respectively) have less than 100 labeled samples and only 5 training samples, in which case ESSRAN achieves the highest classification accuracy. In particular, the OA of the "oats" (C9) is 89.13%, which is 7.46% higher than the highest accuracy of the other methods. This shows that the proposed algorithm has high recognition accuracy for types with few known samples.  The PC algorithm increases the number of samples and contributes significantly to the improvement of classification accuracy. The SVM and LSTM provide worse results than the other methods, due to using only spectral information and missing the spatial relationship.
The experimental results for the PU dataset are shown in Fig. 9. Compared to the ground truth, it can be seen that the proposed algorithm handles the details better and classifies accurately. The other algorithms have poor results on this kind of data with scattered feature types, in particular "asphalt" (C1) and "self-blocking slices" (C8), which are often misclassified, as shown enlarged in the figure. Table V gives the obtained classification results for the PU dataset. From this table, we can see that ESSRAN obtains the highest classification accuracy with 95.87% for OA, 95.37% for AA, and 94.51% for kappa. The standard deviations of category accuracy show that categories with high accuracy generally have low standard deviations. The standard deviations of OA, AA, and kappa of the proposed method are the smallest among all methods, which are less than 0.7, while those of the other methods are larger than 1. This indicates that the proposed method has high stability and can accurately identify the target feature types. A total of 2/3 of the categories have accuracies higher than 97%, and 3 categories obtain the highest category accuracy. The category with the most significant accuracy improvement is ''bitumen" (C7), which improved by 7.95% over RSSAN. The high classification accuracy obtained with only 1% of the training samples shows that the proposed ESSRAN has a strong learning capability when the number of samples is small. Fig. 10 shows the classification results of SA dataset. It can be seen that the misclassified feature types are mainly "vinyard_untrained" (C15) and "grapes_untrained" (C8). The classification maps of SVM and LSTM have obvious dot noise for the worst results, and the results of 3-D CNN and two improved CNN-based methods, HybridSN and DHCNet, also have many mismarks. Due to the use of a unique graph structure, GCN achieves visually smooth results, but in reality there are many misclassifications, such as "bro-coli_green_weeds_2" (C2) being misclassified as "bro-coli_green_weeds_1" (C1). The classification results of RSSAN and A2S2K-ResNet, which use an SSA mechanism, are better than the previous methods. In particular A2S2K-ResNet, which uses an adaptive adjustment of the kernel size, achieves high accuracy. Compared with other methods, the ESSRAN generates the most accurate and smooth classification maps, especially at the boundary of two different classes. The highest classification rates obtained with ESSRAN are 98.34% for OA, 98.84% for AA, and 98.15% for kappa (see Table VI). Meanwhile, ESS-RAN has a low standard deviation of accuracy, and the average standard deviation of the three metrics is 0.18%. Among the other methods, the maximum standard deviation is 2.42% for OA, 1.42% for AA, and 2.73% for kappa. The proposed method achieves a high level of category accuracy, with an accuracy of over 99% for 11 categories. The proposed method gains higher classification accuracy in 8 of the 16 classes, with the accuracy of three categories reaching 100%. The class with the highest accuracy improvement is "lettuce_romaine_7wk" (C14), with an accuracy of 99.34%, which is 2.99% higher than other methods.

E. Ablation Study
To further validate the effectiveness of different modules used in the proposed framework, we perform ablation experiments while keeping the other experimental settings unchanged. There are four modules of the proposed framework as follows.
1) The SSA and LSTM and PC are removed from the proposed framework, and FE and classification are performed using ResNet. 2) The spectral attention, spatial attention, and spectralspatial attention are added to the model of ResNet exclusively and each network is denoted as SpeRAN, SpaRAN, and SSRAN, respectively. 3) The LSTM module is removed from the proposed framework (the resulting model is denoted as PC-SSRAN). 4) The PC module is removed from the proposed framework (the resulting model is denoted as LSTM-SSRAN). Table VII gives the OA results of the ablation study. With the addition of SpeA and SpaA, the accuracy is improved compared to ResNet, and it is clear that SpeA has a greater effect on improving accuracy. After adding SSA, the accuracy of datasets IP, PU, and SA increased by 3.4%, 4.23%, and 2.16% compared to ResNet, due to the SSA module extracting diagnostic spectral and spatial information, which eliminates the effects of uncorrelated pixels and bands. Moreover, we compare the ESSRAN with SSARN, PC-SSARN (without LSTM) and LSTM-SSRAN (without PC). The results show that the inclusion of both PC and LSTM is important for OA enhancement, and the accuracy of the three datasets IP, PU, and SA is 0.75%, 0.49%, and 0.34% higher than that of SSARN, respectively. It can be concluded that the sample increase and the extraction of the relationship between adjacent bands are of great significance for the improvement of classification accuracy. For the IP and PU datasets, the addition of the spatial attention mechanism does not bring accuracy improvement to SSRAN, but for the SA dataset, the spatial attention mechanism is indispensable. This is related to the complexity and characteristics of the dataset itself. Table VIII gives the complexity of different methods in terms of training time, the number of trainable weight parameters updated during backpropagation, and computational cost. The results show that the proposed method takes more time to train than other methods due to the use of LSTM-based cell structure. However, the training time of the proposed method is less than the sum of the training time of LSTM and RSSAN, which indicates that the proposed method does not increase the time cost. Since the proposed algorithm does not have deep network layers, the number of parameters used for training is small (9.41×10 4 ) and is only larger than that of GCN. The computational cost is calculated by floating point operations (10 6 ×FLOPs). The results show that the ESSRAN has much smaller FLOPs than the A2S2K-ResNet, which is 399.57×10 6 FLOPs.

G. Discussion
First, three hyperparameters (including learning rate, spatial size, and training size) that affect the experimental performance are tested in cross-validation experiments. The learning rate of the network is closely related to convergence and is set to 0.005 for the IP dataset and 0.001 for the PU and SA datasets. As the spatial size increases, the experimental accuracy increases first and then stabilizes. The best accuracy is achieved when the spatial size is 9 × 9, considering the input size of the proposed framework. Furthermore, the performance of all  information; better processing of edge information; and more accurate recognition of categories with small sample sizes.
Third, ablation experiments are used to demonstrate the role of individual structures in the model. We analyzed the impact of using or not using SSAN, LSTM, and PC in the model on the experimental results. The experimental results prove that all the added structures are beneficial to the accuracy. The combination of ResNet and LSTM allows it to better capture contextual information and retain the spectral and spatial features extracted by SSAN, which is extremely important for improving classification performance. In addition, PC has a good effect on the accuracy improvement of small sample categories, which is more obvious on the IP dataset.

IV. CONCLUSION
In this article, we proposed an ESSRAN for HSIC. Initially the network uses a spectral-spatial attention mechanism to extract efficient and discriminative spectral and spatial information. Then, deep features are extracted using ResNet with the addition of LSTM, which obtains information about the relationships between adjacent spectra. The residual structure is able to combine the original features with the transformed features to obtain a stronger feature representation, which can further improve classification performance. Adequate experiments on three widely used HSI datasets demonstrate that the proposed ESSRAN model outperforms the state-of-the-art methods and achieves the highest classification accuracy. This network obtains extremely high classification accuracy with a simple structure, which fully demonstrates the advantages of the proposed method. In addition, experiments show that the ESSRAN algorithm has excellent classification results for data with uneven data distribution and a small number of samples, solving difficulties in obtaining labeled hyperspectral data. Considering that the proposed algorithm is a supervised learning classification method, in future research we will learn semisupervised and unsupervised approaches as well as more novel network models to make hyperspectral classification more intelligent accurate, and thus widely applied.